Chapter
2
Reading and saving sequences
How do I get a sequence into GPMAW, and how do I save it afterwards. Handling of protein sequences from/to disk and clipboard.
GPMAW normally reads and saves protein sequences in its own format (see Appendix A), but is also able to read a number of other file formats as well as write in FastA format (Export). Furthermore, you can import sequences from almost any source through files and clipboard (Import) as long as the sequence is in standard 1-letter code (see appendix C.3). The sequences can be read from disk, CD-ROM and the Internet.
If you want to enter a sequence manually or edit an already entered sequence, please see Chapter 4.1 for details.
For information on how to work with sequences please read Chapter 3.
The File|Open command is the
standard way of reading protein sequences from disk. GPMAW enables you to read
sequences in GPMAW and FastA format. For an explanation of the GPMAW format please see the
section below. For information on FastA format please see appendix B.

In the 'Open sequence library' dialog box, only files with the extension '.SEQ' will initially be shown. Alternatively you can select 'Old GPMAW format' (no extension) or 'All files' in the 'File type' drop-down box. Selection of a file takes place either by double clicking on a library name (opens the file directly), selecting a library name or you can enter any file name in the ‘File name’ field. In the last two cases you have to press <enter> or the ‘Open’ button. Alternatively, you can change to a different directory or disk drive in the usual File Manager/Explorer style.
The initial directory displayed will be the one entered as ‘Default working directory’ in the Setup system dialog (see Chapter 5.4). You may also check Appendix D on how to set up GPMAW for multiple users.
If the selected file only contains a single sequence, it will open on the desktop immediately.
i Hint: The bottom of the file menu shows the five most recently opened sequence libraries and can be opened directly. The drop-down arrow next to the ‘File open’ icon opens a menu with the same five file names. Shortcut: Alt + F followed by a number (1-5) will open the corresponding sequence library.
If it contains multiple sequences the 'Select sequence' dialog box will list the name and length in residues (in square brackets) of all sequences contained in the file:

Open sequence options:
· You may select all files by pressing the button. Alternatively you can select multiple files by holding down the Shift button when selecting (continuous selection) or hold down Ctrl button for a discontinuous selection. All selected sequences will be opened when pressing the button, each sequence in a window by itself.
· Press the button to open the selected file and close the dialog box.
· Press the button to close the ‘Select sequence’ dialog without opening further sequences. Any sequences already opened will stay opened.
· Check the 'Sort by name' box to sort the contents of the sequence list alphabetically. Un-check the list to display in file order.
The first part of the currently selected sequence can be viewed just above the status line at the bottom of the dialog.
The status line at the bottom of the dialog shows from left to right:
· the total file size
· the size of the current file as percentage of maximum file size. This box will be green when the file is less than 80% full, and will then turn red, to warn you that you cannot save many more sequences. The maximum file size is 128 kByte or 250 sequences, whichever limit is reached first.
· the mass of currently selected protein
Dragging the edges of the dialog with the mouse can expand the dialog, if you need to see more sequences and/or more of the sequence names.
The primary information saved contains the name and sequence of the protein (in 1-letter code). In addition you should also save the accession number of the protein when available (when reading a FastA indexed database the accession number is retrieved automatically).
In addition you can enter the following information manually (described in detail in Chapter 2 and 3): Cross-links (usually between Cys residues, but can be between any residue), modified residues, N- and C-terminal modifications and annotation. The annotation is a free text field where you can enter any information you want saved with the sequence (see chapter 3.9). When reading SWISS-PROT sequences from the EMBL CD-ROM or PIR sequences from the Atlas CD-ROM, the complete database entry is saved in the annotation. Also when you import data (either from file or clipboard) you have the option of placing the entire record on the annotation page (see chapters 2.5-2.7).
The sequence files of GPMAW can contain up to 300 sequences or 60.000 bytes (characters) whichever is greatest. The files are saved in a proprietary format that in addition to the name and the sequence can contain information on accession number, modified residues, cross-links and annotation.
The sequence files are ASCII (text) files that can be edited in a text editor. However, you should only consider opening the sequence files in a text editor if the file gets corrupted. All input and output is much easier handled by GPMAW.
When the file size gets to be more than 90% of capacity you should either delete sequences or create a new file.
For a full description of the file format, please refer to appendix A.
Protein sequences can be acquired from a large number of sources; the most important of these will be discussed.
i Note: GPMAW only accepts sequences in 1-letter code. If your sequence is available only in 3-letter code you have to enter the sequence manually using the Edit|Edit new sequence command.
You have to enter the sequence manually using the sequence editor as detailed in Chapter 4.1. If you have a scanner with OCR software you can input through your text editor, or a disk file (use File|Import ASCII). Beware; as most OCR software is not able to translate 100% correctly you will have to be careful in checking the entered sequence.
If data are in print, they will usually also be available on-line through the World Wide Web. Please see section below ‘Searching the World Wide Web’.
If the file is in FastA format and smaller than 30000 bytes, GPMAW can read it directly using the File|Open command. If the file is in an ASCII (text) format you can use the File|Import ASCII|From file command (see below). If the file is in a proprietary format (word processor, html etc.) you have to open the sequence in the relevant program and transfer the sequence using ‘cut and paste’.
If you have your sequence as 1-letter code in a word processor (Word, WordPerfect etc.) the easiest way of getting your sequence into GPMAW is to copy to the clipboard and select File|Import ASCII|from clipboard (see below). Alternatively you can paste into a new sequence (Edit|Edit new sequence). You highlight the name in the word processor, change focus to GPMAW and paste it into the name field of the sequence editor. Switch back to the word processor; copy the sequence to the clipboard, switch to GPMAW and paste into the sequence field. If the sequence is in lower case and/or contain extra formatting characters you remove these using the relevant buttons in the editor (see Chapter 4.1).
Most sequences can be found through the File|Web Entrez search (see below), but if you find a sequence elsewhere you may proceed as follows:
When you have a sequence loaded into your browser, you can most easily transfer it to GPMAW using ‘copy and paste’. The fastest way of transfer is:
1. Highlight the complete record (Edit|Select all or <Ctrl-A> if the record takes up the whole page).
2.
Change focus to GPMAW and
select File|Import ASCII|from clipboard
3. Proceed as detailed in the ‘Import ASCII’ section described below.
Alternatively you can highlight and transfer name, accession number and sequence individually.
You may also download complete databases by FTP. Databases in either FastA or Swiss-Prot format can be indexed it with the ‘DBIndex’ program (Chapter 12.3). This utility can be downloaded from the GPMAW web site (see also Appendix B).
If you use the client server version of Entrez, you may transfer sequences either through the clipboard or through a file on disk. In this case you import through the File|Import ASCII|From file command.
If you use the World Wide Web search option, you can cut and paste sequences in 1-letter code through the clipboard.
Although the Web Entrez Access option of GPMAW (Chapter 2.7) uses the same search engine, the search options are not so sophisticated. However, if you need to retrieve a sequence of known name or accession number it is much faster and easier to use this option.
You can type the accession number of your protein of interest into the web retrieval field in the tool bar to retrieve a sequence directly from Expasy or NCBI (Entrez). For details see Chapter 2.7.
GPMAW interfaces directly to the SWISS-PROT and the PIR (Atlas) sequence CD-ROM’s see below and Appendix B.
The GPMAW installation CD from Lighthouse data contains the entire Swiss-Prot database and a protein non-redundant database (EMBL or NCBI). Both databases are already indexed but need to be installed on the users hard drive before access, see chapter 2.6. All other databases (local or downloaded from the internet) can be indexed and accessed from GPMAW (see Appendix B for details).
You can also get a CD-ROM from Lighthouse data containing current protein databases available on the Internet.
If you have a nucleotide sequence it can be imported and translated to protein sequence through the File|Import ASCII|From file or File|Import ASCII|From clipboard command (see below, ch 2.5).
When you have entered new information in a sequence window, the actual sequence, the name or post-translational modification information, you can save the information using either of two commands in GPMAW sequence library. The information saved contains at least the name and the protein sequence and may additionally contain chemical modifications on either terminal and individual residues, cross-linked residues and annotation. Information on multiple peptide chains is saved as part of the protein sequence.
For a complete list on information that can be saved with a sequence please see Appendix A.
The save command saves the currently selected sequence and all modifications to the file and position occupied by a previous instance of the sequence. This means that the command only works when the sequence has been read from a GPMAW sequence library. The program looks for a sequence with the same name and position in the library. If you have changed the name of the sequence or you have just entered or imported the sequence file you will automatically be transferred to the ‘Save as’ command (below).
i OBS: If you make changes to a sequence (i.e. edit, change modification, cross-links etc.) you are warned to save changes when you close a sequence window.
The ‘Save as’ command is used when you want to save your sequence to a new file/position or when you have just entered or imported a new sequence and want to save the information in a GPMAW sequence library.
If you have made changes to the sequence and do not require to save the sequence to a new position, you should use the ‘Save’ command (above).
1) Select the relevant sequence window.
2) Select File|Save as.
3) The 'Save sequence' dialog box will open in the currently selected user directory. You can change to a different directory. By default only sequence libraries (files with the .seq extension) will be displayed.
4) You now either select an existing file or enter a new name. The .SEQ extension is automatically added to the filename.
5) If you select an existing file, the sequence in the active window will be appended to the ones already present in the file.
i Note: If you save a sequence with a name that is already present in the sequence file, the name will get a ‘rev.1’ (or ‘rev2’ etc.) attached to the end of the name.
This command is similar to the ‘Save’ command above, but information on the underlined areas of the sequences (chapter 3.4) is saved along with the other information.
Note: This information is not saved dynamically; you have to save it specifically when you make changes. Neither are you informed about changes in underlining that need to be saved.
When you want to remove a sequence from a file you:
1) Select File|Delete sequence.
2) From the 'Select file' dialog box you select the file containing the sequence to be deleted. By default only sequence libraries (files with the .seq extension) will be displayed.
3) In the 'Select sequence' dialog box you select the sequence to be deleted. If you have multiple occurrences of the same sequence name you should remember that new sequences are always appended to the end of the file.
4) From the 'Delete sequence' dialog box you select when you have verified that the selected sequence is correct.
When you remove the last sequence from a file, the file will be removed completely.
When you delete a sequence, the previous sequence file is saved with the same name but having the extension ‘.BAK’. This ensures that if you delete a sequence by accidence, you will be able to retrieve it (until the next save or delete operation on the file).
If you are unable to read a sequence file using the normal File|Open command you will be able to import the sequence using the File|Import ASCII|From file command. The only limitations are:
1. The sequence has to be in 1-letter amino acid or nucleotide code.
2. The file has to be an ASCII (text) file.
3. The file is smaller than 30000 bytes (characters).
The 'Open text file' dialog box is identical to the 'Open sequence file' for opening standard sequences except that files with the extension '.TXT' are displayed by default. In the drop-down list 'List files of type' you can select 'All files' or you can type any name into the 'File name' box. If the file selected is not an ASCII file, an error message will appear and you will be unable to proceed with the import. If the requested file is a text file, the following dialog box will open showing the file contents:
i
Note: The ‘Import ASCII’ dialog
recognizes the FastA, the Swiss-Prot and the GenPept (Entrez)
formats. This means that the name, sequence and accession numbers are pasted
directly into the respective fields of the dialog.
However, remember to include only the complete record and not any extra lines
of text or graphic (e.g. when you copy from a web page).
To import a sequence you have to carry out the following steps:
1. Highlight the name and press the button. This will copy the highlighted text into the name field at the bottom of the dialog box (only the text that fits into the dialog box will be displayed).
2. Highlight the accession number and press the button.
3. Scroll to the sequence.
4. Highlight the sequence.
5. If the sequence is written in lower case press the button to transform lower case letters to upper case. The sequence will stay highlighted.

6. If you are importing a nucleotide sequence press the button (see import nucleotide sequence below) otherwise press the button. Do not worry about numbers, space characters or line breaks - these will not be imported.
7. The first part of the sequence will be displayed in the 'Sequence' line below. The length of the sequence read will be shown in the 'Length' line.
8. Press to open a sequence window containing the imported sequence. If necessary, you can edit the sequence later by selecting Edit|Edit sequence option (Chapter 4.1).
9. If you check the ‘Save all as annotation’ checkbox, the whole content of the import box will be saved in the annotation page of the sequence. For more information on the annotation see ‘Annotation’ (Chapter 3.9).
i Hint: The text field in the top part of the dialog box is an edit control. This means that you can edit the name and sequence before importing into GPMAW. You can also use ‘cut and paste’ from the pop-up menu or you can use the standard keyboard shortcuts (Ctrl-X, Ctrl-C, Ctrl-V).
i Note: Only amino acid residues that are defined in the currently selected mass file will be imported as part of the sequence. If you need to import unusual 1-letter codes please make certain that you have the appropriate mass table loaded before import.
If, in step 6 above, you press the button you will be presented with the 'Convert DNA sequence' dialog box:
The six green lines along the top of the dialog box represent the translated nucleotide sequence (three forward reading frames and three backward) with the red dots representing stop codons. If a name has already been selected for the sequence it will be shown above the green lines.
The six buttons below control the display and selection of the reading frame. The protein sequence of the currently selected reading frame will be shown in the large list-box. Stop codons are shown as 'X' and will be imported into GPMAW as chain terminators (a maximum of six chains can be imported). To the right of each reading frame button the number of ORF's is shown (open reading frames) along with the size of the largest ORF in the current frame.
If you check the 'Longest ORF only' checkbox, only the longest ORF will be displayed in the list-box and translated into GPMAW.

Usually, the longest open reading frame (in this example frame 3) is the correct one. Checking the 'Longest ORF only' and pressing will return you to the 'Import ASCII' dialog box above with the translated protein sequence displayed in the sequence line (step 7).
The ‘Import from clipboard’ dialog box is identical to the ‘Import from file’ except that the content of the clipboard is pasted directly in the text box (top part of the dialog box).
If the sequence on the clipboard is in FastA format, it will be parsed immediately and the name and the sequence will be entered directly into their respective lines (i.e. you only need to press the button).
i Note: The ‘Import ASCII’ dialog box is also used as transit station when reading protein sequences from a number of other sources like BLAST search (Chapter 7.2), Internet accession number retrieval (Chapter 2.7) etc..
You can read a sequence from a local database (this section) or directly by accessing certain databases on the Internet (next section, 2.7).
GPMAW can access local databases in two formats:
Ř General FastA format indexed with the database-indexing tool (Dbindex – Chapter 12.3) available from Lighthouse data. If the DBindex program was not part of your installation, you can download it free of charge (see Chapter 1.9).
Ř SWISS-PROT as available on CD-ROM
Ř PIR as available on the ATLAS CD-ROM
See Appendix B for how to acquire the databases and Chapter 5.4, ‘System directories’, on how to set up the CD-ROM’s.
When you select either File|Open database|SWISS-PROT or File|Open database|PIR (ATLAS) GPMAW will search for the index files in the directory specified in Setup. If the index files are not found you will be informed by a dialog box and the action will terminate. Otherwise a 'Search for' dialog box will open.
i Note: The Swiss-Prot database is no longer in the public domain (freeware). If you are part of a commercial company or institution you need a license agreement. In this case, please contact Swiss-Prot (www.ebi.ac.uk).
Protein databases in FastA format are available from a number of sources, particularly on the Internet. Some of the available databases that have been tested with GPMAW:
Ř PIR (Protein Identification Resource),
Ř Swiss-Prot,
Ř EMBL non-redundant.
Ř TREMBL (translated EMBL),
Ř GenPept (translated GenBank),
Ř OWL (non-redundant combined database),
Ř NCBI nr (non-redundant).
Appendix B contains more information on the databases and how to obtain them. Before you can access any of these databases from GPMAW you have to create index files. For this purpose you can download a utility ‘DBIndex’ from Lighthouse data. In addition to indexing straight FastA databases, ‘DBIndex’ can perform the following functions (see Chapter 12.3):
Ř Convert the Swiss-Prot database to into FastA format before use.
Ř Rewrite the NCBI nr and EMBL nr databases to a simpler FastA format before indexing.
Ř Extract sequences with a certain composition to a new database.
A CD-ROM containing several of the above-mentioned databases including indices is available from Lighthouse data.

Selecting File|Open database|FastA opens the ‘Search FastA database’ dialog box.
i Note: As soon as you have accessed a FastA database once, a new menu item will be added to the File menu (File|Open FastA database) allowing a more direct database access.
To search an indexed FastA database:
1. Select File|Open database or press the button. In the ‘Open file’ dialog box select an index file with the .trg extension. The dialog ‘remembers’ the five most recently opened databases, which are shown at the bottom of the File menu. Three buttons at the bottom of the dialog show the initial letter of the three most recently opened databases. If you let the mouse cursor rest on a button for a few seconds, the fly-by help shows the full name of the database connected to the button. The full path and name of the database opened will be shown in the bottom status line.
2. You can enter up to three search words and combine them with ‘and’/’or’ as appropriate. Alternatively you can enter an accession number. If you enter and accession number, it takes precedence over search words.
3. Press the button and any matches to the search criteria will be displayed in the bottom yellow result box. When you enter search words, you should enter the least common word first in order to speed the search and not overload the search engine (e.g. searching for ‘human prothrombin’ you should enter ‘prothrombin’ as the first search word and ‘human’ as the second).
i Note: The words ‘protein’, ‘sequence’, ‘temporary’, and ‘tentative’ have been removed from the search list as being too common and not providing enough information. Furthermore, the following symbols have been removed and replaced by word delimiters: ‘[]",.;()/’
4. Highlight any sequence you want to retrieve from the database and
press the button to load the
sequence into GPMAW. Alternatively you can load the sequence by double-clicking
on the name.
As FastA formatted databases contain no annotation, only the name, the
accession number, and the sequence will be loaded. You can highlight several
sequences while holding down the <ctrl> key. Pressing the button will load all highlighted sequences into each own sequence
window.
i Note: If you have several databases installed on your system you do not have to reenter search words when switching between the databases.
The Swiss-Prot and the GenPept databases are published as infrequently updated primary releases and frequently released update databases (SwissNew and GpNew). When both the main and the updates are installed in the same directory, GPMAW will recognize the updates and the button will be enabled when loading the main database. You can now search the main database by pressing the button and the database update by pressing the button.
Swiss-Prot: If you are searching the FastA version of the Swiss-Prot database (‘SPROT37.SEQ’) and you have all the accessory files in the same directory, you can read the full database record into the annotation page of the sequence window if the read annotation box is checked (see Chapter 3.9 for more information on the annotation page).
In order for the annotation retrieval to work you need the following: The full Swiss-Prot database release 37 (the file has to be called ‘SPROT37.SEQ’). The FastA version of Swiss-Prot (‘SWISS.SEQ’) generated by the ‘DBIndex’ utility (version1.02 or later) and the corresponding index file ‘SWISS.IDX’. The FastA index files ‘SWISS.TRG’, ‘SWISS.NDX’, ‘SWISS.ACC’, and ‘SWISS.FAC’ also generated with the ‘DBIndex’ utility (see Chapter 12.3). The ‘DBIndex’ program can be freely downloaded from Lighthouse data.

In the 'Search for' dialog box you can enter:
Free text: Any text - usually the name of the protein, but the search result will include all entries where the search text occurs anywhere in the database entry.
Species: Source organism of the entry.
Accession #: Each entry has a specific accession number (e.g. P01966).
Entry name: Each entry has a specific entry name (e.g. HBA_BOVIN).
In the case of accession number and entry name only a single entry will be shown (if it exists), while using text and/or species searches usually reveals several entries. The search engine can find a maximum of 500 entries.
Highlighting and pressing ‘Open’ (the dialog box will remain open) or ‘OK’ (the dialog box will close) will open the selected entries in GPMAW. You can only use multiple instances of when the index files have been copied to a hard disk (or network drive, see Chapter 5, Setup directories).
The resulting sequence window will display the first 40 characters of the name field (DE) and the first part of the species field (OS). The sequence will be read from the sequence field (SQ).
The complete entry will be read into the Annotation page (see below).
The 'Search for PIR' dialog box is identical to the 'Search for SWISS-PROT' dialog box described above, except that the field 'Entry name' does not exist in the PIR database.
The ATLAS CD-ROM is a single CD-ROM and contains both the indices and the databases on the same disk.
Like for SWISS-PROT entries, the annotation is read into the Annotation page (Chapter 3). However, the sequence is not part of the annotation.
If you know the accession number of your protein of interest you can enter it into the web access field in the mail toolbar:
![]()
When you pres the button GPMAW will either access the Swiss-Prot database on the Expasy server (if the accession number starts with O, P or Q) or the Entrez server hosted by NCBI. If the first query is unsuccessful, the other server will be queried. The result will open in the ‘Import ASCII’ dialog box (section 2.5). Both the Swiss-Prot and the Entrez (GenPept) format are recognized by GPMAW so you can import the sequence just by pressing the ‘OK’ button. The complete database entry will be saved in the ‘annotation’ page (section 3.9).
Through the File|Web Entrez search you now have the possibility to search for protein sequences on the web using the Entrez search engine. The search by GPMAW only implement part of the Entrez search engine and is not meant as a replacement for searching the WWW, but rather as a quick way to retrieve protein sequences into GPMAW. This option is particularly useful if you do not have access to protein databases on CD-ROM or do not wish to download a complete database yourself.

The Internet address http://www.ncbi.nlm.nih.gov/ contains a full implementation of an Entrez search engine and information on the databases. You can also download a client/server version of the search engine with an enhanced functionality and greatly increased speed. Appendix A and B contains further information on database formats and content.
i Note: You need an Internet connection in order to use these functions! They may not work through a firewall. If you have problems, please consult your local network specialists before contacting Lighthouse data.
The Entrez web dialog allows you to enter two search terms, search in three different databases and display in four formats.
The results of the search are shown in the two large list boxes in the bottom half of the dialog box. The top shows the proteins found while the bottom one shows the sequence of the currently selected protein in the chosen display format.
Up to 100 sequences are listed in a single search. If the 100 sequence limit is exceeded, you will have to narrow your search.
Protein: This is the NCBI non-redundant protein database containing over 705.000 proteins (July 2001). The database is compiled by combining Swiss-Prot, PIR, GenPept and additional databases. The NCBI non-redundant databases is updated almost daily.
Nucleotide: The NCBI non-redundant nucleotide database, based on GenBank supplied with other databases. This database is not normally used for retrieval of protein sequences but can be used for cross-references.
MedLine: Allows you to search the complete MedLine database. Like the nucleotide database it is not useful for retrieval of protein sequences, but can give a fast short cut to references.
The ‘Protein’ and ‘Nucleotide’ database shares the same display formats while the ‘MedLine’ database has ‘FastA’ and ‘GenPept’ replaced by ‘’ and ‘MEDLARS’.
FastA: A compact format mostly used for storage of sequences used in homology searches. The first line starts with ‘>’ and contains accession number and name of sequence. The following lines (usually formatted with 60-character pr line) contain the sequence in 1-letter code.
GenPept: Considerably more information is available in the GenPept format. In addition to name and species, information on species, literature reference, post-translational modifications etc. are included.
Report: Format similar in details to ‘GenPept’. For MedLine it is similar to ‘Abstract’ but includes MeSH terms.
Entrez: Short listing of the headlines in all database formats. Useful for browsing when the search is expected to yield numerous hits.
Abstract: Reference and full abstract of MedLine entry (some entries, usually old or very new, does not have abstracts).
MEDLARS: Standard literature database reference format.
Two search terms can be entered. The terms are always ‘AND’ed, meaning that both term 1 and term 2 has to be present in a database entry in order for Entrez to retrieve it. The ‘Field’ drop-down list boxes enables you to narrow the search to specific fields of the database entries.
Field: The default selection of the ‘Field’ box is ‘All fields’ where the complete database is searched. This is usually sufficiently specific in most cases, but as the database entries often contain cross-references to other database entries you will often retrieve a number of homologous proteins.
The protein and nucleotide databases allows you to narrow each term to the fields Protein name, Author name, Organism and Accession # (number). The MedLine database allows the search to be narrowed to Title word, Text word (in abstract), Author name and PubMed ID.
i Note: Whenever you change database the ‘Field’ boxes are reset to ‘All fields’.
Selecting a sequence name in the top list box and press the button retrieves a sequence. Alternatively you can double-click on a sequence name.
If the display format is ‘FastA’, the sequence is directly copied into GPMAW as a sequence window. If another format is chosen (e.g. ‘GenPept’), the complete sequence record is copied into the ‘Import ASCII’ dialog box (see 2.5) from where you can select whether you want just the sequence read into GPMAW or you want to save the annotation as well.
Pressing the button will print the complete search results unless part of the results is highlighted. In this case the program will ask whether to print only the selected part. If you answer , the complete result list will be printed.
The printout will also list the search terms, fields and database.
The dialog supports two local menus:
Right clicking in the top part of the dialog displays a dialog with the options: Clear Terms, Retrieve sequence, Copy to clipboard and Print. ‘Clear terms’ clears the two term input boxes, ‘Retrieve sequence’ and ‘Print’ duplicates the corresponding buttons and ‘Copy to clipboard’ copies the entire memo box content at the bottom to the clipboard.
The bottom edit box supports a local pop-up menu that enables you to copy and paste part or all of the contents.
Search for ‘Surfactant protein A’:
Select Database: Protein; Display: Entrez; Term 1: Surfactant protein A; Field 1: All fields (make certain to get all hits);
The result is 20 hits, four of which does not have relation to SP-A. Finding human SP-A (id: PSPA_HUMAN) highlight the Swiss-Prot accession number (P07714 – in the bottom edit box), copy to clipboard (Ctrl-C), move to Term 1 and paste it in (Ctrl-V). Change Field 1 to ‘Accession number’ and press ‘Search’.
The result is a single hit, human surfactant protein A.
Change ‘Display’ to ‘FastA’ and making another search gives you the FastA formatted sequence.
Highlight the sequence and press ‘Retrieve’ opens a new sequence window in GPMAW with human SP-A. To get more information you change ‘Display’ to ‘GenPept’ and make another search. This gives you the database entry in GenPept format. This you can either print or copy to clipboard for later incorporation into the ‘Annotation’ page. The reason for not importing the GenPept format in the first place is that importing the FastA format is much faster (semi-automatic).
In the GenPept formatted entry you read an interesting reference by Floros et al. with the MedLine reference 86250832. This you copy to ‘Term 1’, switch database to ‘MedLine’, Field 1 to ‘PubMed ID’ and ‘Display’ to ‘Abstract’. Pressing ‘Search’ gives you the reference and abstract to the requested article (ready for printing and ordering from the library).
The Export sequence command yields five options. The first two options save the sequence to a file on disk and the last three copies the sequence(s) to the clipboard:
You can save your sequence to a disk file in the basic GPMAW format, that is name and sequence only, without information about cross-links, modified residues, annotation etc. See Appendix A for details.
The sequence is saved to disk in FastA format containing name and sequence only (see Appendix A and B for details). This format is very useful for interchange with other programs, transfer to the Internet etc. For transfer to input boxes in other programs (e.g. the Internet) the ‘to clipboard’ function is usually more convenient, see below.
It is often required that a sequence is formatted in a special way in a report, and for this purpose you can choose the File|Export sequence|to clipboard.
The ‘Export sequence to clipboard’ dialog box displays the sequence name in the top part (for verification) and presents a number of options below:
Residues
per line: In 1-letter code 60 is the default; in
3-letter code 20 is the default. Range is 10 -100.
Residue type: 1- or 3-letter code. Default is like the current sequence window.
Numbering:
On - each line ends with the number of the last residue.
Haptoglobin-2 precursor - Human (406
res.)
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIA 40
HGYVEHSVRYQCKNYYKLRTEGDGVYTLNDKKQWINKAVG 80
Off - no numbering.
Haptoglobin-2
precursor - Human (406 res.)
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIA
HGYVEHSVRYQCKNYYKLRTEGDGVYTLNDKKQWINKAVG
Detailed - sequence residue numbers appear beneath every 10th residue of the sequence (remember to display the sequence in a monospaced font, like Courier, for this function to be effective).
Haptoglobin-2
precursor - Human (406 res.)
10 20 30 40
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIA
50 60 70 80
HGYVEHSVRYQCKNYYKLRTEGDGVYTLNDKKQWINKAVG

The button will put a ‘>’ in front of the name, switch residues per line to 50 and put numbering ‘off’ in order to present a FastA formatted sequence to the clipboard. This is a common way of presenting data in input-boxes on the Internet, and most other sequence analysis program will recognize this format.
i Note: When you transfer a sequence to a report, remember to print it in a monospaced font (e.g. Courier New) in order for the numbering and amino acid residues to line up correctly. You can also copy your sequence to the clipboard (copy and paste) for transfer to other programs by selecting Edit|Copy (or press Ctrl + C), which places a copy of the sequence in the currently selected format (1- or 3-letter code) on the clipboard.
i Note: When copying this way only the sequence, not the name is copied to the clipboard. If you need both name and sequence use the Export option.
The currently selected protein will be copied to clipboard in standard FastA format (60 residues/line). By using the shortcut <Ctrl-F> you can quickly export sequences when you need it for transfer to other programs or to the web.
All protein sequences open on the desktop (i.e. content of all currently opened sequence windows) will be copied to the clipboard. The order of the sequences will be in the Z-order of the respective sequence windows (i.e. the topmost sequence window will be first).
This option can for example be very handy to copy all sequences to a multiple alignment input box on the Internet.