Chapter
12
Utilities
This section contains a few utilities that do not fit into other categories.
Several of the functions presented here (e.g. mass comparison and composition calculator) are usually called as auxiliary functions from other windows, but as each function can sometimes be useful on its own, they are presented here as individual windows.
The MS peak analysis displays mass peak difference data in a x/y table much like a spreadsheet.
You start by entering the peak data in a table (see right). The
table can be manually edited, but you can load tables in PerSeptive, Bruker and
HP MALDI ms format (peak file) or you can paste from the clipboard. GPMAW’s own
PEP format is of course also supported.
Once the data is entered you can edit them (see Chapter 4.1 for an explanation of the button), copy to clipboard or save to a disk file in PEP format just like data entry in the ‘Mass search’ (Chapter 6.1) and ‘Digest database search’ (Chapter 8).
The table information is automatically set to the file name of the most recent file read.
The local pop-up menu supports the same functions as the right-hand buttons.
The table accepts a maximum of 150 entries.
Once the data has been accepted by pressing the button, a new dialog opens that shows the mass difference data.

The table shows the left hand column minus the top row.
The table by itself can be interesting, as it is relatively easy to spot repetitive occurrences of the same or similar mass. The real advantage comes when comparing to predefined mass tables as defined in the bottom status line.
Activation the button highlights all mass differences that correspond to amino acid residue differences. The amino acid masses are taken from the currently loaded mass file. When you move the mouse cursor on to a highlighted field, a fly-by hint will open showing the name of the amino acid residue. The local pop-up menu allows you to switch the mass value of the highlighted fields to three-letter residue names. In cases where you have isobaric residues (Leu/Ile, Lys/Gln) both names will be shown.
The ‘Precision’ box defines the precision by which the highlights are found. The box is in Dalton and can be edited directly or you may use the mouse to activate the up/down arrows.
The sequence tag
button
will draw lines between
successive amino acid residues, i.e. a sequence tag:

The button highlights mass differences corresponding to sugar (carbohydrate) residue differences. The mass values are taken from a modification file (in the ‘System’ directory) called ‘SUGARS.MOD’. The user can modify this file like all other modification files (see Chapter 4.3) and can in principle contain any data. In order to stay with the label of the button, you should keep it this way.
The button uses the currently loaded modification file (Chapter 4.3) to search for mass differences.
The three mass difference types use the same precision but use different background colors for highlighting. Only the amino acid differences can be switched to show names in the table, the two others only show the names in the fly-by hint.
The bottom left button prints the table (across several pages if needed) and the button to the right of this toggles between showing average and monoisotopic mass values.
The local pop-up menu has four entries:
Draw aa links: Same as the ‘Sequence tag’ button.
AA as text: Toggles between showing highlighted amino acid masses as text and as masses.
Compressed table: Toggles the displays the table into a compressed format in order to show more cells.
Print: Same as the ‘Print’ button.
The composition calculator is usually called from dialog boxes that demand input of compositions (i.e. ‘Edit mass file’, ‘Edit modification file’ etc.). Through the ‘Utilities’ section you have direct access to the dialog for calculating the mass of a given composition or to get the composition string in GPMAW format.
For more information on the GPMAW composition formula strings, see Chapter 4.4.
The Elemental composition dialog contains an edit dialog with a corresponding up/down spin control for each atom define in the ‘Atomic masses’ section of the ‘Edit mass’ dialog (see Chapter 4.2).

The number of each atom is controlled by entering a number in the edit boxes or by using the spin control with the mouse. Only integer values can be entered. The composition box is updated for every change made in the composition. The composition line can be copied (highlight and press <Ctrl+C> or use the pop-up menu) but cannot be edited directly.
Negative compositions or compositions that are part negative are accepted.
The button zeroes both the composition and the edit boxes.
Both the average and monoisotopic mass is reported.
The database indexer, called DBindex, is a separate program that is either bundled with GPMAW or can be freely downloaded from the Lighthouse data website (see Chapter 1.9). As the program is separate from GPMAW it means that once called you can run it independently (i.e. switch back and forth without one program interfering with the other). The version numbers are separate from GPMAW and you should consult the web site or Lighthouse data for the latest version. Current version, August 2000, is version 1.11.
The program (DBINDEX.EXE) has to be present in the same directory as the GPMAW executable file (GPMAW3.EXE) in order to be called from the menu.
i Note: If the menu entry Utilities|Call DBindex is disabled (grayed) it is because GPMAW could not find the database indexing utility. You should then copy the program and the help file (DBINDEX.HLP) to the GPMAW\BIN directory.
When the program opens, you are greeted with a dialog stating copyright and version number.

From the copyright dialog you can select
what section of the DBindexer to enter. Independent of the section you choose
here, you can switch between sections on differents tabs in the running
program.
Index: Create indices from
a FastA formatted database that enables GPMAW to search the database on the
basis of protein name or accession number.
Combine databases (i.e. add one database to the end of another).
Convert: Convert the Swiss-Prot database to FastA format and still enable retrieval of the full
Swiss-Prot entry through GPMAW.
Reduce the complexity of some non-redundant databases that create very long
(>255 character) name lines (NCBI-nr and EMBL-nr).
Convert files in VMS format to DOS (Windows) format. The VMS format is used on
most UNIX systems and needs to be converted to for GPMAW to access the databases.
Filter: Filter a FastA formatted database with regard to amino acid composition and
molecular size.
Download: Download databases from the
internet usig the FTP protocol. This section is still in the experimental
stage, and if you experiences any problems, you should use a dedicated FTP download program.
Other: The program opens
on the first page of the window without asking for a database.
The ‘Index’, ‘Convert’ and ‘Filter’ commands starts by asking for a database file to work on. If you later need to open a new database, just click on the ‘Load database’ button in the buttom command line.
![]()
When you have
opened a database, the next button in the command line, ‘Db Peek’, becomes active. Pressing this button opens a text dialog
with the first couple of thousand characters of the database. This enables you
to check the format and content of the database.

In the database
file viewer (above) you are only able to only view the content of the file, you
cannot edit it.
The view starts
with the full file name, then comes whether the file conforms to the GPMAW
FastA format, and finally what type of file is present (ASCII vs. Binary
format, DOS vs. VMS format).
After the
stippled line, comes the first records of the database.
Click on to return to DBindex.
Start by loading a FastA formatted database. If you have not selected one when opening the program, or if you want to select a different database, press the button and select a new database. The database loaded will be shown in the yellow list box in the right hand part of the dialog and in the status bar above the progress bar at the bottom of the display.
In the drop-down list ‘Format of FastA database’ you select the kind of database loaded.

Press thebutton and database indices will be generated. The progress of the index creation can be followed in the progress bar, the ‘Record number’ and the ‘Record position’. The yellow list box will show statistics and files generated.
The index files generated will be placed in the same directory as the database.
i Note: Make sure that you have sufficient space on your harddisk as the index files a quite voluminous. E.g. the current version of EMBL-nr takes up 280MB of space (after conversion from VMS to DOS and reduction of name lines) while the indices additionally take up almost 30 MB of space.
Four files are generated, each characterized by its extension:
.ACC Accession number
.FAC Facts file. This is a text file that contains general information on the database and the indices.
.NDX Index into the main database.
.TRG Target database. Contains the search words.
The user should not modify any of the files generated. Only the FAC file can be of any use and can be read into any text editor.
When the program finishes indexing (a rather lengthy procedure ~20 min.) you can copy all databases to a different drive or medium (e.g. CD-ROM) for easier access. You may choose only to copy the indices and leave the database as such on a networked or slow drive. In this case you should modify the reference to the original database in the FAC file using Notepad.
The command ‘Add database’ enables you to combine two databases (or any files) into a single database. This can typically be used with the PIR database (comes as 4-5 separate files), genome databases (are often present as a database for each chromosome), the TREMBL database (a database for each species) or the Swiss-Prot and the GenPept that are published as a main database with an update.
Start by loading a database. Press the ‘Add database’ button and you will be asked for a database to add. When you select a file, it will be added to the original database (file).
i Note: The combined database will have the name of the database first
loaded. If you want to preserve this file you may make a copy of the first
database and rename it to reflect the nature of the final database.
The files added to the first file are not changed in any way.
When all the files have been added, you can proceed with indexing the database.
As the addition of files is a straight binary combination of files, it does not matter whether you convert from VMS to DOS before or after combining the files.
Several databases on the Internet are in a format that is not directly accessible by GPMAW. In most cases the transformation is rather trivial (e.g. converting from VMS to DOS) and is used mainly to speed up access and simplify coding, while in the case of the Swiss-Prot database, the information herein is so valuable, that it is of great value to access directly from GPMAW.
Use the ‘Convert file’ tab of the program to access these conversion routines.

A few of the non-redundant databases (NCBI-nr and EMBL-nr) are created with name lines that exceed 255 characters, which is the limit for accepted names in GPMAW.
You have to reduce these databases by selecting the relevant radio button in the ‘Non-standard to standard FastA’ panel, and then press .
You are then asked to open a database, and then to give a name to the new database. GPMAW suggests a name depending on the selection of the radio-buttons above.
The new database will be placed in the same directory as the original database.
i Note: Make sure you have enough disk space for the operation, as the new database will only be about 10% smaller than the original.
The ‘Convert’ command also takes care of converting from VMS to DOS format if necessary.
When you are through converting the database, you can delete the old (original) database.
If you want to create indices from the newly created database you have to re-load it through the button.
Text files, and thus also flat file databases, are internally differently represented in UNIX and DOS (Windows). Where the DOS format specifies that each line ends in carriage return and line feed characters (#13#10), the VMS file system only specifies a line feed character (#10). Using the VMS to DOS file conversion routine takes care of this.
When pressing the ‘VMS to DOS’ button you are asked for the file to convert, and when accepted, the file is converted.
The new file replaces the old one (unlike the ‘Convert’ command above).
You can follow the file conversion process in the ‘File position’ label and in the progress bar.
Press the ‘Swiss-Prot’ button and select the Swiss-Prot sequence database (typically named sprot37.seq for the main database or new_seq.seq for the update). You are then asked whether you are converting the main, the update or another database. The default names for the converted database is SWISS.SEQ for the main and SWISSNEW.SEQ for the update.
If you keep all the files in the same directory, GPMAW is able to search the FastA indexed database and retrieve the full entry in the original Swiss-Prot database, please see ‘Reading CD-ROM based sequences - FastA’ in Chapter 2.6 for details.
i Note: The Swiss-Prot database is no longer free-ware. If you are a non-commercial organization you need a licensing agreement. Please see http://www.expasy.hcuge.ch/announce/ for further details.
The ‘Composition filter’ page of DBindex enables you to filter a FastA formatted database based on amino acid composition and/or molecular mass.
The input is a database that has to be in FastA format. The operation is based on a composition range (in %) for each amino acid residue and/or a mass range specified by a lower and an upper limit. The result is a new database where each entry conforms to the filter specification.

Ø Select a database if the proper database has not been selected already (the currently opened database is shown in the bottom status bar).
Ø Enter minimum and maximum % values for each residue in the top edit boxes. The amino acid residues are selected from the left-hand list box whereupon the low and high % can be changed. The edit box is updated whenever the focus changes from an edit box. By default all residues are set to a composition between 0 and 20%.
Ø Enter low and high mass values (in kDa) for the proteins to be selected.
Ø You may check either of the ‘Ignore composition’ or ‘Ignore mass’ to generate a database that does not take composition/mass into consideration when filtering
Ø The name of the resulting database can be edited (by default DBSORT.DAT).
Ø Click on the to start the conversion process.
The filtering process can be followed on the progress bar.
After the filtered database has been generated it can be searched, indexed and viewed like a normal FastA formatted database. If the file is small enough (< 32.000 bytes for Win95/98) it can be viewed directly in Notepad, if larger it can be viewed in a word processor.

The FTP download command is a simple way to obtain databases that can be used with GPMAW after indexing (above or Digest search). It is feasible, but not very effective to use the download command to download other files.
Start by selecting the relevant database in ‘File to download’, then select the options you want in ‘Post download command’. Determine whether you want to ‘Overwrite destination file’. Check ‘Destination’ (can be edited) and then press. The actual file downloaded can bee seen in the bottom status bar.
The progress of the download can be followed in the horizontal ‘Download progress’ bar.
The ‘Index database’ and ‘Reduce complexity’ post-download commands are the same as the ‘Make index’ and ‘Convert’ commands described above. When downloading EMBL and NCBI non-redundant databases you should always have the ‘Reduce complexity’ checked if you check the ‘Index database’.
The actual destination directory and download file can be edited by pressing the button.
i Note: The FTP download may not work under all circumstances; particularly it may have trouble going through firewalls.
The simulated 2-D gel shows a graphical representation of a number of proteins presented as dots in a graph where the X-axis is the pI of the protein and the Y-axis is the mass. This is the typical setup when using 2-D gel electrophoresis.
i Note: The pI calculated by GPMAW is a theoretical calculation based on the input sequence and as such is quite approximate. You should be aware that the trimmings of signal sequences and other post-translational modifications have a considerable influence the pI. The mass of the protein is influenced to a lesser degree by modifications. Even when the protein contains no modifications the three-dimensional structure can influence both the pI and electrophoretic mobility in the gel.
The initial window of the ‘Simulated 2-D
gel’ shows a display with a mass range of 10 – 200 kDa and a pI range of 3 –
10. Green lines show each pI value and 25 kDa mass. The 100 kDa mass line and
the pI 7.0 line are shown in dark green.
The ‘2D-gel’ can show either
1. the proteins present in a FastA formatted database
2. the proteins opened on the desktop
3. a combination of 1. and 2.
The parameters can be accessed through the
toolbar and the right mouse popup menu.
![]()
The toolbar shows from left to right:
Open database: Select a database in FastA
format. As each protein has to be read and the pI calculated the time to read a
large database takes quite a time. The database proteins are shown as blue
dots.
Save graph to disk: The graph (‘2-D gel’)
is saved to disk in bitmap format. This can then be imported in a report or
further modified in a graphics program. You can copy to clipboard through the main menu (Edit | Copy) or use the
Ctrl+C keyboard shortcut.
Import sequences: Load all protein
sequences that are opened on the desktop into the ‘2-D gel’. The imported
proteins are shown as red dots.
Set scale: Enables you to redefine the
display limits. This dialog can also be invoked by double-clicking in the ‘gel’
display area. You can zoom in to part of the display by ‘click-and-drag’ the
mouse cursor across the required part of the graphics.
Display grid: Turns the green grid on and off. The lines are drawn for every 25
kDa and 1 pI unit. 100 kDa and pI 7 are drawn in a darker color. The grid is on
by default.
Show tails as lines: Shows N- and C-terminal trimmings as lines.
N-terminal trimmings are green and C-terminal trimmings are red. Up to 300
residues are trimmed from either terminal. If you combine the lines with trimming
dots (see below) you will be able to navigate by positioning the mouse cursor
above the dots. This function works only on imported sequences.
Exit. Close the ‘2D-gel’ window.
To the right
of the command buttons you can select various options for the display
![]()
Dots: The dots are either single pixels or 2x2 pixel dots. The large dots
are on by default. You will probably only need the small dots when displaying a
very large database.
The following options will only work on
imported sequences (not on database
proteins) and are only enabled when sequences have been imported from the
desktop. Each options works trough a drop-down menu activated by pressing on
the down-arrow
. You can get information on the resulting trimmed or
‘phosporylated’ dots by resting the mouse cursor on top of the dot in question
(see ‘Information panel’ below).
Trim
size: Defines
the number of residues cleaved from either end of the imported protein. The
‘trim’ parameter is combined with the ‘Numbers’ parameter below. Trim size can
be defined as 1, 2, 3, 5, 10 or 20. The last option on the menu, ‘Labels’, will turn name labels on and off for proteins imported from the
desktop.
Numbers:
The number of ‘tails’ shown, i.e. the number of
‘trimmed’ spots generated for each protein. You can specify 1, 5, 10, 20, 30 or
50 tails. With a ‘trim size’ 2 and ‘tail numbers’ of 5, the N-terminal tails of
a 400 residue protein will be 2-400, 4-400 … 10-400, and the C-terminal tails
will be 1-398, 1-396 … 1-390.
Phospho: Lets you simulate the addition of phosphorylations to the imported proteins.
You can specify 1-4 phosphorylations. These will show up as dots
trailing towards the acidic part of the ‘2D-gel’. Only the charge is considered
in the ‘gel’ representation, as the mass will usually be insignificant. The
‘phosphorylations’ will only be carried out on the intact protein, not on the
‘trimmed’ proteins.
Spot
colors:
·
Blue: Database proteins
·
Red: Proteins imported from the desktop
·
Green: N-terminal ‘tails’
·
Pink: C-terminal ‘tails’
·
Light blue: ‘Phosphorylations’
Information
panel: The bottom of the ‘2D-gel’ window contains
three information panels that show from left to right:
1. pI and mass of the mouse cursor position. If the mouse cursor points to an
imported protein or one of the tail dots, the name and modification of that
protein will be shown. If the cursor rests for a couple of seconds on the spot,
the name will also be shown in a fly-by help window.
2. pI and mass range of the entire window.
3. Number of proteins imported from a database
You can zoom
the view of the ‘2D-gel’ in the usual way for graphs by ‘click-and-drag’ the
mouse cursor across the required part of the picture in order to enlarge that
portion. If you double-click in the graph area, you reset the graph to the
default values.

The picture shows the ‘2-D gel’ window with the proteins from the E. coli protein database.
Any protein database can be displayed in the ‘2D-gel’ as long as it is in FastA format. If you have problems reading the database, you may have to convert it from VMS to DOS format or the name lines may be too long (please see ‘Database indexer’ in section 12.3 and Appendix B for more information on how to treat FastA formatted databases).
i Note: Both the mass and the pI of each spot in the ‘2D-gel’ are the result of theoretical calculations based on the sequence in the database. In vivo a large part of the proteins are likely to be post-translationally modified, either by trimming and/or by chemical addition/removal of groups, which can affect both pI and mass.

A zoomed view of the ‘2D-gel’ after import of the prothrombin sequence. 5 N-terminal and 5 C-terminal trimmings of 10-50 residues are shown in the graph. N-terminal trimmings result in a move of the protein towards the acidic part of the ‘gel’ and C-terminal trimming results in a, smaller, move towards the basic part of the ‘2D-gel’. The legend in the lower left panel shows that the mouse cursor rests on top of the spot representing the N-terminal trimming of 20 residues (green spot labeled 20 in the ‘2D-gel’).
Combining the above view with the ‘Tails as lines’ option
gives you a view of how the protein ‘travels’ through a 2D-gel when being
trimmed from either end. The dots of the tails can be seen as bulges on the
lines. When the mouse cursor points to a dot, the trimming and sequence will be
shown as fly-by help and in the bottom left-hand panel. The legend to each dot
can be turned on and off through the ‘Trim size | Labels’ menu option.
If you combine imported sequences with a database, you should read the database as the last operation, as every other operation re-draws the database on the screen which can be quite time-consuming for a large database.

Using the button enables you to simulate up to four negative charges (phosphorylations) on the intact protein. These dots will show up as dots towards the acidic end of the ‘2D-gel’. As these spots can lie quite close to the intact proteins, their label will be printed below the dots (all other labels will appear just above their respective dots. The spacing of these dots relative to the ‘native’ protein gives a good indication of how ‘resistant’ the protein is towards single charge changes.
You can print the ‘2D-gel’ by selecting ‘File | Print’ in the main menu, pressing the ‘Print’ button in the main toolbar or selecting ‘Print’ from the pop-up menu. Only the displayed part of the ‘2D-gel’ will be printed.
i Note: Printing complex ‘2D-gels’, particularly with N- and C-terminal tails should be done on a color printer, as a black and white print can be very confusing.