Chapter

12

Utilities

This section contains a few utilities that do not fit into other categories.

Several of the functions presented here (e.g. mass comparison and composition calculator) are usually called as auxiliary functions from other windows, but as each function can sometimes be useful on its own, they are presented here as individual windows.

MS peak analysis                                                                            12.1

The MS peak analysis displays mass peak difference data in a x/y table much like a spreadsheet.

Data input

You start by entering the peak data in a table (see right). The table can be manually edited, but you can load tables in PerSeptive, Bruker and HP MALDI ms format (peak file) or you can paste from the clipboard. GPMAW’s own PEP format is of course also supported.

Once the data is entered you can edit them (see Chapter 4.1 for an explanation of the ‘Edit’ button), copy to clipboard or save to a disk file in PEP format just like data entry in the ‘Mass search’ (Chapter 6.1) and ‘Digest database search’ (Chapter 8).

The table information is automatically set to the file name of the most recent file read.

The local pop-up menu supports the same functions as the right-hand buttons.

The table accepts a maximum of 150 entries.

X/Y table

Once the data has been accepted by pressing the ‘OK’ button, a new dialog opens that shows the mass difference data.

The table shows the left hand column minus the top row.

The table by itself can be interesting, as it is relatively easy to spot repetitive occurrences of the same or similar mass. The real advantage comes when comparing to predefined mass tables as defined in the bottom status line.

Activation the ‘Amino acid’ button highlights all mass differences that correspond to amino acid residue differences. The amino acid masses are taken from the currently loaded mass file. When you move the mouse cursor on to a highlighted field, a fly-by hint will open showing the name of the amino acid residue. The local pop-up menu allows you to switch the mass value of the highlighted fields to three-letter residue names. In cases where you have isobaric residues (Leu/Ile, Lys/Gln) both names will be shown.

The ‘Precision’ box defines the precision by which the highlights are found. The box is in Dalton and can be edited directly or you may use the mouse to activate the up/down arrows.

The sequence tag button  will draw lines between successive amino acid residues, i.e. a sequence tag:

The ‘Sugar’ button highlights mass differences corresponding to sugar (carbohydrate) residue differences. The mass values are taken from a modification file (in the ‘System’ directory) called ‘SUGARS.MOD’. The user can modify this file like all other modification files (see Chapter 4.3) and can in principle contain any data. In order to stay with the label of the button, you should keep it this way.

The ‘Modificat.’ button uses the currently loaded modification file (Chapter 4.3) to search for mass differences.

The three mass difference types use the same precision but use different background colors for highlighting. Only the amino acid differences can be switched to show names in the table, the two others only show the names in the fly-by hint.

 The bottom left button ‘Print’ prints the table (across several pages if needed) and the button to the right of this toggles between showing average and monoisotopic mass values.

Pop-up menu

The local pop-up menu has four entries:

Draw aa links:             Same as the ‘Sequence tag’ button.

AA as text:                   Toggles between showing highlighted amino acid masses as text and as masses.

Compressed table:    Toggles the displays the table into a compressed format in order to show more cells.

Print:                              Same as the ‘Print’ button.

Composition calculator                                                                   12.2

The composition calculator is usually called from dialog boxes that demand input of compositions (i.e. ‘Edit mass file’, ‘Edit modification file’ etc.). Through the ‘Utilities’ section you have direct access to the dialog for calculating the mass of a given composition or to get the composition string in GPMAW format.

For more information on the GPMAW composition formula strings, see Chapter 4.4.

The Elemental composition dialog contains an edit dialog with a corresponding up/down spin control for each atom define in the ‘Atomic masses’ section of the ‘Edit mass’ dialog (see Chapter 4.2).

The number of each atom is controlled by entering a number in the edit boxes or by using the spin control with the mouse. Only integer values can be entered. The composition box is updated for every change made in the composition. The composition line can be copied (highlight and press <Ctrl+C> or use the pop-up menu) but cannot be edited directly.

Negative compositions or compositions that are part negative are accepted.

The ‘Clear’ button zeroes both the composition and the edit boxes.

Both the average and monoisotopic mass is reported.

Database indexer (DBindex)                                                           12.3

The database indexer, called DBindex, is a separate program that is either bundled with GPMAW or can be freely downloaded from the Lighthouse data website (see Chapter 1.9). As the program is separate from GPMAW it means that once called you can run it independently (i.e. switch back and forth without one program interfering with the other). The version numbers are separate from GPMAW and you should consult the web site or Lighthouse data for the latest version. Current version, August 2000, is version 1.11.

The program (DBINDEX.EXE) has to be present in the same directory as the GPMAW executable file (GPMAW3.EXE) in order to be called from the menu.

i       Note: If the menu entry Utilities|Call DBindex is disabled (grayed) it is because GPMAW could not find the database indexing utility. You should then copy the program and the help file (DBINDEX.HLP) to the GPMAW\BIN directory.

When the program opens, you are greeted with a dialog stating copyright and version number.

From the copyright dialog you can select what section of the DBindexer to enter. Independent of the section you choose here, you can switch between sections on differents tabs in the running program.

Index:            Create indices from a FastA formatted database that enables GPMAW to search the database on the basis of protein name or accession number.
Combine databases (i.e. add one database to the end of another).

Convert:        Convert the Swiss-Prot database to FastA format and still enable retrieval of the full Swiss-Prot entry through GPMAW.
Reduce the complexity of some non-redundant databases that create very long (>255 character) name lines (NCBI-nr and EMBL-nr).
Convert files in VMS format to DOS (Windows) format. The VMS format is used on most UNIX systems and needs to be converted to for GPMAW to access the databases.

Filter:             Filter a FastA formatted database with regard to amino acid composition and molecular size.

Download:    Download databases from the internet usig the FTP protocol. This section is still in the experimental stage, and if you experiences any problems, you  should use a dedicated FTP download program.

Other:            The program opens on the first page of the window without asking for a database.

The ‘Index’, ‘Convert’ and ‘Filter’ commands starts by asking for a database file to work on. If you later need to open a new database, just click on the ‘Load database’ button in the buttom command line.

When you have opened a database, the next button in the command line, ‘Db Peek’, becomes active. Pressing this button opens a text dialog with the first couple of thousand characters of the database. This enables you to check the format and content of the database.

In the database file viewer (above) you are only able to only view the content of the file, you cannot edit it.

The view starts with the full file name, then comes whether the file conforms to the GPMAW FastA format, and finally what type of file is present (ASCII vs. Binary format, DOS vs. VMS format).

After the stippled line, comes the first records of the database.

Click on ‘OK’ to return to DBindex.

Indexing a FastA formatted database

Start by loading a FastA formatted database. If you have not selected one when opening the program, or if you want to select a different database, press the ‘Load database’ button and select a new database. The database loaded will be shown in the yellow list box in the right hand part of the dialog and in the status bar above the progress bar at the bottom of the display.

In the drop-down list ‘Format of FastA database’ you select the kind of database loaded.

Press the ‘Make index’ button and database indices will be generated. The progress of the index creation can be followed in the progress bar, the ‘Record number’ and the ‘Record position’. The yellow list box will show statistics and files generated.

The index files generated will be placed in the same directory as the database.

i          Note: Make sure that you have sufficient space on your harddisk as the index files a quite voluminous. E.g. the current version of EMBL-nr takes up 280MB of space (after conversion from VMS to DOS and reduction of name lines) while the indices additionally take up almost 30 MB of space.

Four files are generated, each characterized by its extension:

.ACC         Accession number

.FAC                         Facts file. This is a text file that contains general information on the database and the indices.

.NDX        Index into the main database.

.TRG        Target database. Contains the search words.

The user should not modify any of the files generated. Only the FAC file can be of any use and can be read into any text editor.

When the program finishes indexing (a rather lengthy procedure ~20 min.) you can copy all databases to a different drive or medium (e.g. CD-ROM) for easier access. You may choose only to copy the indices and leave the database as such on a networked or slow drive. In this case you should modify the reference to the original database in the FAC file using Notepad.

Combine databases

The command ‘Add database’ enables you to combine two databases (or any files) into a single database. This can typically be used with the PIR database (comes as 4-5 separate files), genome databases (are often present as a database for each chromosome), the TREMBL database (a database for each species) or the Swiss-Prot and the GenPept that are published as a main database with an update.

Start by loading a database. Press the ‘Add database’ button and you will be asked for a database to add. When you select a file, it will be added to the original database (file).

i       Note: The combined database will have the name of the database first loaded. If you want to preserve this file you may make a copy of the first database and rename it to reflect the nature of the final database.
The files added to the first file are not changed in any way.

When all the files have been added, you can proceed with indexing the database.

As the addition of files is a straight binary combination of files, it does not matter whether you convert from VMS to DOS before or after combining the files.


Converting files

Several databases on the Internet are in a format that is not directly accessible by GPMAW. In most cases the transformation is rather trivial (e.g. converting from VMS to DOS) and is used mainly to speed up access and simplify coding, while in the case of the Swiss-Prot database, the information herein is so valuable, that it is of great value to access directly from GPMAW.

Use the ‘Convert file’ tab of the program to access these conversion routines.

Reducing the complexity of a database

A few of the non-redundant databases (NCBI-nr and EMBL-nr) are created with name lines that exceed 255 characters, which is the limit for accepted names in GPMAW.

You have to reduce these databases by selecting the relevant radio button in the ‘Non-standard to standard FastA’ panel, and then press ‘Convert’.

You are then asked to open a database, and then to give a name to the new database. GPMAW suggests a name depending on the selection of the radio-buttons above.

The new database will be placed in the same directory as the original database.

i       Note: Make sure you have enough disk space for the operation, as the new database will only be about 10% smaller than the original.

The ‘Convert’ command also takes care of converting from VMS to DOS format if necessary.

When you are through converting the database, you can delete the old (original) database.

If you want to create indices from the newly created database you have to re-load it through the ‘Load database’ button.

VMS to DOS conversion

Text files, and thus also flat file databases, are internally differently represented in UNIX and DOS (Windows). Where the DOS format specifies that each line ends in carriage return and line feed characters (#13#10), the VMS file system only specifies a line feed character (#10). Using the VMS to DOS file conversion routine takes care of this.

When pressing the ‘VMS to DOS’ button you are asked for the file to convert, and when accepted, the file is converted.

The new file replaces the old one (unlike the ‘Convert’ command above).

You can follow the file conversion process in the ‘File position’ label and in the progress bar.

Convert Swiss-Prot to FastA

Press the ‘Swiss-Prot’ button and select the Swiss-Prot sequence database (typically named sprot37.seq for the main database or new_seq.seq for the update). You are then asked whether you are converting the main, the update or another database. The default names for the converted database is SWISS.SEQ for the main and SWISSNEW.SEQ for the update.

If you keep all the files in the same directory, GPMAW is able to search the FastA indexed database and retrieve the full entry in the original Swiss-Prot database, please see ‘Reading CD-ROM based sequences - FastA’ in Chapter 2.6 for details.

i       Note: The Swiss-Prot database is no longer free-ware. If you are a non-commercial organization you need a licensing agreement. Please see http://www.expasy.hcuge.ch/announce/ for further details.

Composition filter a FastA database

The ‘Composition filter’ page of DBindex enables you to filter a FastA formatted database based on amino acid composition and/or molecular mass.

The input is a database that has to be in FastA format. The operation is based on a composition range (in %) for each amino acid residue and/or a mass range specified by a lower and an upper limit. The result is a new database where each entry conforms to the filter specification.

Ø       Select a database if the proper database has not been selected already (the currently opened database is shown in the bottom status bar).

Ø       Enter minimum and maximum % values for each residue in the top edit boxes. The amino acid residues are selected from the left-hand list box whereupon the low and high % can be changed. The edit box is updated whenever the focus changes from an edit box. By default all residues are set to a composition between 0 and 20%.

Ø       Enter low and high mass values (in kDa) for the proteins to be selected.

Ø       You may check either of the ‘Ignore composition’ or ‘Ignore mass’ to generate a database that does not take composition/mass into consideration when filtering

Ø       The name of the resulting database can be edited (by default DBSORT.DAT).

Ø       Click on the ‘Create filtered database’ to start the conversion process.

The filtering process can be followed on the progress bar.

After the filtered database has been generated it can be searched, indexed and viewed like a normal FastA formatted database. If the file is small enough (< 32.000 bytes for Win95/98) it can be viewed directly in Notepad, if larger it can be viewed in a word processor.

FTP download

The FTP download command is a simple way to obtain databases that can be used with GPMAW after indexing (above or Digest search). It is feasible, but not very effective to use the download command to download other files.

Start by selecting the relevant database in ‘File to download’, then select the options you want in ‘Post download command’. Determine whether you want to ‘Overwrite destination file’. Check ‘Destination’ (can be edited) and then press ‘Download’. The actual file downloaded can bee seen in the bottom status bar.

The progress of the download can be followed in the horizontal ‘Download progress’ bar.

The ‘Index database’ and ‘Reduce complexity’ post-download commands are the same as the ‘Make index’ and ‘Convert’ commands described above. When downloading EMBL and NCBI non-redundant databases you should always have the ‘Reduce complexity’ checked if you check the ‘Index database’.

The actual destination directory and download file can be edited by pressing the ‘Setup’ button.

i       Note: The FTP download may not work under all circumstances; particularly it may have trouble going through firewalls.

 

Simulated 2-D gel                                                                            12.4

The simulated 2-D gel shows a graphical representation of a number of proteins presented as dots in a graph where the X-axis is the pI of the protein and the Y-axis is the mass. This is the typical setup when using 2-D gel electrophoresis.

i       Note: The pI calculated by GPMAW is a theoretical calculation based on the input sequence and as such is quite approximate. You should be aware that the trimmings of signal sequences and other post-translational modifications have a considerable influence the pI. The mass of the protein is influenced to a lesser degree by modifications. Even when the protein contains no modifications the three-dimensional structure can influence both the pI and electrophoretic mobility in the gel.

The initial window of the ‘Simulated 2-D gel’ shows a display with a mass range of 10 – 200 kDa and a pI range of 3 – 10. Green lines show each pI value and 25 kDa mass. The 100 kDa mass line and the pI 7.0 line are shown in dark green.

The ‘2D-gel’ can show either

1.       the proteins present in a FastA formatted database

2.       the proteins opened on the desktop

3.       a combination of 1. and 2.

The parameters can be accessed through the toolbar and the right mouse popup menu.

The toolbar shows from left to right:

 Open database: Select a database in FastA format. As each protein has to be read and the pI calculated the time to read a large database takes quite a time. The database proteins are shown as blue dots.

 Save graph to disk: The graph (‘2-D gel’) is saved to disk in bitmap format. This can then be imported in a report or further modified in a graphics program. You can copy to clipboard through the main menu (Edit | Copy) or use the Ctrl+C keyboard shortcut.

 Import sequences: Load all protein sequences that are opened on the desktop into the ‘2-D gel’. The imported proteins are shown as red dots.

 Set scale: Enables you to redefine the display limits. This dialog can also be invoked by double-clicking in the ‘gel’ display area. You can zoom in to part of the display by ‘click-and-drag’ the mouse cursor across the required part of the graphics.

 Display grid: Turns the green grid on and off. The lines are drawn for every 25 kDa and 1 pI unit. 100 kDa and pI 7 are drawn in a darker color. The grid is on by default.

 Show tails as lines: Shows N- and C-terminal trimmings as lines. N-terminal trimmings are green and C-terminal trimmings are red. Up to 300 residues are trimmed from either terminal. If you combine the lines with trimming dots (see below) you will be able to navigate by positioning the mouse cursor above the dots. This function works only on imported sequences.

 Exit. Close the ‘2D-gel’ window.

To the right of the command buttons you can select various options for the display

Dots: The dots are either single pixels or 2x2 pixel dots. The large dots are on by default. You will probably only need the small dots when displaying a very large database.

The following options will only work on imported sequences (not on database proteins) and are only enabled when sequences have been imported from the desktop. Each options works trough a drop-down menu activated by pressing on the down-arrow . You can get information on the resulting trimmed or ‘phosporylated’ dots by resting the mouse cursor on top of the dot in question (see ‘Information panel’ below).

Trim size:  Defines the number of residues cleaved from either end of the imported protein. The ‘trim’ parameter is combined with the ‘Numbers’ parameter below. Trim size can be defined as 1, 2, 3, 5, 10 or 20. The last option on the menu, ‘Labels’, will turn name labels on and off for proteins imported from the desktop.

Numbers: The number of ‘tails’ shown, i.e. the number of ‘trimmed’ spots generated for each protein. You can specify 1, 5, 10, 20, 30 or 50 tails. With a ‘trim size’ 2 and ‘tail numbers’ of 5, the N-terminal tails of a 400 residue protein will be 2-400, 4-400 … 10-400, and the C-terminal tails will be 1-398, 1-396 … 1-390.

Phospho: Lets you simulate the addition of phosphorylations to the imported proteins.  You can specify 1-4 phosphorylations. These will show up as dots trailing towards the acidic part of the ‘2D-gel’. Only the charge is considered in the ‘gel’ representation, as the mass will usually be insignificant. The ‘phosphorylations’ will only be carried out on the intact protein, not on the ‘trimmed’ proteins.

Spot colors:

·          Blue: Database proteins

·          Red: Proteins imported from the desktop

·          Green: N-terminal ‘tails’

·          Pink: C-terminal ‘tails’

·          Light blue: ‘Phosphorylations’

Information panel: The bottom of the ‘2D-gel’ window contains three information panels that show from left to right:

1.       pI and mass of the mouse cursor position. If the mouse cursor points to an imported protein or one of the tail dots, the name and modification of that protein will be shown. If the cursor rests for a couple of seconds on the spot, the name will also be shown in a fly-by help window.

2.       pI and mass range of the entire window.

3.       Number of proteins imported from a database

You can zoom the view of the ‘2D-gel’ in the usual way for graphs by ‘click-and-drag’ the mouse cursor across the required part of the picture in order to enlarge that portion. If you double-click in the graph area, you reset the graph to the default values.

The picture shows the ‘2-D gel’ window with the proteins from the E. coli protein database.

Any protein database can be displayed in the ‘2D-gel’ as long as it is in FastA format. If you have problems reading the database, you may have to convert it from VMS to DOS format or the name lines may be too long (please see ‘Database indexer’ in section 12.3 and Appendix B for more information on how to treat FastA formatted databases).

i   Note: Both the mass and the pI of each spot in the ‘2D-gel’ are the result of theoretical calculations based on the sequence in the database. In vivo a large part of the proteins are likely to be post-translationally modified, either by trimming and/or by chemical addition/removal of groups, which can affect both pI and mass.

A zoomed view of the ‘2D-gel’ after import of the prothrombin sequence. 5 N-terminal and 5 C-terminal trimmings of 10-50 residues are shown in the graph. N-terminal trimmings result in a move of the protein towards the acidic part of the ‘gel’ and C-terminal trimming results in a, smaller, move towards the basic part of the ‘2D-gel’. The legend in the lower left panel shows that the mouse cursor rests on top of the spot representing the N-terminal trimming of 20 residues (green spot labeled 20 in the ‘2D-gel’).

Combining the above view with the ‘Tails as lines’ option gives you a view of how the protein ‘travels’ through a 2D-gel when being trimmed from either end. The dots of the tails can be seen as bulges on the lines. When the mouse cursor points to a dot, the trimming and sequence will be shown as fly-by help and in the bottom left-hand panel. The legend to each dot can be turned on and off through the ‘Trim size | Labels’ menu option.

If you combine imported sequences with a database, you should read the database as the last operation, as every other operation re-draws the database on the screen which can be quite time-consuming for a large database.

 

Using the ‘Phospo’ button enables you to simulate up to four negative charges (phosphorylations) on the intact protein. These dots will show up as dots towards the acidic end of the ‘2D-gel’. As these spots can lie quite close to the intact proteins, their label will be printed below the dots (all other labels will appear just above their respective dots. The spacing of these dots relative to the ‘native’ protein gives a good indication of how ‘resistant’ the protein is towards single charge changes.

 

Print

You can print the ‘2D-gel’ by selecting ‘File | Print’ in the main menu, pressing the ‘Print’ button in the main toolbar or selecting ‘Print’ from the pop-up menu. Only the displayed part of the ‘2D-gel’ will be printed.

i          Note: Printing complex ‘2D-gels’, particularly with N- and C-terminal tails should be done on a color printer, as a black and white print can be very confusing.