Chapter
7
Searches
Search for a given composition in a sequence or local BLAST homology searching of a database
Like mass search it is possible to locate a peptide in a protein if you know its amino acid composition. However, as the precision in the determination of amino acid compositions is at least an order of magnitude worse than the average mass spectrum, the search precision can never be as good as 'Mass search' (Chapter 6.1). The low precision is to some degree compensated by the fact that you search for a combination of (at most) 18 amino acids (Asn -> Asp and Gln -> Glu due to hydrolysis) not just a single value.

You fill in the composition search table with the expected number of residues of the search peptide (not the % composition). You can use decimal numbers as the calculations are carried out with 1 decimal.
The table is persistent between searches (remembers the table values), enabling you to carry out several searches with slight modifications. The button clears the table, and the ‘Number of extra residues to add:’ field adds the required number of 'unknown' residues to the search, resulting in a smoothing effect

The results of the search are shown as a graph of the deviation index (DI) of a sliding search window along the entire sequence
Low points in the graph show areas of the sequence, which have a composition similar to that given in the input dialog box. If the given composition reflects an actual peptide, this will usually be quite evident from the sequence (around residue 66 in the above graph). Please note that the graph starts and stops with a Y-value of zero. If the composition fits a terminal peptide, the graph has to be horizontal or make a sharp downward bend at the terminal.
The position of the cursor is shown in the first panel of the status bar.
For details about general handling of the graph, please see Chapter 11.1.
H. Metzger, M.B. Shapiro, J.E. Mosimann & J.E. Vinton, Nature 219, 1166-1168 (1968)
R.J.T. Corbett & R.S. Roche, Anal. Biochem. 162, 546-552 (1987).
The GPMAW BLAST is a local implementation of the BLAST sequence homology search available on the NCBI server (http://www.ncbi.nlm.nih.gov/BLAST) and uses the same code compiled for Windows. The search runs as a separate program, but all communication and interface elements are implemented in GPMAW, so from a users point of use, it works like an integrated part of GPMAW.
The main reasons for using a local implementation of BLAST could be:
1) A slow Internet connection (or none at all). When searching a large database, the NCBI server is faster than a local implementation. However, at regular intervals the NCBI BLAST server slows to a crawl.
2) A specialized or proprietary database. If you are searching a small genome database (e.g. E. coli or A. thaliana) a local implementation can be very fast (2-3 seconds).
3) Security concerns – communications across the Internet may be compromised.
4) Convenience – the local homology search of a protein is just a click away.
The first thing is to make sure that the BLAST homology search program is installed and recognized by GPMAW. Open System setup (Setup | System setup) and click on the BLAST page.

In the bottom left corner is the legend
‘BLAST program location’. The line below should show the location of the
blastall.exe program. If not, you have to press the ‘Install BLAST’ button
and navigate to the
blastall.exe program in order to make GPMAW aware of the location. By default
the ‘blastall.exe’ program will be installed in the \gpmaw\bin\ directory along
with the other executable files.
i Tech: In addition to the ‘blastall.exe’ you need the file ‘formatdb.exe’ and a subdirectory called ‘DATA’ with the following files: ‘seqcode.val’, ‘blosum42’, ‘blosum62’, ‘blosum80’, ‘pam30’ and ‘pam70’. All these files will normally be installed by GPMAW, but may also be downloaded from the NCBI web server.
When the BLAST program is installed, you need a protein database in FastA format. This can be the same database used for sequence retrieval (chapter 2.6) and/or used for digest database searching (chapter 8). The databases on the GPMAW installation CD-ROM can be used and chapter 12.3 and appendix B contains information on how to retrieve sequences from the Internet.
When you have the database you need to reformat it for BLAST:
Click on the button
and in the ‘Open file’
dialog you navigate to and select the FastA database (e.g. swiss.seq from the
GPMAW CD-ROM – it needs to be installed on the hard drive). When selected, the
database is quickly converted (note that the converted database takes up
approximately the same amount of space as the original database) and you are
asked to add it to the list of databases available for local BLAST.
You may add several databases to the list.
The button
may be used to add already
converted databases to the list (e.g. if they are shared across a network) and
with the button
you can remove databases
from the list.
The local BLAST can be called from all sequence windows, either through the main menu Search | Local BLAST or the same command from the pop-up menu.
This will open the BLAST dialog box with the name of the sequence in the top edit box and the sequence in the ‘Input sequence’ multiline edit box below. If you have installed one or more databases you can select them in the drop-down selection box ‘Sequence database’. If no BLAST databases are installed you can go to the setup page by pressing the button. The program will remember the most recently used database.
Both the name and the sequence can be edited before performing the BLAST search.
Several parameters can be set to fine-tune the search:
Expect value: The expected number of hits from a given database with the given input sequence. E-values up to this value may be reported. When searching with protein sequences values of 1-10 are common, when searching with peptide you can increase the value to 1000-10000.
Substitution matrix: The amino acid substitution matrix used to calculate the score in the homology search. The matrix to use depends on what you are looking for: BLOSUM matrices are based on the comparison of blocks of homologous sequences while the PAM matrices are based on the total alignment of protein sequences. If you are looking for highly divergent proteins you should use low BLOSUM or high PAM values. Looking for closely similar proteins use high BLOSUM or low PAM values. BLOSUM62 is usually a good compromise for general searches.
Hits to report: Determines the maximum number of hits to show in the results window.
Filter sequence for low complexity: If checked, parts of the input sequence that have a low complexity (e.g. skewed composition or simple repetitive regions) will be masked during the search. This box should not be checked when using short sequences for the homology search.

Selecting calls the
external BLAST search program and opens the BLAST result window. This window
gives the message ‘>Searching database<’, ‘>Please wait<’ and
displays a counter that shows the elapsed search time.
Search times depend largely on the size of the database, but the size of the input sequence also has a minor influence.

The result of the search is presented in the same window as the search timer. At the top is the date, the name of the input sequence, the name of the database followed by the list of the highest scoring comparisons. Each hit is accompanied by a score and an E-value. The E-value is the likelihood of finding a comparison with this score in a database of this size. Significant similarities are usually taken as E-values below 10-4, but the lower the better. Homology is a theory that can be difficult to prove.
Below the summary list, the highest scoring segments of each hit is presented. At the bottom of the display is search statistics and a reference to the article presenting the algorithm and search program (Altshul et al., 1997).
For an in depth treatment of homologies please consult the articles referenced at the end of this chapter or some of the many books on bioinformatics.
![]()
The toolbar at the top of the window contains the following commands from left to right:
1) Copy result table to clipboard.
2) Save result list to a file on disk (in text format).
3) Move the high scoring segment comparison to the top of the window. This option is only highlighted when a line in the summary list is selected. You can accomplish the same thing by double-clicking on the line.
4) Scroll the top of the summary list to the top of the window. Only available when no line in the summary list is selected.
5) Retrieve sequence into GPMAW. This button works in conjunction with retrieval method drop-down list discussed below. This button is only active when a name line is selected. As the sequence retrieval option works through the accession number, it is only active when the BLAST database used for searching is in a format where the accession number is listed in the name line. The accession number used for retrieval is listed in the panel to the right of the toolbar. If there is no accession listed, the function is unable to retrieve a sequence.
6) Print the list.
7) Close BLAST result window.
![]()
The retrieval method drop-down list enables you to specify where the protein should be retrieved from:
Smart retrieval: If the accession number starts with O, P or Q the search starts with the Swiss-Prot database (Expasy). If no result, the Entrez database is searched (NCBI). Finally the sequence is retrieved from the local FastA formatted database.
Entrez->Swiss Prot: The Entrez database is always searched before the Swiss-Prot database.
Swiss-Prot->Entrez: The Swiss-Prot database is always searched before the Entrez database.
Local FastA: Only the local database is searched. You should choose this option only if you are not connected to the Internet or have restricted access (firewall).
The advantage of retrieving the sequence from the Internet is that in addition to the sequence you will retrieve the complete database record (see Chapter 3.9 and Appendix B).
S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Ahang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 25, 3389-3402 (1997).
S.F. Altschul, R. J. Carrol, and D.J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215, 403 (1990)
S. Henikoff and J.G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89, 10915 (1992).
S.E. Brenner, C. Chothia, and T.J.P Hubbard. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA, 95, 6073 (1998)