This article originally appeared on the BeyeNETWORK
The Basic Local Alignment Search Tool (BLAST) is one of the most well-known and widely used bioinformatics tools available. Many experts agree that BLAST, as well as its many variants, is used by more scientists than almost any other bioinformatics application. It is popular since gene and protein sequences are fundamentally important in molecular biology, evolutionary biology and drug discovery. BLAST is an ideal tool for analyzing these sequences.
This article will quickly examine BLAST and the key issues surrounding its use in sequence analysis.
Purpose of BLAST
BLAST is used to compare two gene or two protein sequences and find regions of local similarity between those sequences. BLAST first compares nucleotide sequences or protein amino acid sequences against target databases. These target databases contain hundreds or thousands of archived sequences. After this is completed, one must calculate the statistical significance of the sequence matches. The results of a BLAST analysis can be used to infer functional and evolutionary relationships between sequences, and to discover similar or related sequences. This can be extremely useful in determining whether the known query sequences are derived from related species, genomes or proteins.
Brief history of BLAST
The first BLAST algorithm was developed in 1990 by Steve Altschul, Warren Gish and Dave Lipman at the National Center for Biotechnology Information (NCBI), Webb Miller at Penn State University and Gene Myers at the University of Arizona. Interestingly, the original paper describing BLAST (Altschul et al., 1990, “Basic local alignment search tool,” Journal of Molecular Biology 215(3):403-10) was one of the most highly cited papers published in scientific literature in the 1990’s. This is a testament to the broad applicability of BLAST in the bioinformatics community.
The BLAST Pedigree
Over the past 15 years, the original BLAST algorithm has spawned a set of increasingly specialized variant algorithms. In addition to gene sequences, these new forms of BLAST can handle protein amino acid sequences, immunoglobulin sequences and gene expression data. They were also designed to handle gaps in sequence data and detect subtle differences in sequence alignments. Table 1 lists some of the current BLAST analogues and their functions:
|BLASTN||Compare nucleotide query sequence against nucleotide sequence database|
|BLASTP||Compare amino acid query sequence against protein sequence database|
|BLASTX||Finds homologous proteins to a query nucleotide coding region|
|BLASTZ||Compares mouse genome sequences to human genome sequences|
|CDART||Conserved domain architecture retrieval tool|
|GEOBLAST||Searches gene expression data|
|IgBLAST||Searches immunoglobulin sequences in the GENBANK database|
|MEGABLAST||Rapid searches of highly similar DNA sequences|
|NCBI BLAST||Original version of BLAST|
|PHI-BLAST||Pattern hit initiated BLAST; detects specific sequence patterns|
|PSI-BLAST||Position specific iterated BLAST; detects weak sequence similarities|
|RPSBLAST||Reverse position specific BLAST; searches a conserved domain database|
|SNP BLAST||Single nucleotide polymorphism BLAST; detects SNP’s|
|TBLASTN||Compare amino acid query sequence against a translated nucleotide database|
|TBLASTX||Compare translated nucleotide query sequence against a translated database|
|WU BLAST||Enhanced BLAST that includes gapped alignments|
Table 1. BLAST and its many variants. Notice the increasing levels of specialization relative to the original NCBI BLAST program.
The basic algorithm used in BLAST is the famous Smith-Waterman algorithm, which performs optimal local alignments of two sequences. There are many variants of Smith-Waterman that improve its performance on large sequences, highly similar sequences and highly divergent sequences. When performing BLAST alignments, it's wise to have some understanding of the Smith-Waterman internal logic and its limitations.
Be Careful Using BLAST
BLAST programs are seductively easy to use. They are usually hosted on websites with slick graphical interfaces, a handful of default values that can be overridden by end-users, and appealing reports and aligned sequences as output. But looks can be deceiving. Using BLAST effectively, though, requires caution and insight on the algorithmic details of the program.
The output of BLAST runs includes a statistical analysis of the likely relationships between two gene or two protein sequences. BLAST derives the level of statistical significance in part from internal scoring matrices. For protein sequences these matrices have the fanciful names of BLOSUM and PAM, which are acronyms for “Blocks Substitution Matrix” and “Percent Accepted Mutations” respectively. For DNA sequences, DNA substitution matrices are used instead. Without going into excruciating detail on how these matrices are generated, they basically determine the likelihood that specific nucleotides in two gene sequences will pair up, or how likely it is that amino acids in two protein sequences will pair with one another.
There are numerous variants of these scoring matrices, and most BLAST implementations offer users a choice of which matrix to use. This choice can have a profound effect on the output of aligned sequences. For example, choosing the so-called BLOSUM45 matrix will yield one alignment, while choosing BLOSUM62 will yield another (most of the time). BLOSUM45 is generally more appropriate for sequences from divergent species, whereas BLOSUM62 is generally more appropriate for closely related species (but there are exceptions to the rule!). The bottom line for using BLAST is to investigate the evolutionary relationships for the species under consideration and then choose the most appropriate scoring matrix, not just the default matrix listed on a BLAST web site.
Another point of contention about BLAST is the assignment of gap penalties in a sequence analysis. It is quite rare for two sequences to align perfectly with one another. Usually, there are gaps in one sequence or the other. The way BLAST treats these gaps can significantly affect the outcome of sequence analysis. There are so-called gap penalties for opening up or extending gaps in a sequence in order to get them aligned. The exact values chosen for these gap penalties will usually alter how two gene or protein sequences line up. Although I will not discuss them here, there are many guidelines and rules-of-thumb for assigning gap penalties.
Obviously, the combination of algorithm logic, scoring matrices and gap penalties will often affect how sequences are aligned, and they must be carefully considered in any sequence analysis.
Example of BLAST
As an example of using BLAST we will examine the protein coat (capsid) of a well-known emergent Flavivirus, the West Nile Virus. This virus, which is transmitted by mosquitoes, infects humans, horses and birds. A ribbon diagram of the 3D protein structure of the viral capsid is shown in Fig. 1:
Fig. 1. 3D structure of the protein coat of West Nile Virus.
The amino acid sequence for this protein is:
The letters in this sequence signify individual amino acids. Instead of going into details about this, just note that we will BLAST this protein sequence against a target database. Running BLAST from the NCBI site reveals another protein sequence:
This sequence is associated with the Kunjin Virus. By placing both sequences alongside each other, you get:
This sequence highlights their similarities. It reveals that the Kunjin Virus is closely related to the West Nile Virus. Not surprisingly, the two species share similar proteins in their protein coats. BLAST correctly identified this relationship (and others), which could provide a useful starting point for further studies on the nature of viral protein structure and potential vaccines against these viruses.
Where to Find BLAST
If you are interested in using BLAST or its progeny, see the NCBI web site. This web site includes the original BLAST program and its many variants, target databases for BLAST’ing sequences and links to other BLAST web sites.