ABSTRACT:
In this article, we will discuss about the pairwise alignment algorithm for biological sequencing. In the field of bioinformatics, sequence alignment plays a crucial role in comparing and analyzing biological sequences such as DNA and protein sequences. Alignment algorithms facilitate the identification of similarities, differences, and patterns within sequences, providing valuable insights into the evolutionary relationships, functional annotations, and structural features of biological macromolecules. We will describe the dot-matrix method, dynamic programming and word match method in detail to understand the biological sequencing.
INTRODUCTION OF PAIRWISE ALIGNMENT ALGORITHM FOR BIOLOGICAL SEQUENCING:
Pairwise alignment algorithms aim to optimize the alignment of two or more sequences such that they share as many matching characters as possible. These algorithms employ various techniques to compare sequences and determine the optimal alignment based on specific scoring criteria. The result of an alignment algorithm usually presented in the form of a visual representation, known as a sequence alignment, which highlights the matched and mismatched regions of the sequences.
TYPES OF PAIRWISE ALIGNMENT ALGORITHM FOR BIOLOGICAL SEQUENCING:
1. DOT MATRIX METHOD-PAIRWISE ALIGNMENT ALGORITHM:
The most basic sequence alignment method is the Dot Plot method. Graphical way of comparing two sequences. Two dimensional way of comparison. In a dot matrix, two sequences to be compared written in the horizontal and vertical axes of the matrix. The comparison done by scanning each residue of one sequence for similarity with all residues in the other sequence. If a residue match found, a dot placed within the graph. Otherwise, the matrix positions left blank. When the two sequences have substantial regions of similarity, many dots line up to form contiguous diagonal lines, which reveal the sequence alignment. If there are interruptions in the middle of a diagonal line, they indicate insertions or deletions. Parallel diagonal lines within the matrix represent repetitive regions of the sequences.
ADVANTAGES OF DOT MATRIX METHOD:
A direct visual statement of the relationship between two sequences. Identification of Sequence Repeat Regions. This method has some applications in genomics. Identifying Chromosomal Repeats and in comparing gene order conservation between two closely related genomes. It can also be used in identifying nucleic acid secondary structures through detecting self-complementarity of a sequence. The dot matrix method displays all possible sequence matches. However, it is often up to the user to construct a full alignment with insertions and deletions by linking nearby diagonals.
DISADVANTAGES OF DOT MATRIX METHOD:
It lacks statistical rigor in assessing the quality of the alignment. The method is also restricted to pairwise alignment. It is difficult for the method to scale up to multiple alignment. The following are examples of web servers that provide pairwise sequence comparison using dot plots. Dot-matcher, Dott-up and dot helix.
2. DYNAMIC PROGRAMMING METHOD-PAIRWISE ALIGNMENT ALGORITHM:
Dynamic programming is a method that determines optimal alignment by matching two sequences for all possible pairs of characters between the two sequences. It is fundamentally similar to the dot matrix method in that it also creates a two dimensional alignment grid. A more quantitative way by converting a dot matrix into a scoring matrix to account for matches and mismatches between sequences. By searching for the set of highest scores in this matrix, the best alignment can be accurately obtained.
PROCEDURE OF DYNAMIC PROGRAMMING METHOD:
Dynamic programming works by first constructing a two-dimensional matrix whose axes are the two sequences to compare. The residue matching is according to a particular scoring matrix. The scores calculated one row at a time. This starts with the first row of one sequence, which used to scan through the entire length of the other sequence. Scanning of the second row. The matching scores calculated. The scanning of the second row takes into account the scores already obtained in the first round. The best score put into the upper right corner of an intermediate matrix. This process is incomplete until values for all the cells filled. Thus, the scores accumulated along the diagonal going from the lower left corner to the upper right corner.
Once the scores have accumulated in matrix, the next step is to find the path that represents the optimal alignment. This is done by tracing back through the matrix in reverse order from the upper right-hand corner of the matrix towards the origin of the matrix in the lower left-hand corner.
The best matching path is the one that has the maximum total score. If two or more paths reach the same highest score, one chosen arbitrarily to represent the best alignment. The path can also move horizontally or vertically at a certain point, which corresponds to introduction of a gap.
DYNAMIC PROGRAMMING IN GLOBAL SEQUENCE ALIGNMENT:
The classical global pairwise alignment algorithm using dynamic programming is the Needleman–Wunsch algorithm. In this algorithm, an optimal alignment obtained over the entire lengths of the two sequences. It must extend from the beginning to the end of both sequences to achieve the highest total score. In other words, the alignment path has to go from the upper right corner of the matrix to the lower left corner. The drawback of focusing on getting a maximum score for the full-length sequence alignment is the risk of missing the best local similarity. This strategy is only suitable for aligning two closely related sequences that are of the same length. For divergent sequences or sequences with different domain structures, the approach does not produce optimal alignment. One of the few web servers dedicated to global pairwise alignment is GAP.
DYNAMIC PROGRAMMING IN LOCAL ALIGNMENT:
In regular sequence alignment, the divergence level between the two sequences to be aligned is not easily known. The sequence lengths of the two sequences may also be unequal. In such cases, identification of regional sequence similarity may be of greater significance than finding a match that includes all residues. The first application of dynamic programming in local alignment is the Smith–Waterman algorithm. In this algorithm, positive scores are assigned for matching residues and zeros for mismatches. No negative scores are used. This approach maybe suitable for aligning divergent sequences or sequences with multiple domains that may be of different origins. Most commonly used pairwise alignment web servers apply the local alignment strategy, which include SIM, SSEARCH, and LALIGN.
CONCLUSION:
The developments in sequencing technologies have enabled unprecedentedly fast sequencing speeds and large-scale sequencing capabilities. The increasing number of sequences are challenging the automated sequence analysis procedures. Sequence alignment is one of the basic tasks in the processing of biological sequences, and the accuracy of alignment affects the subsequent analyses. Phylogenetics, comparative genomics, and protein structure and function prediction all depend on sequence alignment to look for conserved regions.
REFERENCES:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9024764/
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0030123
https://www.sciencedirect.com/science/article/pii/S0888754317300551
Bowie JU, Lüthy R, Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 1991 Jul 12;253(5016):164–170. https://pubmed.ncbi.nlm.nih.gov/1853201/
Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. https://pubmed.ncbi.nlm.nih.gov/5420325/