ABSTRACT:
In this article, we will discuss about the biological sequence alignment and its fascinating types. Biological sequence alignment is a fundamental technique used in bioinformatics to compare and analyze DNA, RNA, and protein sequences. It plays a crucial role in understanding the structure, function, and evolution of biological molecules. We will also discuss the methods of biological sequence alignment such as dot-matrix method, dynamic programming and word match. We will describe the concept of homology, sequence similarity, sequence identity, global and local alignment.
INTRODUCTION:
Sequence alignment refers to the process of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Biological sequence alignment is a fundamental technique used in bioinformatics to compare and analyze DNA, RNA, and protein sequences. It plays a crucial role in understanding the structure, function, and evolution of biological molecules. By aligning sequences, researchers can identify similarities, differences, and patterns that provide valuable insights into genetic relationships, evolutionary history, and protein structure-function relationships.
PURPOSES OF SEQUENCE ALIGNMENT:
There are some crucial purposes of biological sequence alignment:
• Functional inference
• Evolutionary inference
Functional inference: It means when we analyze the sequences of protein, DNA/RNA, we can determine the functions of these proteins and DNA/RNA from their structure. We can also determine their structures from their functions. For example: we have two sequences of proteins from a chimpanzee, the function of these proteins is to produce growth hormones, we can predict their structure from their functions.
Evolutionary inference: It means we determine the evolutionary linkage between different organisms through the study of their evolution of sequences of protein, DNA/RNA. If the sequences are similar, they may have less or no evolution. If they have differences in their sequences, they show evolutionary linkage.
CONCEPT OF SEQUENCE SIMILARITY AND SEQUENCE IDENTITY:
Sequence similarity means both sequences belong to same family of nucleotides in DNA/RNA or amino acids in proteins. It refers to degree of likeness between two sequences. Sequence identity refers to the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences. In case of DNA/RNA, sequence similarity is equal to the sequence identity. While in proteins, sequence similarity is not equal to sequence identity.
IN CASE OF DNA/RNA:
{ATCGGCTATACG, ATCGGCTATACG} These two sequences are 100 percent similar and identical.
{ATCGCGTATACG, ATCGCGTATACG} These two sequences are 90 percent similar and identical.
IN CASE OF PROTEINS:
{VAPHDE, VAPHDV} These two sequences are neither similar nor identical.
{VAPHDE, VAPHDD} These two sequences are similar but not identical.
CONCEPT OF HOMOLOGY:
If two sequences are similar in structure, their function and they have common ancestor, it is known as homology. There are two types of homology: a) paralogous b) orthologous.
Paralogous: If these homologous sequences are present in the same genome or in the same organisms, they are called paralogous. It is due to the duplication in the sequences of genome. For example: alpha hemoglobin and beta hemoglobin in humans.
Orthologous: If these homologous sequences are present in different organisms or in different genomes, they are called orthologous. It is due to the speciation (the arise of new species from a specie). For example: alpha hemoglobin in humans and chimpanzee.
Orthologous species are more similar to each other than paralogous species due to mutations. The reason is that both sequences are present in the same species in case of paralogous, mutations effect the both sequences equally and mutate these sequences in wide range. While in case of orthologous, both sequences are from different species and mutations effect these sequences in such a way that after mutations they may become more similar to each other.
HOW TO DETERMINE HOMOLOGY?
We can determine homology with the help of a graph that shows three zones: 1. Midnight zone, 2. Twilight zone and 3. Safe zone.
If the length of the sequence is fall in midnight zone, they are less identical. So, they are not homologous. If sequence length is fall in twilight zone, it produces ambiguity whether they are homologous or not. While the sequence length falls in safe zone, they are identical and homologous. If sequence length is small, they require more mutations to become identical and similar. While, if the length of both sequences is large, they need less mutations to become similar and identical.
FORMULAS TO FIND SIMILARITY AND IDENTITY:
% S = [(Ls x 2) / (La + Lb)] x 100
% I = [(Li x 2) / (La + Lb)] x 100
Another formula is:
S I (s) = (Li (s) / L shorter) x 100
TYPES OF BIOLOGICAL SEQUENCE ALIGNMENT:
1. PAIRWISE SEQUENCE ALIGNMENT:
Pairwise sequence alignment is the most basic form of sequence alignment, involving the comparison of two sequences. It aims to identify regions of similarity and dissimilarity between the sequences. The most commonly used algorithms for pairwise alignment are the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. The Needleman-Wunsch algorithm guarantees finding the optimal alignment, while the Smith-Waterman algorithm is used for local alignment, which identifies regions of similarity within sequences.
2. MULTIPLE SEQUENCE ALIGNMENT:
Multiple sequence alignment (MSA) involves aligning three or more sequences simultaneously. MSA is essential for comparing and analyzing sequences from related organisms or identifying conserved regions across a set of sequences. It helps in understanding evolutionary relationships, identifying functional motifs, and predicting protein structures. Popular algorithms for MSA include ClustalW, MUSCLE, and T-Coffee. However, MSA is computationally intensive and becomes more challenging as the number of sequences increases.
3. PROFILE BASED SEQUENCE ALIGNMENT:
Profile-based sequence alignment is an advanced technique that uses a profile or position-specific scoring matrix (PSSM) derived from a multiple sequence alignment. The PSSM represents the conservation of amino acids at each position in the alignment. By aligning a new sequence against the PSSM, researchers can identify conserved regions and predict functional motifs. Profile Hidden Markov Models (HMMs) are commonly used for profile-based alignment. Tools like PSI-BLAST and HMMER utilize profile-based alignment for protein sequence analysis.
4. STRUCTURAL ALIGNMENT:
Structural alignment involves aligning protein sequences based on their three-dimensional structures rather than their primary sequences. It helps in identifying structural similarities and inferring functional relationships between proteins. Structural alignment algorithms, such as DALI and CE, use geometric and statistical methods to align protein structures. It is particularly useful when primary sequence similarity is low but structural similarity is high.
TYPES OF PAIRWISE BIOLOGICAL SEQUENCE ALIGNMENT:
1. GLOBAL ALIGNMENT:
In this method, we align whole sequence with the other whole sequence. We use this method, when we want to align the whole genome. In global alignment, two sequences to be aligned are assumed to be generally similar over their entire length. Alignment is carried out from beginning to end of both sequences to find the best possible alignment across the entire length between the two sequences. This method is more applicable for aligning two closely related sequences of roughly the same length. For divergent sequences and sequences of variable lengths, this method may not be able to generate optimal results because it fails to recognize highly similar local regions between the two sequences.
2. LOCAL ALIGNMENT:
In this method, we align a whole sequence with the most similar part of another sequence. We use this method, when both sequences are not equal in length. It does not assume that the two sequences in question have similarity over the entire length. It only finds local regions with the highest level of similarity between the two sequences and aligns these regions without regard for the alignment of the rest of the sequence regions. This approach can be used for aligning more divergent sequences with the goal of searching for conserved patterns in DNA or protein sequences. The two sequences to be aligned can be of different lengths. This approach is more appropriate for aligning divergent biological sequences containing only modules that are similar, which are referred to as domains or motifs.
CONCLUSION:
Biological sequence alignment is a powerful tool for understanding the relationships and functions of biological molecules. Pairwise sequence alignment is used to compare two sequences, while multiple sequence alignment allows the comparison of three or more sequences. Profile-based alignment utilizes position-specific scoring matrices derived from multiple alignments, and structural alignment aligns sequences based on their three-dimensional structures. Each type of alignment has its own advantages and applications, contributing to our understanding of genetics, evolution, and protein structure-function relationships. In next article, we will discuss about the alignment algorithm to sequence biological sequences.
REFERENCES:
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453. https://pubmed.ncbi.nlm.nih.gov/5420325/
Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197. https://pubmed.ncbi.nlm.nih.gov/7265238/
Thompson, J. D., Higgins, D. G., & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 2. https://pubmed.ncbi.nlm.nih.gov/7984417/