On the use of suffix trees in multiple sequence alignment

COMP 156

Kal Renganathan Sharma, jyoti_kalpika@yahoo.com, Formerly Principal, Mathur Post, Anna University - Sakthi Engineering College, Mount, Ramapuram (opp. MGR Gardens), Chennai, 602105, India
The sequencing of the 3 billion basepairs in human genome, was completed two years ahead of schedule. Three mammalian genomes have been sequenced in its entirety. Projections of the lifespan of an individual increasing to 200 years have been drawn up given the information available from these projects. Disease mechanism models and its relation with gene expression and the therapeutic action and the alteration of the gene expression can be used to devise cures for every human disease there is by the year 2050 according to the inventors of biochip and microarray analyses. The time taken to align two complete human genome sequences by the current methods of dynamic programming using a 1 GHz computer can run to over 60 years and run in half-a day on the fasted 20 Terra flop supercomputer. The scope is immense in the area of seeking global alignment in less time. VLDB, Very Large Database storage can be improved using approaches such as the Parallel Disk Model. Representing sequence information in the form of suffix tree taps into the inherent microstructure of the polynucleotide or polypetide. An approximate alignment either for two or sequences or for k sequences can be obtained in O(n) time. The error bounds are calculated and compared with the optimal solution obtained in O(n^2) time for two sequences by the Smith and Waterman algorithm. Multiple sequence optimal alignment problem has been shown to be NP complete. Hence the O(n) approximate solution may have applications in protein family alignment, study of repetitive sequences. ALU repeats for example, occurs 600,000 times in human genome.