dbBLAST

DNA helix

The extensible and algorithmic functionality embedded in modern DBMSs is not widely recognised; in particular, the opportunity to extend DBMSs with new specialised datatypes, and to implement algorithms and searches inside database systems as Ôstored proceduresÕ. The key problem is that these individual components, while potentially powerful, lack the integration required to optimise their support of bioinformatics databases. Once these components are integrated, DBMSs are an obvious platform for bioinformatics databases.

In our dbBLAST project, we aim at improving the performance of sequence alignment searches in a sequence databases. Our idea is to utilize the facilities of today's database systems to efficiently process huge amounts of data.

The Basic Local Alignment Search Tool (BLAST) is one of the central algorithms in bioinformatics. Its purpose is to find for a given gene or protein sequence the best matches ('local alignments') in a large collection of already known protein/nucleotide sequences. The algorithm employs dynamic programming techniques and it typically runs as standalone program which parses and processes huge textfiles holding lists nucleotide sequences. Typical sequence collection sizes are in the order of several hundred of megabytes - a magnitude which should clearly benefit from the use of state of the ar t database management technologies.

We have implemented the standard BALST algorithm using state-of-the-art stored procedure languages in several relational database management systems and evaluated its performance characteristics. This includes the development of an appropriate physical design for the sequence database and the identification which parts of the BLAST algorithm can take advantage of the set processing capabilities of SQL.

We are currently working on:

  • designing bio-datatypes for the storage of biodata in relational databases;
  • tightly integrating biodata and bioinformatics algorithms, using .NET-based stored procedures; and
  • creating novel methods to efficiently parallelise bioinformatics algorithms, using a database cluster.

The Team

This project would not have been possible without the contributions by our project students:
Chun-Wu Chen, Thanh-Mai Diep, Alexander Bolodurin, Sujit George, and Harshana Randeni.

Acknowledgements

This work is funded by Microsoft Research Bay Area and the Australian Research Council as part of the ARC Linkage project LP0669685.

Publications

Uwe Röhm and Thanh-Mai Diep. How to BLAST your Database - A Comparison Study of Stored Procedures for BLAST Searches. In: Proceedings of 11th International Conference on Database Systems and Advanced Applications (DASFAA'2006), 12-15 April, Singapore, 2006.

Chun-Wu Chen and Uwe Röhm, A Service-oriented Approach for the Parallelization of Data-intensive Algorithms in a Grid-enabled Cluster. In: Proceedings of the First International Workshop on Biomedical Data Engineering (BMDE) in conjunction with ICDE2005, Tokyo, Japan, April 3-4, pages 22-29, 2005.