The greatest challenge facing the molecular biology community today is to make sense of the wealth of data that has been produced by the genome sequencing projects. Traditionally, molecular biology research was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data being produced in this genomic era has seen a need to incorporate computers into this research process.
Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent tasks. However, the molecular biology of an organism is a very complex issue with research being carried out at different levels including the genome, proteome, transcriptome and metabalome levels. Following on from the explosion in volume of genomic data, similar increase in data have been observed in the fields of proteomics, transcriptomics and metabalomics.
The first challenge facing the bioinformatics community today is the intelligent and efficient storage of this mass of data. It is then their responsibility to provide easy and reliable access to this data. The data itself is meaningless before analysis and the sheer volume present makes it impossible for even a trained biologist to begin to interpret it manually. Therefore, incisive computer tools must be developed to allow the extraction of meaningful biological information.
There are three central biological processes around which bioinformatics tools must be developed:
- DNA sequence determines protein sequence
- Protein sequence determines protein structure
- Protein structure determines protein function
The integration of information learned about these key biological processes should allow us to achieve the long term goal of the complete understanding of the biology of organisms.
Biological databases
Biological databases are archives of consistent data that are stored in a uniform and efficient manner. These databases contain data from a broad spectrum of molecular biology areas. Primary or archived databases contain information and annotation of DNA and protein sequences, DNA and protein structures and DNA and protein expression profiles.
Secondary or derived databases are so called because they contain the results of analysis on the primary resources including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. Information from the literature is contained in bibliographic databases, such as Medline.
It is essential that these databases are easily accessible and that an intuitive query system is provided to allow researchers to obtain very specific information on a particular biological subject. The data should be provided in a clear, consistent manner with some visualisation tools to aid biological interpretation.
Specialist databases for particular subjects have been set-up for example EMBL database for nucleotide sequence data, UniProtKB/Swiss-Prot protein database and PDBe a 3D protein structure database.
Scientists also need to be able to integrate the information obtained from the underlying heterogeneous databases in a sensible manner in order to be able to get a clear overview of their biological subject. SRS(Sequence Retrieval System) is a powerful, querying tool provided by the EBI that links information from more than 150 heterogeneous resources.
Biological applications
Once all of the biological data is stored consistently and is easily available to the scientific community, the requirement is then to provide methods for extracting the meaningful information from the mass of data. Bioinformatic tools are software programs that are designed to carry out this analysis step.
Factors that must be taken into consideration when designing these tools are:
- The end user (the biologist) may not be a frequent user of computer technology
- These software tools must be made available over the internet given the global distribution of the scientific research community
The EBI provides a wide range of biological data analysis tools that fall into the following four major categories:
Homologous sequences are sequences that are related by divergence from a common ancestor. Thus the degree of similarity between two sequences can be measured while their homology is a case of being either true of false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.
This group of programs allow you to compare your protein sequence to the secondary (or derived) protein databases that contain information on motifs, signatures and protein domains. Highly significant hits against these different pattern databases allow you to approximate the biochemical function of your query protein.
This set of tools allow you to compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. The determination of a protein's 2D/3D structure is crucial in the study of its function.
This set of tools allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations, hydropathy regions, CpG islands and compositional biases. The identification of these and other biological properties are all clues that aid the search to elucidate the specific function of your sequence.
Credit: EBI
Source: http://abc.cbi.pku.edu.cn/