Systems biologists collect large quantities of data from wet lab experiments and high-throughput platforms and also from public database resources in the interest of probing their biological process of interest. Thus, technologies must be in place for extracting useful information from the data. These technologies include:
- Laboratory Information Management Systems (LIMS)
- Bioinformatics pipelines
- Database frameworks
LIMS is used to manage laboratory workflow; track samples through a multi-step data collection protocol; perform quality control; monitor supply usage; capture costs, etc. The goal of a LIMS is to ensure that the best possible data are collected, with minimal errors due to sample mix-up or insufficient maintenance of the equipment. Commercial providers of LIMS exist, but each group has to tailor the procedures to their own laboratory context.
Bioinformatics pipelines are used to convert "raw" data into data-types that researchers can use for analyses. Bearing in mind that data that is considered "raw" from one perspective might be rather "cooked" from another, and a wide variety of bioinformatics pipelines are used to collect, extract, store, and interpret data at several different levels of analysis.
For example, consider the DNA sequence. One pipeline extracts base-calls from samples run on the automated sequencing machine, and this involves measuring fluorescent dye signals; calculating signal/noise ratios; presenting a graphical view of the sequence read ("chromatogram"); and a sequence summary (file with As, Cs, Gs and Ts). Another pipeline produces a genomic sequence from individual sequence reads. In this pipeline, reads are assembled into longer sequence "contigs" based on overlaps among the reads, and a consensus sequence is determined for the contig based on analysis of the quality of the read data. Several additional bioinformatics pipelines are applied to the DNA sequence to extract features such as genes; transposable elements; GC content; CpG islands; transcription factor binding sites; or conserved sequence regions across different species.
Thus, wet lab-driven bioinformatics pipelines are needed to convert output produced by a machine, such as a DNA sequencer or mass spectrometer, into a biologically useful input such as a sequence read, gene expression profile table, or peptide composition analysis. Some of these pipelines involve a rather arcane and complex series of data-processing steps. For example, the National Heart, Lung and Blood Institute-funded Seattle Proteome Center (NHLBI/SPC) has established an extensive proteomics pipeline for extracting peptide composition information from mass spectrogram data (see http://tools.proteomecenter.org/software.php).
Steps in this pipeline include:
- Input processing/Data handling
- Probability Assignment and Validation
- Protein Quantification
- Protein ID Curation
Once data have been transferred from the bench to the computer, an array of data analysis pipelines are mobilized to produce tables and graphical representations of features the systems biologist considers to be meaningful or significant in light of the context of inquiry. Some of these pipelines are securely grounded in established knowledge (e.g., creating a predicted restriction digest fingerprint from a DNA sequence) whereas others are on the cutting edge of theoretical development (e.g., predicting cis-regulatory elements from a DNA sequence). Developing effective data analysis pipelines thus constitutes an area of systems biology research.
Given the extensive number of different data types that systems biologists handle, a daunting number of bioinformatics pipelines must be built, imported, maintained, and replaced when superceded by improved procedures. Public resources such as BioPerl (see http://bio.perl.org/) are available to make the task easier. Open source bioinformatics and computational biology software allow each laboratory to customize applications to its own research context.
Database frameworks serve to store data, allow data access by query, and, in some cases, facilitate data curation. Using a basic framework such as MySQL, the database is populated with resource material that is relevant for a given context of investigation. For example, the University of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu) allows access to a wealth of features derived from the genome sequences of multiple vertebrate and invertebrate species. It is heavily contributed to, and used by, researchers in comparative genomics. The UCSC genome browser also is incorporating data being produced by numerous laboratories for the purpose of systematically discovering and understanding all of the functional elements that exist within about 1 percent of the human genome (the ´ENCODE´ project; http://www.genome.gov/10005107). Although concerned primarily with genomics, the genome browser exemplifies how useful the combination of a database and effective visualization tools can be for facilitating data integration.
At the ISB, we are developing SBEAMS (Systems Biology Experiment Analysis Management System) as a platform for managing data derived primarily from microarray and proteomics experiments (http://www.sbeams.org/project_description.php). More specifically, within the SBEAMS framework, each investigator may first store and manage the data unique to his or her experiment. The experimental data products are loaded into the database and an automated pipeline processes the raw data into gene expression measures with data quality estimates or protein matches and quality scores. Then, the investigator may use the SBEAMS built-in tools or custom scripts built on top of the framework to correlate the experimental results and experiment conditions, and further understand the experimental results. Investigator annotations also are captured in the database for later analysis and correlation with other experiments.
Several challenges face the development of useful biological databases. These include agreement over the standards that apply to each data type, finding efficient ways to store and access large datasets within the context of the database, integrating different types of data, improving the query capabilities, and, not to be underestimated, recruiting and satisfying a group of users. The sociological issues around data management can be equally as challenging as the technical aspects.
|