A goal of systems biology research is the statistical modeling of molecular interactions so that the effects of perturbations on the networks can be predicted with confidence. Prior to statistical modeling, considerable descriptive or qualitative information about the networks is typically acquired. Sources for this information include:
- Pre-existing knowledge from the literature and curated databases: This might include biochemical pathways, annotated genomes, known protein complexes, or gene ontology tables.
- Large-scale computational analyses: This might include gene prediction programs, transcription factor binding site predictions, location of genome-wide repetitive elements, or protein structure predictions.
- Findings from large-scale wet lab data collection endeavors: This might include tables based on microarrays, yeast two-hybrids, or genome sequencingc.
One promise of the systems biology approach is that large scale experimentally- or computationally-derived datasets, when integrated with preexisting knowledge, will reveal profound new insights about a molecular network. Unfortunately, the result instead may be an increased level of confusion.
The difficulty is that many of the datasets used by systems biologists are error prone, leading to both false positives and false negatives. Some of the experimental error is due to technical irreproducibility and some to biological features, such as the heterogeneity of gene expression levels in cell populations, even under well-controlled conditions. Computational analyses based on faulty assumptions, improper statistical validation, or incomplete datasets, also can be misleading. The pre-existing biological literature is not perfect. At any given time, biological conclusions are drawn within a context from which key features might well be missing because they have not yet been discovered or conceptualized. Imagine innate immunity without Toll receptors or adaptive immunity without T-cells.
A major technical challenge for systems biology is developing procedures for handling error so that legitimate interpretations can be made from the available data and knowledge base. Some of this can be accomplished by improving the bioinformatics pipelines for handling data, adopting standards, and having internal controls and benchmarks to gauge data quality. Statistical programs that model error also are helpful.
Systems biologists also approach error another way. They assume that integration across diverse data types will mitigate the effects of sporadic or even systematic error from any one data type. For example, data from gene expression arrays, ChIP-chip arrays and proteomics analysis may converge on a limited set of genes that might be involved in a regulatory network.
Thus, a major focus of systems biology is to develop computational tools that facilitate data integration. Some of this can be done through comprehensive databases such as the Systems Biology Experimental Analysis Management System (SBEAMS), a database being developed in-house at the ISB for managing and retrieving the large datasets produced by genomics and proteomics methodologies. The idea is to allow the user to combine retrieval of targeted sets of experimental data with information from public databases such as KEGG pathways (http://www.genome.jp/kegg/kegg2.html ).
Graphical interfaces for portraying and visualizing interaction data also are popular with systems biologists (see Cytoscape) Components of a network, such as genes or proteins, are portrayed as nodes, and the interactions between the components as edges. Different types of interactions are overlaid on the same network graph allowing an integrated view. The results of perturbation experiments can be compared, using plug-in applications that allow changes in gene expression to be visualized as variations in color or as smaller or larger circles for the nodes. The behavior of the network over time can be visualized as a series of "snapshots". Although descriptive, these types of graphical models are useful for suggesting hypotheses because they help the researcher to organize the data and see patterns within it. Their utility relies upon the validity of the data extraction procedures and clustering algorithms that underlie the graphical interfaces.
In the long run, descriptive network models are not ideal. Dependency relationships are not represented easily, nor is kinetic behavior, especially non-linear responses in the measurement of the network´s key components as the network is perturbed. The stochastic variation in measurements obtained from heterogeneous cell populations also must be considered. Quantitative statistical modeling of biological networks is a key area of needed technology development. Ultimately, mathematically-based models will be used to simulate a network´s responses to genetic or environmental perturbations, and thereby allow the hypothesis-validation phase of experimentation to proceed in a more rational and directed fashion. To get to this desired outcome, formalistic rules governing the state changes occurring in dynamic networks need to be articulated and tested in a wide variety of biological systems. This science is in its infancy but constitutes an essential component of systems biology research.
|