|
General Technique
The Gaggle uses a minimalist approach to integrating data and software. It is written in Java and uses standard Java libraries. It is simple to install and easy to update, and new data sources and software tools can be added with minimal implementation costs. A small server program (the ´Gaggle Boss´) provides communication among analysis and display programs (the ´geese´) which are modest and minimal adaptations of existing (or novel) bioinformatics and computational biology programs, and web resources. The Boss and the geese all run as separate programs on the user´s desktop computer, communicating with each other, at the user´s behest, by passing simple messages.
In the Gaggle, semantic flexibility¹ - the notion that "word meanings are not fixed and unchanging, but tend to vary according to the context of their use" is seen as the solution to the complexities of data integration, rather than as a problem that must be solved before integration can begin. For example, the gene name "fliG" identifies
- flagellar motor switch protein in three KEGG (5) pathway maps.
- node in a Cytoscape (6) association network.
- row in a matrix to a microarray data viewer.
- set of PubMed (7) abstracts to a literature search tool.
The biological semantics attached to the gene name in each of these environments are rich, significant, and different. But in the Gaggle´s approach to software and data integration, no formal mapping, and no explicit integration are needed. It suffices to simply pass the unadorned gene name to each environment, where in each case a different web of meanings is invoked. It is in the biologist´s mind, rather than in an external semantic map, that the meanings may coalesce and combine.
The Gaggle does not, however, preclude the use of applications and data repositories which are built upon, and offer the benefits of, careful semantic mapping. The Gaggle´s KEGG goose, demonstrates this: KEGG´s carefully curated semantic mappings (of gene name to metabolic pathways, biochemical reactions, cellular structures, DNA sequences, protein functions, and orthology groups are obtained by passing simple gene names to the goose. As systems biology matures, we predict that many more such semantically rigorous resources will become available, and that they, too, will be easy to add to the Gaggle using this same approach. Similarly, large scale efforts such as SBML(8) and BioPAX(9) are complementary to, and not competitive with, the Gaggle. We are confident that, given the heterogeneity of systems biology, it is unlikely that a single unifying language or unifying scheme will emerge. Valuable work will continue to be done in more or less restricted domains, and semantic flexibility will always be required to integrate them.
Purpose/use/application of the technique:
The Gaggle integrates the diverse data types, data sources, and software tools used in systems biology. Among the current ´geese´:
- DMV The DataMatrixViewer is used for navigating and selecting from experiments (microarray, ChIP-chip, proteomics), and for displaying and plotting their numerical data (Reference: Johnson, etal).
- Cytoscape, with assorted plugins This is used for viewing protein-protein interactions, protein-DNA interactions, association networks, biclusters, and inferelator results.
- TIGR´s Mev This is a popular tool for microarray analysis.
- The R Goose provides full access to R and its packages, including BioConductor.
- Simple Bioinformatics Web Browser This provides easy access to web-based bioinformatics resources, e.g., KEGG, EMBL´s String, BioCyc.
Example(s) of projects at ISB that use this technique:
The Baliga laboratory´s effort to decipher the gene regulatory circuit of Halobacterium relies upon the Gaggle.
Ongoing area of technology development:
Future work on the Gaggle includes the addition of new geese (for Robetta, GO), the ability to save state, and possible integration with a web-based collaboration tool.
Representative publication(s):
The Gaggle: An open-source software system for integrating bioinformatics software and data sources
Paul T Shannon, David J Reiss, Richard Bonneau, and Nitin S Baliga
BMC Bioinformatics 2006, 7:176 doi:10.1186/1471-2105-7-176.
G. A. Stolovitzky, A. Kundaje, G. A. Held, K. H. Duggar, C. D. Haudenschild, D. Zhou , T. J. Vasicek, K. D. Smith, A. Aderem, and J. C. Roach. Statistical analysis of MPSS measurements: Application to the study of LPS-activated macrophage gene expression . PNAS. February, 2005
|