Print Page
 Gaggle - Seamless Data Integration
Goals of the project

There are two main goals of this project:. The first is to ensure a flexible and extensible data schema that truly represents experimental design. This allows information regarding genetic makeup, culture condition, perturbation, etc. to be stored in a machine-readable format for use with downstream data integration and processing. The second is to transition these data into software programs that are used to interrogate them. Our focus in this project is on scientific data standardization and management for seamless integration and exploration.

Primary methodologies/approaches/strategies used to accomplish the goals

A. Data Standards

An important consideration in all systems biology projects is large-scale data integration and analysis in the context of experimental design. Unfortunately, as systems biology datasets are ever-increasing in both size and numbers, the descriptive approach to tracking experimental design is not effective when analyzing the data using unsupervised algorithms. This is the motivation for developing data standards that help digitize experimental design.

We have devised an XML (eXtensible Markup Language) data standard schema for representing global data sets such as:

  • microarray ratios
  • protein levels
  • protein-protein interactions
  • protein-DNA interactions

The power of this data schema is in its simplicity: easy extensibility to accommodate new types of information and intuitive graphical interfaces that help experimental biologists to record their experimental designs. This schema is also supported in graphical viewers developed as part of the Gaggle project.

B. Seamless Integration and Exploration

Systems biology datasets are of different types, dimensions and formats. To further complicate this, many data are stored in databases at various locations worldwide. To fully comprehend the information in these large datasets, they need to be integrated, explored and visualized simultaneously in a seamless manner. This is our inspiration to develop the Gaggle platform.

The Gaggle is a collection of software tools for exploring the diverse range of systems biology data. Each software tool can be operated independently and is useful for exploring a specific type of data. However, the benefits of these software tools are most evident when they are connected and operated together. The 'Gaggle' is our name for this ensemble of loosely coupled programs, and for the underlying communication protocol.

We first discuss the separate programs, beginning with a few network viewers, each of which is a Cytoscape plugin tailored to a specific type of data. Following this we introduce the DataMatrixViewer, the 'R Goose', and finally the Gaggle Boss, which links all the programs together.

Bicluster & Inferelator Viewer
This Cytoscape plugin is customized to display the results of cMonkey and the Inferelator. Biclustered genes and conditions appear as single aggregated nodes, with edges to regulator nodes (transcription factors). A variety of attributes (e.g. bicluster p values, residuals, COG (Clusters of Orthologous Groups) annotations, and metabolic pathway relationships) are included, and graphical user interface (GUI) controls are provided to filter and otherwise navigate through the data.

IP viewer
This Cytoscape plugin displays the results of immuno-precipitation (IP) experiments, with visual cues to protein probability and peptide count for the individual protein-protein bindings. A variety of user interface controls are provided, such as:

  • hide edges identified in control experiments,
  • hide edges above or below probability, or
  • min or max peptide count thresholds.

ChIP-chip viewer
This Cytoscape plugin is a genome viewer specialized for the display and exploration of ChIP-chip* data, in the topological context of the genome map. Using the standard visual mapping technique offered by Cytoscape with the overlay of Transcription Factor Binding Site (TFBS) enrichment values, the binding of a transcriptional regulator to gene promoter regions can be studied. This is accomplished using the Gaggle and the DataMatrixViewer running in movie mode. User interface controls make it easy to navigate these data and study evidence for protein-DNA binding upstream of genes.

* Chromatin Immunoprecipitation followed by microarray chip analysis. The procedure is used for experimentally mapping protein-DNA interactions.

DataMatrixViewer (DMV)
This Java program provides many ways to load, display, and explore numerical data from high throughput experiments (microarray, proteomics, ChIP-chip, MPSS [massive parallel signature sequencing]). Two kinds of related data are used: numerical matrices (in which there is typically a row for every measured gene and a column for every experimental condition) and metadata (which captures the details of each experimental condition--see above). The DMV provides a spreadsheet view, as well as a number of plotting and simple selection tools. When used in the Gaggle, the DMV has a 'movie mode' in which the per-column values are broadcast for every gene, one column at a time, creating an animated display of experimental measurements in any of the Cytoscape plugins listed above. The DMV also provides the user interface controls for selecting subsets or combinations of your experimental data. When 'gaggled', the DMV can export any selection of data to the 'R Goose' (which is discussed below) for many kinds of statistical analysis.

The R Goose
The 'R Goose' makes the powerful exploratory data analysis environment 'R' conveniently available within the Gaggle. Many kinds of data (experimental measurements, network associations, selected gene names) can be painlessly passed to R (http://www.r-project.org; http://stats.math.uni-augsburg.de/JGR), where limitless possibilities exist for data manipulation, display, and analysis. Results are easily passed back to any other Gaggle program.

The Gaggle Boss
The Gaggle Boss allows the user to monitor and control the programs currently running in the Gaggle. Cytoscape windows, the DMV, or the R Goose may be hidden or brought into view, and they may be instructed to listen to, or ignore data broadcast by other 'geese' The Boss may optionally have plugins, appearing as separated 'tabs', which (for example) provide access to databases, and shared notepads in which a lab may keep notes of hypotheses or questions. Behind the scenes, the boss provides the mechanism, built upon Java RMI (Remote Method Invocation), by which the geese communicate with each other.

Future directions of this project.
Besides improving visualization capabilities and user interfaces, an important future goal for this project is to build intelligent query tools to tease out novel biological insights that can be experimentally verified from the extensive and diverse data. Another future goal is to enable archiving of a Gaggle environment which led to a hypothesis. This archiving approach is expected to recreate the environment for future reference while evaluating that hypothesis.

Group members involved with Project

Nitin Baliga (PI)
Paul Shannon,
Michael Johnson

Representative publication(s):

Baliga, N.S., Bjork, S.J., Bonneau, R., Pan, M., Iloanusi, C., Kottemann, M.C.H., Hood, L., and DiRuggiero, J. (2004) Systems Level Insights Into the Stress Response to UV Radiation in the Halophilic Archaeon Halobacterium NRC-1. Genome Res. 14: 1025-1035.

Bonneau, R., Baliga, N.S., Deutsch, E.W., Shannon, P., and Hood, L. (2004) Comprehensive de novo structure prediction in a systems-biology context for the archaea Halobacterium sp. NRC-1. Genome Biol 5: R52.

Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498-2504.

 Related Information