Connect the Dots is a general data integration tool that is focused on translating identifiers between public biological databases. Included with the software are nearly 40 parsers for large and small public datasets as well as many Affymetrix chip annotation files. A simple query language allows the user to create translation tables programmatically. The results of queries can be stored in database tables or exported to several formats including XML.
A recurrent problem in bioinformatics is the need to translate or connect identifiers for the same or related entities based on information stored in databases or other data sources. A simple example is translating among the many identifiers for a human gene, such as its HUGO-approved official name and symbol; its many unapproved alternate names and aliases; and its accession numbers in the National Center for Biotechnology Information (NCBI), Ensembl, and other gene databases. A more complex example is connecting Affymetrix probeset IDs, Agilent probe names, and gene identifiers for probeset and probes that represent the same gene. Another example is the linking of a gene identifier to PubMed IDs for publications about the gene using information in various gene database, OMIM, or SWISSPROT.
Numerous websites exist that solve specific instances of this problem. In addition, bioinformatics programmers frequently craft special purpose programs to connect identifiers. We cut the problem differently and have developed a software system that offers a more general solution. Instead of focusing directly on the needs of end-users, Connect the Dots provides capabilities that bioinformatics programmers can use to create translation services for specific purposes. We also provide an interactive website for end-users who wish to translate up to a few hundred identifiers at a time using translation tables that we have pre-built.
Connect the Dots divides the problem into four phases:
The software is open source and licensed under the GNU General Public License or Perl Artistic License.
The software is written in Perl and uses the PostgreSQL relational database system.
Funded by: Juvenile Diabetes Research Foundation
Example(s) of projects at ISB that use this technique:
T1DBase, HDBase, GDxBase, Cytoscape
Ongoing area of technology development:
It is under continual development since we use it with our other software. Short term development plans include creating a means for the software to compare the current version of each data source loaded in the Connect the Dots database with the latest public version, and the ability to download that version and update the database automatically.
Representative publication(s):
Available on CPAN under the name Bio::ConnectDots::ConnectDots. CPAN is the standard repository for publicly available Perl modules.
