This article originally appeared on the
“Find all single nucleotide polymorphisms for TP53 gene in human and mouse orthologs with epidemiologic assessments of cisplatin treatments in adenocarcinoma cells.”
“Find all structural homologs of putative N-acetyltransferase SR144 in B. subtillus in SCOP family Acyl-CoA N-acetyltransferase.”
“Find 3-D structure of hypothetical protein UPF0301 in V. cholerae, SWISS-PROT homology models, PFAM/SCOP classifications, and recent PUBMED entries.”
These are difficult tasks posed by biomedical researchers.
If there is any hope of providing answers for them, bioinformaticians will have to provide coherent access to a wide variety of data and information sources. Database federations are one promising solution. These federations act as integration points to bring together biological and biomedical information from widely distributed sources. Let’s briefly examine database federations and their benefits for biomedical research.
What is a Federated Database?
A federated database is a logical association of independent databases that provides a single, integrated, coherent view of all resources in the federation. The federation architecture makes several distinct physical databases appear as one logical database to end-users. Federations also provide a cohesive, unified view of data derived from multiple sources. The data sources for federated systems can include databases (relational or object-based), flat files, text documents, spreadsheets and various other forms of structured and unstructured data. Moreover, the federation can be largely vendor neutral since databases from almost any vendor, including all Open Source products, can be supported (i.e. MySQL, PostgreSQL, Oracle, SQL Server, DB2, etc.).
Remote databases (federates) in the system expose a subset of their resources to the federation. These resources can include metadata (database schemas), raw data, summary data, utilities for management of remote data and API’s (application programming interfaces) for direct manipulation of remote data. The union of these exposed resources, in conjunction with a central integration database, constitutes the federation infrastructure. And federates respond to queries from other federation members while remaining largely autonomous from them.
Some key features of federated systems include:
- Autonomous data sources
- Heterogeneity of data sources
- Data sources often geographically distributed
- Data sources controlled by independent administrative domains
- Logical integration of distributed datasets
- Coherent, unified, and integrated view of data from multiple resources
- Largely vendor neutral
The federation database system is actually a collection of integrated resources. These resources include remote databases (federates), other remote data sources (typically flat files, spreadsheet files, text documents, etc.), database schemas for exposing remote data to the federation, mediators and wrappers for managing data exchange, a central database and associated applications for managing federation data and, perhaps most importantly, bioinformatics tools for data analysis. The figure below shows a highly schematic view of a typical federated system.
Figure 1: Schematic architecture diagram of a federated database system. Remote data sources (federates) are integrated through a central federation management system. Integrated data is then presented to bioinformatics analytical tools in a coherent, unified interface.
The federation architecture often includes mediators, which are software agents that translate queries from a global format to local formats for specific databases. Mediators thus run queries against distributed, in-situ data, and return the results to a single federated dataset. Some mediators return both metadata (remote database schemas) and the remote data itself, while others return just the metadata, leaving remote data intact.
New data sources can be added quickly to the federation by creating wrappers for those sources. Wrappers are relatively simple parsing algorithms that mediate data exchange between the federation’s data query engines and remote data resources. Bioinformaticians can create wrappers with any convenient programming language, typically Perl, Java, or C++. Almost any type of data source can be wrapped including relational database systems, flat files, spreadsheet files, Web-based data and numerous bioinformatics applications, such as BLAST and FAST.
For example, many of the bioinformatics data sources provided by the Entrez Life Sciences Search Engine (managed by the NIH National Center for Biotechnology Information) are available to federations. The Entrez suite includes databases for genomics, proteomics, pharmacology, macromolecular structures, as well as biomedical literature citations and a wealth of additional data sources for the biomedical community. With Entrez tools and utilities, which are based on XML schemas and various programming languages (Perl, PHP, Java, etc.), bioinformaticians can build mediators and wrappers to pull data from Entrez into local federations. This is an elegant way of integrating biomedical information for life science research teams.
Issues in Database Federation
As with any integration technology, there are pros and cons to deploying a federated database system. Here are a few issues that should be considered:
- Query performance With many geographically and administratively distributed data sources contributing to the federation, the query algorithms for federated databases should be carefully designed for optimal performance. Otherwise, end users could experience painfully slow response times to their queries.
- Dependence on autonomous data sources The federation is highly dependent on data sources that are largely beyond direct control of the federation management system. These sources can change quickly and unpredictably. Thus, bioinformaticians who manage the federation should be capable of rapidly responding to modifications in source systems.
- Scalability Federated systems can add new data sources relatively quickly and with minimal cost, and are thus reasonably scalable architectures. However, scaling up new sources adds to the complexity of the infrastructure, adds new demands on network and query performance, and introduces data integrity issues.
- Cost reduction There can be considerable cost savings from deploying federated database systems. Instead of building large data warehouses, providing the supporting infrastructure for them and migrating data to the warehouses, federated systems leave the data sources intact and simply collate data from them. In general, considerably fewer infrastructures are required to build federations than warehouses.
- Timeliness In general, any updates that occur in remote data sources are immediately available to the federation. This is a significant benefit for life science researchers as current and timely information is always available through the federation.
- Schema evolution Database schemas in remote data sources may evolve and change quickly, without warning or communication about the modifications. Bioinformaticians managing the federation should react quickly to schema evolution in federates and implements processes to quickly adapt the federation infrastructure to those changes.
- Technical skills Depending on the scale and complexity of the federated architecture, considerable expertise may be required to design and manage federated database systems. With the growing popularity of federations, the life sciences and information technology communities must offer more training opportunities in this technology.
- Data replication The federation approach minimizes or eliminates data replication. Instead of copying remote source data and storing it in a centralized database, federation servers simply read remote data and integrate it in a federation database. If replication is needed to improve query performance, it remains an option for select data elements.
Examples of Federated Database Systems
The number of federated databases is growing steadily in the life sciences community. These are a few notable examples:
- The Comparative Mouse Genomics Centers Consortium (CMGCC) at the National Institutes of Health (NIH) supports a federated database system for managing mouse genetic information. The federation provides an integrated, coherent view of multiple, heterogeneous, geographically distributed database systems, which are hosted at several university research centers and managed through a bioinformatics working group.
- The Cell Centered Database, another NIH supported system and part of theBiomedical Informatics Research Network (BIRN), is a federated database system that provides an integrated view of cellular imaging data for biomedical research.
- The Structural Proteomics in the Northeast (SPINE) project, sponsored by theNortheast Structural Genomics Consortium, uses a federated system of resourcesfor protein structure determination. Participating universities in the consortium provide data resources to the federation, which are then integrated into a single coherent bioinformatics interface for use by all consortium members.
Federated database systems are rapidly becoming a mainstream approach for providing integrated resources to the life sciences community. Today is definitely an exciting time for bioinformaticians with an interest in promoting this technology.