Code of conduct to help integrated approach to bioinformatics

At last a code of conduct for bioinformatic data providers has been formulated. As Adrian Fergus explains, this will help the industry handle its ever increasing volumes of complex and often interrelated data.

Bioinformatics: "A mechanism for acquisition, processing, structured analysis and storage of biological data.“

If the above is accepted as a satisfactory definition, then the creation of a single solution for bioinformatics for even a modest-sized life science institution appears to be a formidable task. Especially given the sheer volume of data and its exponential growth, the complexity of biological data sets, and the multitude of different data types.

Slowly, we ­ and I mean the global life sciences community along with the vendor industry here ­ are recognising that we cannot begin to solve the whole challenge of bioinformatics in one fell swoop. If we are to satisfy the role of bioinformatics ­ aenabling the integration of biological data with knowledge resources from other domains' ­ there's a viewpoint gathering support that we must look to an integrated approach where we borrow, adapt, and reuse existing solutions wherever possible.

Furthermore, with the emergence of a new computing architecture known as web services, there is now a technological platform that could potentially deliver the integration and accessibility of biological data.

There are a number of key business-scientific drivers behind the requirement for an integrated approach to bioinformatics:

* The data deluge.

* Data islands.

* Differing data requirements.

* A multiplicity of data types.

* aA land of city states?'

* Knowledge from data.

The drivers

The growth of GenBank, in terms of DNA sequence data alone, parallels the growth of internet nodes globally over the past 20 years. The Internet community recognised early that the development of standards for data exchange, node naming conventions and related low-level protocols were required to allow graceful handling of an explosion in the number of nodes.

The Internet was designed from the beginning to be robust, scalable and accessible. Even so, there is now considerable consternation over having to deal with the constraints imposed by the 12-digit IP addressing scheme. The bottom line is that it is very difficult to plan for changes in data volume of several orders of magnitude, but the underlying development must at least be scaleable. Each different field of bioscience is producing its own data in isolation from the others. However, a sea change is anticipated. In spite of a traditional reputation for insularity, researchers recognise that each discipline does not exist in isolation of the others, rather there is interaction and feedback between and among each of the levels. Bioinformatics must rise to the challenge of providing a unified view of the data behind the interactions. New tools are required that transcend the primary data storage and analysis.

Organism-specific researchers generate and use data locally, annotate the data as required and use it to answer very specific experimental questions. However, this data is often shared with the global community, which requires access to large data sets to address questions that may be of little interest to the original producers of the data. The tools and data formats used at these two levels may be very different. At the lowest level, LIMS allow information management and data processing of interest to individual laboratories. At the global level, however, there are agreements, standards, and protocols that allow data to be shared between researchers. An ideal integrated bioinformatics solution should allow these different levels to translate and exchange data efficiently and seamlessly.

The modern researcher is faced with a multiplicity of data types, relating to, for example, MS, 2D gels, 3D molecular structures, microarrays, and DNA sequences. An integrated solution is required to allow comparison and interpretation of this data. To take a proteomics example, a researcher may have 2D gel data that is associated with MS data and peptide sequence data for a protein but have no single picture of the whole pipeline of knowledge. Instead, the researcher is only able to view the data types in isolation.

Lincoln Stein of the Cold Spring Harbor Laboratory recently delivered a keynote presentation* at the O'Reilly Bioinformatics Technology Conference, Tucson, Arizona, in which he compared the current status of bioinformatics to that of Italy in the Middle Ages.

The Italian city-states were a disparate group with different legal and political systems, dialects, cultures, weights and measures, taxation, and currencies. Even though Italy had brilliant thinkers and scientists, its technological and industrial development lagged because of the difficulty in overcoming these differences. Lincoln argued that today's bioinformatics data providers are also suffering from too many differences and these differences are hindering the advancement of science. "We see a lot of fragmentation in the landscape of data providers,"" he said, ""and each of these data sites has its own view of the world.“

Bioinformatics databases such as NCBI, EnsEMBL, FlyBase, SGD, WormBase, and UCSC are all providing relevant data, but unfortunately they are using a wide range of different systems and formats. Lincoln stressed that there is a clear need for a more integrated approach to bioinformatics.

We do not collect data for its own sake. We must derive knowledge from it, if there is to be any value. An integrated approach is required to derive knowledge from the myriad biological data sources. LIMS is increasingly seen as the tool which can both offer a central repository for bioinformatic data while offering integration with lab robotics and instruments, management of information on assays, reagents, and protocols, and a means for exchanging and integrating other applications and databases. Using Thermo's Nautilus LIMS as an example, Fig. 1 illustrates the pivotal role of a LIMS in the management of bioinformatic data.

A bioinformatics code of conduct

Lincoln Stein and others recently proposed a code of conduct for biological data providers, with the intention of allowing easy integration of bioinformatics resources. It provides a good basis within which suppliers and customers should progress their work, together with a means of developing relationships for the good of the bioinformatics community. The six tenets can be summarised as follows:

* Reuse existing code and make use of open source resources, for example the parsing of BLAST reports, NCBI toolkit code library.

* Use existing data formats to avoid areinvention of the wheel', for example for sequences, microarrays, and 3D structures.

* Design sensible data formats using simple new data formats where no existing format exists and avoiding proprietary binary data types.

* Interfaces represent contracts meaning interfaces between an application and the data source represent formal agreements between the data provider and the data consumer, and must be well documented.

* Encourage choice for data consumers when designing interfaces. Examples include COM, delimited text files, HTML, CORBA, SOAP-XML.

* Support ad-hoc queries meaning data providers should recognise that customers will use the data in unexpected ways and the ad hoc queries should be supported.

Database federations and data warehouses have traditionally been used to integrate disparate data sources. However, the advent of Web Services is offering new possibilities.

A database federation can have a global (federation) schema that provides users with a uniform view of the federation and thus insulates them from the component databases, or local views that provides users with multiple views of the federation.

A data warehouse represents the materialisation of a global schema, ie the warehouse database, defined by the global schema, is loaded periodically with data from the component databases. It organises disparate databases into a data warehouse with or without a common schema.

Some of the more established examples are GUS (Genomic Unified Schema), a data warehouse that attempts to predict protein function based on protein domains, and EnsEMBL, a collaborative project between EMBL (European Molecular Biology Laboratory), EBI (European Bioinformatics Institute) and the Sanger Center to automatically track sequenced fragments of the human genome and assemble them into longer stretches.

Web services are intended to enable the exchange of data between heterogeneous systems in the form of XML (eXtensible Markup Language) messages. Two key qualities of XML are that it is human readable and platform neutral. Web services architecture represents an attempt to allow remote access of data and application logic in a loosely combined fashion. Previous attempts at achieving this (such as DCOM and Java/RMI) required tight integration between the client and server and used platform- and implementation-specific binary data formats.

An advantage of web services is that they are not aowned' by any one company or organisation. Programs written in any language, using any component model and running on any operating system can all access web services.

As mergers, data sharing and communal resources become more accepted in the biotech and pharmaceutical industry, data compatibility and system integration become difficult, and often expensive, considerations. Here, web services can offer significant benefits.

Exposing the functionality of a laboratory information management system (LIMS), electronic record-keeping system or other scientific information system using web services allows scientists to share data more effectively.

By using common schemas and transforming the information, data held in the different systems can be searched, queried and displayed via XML documents, using common interfaces.

A proteomics researcher could, for example, perform a keyword search for related samples, spectra, annotated sequences, and 2D gel images of a particular protein across multiple systems within an organisation. This search could be performed from a page in the corporate intranet or portal.

Some portals offer organisations and users the ability to create a site that is personalised for individual interests; as such, a laboratory portal could be used to bring together sources, databases and functionality pertinent to a laboratory user. This might include sample registration and tracking, access to analytical results, spectral and chemical information and even documents and procedures.

Having a central access point for this type of information helps eliminate barriers between departments and functions, improves internal collaboration and communication, and ultimately delivers operational efficiency by making better use of internal resources and knowledge.

Microsoft clearly rates web services as extremely important. Approximately 80 per cent of their its R&D budget was allocated to the .NET framework and web services. Now Visual Studio.NET is released, web services architecture is set to explode.

There are a number of good examples of the use of web services in bioinformatics, such as EBI's Bibliographic Query Service (BQS) which provides web service access to life science publications; thereby fulfilling a similar role to PubMed only using a richer interface for querying and retrieving publications.

A further example is EBI's XEMBL that offers access to the EMBL nucleotide database for the first time as a web service.

Summary

The need for integration in the management of bioinformatic data has never been as stark as it is today, as the industry struggles to deal with increasingly greater volumes of complex and often interrelated data.

In order to maximise the usefulness and re-usability of a data source, a code of conduct for data providers has been formulated; the principles of which are now gaining widespread support. From a technological standpoint, web services appear to offer the architecture to support the integration and accessibility of data that researchers and bioinformaticians have been waiting for. The elements now appear to be in place for truly integrated bioinformatics to become a reality.

Adrian Fergus is with International Marketing, Thermo Electron Informatics.

References: *Building a Nation from a Land of City States, Lincoln D Stein, Cold Spring Harbor Laboratory. http://www.nature.com/cgiaf/dynapage.taf?file=/nature/journal/v417/n6885/full/417119a_fs.html

Recent Issues