Integration and consolidation:key to analysis and data mining

1st April 2013

Jack Elands reports on a the strategies needed to deliver maximum value from the amountsof electronic data now used by pharmaceutical and biotech companies.

Electronic data is rapidly becoming the biggest asset of pharmaceutical and biotech companies. In addition, the volume of this asset is increasing at a continuously growing pace. The relationships and links between these vast amounts of data can no longer be grasped by individuals in any organisation.

Thus, in order to put this data to use effectively, strategies need to be developed that allow a maximum of value to be extracted from both existing data as well as from newly generated and future data.

Data integration is often considered at high levels, integration strategies often start at the top, ie at the querying and reporting level and are multiple approaches are available to integrate different data sources (querying multiple databases at run time, data warehousing, database federation). And while these strategies are very effective they only tap into a subset of the potential for data integration.

IDBS' vision of data integration is based around a dynamic, data centric architecture and starts at the very beginning, there were the data is captured using a data management system. These systems, such as IDBS' ActivityBase, are typical transactional or operational systems and are optimised for data capture. For convenience these systems are often run at the local level (sites, labs, etc.), although they are also frequently run at a global level. And while data can be associated to different experimental approaches, or be of different types (chemical structures and their related physicochemical parameters, biological and more recently genomics and proteomics data) a data management system such as ActivityBase can easily manage very different data types and thereby provide the first level of integration.

The screen captures from ActivityBase in this article provide examples of assays in different phases of discovery research, where different types of data are generated.

All these data types can easily be captured and managed by ActivityBase. By providing the flexibility to manage very different data types from structurally different assays and experimental protocols ActivityBase effectively facilitates data integration. Data integration at this level not only provides a common mechanism of data capturing and management; it also ensures a consistent quality of the data.

Data quality is also relevant for effective data analysis and successful data mining. All too often only data is stored that contains little contextual information. The data context, describing how and under which conditions the data was generated, is of crucial importance for data interpretation.

When sufficient contextual information is stored with the data the data will retain its value over time. These mechanisms of storing contextual information are a key part of ActivityBase, which makes it the preferred solution for capturing and managing discovery research data.

Systems like ActivityBase have been used very successfully for data reporting and analysis. But with ever increasing numbers of data points that are captured and more complex data analysis requirements different strategies can be adopted to more effectively organise the data for efficient data capture and analysis. A key component is the data warehouse strategy, such as the Discovery Warehouse from IDBS. A data warehouse is another essential element of the IDBS dynamic, data centric architecture. Discovery Warehouses are a powerful approach to data consolidation.

When multiple transactional databases are in operation they can effectively be consolidated into a single data repository, which then becomes a preferred point for querying and analysis. At this level also data from other sources can be uploaded to the Discovery Warehouse. Consolidating several data sources obviously provides another level of data integration.

Corporate portals or data federation layers are much more effective when they can rely on a few consolidated data sources that are organised for analysis, such as the Discovery Warehouse, rather then having to depend on a large number of transactional databases, whose primary function it is to capture data.

The majority of data reporting deals with questions that are highly repetitive, reports on screening, structure activity, project data, etc.

While data mining requires access to the larger dataset, the majority of the reporting can be done more effectively using data marts, data sources that are optimised for specific reporting purposes.

Such data marts, like the IDBS Discovery Mart, can be designed to address specific reporting needs such as compound projects, etc.

Updated automatically by warehouses, marts can be created and obsoleted as; they serve as long as they are required.

Data marts will become important sources for many query applications: they are highly integrated and provide data in very easily accessible formats, making access for query tools and federation layers very straightforward.

The IDBS dynamic data centric architecture is not about applications. It provides a consistent way to organise data for capturing, consolidation and integration, as well as querying, reporting, analysing and mining of the data. The applications play a vital role in ensuring that data and its contextual information is captured and maintained to ensure highest quality and reliability.


Jack Elands is with IDBS, Guildford, Surrey, UK. Tel: +44 (0)1483 595000. Fax: +44 (0)1483 595001.





To receive the Scientist Live weekly email NewsBrief please enter your details below

Twitter Icon © Setform Limited