Developments in data visualisation pave the way in genetic research

With such vast amounts of data to consider, it can be difficult for scientists to understand the true biological meaning of their research, says Carl-Johan Ivarsson. However, new data visualisation techniques are now making it much easier to uncover new and unexpected results.

As recently as 10 years ago, many biologists were still working with glass slides that revealed a few thousand features of the genes that they were studying, but that number has grown dramatically in recent years, thanks to advances in technology. As such, it has become much more difficult for biologists to identify which genes are being expressed, and to what level.

With such a large volume of data to consider, it is often impossible for these scientists to derive any real biological meaning from their findings with the naked eye alone, which means that sophisticated data algorithms need to be developed in order to interpret this data effectively. As a result, much of the computer software that has been designed for use in this area has focussed on being able to handle increasingly vast amounts of data.

Unfortunately, this shift in focus has (unintentionally) pushed scientists and researchers to one side, since a lot of data analysis must now be performed by specialist bioinformaticians and biostatisticians, especially when complicated algorithms are required for the analysis. This model has several drawbacks, however, since it is typically the scientist who knows the most about the specific subject area being studied.

The good news for scientists is that the latest data visualisation techniques and imaging technologies are already making it much easier for the researchers themselves to examine this enormous quantity of data, to test different hypothesis, and to explore alternative scenarios within seconds, since important findings can now be displayed in an easy-to-interpret graphical form.

Data visualisation

During the last decade, research into molecular biology has helped to identify a large number of disease-associated genes, and is therefore helping researchers to unpick the fundamental biology of major illnesses. Gene expression profiling, for example, is now regularly being used for the study of many serious diseases.

Gene expression experiments help to measure the activity (the expression) of tens of thousands of genes at once, in order to create a global picture of cellular function. These findings can then be used to distinguish between cells that are actively dividing, for example, or to show how the cells react to a particular treatment. As part of this process, researchers often must consider sub-groups (such as patients who are in remission versus patients who have suffered a relapse), whilst also examining the different types of cell abnormalities related to clinical conditions such as diabetes and cancer.

Difficulties can arise, however, as a result of the vast amount of data that is created by experiments like these. This 'data overload' can present a serious problem for researchers, since it is essential to capture, explore, and analyse this kind of data effectively in order to obtain meaningful results.

To address this issue, a new generation of data visualisation tools has been designed to take full advantage of the most powerful pattern recogniser that exists: the human brain. Indeed, powerful software engines are already being used to help researchers to visualise their data in 3D, so that they can identify hidden structures and patterns more easily, and therefore identify any interesting and/or significant results easily, by themselves, without having to rely on specialist bioinformaticians and biostatisticians.

[Page Break]

Identifying patterns

Data visualisation works by projecting high dimensional data down to lower dimensions, which can then be plotted in 3D on a computer screen, and then rotated manually or automatically and examined by the naked eye. With the benefit of instant user feedback on all of these actions, scientists studying diseases like diabetes and leukaemia can now easily analyse their findings in real-time, directly on their computer screen, in an easy-to-interpret graphical form.

Scientists are already making use of this exciting new technology in a real-world setting. For example, a large EU-funded research project is attempting to develop and optimise in vitro test strategies that could reduce or replace animal testing for sensitisation studies.

The project, known as Sens-it-iv, combines both private and public research institutions, as well as several industrial and societal interest organisations. Dr Ann-Sofie Albrekt is currently advanced data analysis software for her work in this important area, under the supervision of Professor Carl Borrebaeck, a sub-coordinator of Sens-it-iv.

"In our studies, we are dealing with very large amounts of data, sometimes between 10 and 100 million data points, which we tend to view as graphics. With earlier applications, these graphics would take a long time to appear, but with the latest data analysis tools, these 3D images are presented instantly," Dr Albrekt says. "As a result, we can be much more creative with our theories, as we can easily test any number of hypotheses in rapid succession, and see the results at a glance."

When used during research in this way, the ability to visualise data in 3D represents a very powerful tool for scientists, since the human brain is very good at detecting structures and patterns. The idea behind this approach is that highly complex data will be easier to understand and comprehend by giving it a graphic form. As such, this approach to information visualisation offers a way to transform raw data into a comprehensible graphical format, so that scientists can make decisions based on information that they can identify and understand easily.

New imaging functions contained within the latest data analysis applications are currently allowing scientists to analyse very large data sets by using a combination of different visualisation techniques, such as Heatmaps and Principal Component Analysis (PCA). With visualisation tools like these, it is possible to investigate large and complex data sets without being a statistics expert, since visualising information reduces the time required to take in data, make sense of it, and draw conclusions from it.

The process begins by reducing high dimension data down to lower dimensions so that it can be plotted in 3D. Principal Component Analysis (PCA) is often used for this purpose, as it uses a mathematical procedure to transform a number of possibly correlated variables into a number of uncorrelated variables (called principal components).

Key breakthrough

One of the key breakthroughs in the latest generation of bioinformatics software is the introduction of dynamic PCA, an innovative way of combining PCA analysis with immediate user interaction. This unique feature allows scientists to manipulate different PCA-plots - interactively and in real time - directly on the computer screen, and at the same time work with all annotations and other links in a fully integrated way. With this approach, researchers are given full freedom to explore all possible versions of the presented view, and are therefore able to visualise, analyse, and explore a large dataset easily.

By using a tool known as a 'heat map' alongside dynamic PCA analysis, scientists have yet another way of visualising their data, since heat maps can take the values of a variable in a two-dimensional map and represent them as different colours. Because modern heat maps use sophisticated mapping techniques to represent this data (as opposed to standard charting and graphing techniques), they can provide a view of data that is simply not possible to achieve with simple charts and graphs.

Also, because they are often obtained from DNA microarrays, biology heat maps are often used to represent the level of expression of many genes across a number of comparable samples, such as cells in different states or samples from different patients. Heat maps are also popular for their ability to be dynamically updated when any filter parameters are changed.

A group of scientists studying the human eye at the Division of Ophthalmology and Visual Sciences at Queen's Medical Centre (QMC), part of the University of Nottingham, regularly use heat maps as part of their study of the Limbal stem cells on the ocular surface of the eye.

"Compared to what is possible with modern data analysis software, previous studies were more complex to analyse and difficult for biologists to understand," according to Dr Bina Kulkarni, one of the researchers working at QMC. "Data analysis is now much easier, as the latest software in this area provides instant graphical visualisation of the statistical tests in the form of heat maps, as well as variable and sample PCA plots, which really helps us to understand the analysis and the changes in gene expression patterns across different samples."

What's next?

As computer technology improves - with greater processing power, better graphics applications and more sophisticated analysis software - data visualisation will continue to develop as well. As such, these new methods of visualising data are likely to make traditional forms of data presentation (such as spreadsheets and basic graphics) obsolete in the future.

[Page Break]

Epigenetic alterations

Already, a team of scientists at the Institute of Human Genetics of the Christian-Albrechts-University in Kiel, Germany, is using data visualisation to support a number of national and international projects related to the epigenetic alterations related to several cancers, including malignant lymphoma, colorectal cancer, and hepatocellular carcinoma, as well as developmental disorders and other diseases.

"Larger studies, especially those which include multiple samples that need to be analysed on comprehensive array platforms, have traditionally been very time-consuming, and have also required a considerable amount of computer power," says lead researcher Dr. Ole Ammerpohl. "As humans, we are all used to interpreting 3D pictures in our environment, and so our brain is able to find structures in complex 3D figures very quickly. Therefore, it's no wonder that a 3D presentation of complex mathematical/statistical coherences makes its interpretation much easier for us."

Even though the exploration and analysis of large data sets can be challenging, the use of tools like PCA and heat maps can provide a powerful way of identifying important structures and patterns very quickly, especially as visualisation typically provides the user with instant feedback, and with results that present themselves as they are being generated.

Already, the latest technological advances in this area are therefore making it much easier for scientists to compare the vast quantity of data generated by epigenetic studies and to test different hypotheses very quickly. As a result, the latest generation of data analysis software is helping scientists to regain control of this analysis, and to realise the true potential of the important research being conducted in this area.

Qlucore started as a collaborative research project at Lund University, Sweden, supported by researchers at the Departments of Mathematics and Clinical Genetics, in order to address the vast amount of high-dimensional data generated with microarray gene expression analysis. As a result, it was recognised that an interactive scientific software tool was needed to conceptualise the ideas evolving from the research collaboration.

The basic concept behind the software is to provide a tool that can take full advantage of the most powerful pattern recogniser that exists - the human brain. The result is a core software engine that visualises the data in 3D and will aid the user in identifying hidden structures and patterns. Over the past two years the major efforts have been to optimise the early ideas and to develop a core software engine that is extremely fast, allowing the user to interactively and in real time instantly explore and analyse high-dimensional data sets with the use of a normal PC.

Qlucore was founded in early 2007 and the first product released was the Qlucore Gene Expression Explorer 1.0. The latest version of this software, Version1.1, represents a major step forward with the advanced statistics support. All user action is at most two mouse clicks away. The company's early customers are mainly from the Life-science and Biotech industries, but solutions for other industries are currently under development.

One of the key methods used by Qlucore Gene Expression Explorer to visualise data is dynamic principal component analysis (PCA), an innovative way of combining PCA analysis with immediate user interaction.

Dynamic PCA is PCA analysis combined with instant user response, a combination which provides an optimal way for users to visualise and analyse a large dataset by presenting a comprehensive view of the data set at the same time, since the user is given full freedom to explore all possible versions of the presented view.

PCA analysis works by projecting high dimensional data down to lower dimensions. The specific projections of the high-dimensional data are chosen in order to maintain as much variance as possible in the projected data set. With Qlucore Gene Expression Explorer, data is projected and plotted on the two dimensional computer screen and then rotated manually or automatically and examined by the naked eye.

Recent Issues