Proteomics goes on the march and outstrips computerised systems

Eric Russell looks at the technical advances in genomics and proteomics which have spawned a related revolution in the growing field of bioinformatics. This can be described as the acquisition, analysis, and storage of biological information, specifically nucleic acid and protein sequences. Such growth is also fuelling the demand for more software development.

While the commercial and economic world is slowing down, with recession still a possibility, the world of proteomics is moving forward very rapidly. But the rate of progress is outstripping the ability of computerised systems to keep up.

A brief survey of leading academicians suggests that more developments in software would be welcome. For a start, the amount of data being generated by research laboratories is huge. But it is not so much the data on individual proteins and genes that is the problem; it is the volume of data that derives from the interactions and modifications of these elements. That is generating data exponentially.

It is also felt that much of the available genomic and proteomic data is incomplete, poorly annotated and unorganised, while most existing analytical tools provide limited assistance in transforming it into useful information.

Biological diversity

The situation is further complicated because the biological diversity associated with disease may also be influenced by post-translational processes. These are the structural modifications of the initial protein transcript, controlled by the cellular environment changes that cannot be inferred from known DNA variance.

Bioinformatics has a major role in identifying relationships between different genes and proteins. Many proteins arise from a common ancestor and still bear resemblances to each other, sometimes even at the DNA sequence level. The problem becomes one of trying to align two or more sequences of letters from a given alphabet, allowing gaps in each of them. So the problem becomes algorithmic and statistical.

Computational biology needs to generate programs that allow the user to extract more information than was originally entered into the system. Data mining is a software technique that enables complex and varied questions to be asked of a database. This enables the inferring of structural, functional and evolutionary relationships via protein sequence alignments.

Another challenge is to close the gap between protein entries in databases versus the entries in protein sequence databases resulting from DNA sequencing. In 1997, says one source, the difference was 507 versus 428 814.

Traditional protein techniques cannot hope to bridge this gap so the biologist must rely on predictive methods. This approach uses amino acid compositional data such as molecular weights, hydrophobicity values, isoelectric points, peptide masses, and the presence of helices or sheets. Such information is used in various algorithms to assign new protein sequences to known protein families.

New drug targets

Many pharmaceutical companies in search of new drug targets are using a combination of genomic and proteomic information to drive screening programs for broad spectrum targets, which are conserved in many genomes. This is because the presence of a given DNA sequence does not guarantee the synthesis of a corresponding protein.

In addition, DNA information is insufficient to describe protein structure and function, because much of a protein's complexity arises from cellular, context-dependent, post-translational processes. The genetic code also gives no information about protein-protein interactions that generate functional networks.

All this demonstrates that there is not a simple one-to-one correlation between DNA and protein codes. Thus, the study of proteomics is an essential part of the chemical definition of biological systems.

In the past, significant differences in DNA, RNA, and protein analytical methodologies led to the development of separate disciplines for capturing the information residing in DNA and protein structures.

These separate analytical regimens, staff, laboratories, methods and genomic and proteomic data systems have been maintained in an expensive duplication. But now, it is vital to have a single analytical platform for gathering genomic and proteomic data. A common data system to collect, access, and merge collateral genomic and proteomic information would improve resource utilisation and increase productivity.

In principle, the analytical logic is the same: fragmentation, analysis of the DNA fragment or peptide sequences, then reconstruction of the unfragmented DNA or protein.

Emerging techniques

One suggestion is that mass spectrometry (MS) takes on a larger role. Although the current mass range is limited, MS accuracy and resolution make it the detector of choice in emerging techniques for DNA analysis.

Currently, MALDI-TOF and electrospray ionisation (ESI) quadrupole MS are the most popular MS instruments for direct DNA and RNA analysis.

There is also the need to increase throughput of samples in the laboratory. More automation is needed. But MS instrumentation is being developed rapidly and novel configurations are appearing quite fast.

Biochips are also felt to be the way to go for increased throughput, but better links are sought between 2D gel electrophoresis and MS or

MALDI-TOF analytical equipment. More flexible robots would increase throughput, generally accepted as a key area for improvement.

DNA microarrays, DNA chips or biochips, help by performing a vast number of parallel experiments in a single pass. The technology provides an inexpensive tool for rapid exploration of genome structure, gene expression profiling, gene function and cell biology.

The key is using the genome to identify genetic problems and develop patentable treatments. Assays are a powerful screening technology that identifies the faulty genetic patterns that trigger disease and cause protein imbalances; while proteomics, which indicate the PROTEins expressed by a genOME, enlist proteins such as antibodies to target genes and act on them.

The insertion of genes into bacterial hosts is a powerful way to make proteins for medicinal purposes. The first recombinant protein used therapeutically was insulin, for example.

Scientists would also like to see a quantum leap in the recognition software used in vision systems. It is felt the human eye can still be better at picking spots out of gels although it suffers from fatigue and subjectivity.

These issues have to be addressed because proteomics is an increasingly important technology that enables mass screening of proteins and their post-translational modifications. Proteomics research aims at characterising the hundreds or thousands of proteins expressed by organisms in the context of whole organisms, specific tissues or normal versus diseased states.

But, as one professor says, the rate of progress is always limited by the rate at which funds can be raised. With pieces of equipment and suites of software now costing in the 100 000 euro range, university revenue does not go far.

The country that succeeds in proteomic research will be the one that funds its industry best. But it is not just a matter of money, speed is of the essence. The science of proteomics is developing so fast that governments will have to move at an unaccustomed speed to fuel the industry. The will and the ability to proceed rapidly is there in academia and if governments do not respond, then frustration will limit what can be achieved.

Recent Issues