New DNA sequencing

US company Solexa has completed its first genome sequence, that of the virus Phi-X174. The company announced genome coverage of 100percent, accuracy of at least 99.93percent and the detection of at least three mutations subsequently confirmed by conventional DNA sequencing techniques.

This accuracy was achieved despite the presence of a number of sub-sequences that are particularly difficult to sequence with certain other chemistries.

At the heart of this breakthrough is the company's novel sequencing biochemistry (see aHow it works' below). This work provides end-to-end demonstration of a technology expected to sequence the DNA of individual humans for the detection of key disease-predisposing mutations. The genome sequence has already been repeated a number of times.

Generated data

While the Phi-X174 genome sequenced was small at just over 5000bases, the amount of sequence data generated was considerably larger. Whereas conventional DNA sequencing equipment typically delivers no more than approximately 1200bases per sample preparation, this experiment delivered more than three million bases from a single sample preparation. As a result, says the company, sample preparation, which can be a major effort in large-scale DNA sequencing projects, can potentially be reduced by over 1000-fold.

Although impressive, Solexa maintains that these results significantly underestimate the amount of data available. This is because the prototype nature of the hardware used only allowed the instrumentation system to image three per cent of the available area in its flow-cell. The company estimates that more than 100million bases of data were represented in a single flow cell, all from a single sample preparation.

Future experiments are expected to substantially increase the fraction of data to be recovered. Later this year, fully automated instrumentation is expected to allow hands-off capture of almost all available data.

Lower sequencing costs

The genome experiments were conducted using DNA cluster technology that was acquired by the company in early 2004 and has been significantly refined and developed since that time. The results were implemented with proprietary surface chemistry developed by the company.

This approach has successfully achieved clusters so small that they are beyond the resolving power of the research microscope used to observe them. According to the company, this justifies its decision to move from earlier bead based work to clusters.

The new approach provides high fluorescence signals while achieving submicron feature sizes, thereby enabling rapid and inexpensive detection of large numbers of DNA sequencing data points.

Since instrument depreciation is a major contributor to the cost per data point, this is an important advancement.

By lowering instrument costs per data point while simultaneously achieving extremely low reagent usage, the company anticipates that cluster technology may result in substantially lower sequencing costs.

While companies with competing technologies have developed novel DNA sequencing technologies based on beads spaced by as much as 50 microns apart, Solexa is now working with clusters as small as one-half a micron in radius. This density of sequence reads is up to 500 times higher than the bead-based approach. If reagent costs scale in parallel, the company believes that it will gain a substantial long-term cost advantage.

Human re-sequencing

Bioinformatics analysis of the human genome reference sequence has shown that read lengths of 25base pairs are the point of diminishing returns for increasing read-lengths in genome-scale re-sequencing work.

At this level, up to 82percent of the human genome can be uniquely associated with specific reads, even when they record mutations.

Above this level, the percentage of the genome covered increases very slowly with increasing read length, due to the content of highly repetitive sequences ­ those of least importance to most researchers.

The read length has now been achieved in the Phi-X experiment and more than a hundred thousand reads of this length on a wide range of sequence contexts has been obtained. The sequencing technology is not fundamentally limited to this read length.

The sequence covered by Solexa in this demonstration of the Cluster-SBS technology includes a number of cases in which the same nucleotide occurs for many consecutive positions, a type of subsequence that can be problematic for other sequencing chemistries. SBS chemistry reads through these by analysis of each incremental base in a stepwise fashion.

This focus on accuracy is expected to be a key competitive advantage for the company. Re-sequencing is often used to look for very rare mutations, particularly in cancer samples. In these and other cases, even a modest error rate can create more false positives than real detected mutations.

Solexa's core technology, called the single molecule array (SMA), allows simultaneous analysis of hundreds of millions of individual molecules. It is being applied to the measurement of individual genetic variation.

Randomly distributed

Unlike conventional high-density bio-arrays, the sites on SMAs are randomly distributed and at each there is only one single molecule of DNA. As a result it is possible to create arrays of very high site density, around 108 sites per cm2 or more, allowing massively parallel processing.

By working at the single molecule level, this method also avoids the need for amplification of target sequence, allowing one-tube sample preparation for a whole genome analysis. It is the combination of these two features of ultra-high site density and amplification-free, one-tube sample preparation that creates the breakthrough in economics and throughput.

The company's goal is to determine individual subject sequence variation compared to a reference sequence, rather than de novo sequencing. For a human reference sequence, this requires read lengths of approximately 25bases.

Solexa has developed a proprietary sequencing biochemistry, SmaSeq, that is compatible with its SMAs. It has also developed a proprietary bioinformatics system that aligns the 25base sequencing output against a reference system.

How it works

The process begins with the extraction of genomic DNA from an individual's sample cells. In a single tube reaction, this genomic DNA is processed into single-stranded oligonucleotide fragments. These are prepared for attachment to SMAs using proprietary primer and anchor molecules.

Hundreds of millions of molecules, representing the entire genome of the individual, are then deposited and attached at discrete sites on an SMA.

Fluorescently labelled nucleotides and a polymerase enzyme are added to the array. Complementary nucleotides base-pair to the first base of each oligonucleotide fragment and are added to the primer by the enzyme. Remaining free nucleotides are removed.

Laser light of a specific wavelength for each base excites the label on the incorporated nucleotides, which fluoresce. This fluorescence is detected by a CCD camera that rapidly scans the entire array to identify the incorporated nucleotides on each fragment. Fluorescence is then removed.

The identity of the incorporated nucleotide reveals the identity of the base in the sample sequence to which it is paired.

This cycle of incorporation, detection and identification is repeated approximately 25 times to determine the first 25bases in each oligonucleotide fragment. By simultaneously sequencing all molecules on the array the first 25 bases for the hundreds of millions of oligonucleotide fragments are determined.

These hundreds of millions of sequences are aligned and compared to the reference sequence using Solexa's proprietary bioinformatics system. Known and unknown single nucleotide polymorphisms (SNPs), together with other genetic variations, can then be readily determined.

The result, maintains the company, is a technique with the fundamental economics and scope of methodology to enable valuable target discovery, pharmacogenomics and personalised healthcare applications.

Recent Issues