Validating next-generation sequencing-based genetic tests

8th June 2015

Posted By Paul Boughton

DNASTAR’s dedicated data flow.
Screenshot from Lasergene Genomics Suite

Frederick R. Blattner & Tim Durfee showcase software for auto-analysis of internal benchmarking controls

As next-generation sequencing (NGS) makes genetic testing for a wide range of human diseases increasingly commonplace, facile methods for validating the efficacy of those tests are essential.

In the USA, federal regulatory standards embodied in the Clinical Laboratory Improvement Amendments (CLIA), for instance, are designed to ensure that these tests reliably achieve certain performance specifications in terms of accuracy, precision and analytical sensitivity and specificity.

To facilitate the validation process, the National Institute of Standards and Technology (NIST) through the Genome in a Bottle Consortium (GIAB) developed a highly curated set of genome-wide reference materials for the HapMap/1,000 Genomes CEU female, NA12878. These materials include BED and VCF files of high confidence sequence regions and variant calls, respectively. NA12878 genomic DNA and a cell line are available (Coriell Institute), providing laboratories with an internal control for their processing and analysis pipeline. Comparing testing results to the GIAB call sets allows establishment of both the analytical performance for regulatory certification as well as the appropriate assembly thresholds to apply when considering potential variants in patient samples.

Computational challenge

For clinical sequencing laboratories to efficiently leverage resources such as the GIAB call sets, assembly and analysis software must support the unique aspects of validation control processing.

For example, analyses should be restricted to the intersection between the GIAB high confidence regions and the target regions of the specific test. Variant reporting should also match GIAB call set conventions wherever possible to avoid underestimating the accuracy.

Additionally, data processing and statistical calculations should be rapid, automated and reported in an easily interpreted form. Most clinical laboratories lack the bioinformatics expertise needed to build software pipelines capable of handling these challenges.

Software workflow

DNASTAR has developed a dedicated workflow within its Lasergene Genomics suite for clinical sequencing labs to utilise the NA12878 reference materials to validate their NGS-based genetic tests (Fig. 1.). Within the SeqMan NGen wizard, users specify: NGS reads from their processed NA12878 sample; the human genome reference version (NA12878 reference materials are in GRCh37 coordinates; an intersected BED file between their targeted regions and GIAB’s NA12878 high confidence regions; and GIAB’s NA12878 VCF file of high confidence variant calls. At runtime, the VCF is filtered down to those positions delineated by the intersected BED file.

The data is then assembled against the human genome reference sequence using SeqMan NGen running on a standard desktop computer. Fully gapped alignments are analysed in-stream using a modified version of the MAQ variant caller to produce variant and reference call files for each position in the intersected BED file. Key metrics, including the depth of coverage and probability scores, are recorded for each position. Assemblies can be visualised in SeqMan Pro, allowing evidence for variants and regions of low coverage to be assessed.

For accuracy calculations, variant and reference call files are automatically loaded with the filtered VCF file into ArrayStar post-assembly. Only positions within the intersected BED are considered. Positions are classified as: true positives (TP), called variants also present in the VCF file; false positives (FP), called variants not in the VCF file; true negatives (TN), called reference bases not in the VCF file; and false negatives (FN), called reference bases that are present in the VCF file. The counts of each class are then used to calculate various accuracy metrics, including the true positive rate (TPR, “sensitivity”), the true negative rate (TNR,'specificity') and the false discovery rate (FDR).

Summary report

A summary report is produced with the absolute number of positions in each class and the corresponding statistics stratified based on two thresholds: minimum depth of coverage that a position must match or exceed to be considered in the analysis; and minimum p-value a position must have to be considered a variant. Three p-values ('Pnotref' of 90.0%, 99.0% and 99.9% corresponding to phred-like scores of 10, 20 and 30, respectively) are employed for each of 12 different depths of coverage cutoffs (ranging from 1 to 100). The number and percentage of targeted bases meeting each depth cutoff are also presented.

Lessons learned

NGS-based genetic tests promise to greatly enhance patient care 

in this era of personalised medicine. Given their potential impact on treatment decisions, it is critical that those tests be appropriately validated as part of their regulatory approval and as a measure of routine assessment of the test’s performance. DNASTAR’s validation control workflow using NIST/GIAB’s NA12878 reference materials greatly facilitates this process.

For more information at

Frederick R. Blattner & Tim Durfee are with DNASTAR.




Twitter Icon © Setform Limited