Whole genome MLST analysis

A universal typing approach for bacteria with many advantages. By Katrien De Bruyne, Bruno Pot & Hannes Pouseele

Next-generation sequencing (NGS) technologies are rapidly replacing classical Sanger sequencing. The positive speed and cost evolution of NGS makes whole genome sequencing (WGS) a very attractive alternative not only for research but also for routine analyses, such as in the field of epidemiology and outbreak surveillance. The WGS approach provides the possibility to obtain from one single assay many traditional typing results, such as MLST, rMLST, SPA-typing, spoligotyping, SNP-set based typing, etc. This allows for a seamless link between classical and new knowledge bases, without additional cost and effort of retrospective sequencing. Moreover, as the available sequencing data increases, functional prediction such as resistance or virulence prediction based on the presence or absence of genes involved, is becoming more reliable, which yields critical information for surveillance or for making therapeutic decisions.

Starting from WGS data, there are two main methodologies to obtain interpretable information: whole genome Single Nucleotide Polymorphism (wgSNP) analysis or whole genome Multi Locus Sequence Typing (wgMLST) analysis. As the latter is more readily providing functional information and is more stable for long-term applications, wgMLST is increasingly being considered for subtyping purposes at any desired taxonomic level. 

wgMLST uses WGS data (assembled or not) to perform MLST [1] on a genome-wide scale. For each sample, locus presence is analysed and, when present, allele variants are determined. For each locus, new sequences are assigned new consecutive allele numbers (see Fig. 1).

In contrast to a wgSNP analysis, wgMLST is based on the concept of allelic variation, as the identity or non-identity of complete coding regions between strains is considered. This implies that recombination and deletions or insertions of multiple positions are counted as single evolutionary events. This approach might be biologically more relevant than approaches that consider only point mutations. 

The advantages of wgMLST over wgSNP

As loci generally correspond to functional genes, a wgMLST profile can be used as the basis for preliminary genotypic and (to some extent) phenotypic interpretations: starting from the complete wgMLST profile, different subschemes can be defined, composed of loci that have known functionality or mimic traditional typing schemes.

In contrast to wgSNP, where reference sequence(s) used to analyse a particular set of samples should be phylogenetically as close as possible to the samples, wgMLST is based on a single reference set that captures the complete known diversity of a taxon. The wgMLST scheme, that is, the set of loci used in a wgMLST analysis, can be extended as more genomes are analysed and new genes (or loci) are detected (Fig. 2).

The upfront determination of the wgMLST scheme leads to stability, in the sense that, in contrast to wgSNP, adding new samples does not have an influence on the existing information. The ‘pan-genome’ approach has the advantage of maximising resolution in any sample comparison. When comparing closely related isolates, the pan-genome scheme aims at covering over 95% of the genes in each isolate, whereas for taxon-wide comparisons the pan-genome scheme naturally reduces to the ‘core genome’ subset, that is, a set of loci common to over, for instance, 95% of the strains belonging to the taxon considered.

The use of a core scheme has been shown to have a high epidemiological relevance and is extremely stable over time. Therefore it is becoming the preferred tool for long-term monitoring, surveillance and outbreak investigation [2, 3].

The possible pitfalls

As was the case with classical MLST, the choice and definition of the loci contained in the scheme is of paramount importance to obtain a noiseless analysis of WGS data. While the wgMLST scheme itself is based on the genes present in the taxon, the WG perspective extends the number of available loci far beyond the point of what is manually manageable, and thus there is a need for an automated tool able to create and evaluate a reliable wgMLST scheme[3] (Fig. 2).

Moreover, any scheme creation procedure not only needs to reliably define loci, but also needs to be able to predict and avoid possible locus convergence, as it is likely that, upon addition of new sequences, biological variation might lead to allelic variants with high similarity, albeit derived from different loci.

The locus detection procedures are also of great importance (Fig. 2). We therefore propose a two-tier approach. Assembly-based methodologies identify alleles from de novo assembled genomes using BLAST[1]. The assembly procedure in itself is computationally intensive, but is especially useful for extrinsic validation of the allele calls, such as in silico PCR, in silico hybridisation or synteny-based validation. However, a de novo approach implies that some loci can and will be missed due to draft assemblies. Moreover, de novo assembly has undefined behaviour for the reconstruction of multi-copy loci, and therefore multi-copy loci are not very well detected from de novo contigs. Therefore an additional assembly-free method[4] should be used to compensate for the assembly artefacts (Fig. 2). This is computationally less intensive, provides a more clear definition of missing loci (as they now are missing from the reads rather than from the de novo assembly), and is designed to be exhaustive (as multi-copy loci are picked up as separate allele calls). It also provides invaluable quality control metrics for detecting contamination. 

Once locus definitions are standardised, wgMLST is a highly unambiguous and portable method. Materials required for sequence typing can easily be exchanged between laboratories, as the method is reproducible and scalable. wgMLST can be automated, combining advances in high throughput sequencing and bioinformatics with established population genetics techniques. Most importantly, wgMLST data can be used to investigate evolutionary relationships among bacteria and provides close-to-ultimate discriminatory power to differentiate isolates.

For more information at www.scientistlive.com/eurolab

Katrien De Bruyne, Bruno Pot & Hannes Pouseele are with Applied Maths in Belgium.

References: 1  Jolley K.A., Maiden M.C., BIGSdb: Scalable analysis of bacterial genome variation at the population level, BMC Bioinformatics. 2010 Dec 10;11:595. doi: 10.1186/1471-2105-11-595; 2  Leopold S. R., Goering R. V., et al. 2014. Bacterial whole-genome sequencing revisited: portable, scalable, and standardized analysis for typing and detection of virulence and antibiotic resistance genes. J Clin Microbiol. 2014 Jul;52(7):2365-70; 3  S. Roisin, C. Gaudin, et al. (2015). Abstract O252, Session Staphylococcus – ESCMID, April 29 2015, Copenhagen. https://www.escmid.org/escmid_library/online_lecture_library/material/?mid=22445; 4  Pouseele H., Method of typing nucleic acid or amino acid sequences based on sequence analysis, European patent 2502593 (pending).

Recent Issues