A note to explain the context of the new paper by Heng Li, myself and others, “New synthetic-diploid benchmark for accurate variant calling evaluation” available as a preprint in bioRxiv.
Developing new tools and algorithms for genome analysis relies heavily on the availability of so-called "truth sets" that are used to evaluate performance (accuracy, sensitivity etc.). This has long been a sticking point, though recently the situation has improved dramatically with the availability of several public, high-quality truth sets such as Genome In A Bottle from NIST and Platinum Genomes from Illumina. Even these resources, which have been produced through painstaking analysis and curation, are not immune to the lack of “orthogonality” which plagues most available truth-sets. Chief among these is that the failure modes of Illumina sequencing are usually masked out and the resulting data are biased towards the easier parts of the genome.
The paper I linked above introduces a new dataset that we developed to be less biased. It is based solely on PacBio sequencing, and thus its error modes are less correlated with Illumina’s error modes. Using this dataset for benchmarking has given us high confidence in the accuracy of our validations and has enabled us to improve our methods with less concern of overfitting.
Truth data (for germline DNA methods) tend to be derived from two sources: synthetic (that is, computer generated), or Illumina (and other) sequencing of a particular sample called NA12878. Both of these sources are deeply flawed and ultimately, not good enough. First, it is virtually impossible to create synthetic data that truly resemble the results of sequencing actual biological tissue, for several reasons: the reference is an approximation and the effects of sample-extraction, library-construction, and sequencing are really hard to model accurately. Regarding our biggest issue with NA12878, we simply love this sample too much! Nearly all of NA12878’s variants are present in our resource files (dbSNP, the training files for VQSR, etc.). When we evaluate our method’s performance on NA12878, we cannot really trust the results since we have been using the answer all along. Furthermore, both the NIST and Platinum Genomes truthsets are each restricted to a subset of the genome that they consider the “confidence region”. This region is defined differently in the two datasets, but in both cases it is dependent on performance of Illumina sequencing of NA12878 (among other things). This has the perverse effect that the results are reflecting performance only in the easier-to-sequence-and-analyze part of the genome, falsely inflating our self-confidence, and giving no blame or credit for performance in the harder regions of the genome.
The “Synthetic-diploid” (or as we affectionately call it, SynDip) is generated from two human cell lines (CHM1 and CHM13, PacBio-sequenced and assembled by others) that were derived from Complete Hydatidiform Moles. This rare and devastating condition results in a non-viable collection of cells that is almost entirely homozygous. The homozygosity implies that PacBio sequencing is much more trustworthy as there are no heterozygous sites that tend to confuse the assembly: any confusion is almost certainly due to sequencing error and can therefore be masked out. To make use of this, we aligned the CHM1 and CHM13 assemblies to the hg38 reference, and created a VCF and a confidence region that characterize the variation that a 50-50 mixture of the two cell lines would have. At the same time, we also sequenced and aligned such a 50-50 mixture using our WEx and WGS protocols on Illumina. So to be clear, in that regard, the name is misleading. The only “synthetic” part about SynDip is that it’s synthetically diploid, but in all other aspects it’s as natural as can be, since it was generated from live cells using regular sequencing protocols.
Since the CHM dataset was generated using PacBio data alone, with no consideration for the flaws of Illumina’s short-read technology, there should be less correlation between the failure modes of our methods on the short-read data and SynDip’s confidence regions. This allows us to have better, more trustworthy truth-data. It enables us to remove much uncertainty, defusing our natural tendency to “look under the lamp” and to overfit our methods.
And beyond that, it empowers us to push our method development further by exposing large tracts of the reference where our methods (and not only ours!) do not perform well -- and provides us with a more truthful picture of what lies in those regions. Here are the main ways we have used this resource to that end:
- We have used the insights gained from applying our filtering methods on the SynDip data, which reveal the flaws in their performance, to design better filtering architectures and fine-tune existing ones. (More on this in a future post….)
- We have used the dataset to assess new variant calling methods for CNVs and SVs.
- We have used it to compare different analysis pipelines and determine whether there’s a significant difference between them (e.g. What is the effect of running BQSR over and over again? Answer: Not much beyond the first run.)
- We are currently using it to develop the next version of our joint-calling pipeline which will be able to joint call more than 100K genomes (!!!)
One thing that the current CHM dataset doesn’t help us do is develop better lab methods. This is because the CHM cell lines are not currently commercially available and thus the technology companies cannot test their new protocols and technologies on it. Hopefully, this will eventually be made possible and could enable us to explore hard-to-sequence regions of the genome.
If you are a method developer or you are in a position to evaluate the performance of various pipelines, we encourage you to check out the CHM dataset, and we hope it will help you develop new methods and pipelines! In the future we plan to share more data from the CHM cell lines and make the methods we use for evaluating our methods and data publicly available.