We are starting official support of GRCh38, a reference genome with alternate contigs.
In fact, going forward all of our new projects will use GRCh38. During this transition over the coming year, we will keep supporting GRCh37/hg19. Here are nine takeaways to help you get started in using the latest reference.
1. GRCh38 is special because it has alternate contigs that represent population haplotypes.
Don’t know alternate contig from alternate dimension? Spend five minutes now to review terminology in our Dictionary entry Reference Genome Components. At the least, you should understand the distinction between the primary assembly and alternate contigs.
Long BAM headers notwithstanding, GRCh38 alternate contig sequences are only ~3.6% of the primary assembly length (see table). They encompass alternate haplotypes for which we cannot easily represent variants on the primary assembly. According to my estimation, roughly a tenth of a percent (101,845 basepairs) of the alternate sequence appears highly divergent.
2. The GRCh38 analysis set hard-masks regions and provides decoy contigs for optimal read mapping.
Download your own analysis reference set from the GATK resource bundle. Be certain you are mapping to a version of the genome that hard-masks--replaces with Ns--Y chromosome PARs. Imagine the SHOX of not being able to call variants for pseudoautosomal regions.
3. The challenge alternate contigs presents is a familiar one.
Conceptually it rewraps and regifts the challenge of calling variants for paralogous regions of the genome. The difference is that alternate contigs encompass sequence that is homologous as well as highly divergent for loci across a population instead of across a genome. By definition, we cannot easily represent the variants alternate haplotypes generate against the primary assembly. And so GRCh38 arms us with named alternate contigs that beg to be used when we call their variants. How folks choose to do this with the leeway given by VCF specifications will depend on research aims.
4. Latest versions of BWA-MEM handle GRCh38 alternate contig mappings.
You want to map in an alt-aware manner, i.e. you want your alts handled. Without the handling, you’ll just get a bunch of MAPQ zero ghost reads mapping to both (i) the primary assembly regions that have alternate contigs and (ii) the homologous alternate contig regions. Just as you cannot eat ghost chips, GATK tools refuse to consider zero (and low) MAPQ alignments. No. You. Do. Not. Want. This. Make sure to update to BWA-MEM version 0.7.13+ to be able to map with alt-handling. I’m partial to calling it ghost-busting. This enables two things. First, because it prioritizes alignments on the primary assembly by disappearing alignments from the alternate contigs, it effectively lets you avoid redundantly calling variants on homologous regions of alternate loci. Second, it allows for an additional postalt-processing step that populates multiple alt loci contig(s) with nonzero MAPQ alignments. This enables super-charged variant calling on all the alt contigs. For details, read BWA’s alt-specific README-alt. Although the README currently is marked for an earlier version of the tool, its concepts still apply.
5. Alt-handling requires the SAM format ALT index file.
Special handling requires a special index file. Alt-handling requires that an ALT index is available with the other BWA indexes. Heng Li provides the ALT index for GRCh38 in the linux bwa.kit v0.7.15. Find the hs38DH.fa.alt file in the resource-GRCh38 folder and explore it using Samtools to confirm the following.
- 3,177 total records
- 792 mapped, of which six are supplementary, that correspond to alternate contigs
- 528 HLA contigs (3 supplementary)
- 264 non-HLA alt contigs (3 supplementary)
Each alternate contig record lists a CIGAR string, some of which are rather convoluted, that aligns the alternate contig back to its primary assembly locus. For six of the alternate contigs, we have two alignments each.
- Leaving us 2,385 unmapped records corresponding to decoy contigs. These exclude the EBV contig, which the index considers a part of the primary assembly.
The decoys contain transposable and alpha satellite elements including diverged variants. Why are they represented in the ALT index? See the next takeaway.
6. New Tutorial#8017 shows how to map to GRCh38 with alt-handling and then some.
Tutorial#8017 starts with indexing the reference, reiterates the essentiality of the ALT index and then maps in an alt-aware manner using simulated reads to a miniature-reference. It then goes on to show how to postalt-process alignments using the bwa-postalt.js script. The tutorial does not tell you what to do per se, but rather shows what happens when you use certain options. You definitely want to read sections 5–6 if you plan on calling variants on alternate contigs.
During postalt-processing, two reshufflings take place. First, alignments that can map to both a primary locus and an alternate locus are mapped to both with non-zero MAPQ alignments. These multimappers are supplementary on the alt. Second, if an alignment on the primary assembly aligns better on a decoy contig, then its alignment on the primary assembly is deprioritized with a zero MAPQ score. The tutorial gives an example of the first reshuffle. For those interested in seeing the second reshuffle, I have a suggestion. Change the mini-reference’s single ALT index record to mimic that of a decoy, i.e. change it to an unmapped record, then see what happens when you postalt-process.
If your research aims require one of the reshufflings but not the other, or selective handling for particular loci, then one approach could be to modify the ALT index for the selective postalt-processing.
7. Simulate read mapping for your favorite alternate haplotype.
Tutorial#7859 shows how to generate simulated reads so you can see results akin to those in Tutorial#8017 for your favorite alternate contig. For both tutorials, I use the GPI gene’s singular alternate contig as the example.
Using the liberty the blog format provides, I will digress here. The GPI locus encodes for glucose-6-phosphate isomerase, a protein that has an intracellular role in sugar metabolism and also moonlights extracellularly as Neuroleukin, a factor involved in nerve tissue growth. I chose this locus because (i) it is one of the smallest alternate contigs not near a telomere, (ii) I used to study metabolism and (iii) I worked on an identically named, unrelated molecule. Yes, really.
So, how significant are the alternate contigs? To start answering this question, I asked another. What story can I find for the GPI locus?
I did a little digging last Saturday afternoon for evidence of the alternate haplotype in data resources. In GTex, a project that measures healthy tissue-specific RNA isoform expression, I found that the GPI locus provides cis-eQTLs for WTIP in lung tissue. WTIP encodes for Wilms tumor 1 interacting protein and is three genes down from the GPI locus. Eight of the 11 eQTL sites on the GPI gene match SNPs that my simulated reads, representing the alternate haplotype, generate on the primary assembly. These sites, when I look them up in dbSNP, are all listed as minor alleles and intronic variants. The average global minor allele frequency for the eight SNPs is 38.7% (+/- 0.90%), with 1936 (+/- 45.0) observations in the 1000 Genomes Project phase 3 data. It looks like the GPI locus alternate haplotype is not uncommon and it already has some observed associations.
8. Our production workflow for single sample variant calling on GRCh38 is public and uses shiny new features.
Check it out in our Broad pipelines WDL scripts repository. The document describing the workflow has the .md
extension in the set named PairedEndSingleSampleWf. Even if you are unfamiliar with what is a WDL, no worries. The document focuses on explaining the data transformation steps from alignment to single-sample SNP and indel variant calling. The workflow maps paired reads in an alt-aware manner to GRCh38 and then uses HaplotypeCaller to generate a GVCF callset for the primary assembly. New features the workflow uses include query-grouped alignments through duplicate marking and addition of NM and UQ tags with SetNmAndUqTags.
9. Finally, there is no better time than now to start learning WDL.
It’s pretty straightforward. Using instructions provided by our WDL documentation, even yours truly has written her first three scripts for Tutorial#8017’s workflows. These we share via our new GATK Tutorials WDL scripts repo. WDL scripts will become more prevalent going forward. In conjunction with Docker, these process-centric pipeline scripts enable better provenance and reproducibility in research. If you are a complete newb to WDL, e.g. don’t know how to pronounce the acronym, then start with Blog#7349.
Want to help build our GRCh38 resources? Share your findings by posting a comment.