Is it necessary to process 1000 genome data for exome variant calling training?

I have an independently sequenced human exomes with 100x coverage. I would like to call variants using the GATK best practices guidelines, and have been following the guide to do so. However, I am confused about using 1000 genome data to create training files to improve the accuracy of my variant calling.

I remember before the GVCF best practices were written, the previous guide suggested processing ~35 exomes from the 1000 genome project to be used as a training data set. Therefore, as an experiment, I am using my 50 exomes (from the 1000 genome project) and have created GVCF files which I then combined and genotyped into a single "total.vcf" file. Now, I will run VQSR using this "total.vcf" as input and the training resources listed in the documentation. I believe this will leverage both the 50 exome combination and the resources training sets and I will get a highly filtered set of SNPs from my sequenced exome as output. I will then run SelectVariants with my 1 experimental exome's sample name to extract just those high quality SNPs that pertain to my experimental exome.

(EDIT: I am referring the the documentation I found here: "Add additional samples for variant calling, either by sequencing additional samples or using publicly available exome bams from the 1000 Genomes Project (this option is used by the Broad exome production pipeline). Be aware that you cannot simply add VCFs from the 1000 Genomes Project. You must either call variants from the original BAMs jointly with your own samples, or (better) use the reference model workflow to generate GVCFs from the original BAMs, and perform joint genotyping on those GVCFs along with your own samples' GVCFs with GenotypeGVCFs.")

My questions are as follows:

1) Am I correct in my understanding that calling variants in numerous exomes from the 1000 genome project to create a training data set is good practice with the goal of achieving the best possible variant calling results for my single exome of interest?

2) If so, will my training set produce better results the larger it is (meaning using all ~3,500 exomes from the 1000 genome project will create the best possible training set)?

3) If more is better, is there are resource somewhere of all ~3,500 exomes already processed into GVCFs, or should I do that myself?

Thank you for your help as I learn more about exome sequencing!

Is it necessary to process 1000 genome data for exome variant calling training?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Bureau of Internal Revenue: Regional Offices (Directory)

Form: VAT: registration - land and property (VAT5L)

Four Air Leitchville Pty Ltd v Hurlad Pty Ltd (No 3) [2024] FCA 238

Trial of East Grinstead man accused of rape to begin next week

WONHO – Better Than Me – Single [iTunes Plus M4A]

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Theja Surapaneni The ‘Most Attractive' Man on Australian TV Of All Time

MS-CHAPV2 NAP Policy failing - Reason Code 65

Ex-Colchester United youth player Craig Winskill carried out armed robbery to...

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Telangana TS New Food Security Card/ Telangana Ration card Application Form...

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

High-speed Ethernet switches a bright spot in network forecasts

Wazifa Remedy to Increase Enlarge Penis Size

Arms accused back in court next month

TBT: Samini “Tempo” Feat Mugeez (R2Bees) Prod by Kaywa

In Court: Cases heard at Central Devon Magistrates' Court

Schools benefit from American donation