Hi!
With the aim of phasing haplotype from SNPs of a single individual, I have used HaplotypeCaller which performes ReadBackedPhasing automatically (accuracy of SNP calling is beyond the question). However I observed much more 0|1 (98%, among all phased heterozygous SNPs) then 1|0 (2%).
What I don't understand is that as the reference is built from a mixing of diploid genome, when a output haplotype in .vcf start with 0|1, the next SNP should by chance have 50% of probability to be 0|1 and 50% to be 1|0. In another word, because the reference is unphased haplotype, then when I phase SNPs against such reference, I should have similar amount 0|1s and 1|0s.
For example, in any phased haplotype containg 2 SNPs, for the 1st SNP it always starts with 0|1. For the 2nd SNP, I expect to have similar amount of 0|1 and 1|0. But I have much more 0|1 then 1|0.
I have tried datasets from 5 different species and multiple individuals, including human, birds, and fish. The results are very similar.
I think I may have some misunderstanding about readbackedphasing. Can anyone help me with that?
Thanks.