Hey all, newbie here.
tl;dr:
I have a fasta file containing two sequences of my region of interest (~5.5 kbp), that differ in ~100 SNPs. What is the fastest way to generate a ROD file out of these sequences, as an input to BQSR?
So, hey.
I'm trying to determine the frequency of a genetic fragment I introduced into a bacterial strain, at several different samples. As I wrote, my current challenge is to create the aforementioned ROD file; however, my project is a bit different than 'usual' variant calling projects, and any advice regarding processing and analysis would be appreciated.
- I have a WT bacteria strain. I introduced a 5.5kbp genetic fragment to it, by electroporation and homologous recombination. It is safe to assume different parts of the fragment have invaded the host's genome with different efficiencies (so I may have 'hybrid' variants, that are half WT and half mutated). The introduced fragment had ~100 SNPs compared to the WT fragment.
- I took that sample and grew it on different conditions, in order to determine whether the fragment I introduced is beneficial to the bacteria.
- The fragments were PCR-amplified, sheared to smaller DNA fragments (~300-500 bp), and sequenced (150bp per read, paired-end). I have a coverage of 10^6 reads per base for each sample.
- I'd like to determine the frequency of each SNP at each sample, and ideally, the identity and frequency of each variant.
I have:
The sequencing samples (1 sample of the initial pool, 6 samples of biological replicates for one condition, and 3 samples of biological replicates for the second condition), the sequence of the WT's genome, and the sequence of the of the fragment I introduced.
My questions:
1. How do I turn the fasta file containing my WT and modified fragments to a ROD file (type doesn't matter) for the BQSR procedure? I do not need to relay on the sequenced samples to determine the differences between the sequences, I already know them.
2. Since all my reads originate from a PCR-amplified fragment, can de-duplication introduce biases \ underestimation to my data?
3. I have a huge coverage. Does it require any different processing methods?
4. Any other advice?
Thanks,
Omer