This post summarizes the first steps in the sensitivity analysis of uptake experiments using sheared genomic fragments.
Once the sequencing results of the sheared genomic fragments recovered from periplasm of Rec2 knockout competent cells is obtained, two possible analysis are possible.
1. Reads recovered can be aligned to the genome and coverage per base or coverage per candidate motif can be examined. Then coverage per base (or motif) will be compared with coverage in input.
2. frequency of motifs in the uptake read pool can be compared with the frequency of motifs to the input read pool
Based on this two points, I need to predict how coverage per base (or per motif) will change depending with the motif strength and motif length, as well as the size of the sheared genomic fragments (1). As well as how the frequency of motifs in the uptake pool changes compared with the input pool depending on factor described earlier.
To solve this questions I started simulating a simple scenario where I have a four bases motif AAAA with a 2 fold increase in uptake with 1 or more motifs per fragment in a genome generated with random sequences with a 50%GC content.
I first calculated the distribution frequency of having n number of motifs in the input pool (finput) (assuming to simplify analysis that fragments are 250bp; in reality input pool will have a distribution with a 250bp mean). A small change to previous analysis (in previous post), is that distribution frequency was calculated using a poisson model and not binomial (even though both distributions were the same). Distribution frequency of the uptake pool (fuptake) when probability of a fragment to be taken up increase 2 fold with one or more motifs was calculated (see figure below).
Next step in my calculations was to estimate the amount of reads necessary to reach a certain coverage. For example, using the formula C = LN / G (from illumina tech support sheet)
• C stands for coverage
• G is the haploid genome length
• L is the read length
• N is the number of reads
With this formula with a 1000X coverage, 1.8e+7 reads are generated for a 1e+6 bp genome.
If systematic bias in sequencing (PCR bias effect) are not taking considered it is expected that the input pool of reads have a similar distribution frequency as the finput. So, I multiply finput(n =0) to number of reads (1.8e+7) and I got an estimate of the number of reads with n = 0 motifs. Next, calculated the fraction in bases (on 250bp fragments) of the genome that have n = 0 motifs by multiplying finput to genome size (1.8e+6).
Both of this estimates were used to calculate the expected coverage per position with n = 0 motifs by using the illumina coverage formula C = LN / G where
• N (the number of reads) is replaced with the number of reads in input with n = 0
• G (haploid genome length) is replaced with fraction in bases (on 250bp fragments) of the genome with n=0
As expected, coverage when n = 0 in input was 1000 reads (on average).
This same procedure was applied with the uptake pool frequency with n = 0 and with n >= 1.
Coverage when n = 0 in uptake was 618 reads (on average).
Coverage when n >= 1 in uptake was 1235 reads (on average).
If all this calculations are too complicated, there is a much simpler way to obtained the same results. First we calculate the fold change between finput and fuptake (fuptake/finput). This fold change between Finput and fuptake should be proportional to the fold change observed in regions of the genome with a given number of motifs. For example the fold difference between Finput and Fuptake (with 2 fold increase in uptake) was 1.2354; when this number was multiplied to the mean coverage of 1000 reads, we get the 1235 reads (we calculated before). Of coarse this is a simplification of a real scenario since I still haven’t consider illumina bias or variation in GC content, but it is a first approximation to understand the limitations of using sheared genomic fragments.