In the last post, I described the sensitivity analysis that I did to find out how long and how strong the uptake motifs have to be in order to be able to detect them using synthetic degenerate fragments. Once the more basic calculations based on an example using a four base motif AAAA are done, I have to start thinking in caveats and possible complications, such as how systematic illumina errors would influence the input distribution frequency and how this changes would affect the expected uptake distribution frequency. Additionally I have to consider the probability of having the uptake motif within the flanking regions of the degenerate fragment (see picture below).
First I tested the effect that a systematic bias would introduce in my ability to detect uptake bias. Assuming that the motif analyzed is a four base motif (ACGA) with a bias that increase by 10% the probability of having a A. The probability of having a ACGA will be 0.35*0.25*0.25*0.35 = 0.00765625. With this probability I calculated the distribution of frequency of finding this motif in a 30bp degenerate fragment (Finput). Then, I calculated the distribution of frequencies after a 2 fold (see figure below), 1.1 fold, and 1.01 fold increase in fragments taken up (Fuptake) (see below).
The Finput and expected Fuptake were similar with the frequencies calculated when there is no systematic bias.
When Finput and Fuptake distributions with 10%bias (on A) were compared using a goodness of fit chi-square, results showed highly significant differences when the probability of being taken up increase 2 fold (as in figures above) as well as 0.1 fold and 0.01 fold. This same results were observed when comparing frequency distributions without bias (see previous post).
Next I evaluate the effect that a systematic bias decreasing the probability of two bases (T) in a four bases motif (TCGT) would have in detecting differences between Finput and Fuptake frequencies. The probability of having a ACGA will be 0.15*0.25*0.25*0.15 = 0.00140625. With this probability, I calculated the distribution frequency of finding this motif in a 30bp degenerate fragment (Finput). Then, I calculated the distribution of frequencies after 2 fold (see figure below), 1.1 fold, and 1.01 fold increase in fragments taken up (Fuptake) (see below)
Goodness of fit chi-square showed significant differences between Finput and Fuptake distributions with a 2 fold and 1.1 fold increase in uptake probability, but not with a 1.01 fold increase. This results suggest that any bias decreasing the probability of seen a motif (compare to random expectations) would decrease the ability to recognize very subtle motifs. Still, I have to consider:
1. That a 1.01 fold increase is a very subtle difference.
2. I doubt that Illumina Hiseq will have a bias so strong that will decrease 10% chances of seen a nucleotide
3. I tested my analysis by simulating a number of expected fragments of 1e+6, I expect to have at least 1e+8 fragments after sequencing the input and uptake periplasmic recoveries.
Next step was to evaluate the probability of finding a motif in the entire 200bp synthetic fragment (the degenerate region + flanking regions, see figure above). Since the uptake motif is unknown, I assumed that the 200 bp synthetic fragment is is random. The distribution frequency of a 200bp random fragment (Finput) is shown below.
Frequency of Finput in a 200bp fragment when n=0 was lower than in a 30bp fragment given the higher chances of seen a four base motif in a longer fragment. Given the higher probability of seen the motif in a 200bp fragment I was still able to to compared Finput with Fuptake for 2fold, 1.1fold and 1.01 fold increase in uptake and find significant differences between both distributions; however, in smaller motifs the probability of finding the motif will increase to the point where most fragments will have a motif. In this circumstances, it might be harder to find differences between the distributions.