Today, I spend the day working through calculation necessary to infer the probability of a simple uptake motif (lets say a dimer AA) be in a 30bp degenerate fragment and a 50bp degenerate fragment. The objective is to be able to chose the appropriate size of the degenerate region in my synthetic fragments.
First, my supervisor and I did a basic calculation to figure out the probability of finding an AA in a 30bp fragment.
Given 4 bases = A, C, G, T probability of having an A is 0.25 (1/4)
The probability of having two AA is 0.25*0.25 = 0.0625, or of the 16 dimer combinations, we have 1 success (1/16).
0.0625*29 positions (or events) = 1.8125
But first, to understand number of positions lets imagine we have a 5 base pair sequence:
If we consider each nucleotide position as an independent event we have 5 events.
if we consider each dimer as an event we have 4 events
0.0625*29 positions (or events) = 1.8125 probability in one strand
1.8125*2 = 3.625 probability in 2 strand (since in random proportions probability of AA is equal to probability of TT and complementary strand should have the same proportions)
We thought (or at least I thought) that this result mean that we have 3.6 AAs on average in a 30bp fragment
However, after one day of calculations and frustration I realized that this calculations are wrong.
I went back to my statistics books from my undergraduate statistics classes (10 years ago) and I realize that this is a classic binomial distribution probability. The biggest mistake in the calculations above is that we only took in account the probability of “success” or “p”, but not the probability of failure or “q”.
the formula to calculate this probability comes from the formula:
P(x.n.p) = nCx (p^x) * q^(n-x)
where n = # of essays
x = binomial random variables (0,1,2,3….n)
p = success probability
q = failure probability (1 – p)
nCx can be resolves two ways. The first is using binomial coefficients of Pascal triangle and the second one is resolving nCx using factorials. If factorials are used then the formula above turn into:
Using this formula (multiplying per 2 to account for two strands), I was able to calculate frequency distribution of probability of AA in a 30bp fragments and in a 50bp fragments. Note: If I am statistically strict I would not multipky the frequencies by 2 since they will not add to 1 (100%) anymore, they would add to 200% (100% each strand). Instead I would have to multiply by 2 the estimation of the number of fragments given a certain number of reads. This would just be an observation since It wouldn’t change my results.