Monthly Archives: April 2015

Sensitivity analysis of genomic fragments

This post summarizes the first steps in the sensitivity analysis of uptake experiments using sheared genomic fragments.

Once the sequencing results of the sheared genomic fragments recovered from periplasm of Rec2 knockout competent cells is obtained, two possible analysis are possible.

1. Reads recovered can be aligned to the genome and  coverage per base or coverage per candidate motif can be examined. Then coverage per base (or motif) will be compared with coverage in input.

2. frequency of motifs in the uptake read pool can be compared with the frequency of motifs to the input read pool

Based on this two points,  I need to predict how coverage per base (or per motif) will change depending with the motif strength and motif length, as well as the size of the sheared genomic fragments (1). As well as how the frequency of motifs in the uptake pool changes compared with the input pool depending on factor described earlier.

To solve this questions I started simulating a simple scenario where I have a four bases motif AAAA with a 2 fold increase in uptake with 1 or more motifs per fragment in a genome generated with random sequences with a 50%GC content.

I first calculated the distribution frequency of having n number of motifs in the input pool (finput) (assuming to simplify analysis that fragments are 250bp; in reality input pool will have a distribution with a 250bp mean). A small change to previous analysis (in previous post), is that distribution frequency was calculated using a poisson model and not binomial (even though both distributions were the same). Distribution frequency of the uptake pool (fuptake) when probability of a fragment to be taken up increase 2 fold with one or more motifs was calculated (see figure below).

AAAA 250 2f with #

Next step in my calculations was to estimate the amount of reads necessary to reach a certain coverage. For example, using the formula C = LN / G (from illumina tech support sheet)
• C stands for coverage
• G is the haploid genome length
• L is the read length
• N is the number of reads

With this formula with a 1000X coverage, 1.8e+7 reads are generated for a 1e+6 bp genome.

If systematic bias in sequencing (PCR bias effect) are not taking considered it is expected that the input pool of reads have a similar distribution frequency as the finput. So, I multiply finput(n =0) to number of reads (1.8e+7) and I got an estimate of the number of reads with n = 0 motifs.  Next,  calculated the fraction in bases (on 250bp fragments) of the genome that have n = 0 motifs by multiplying finput to genome size (1.8e+6).

Both of this estimates were used to calculate the expected coverage per position with n = 0 motifs by using the illumina coverage formula C = LN / G where

• N (the number of reads) is replaced with the number of reads in input with n = 0

• G (haploid genome length) is replaced with fraction in bases (on 250bp fragments) of the genome with n=0

As expected, coverage when n = 0 in input was 1000 reads (on average).

This same procedure was applied with the uptake pool frequency with n = 0 and with n >= 1.

Coverage when n = 0 in uptake was 618 reads (on average).

Coverage when n >= 1 in uptake was 1235 reads (on average).

If all this calculations are too complicated, there is a much simpler way to obtained the same results. First we calculate the fold change between finput and fuptake (fuptake/finput). This fold change between Finput and fuptake should be proportional to the fold change observed in regions of the genome with a given number of motifs. For example the fold difference between Finput and Fuptake (with 2 fold increase in uptake) was 1.2354; when this number was multiplied to the mean coverage of 1000 reads, we get the 1235 reads (we calculated before). Of coarse this is a simplification of a real scenario since I still haven’t consider illumina bias or variation in GC content, but it is a first approximation to understand the limitations of using sheared genomic fragments.



In the last post, I described the sensitivity analysis that I did to find out how long and how strong the uptake motifs have to be in order to be able to detect them using synthetic degenerate fragments. Once the more basic calculations based on an example using a four base motif AAAA are done, I have to start thinking in caveats and possible complications, such as how systematic illumina errors would influence the input distribution frequency and how this changes would affect the expected uptake distribution frequency. Additionally I have to consider the probability of having the uptake motif within the flanking regions of the degenerate fragment (see picture below).

syntehtic fragment

First I tested the effect that a systematic bias would introduce in my ability to detect uptake bias. Assuming that the motif analyzed is a four base motif (ACGA) with a bias that increase by 10% the probability of having a A. The probability of having a ACGA will be 0.35*0.25*0.25*0.35 = 0.00765625. With this probability I calculated the distribution of frequency of finding this motif in a 30bp degenerate fragment (Finput). Then, I calculated the distribution of frequencies after a 2 fold (see figure below), 1.1 fold, and 1.01 fold increase in fragments taken up (Fuptake) (see below).

ACGA 2fwith #

The Finput and expected Fuptake were similar with the frequencies calculated when there is no systematic bias.

AAAA 30bp 2fold numbers

When Finput and Fuptake distributions with 10%bias (on A) were compared using a goodness of fit chi-square, results showed highly significant differences when the probability of being taken up increase 2 fold (as in figures above) as well as 0.1 fold and 0.01 fold. This same results were observed when comparing frequency distributions without bias (see previous post).

Next I evaluate the effect that a systematic bias decreasing the probability of two bases (T) in a four bases motif (TCGT) would have in detecting differences between Finput and Fuptake frequencies.  The probability of having a ACGA will be 0.15*0.25*0.25*0.15 = 0.00140625. With this probability, I calculated the distribution frequency of finding this motif in a 30bp degenerate fragment (Finput). Then, I calculated the distribution of frequencies after 2 fold (see figure below), 1.1 fold, and 1.01 fold increase in fragments taken up (Fuptake) (see below)

TCGT 2f with#

Goodness of fit chi-square showed significant differences between Finput and Fuptake distributions with a 2 fold and 1.1 fold increase in uptake probability, but not with a 1.01 fold increase. This results suggest that any bias decreasing the probability of seen a motif (compare to random expectations) would decrease the ability to recognize very subtle motifs. Still, I have to consider:

1. That a 1.01 fold increase is a very subtle difference.

2. I doubt that Illumina Hiseq will have a bias so strong that will decrease 10% chances of seen a nucleotide

3. I tested my analysis by simulating a number of expected fragments of 1e+6, I expect to have at least 1e+8 fragments after sequencing the input and uptake periplasmic recoveries.


Next step was to evaluate the probability of finding a motif in the entire 200bp synthetic fragment (the degenerate region + flanking regions, see figure above).  Since the uptake motif is unknown, I assumed that the 200 bp synthetic fragment is is random. The distribution frequency of a 200bp random fragment (Finput) is shown below.

AAAA 200bp with #

Frequency of Finput in a 200bp fragment when n=0 was lower than in a 30bp fragment given the higher chances of seen a four base motif in a longer fragment. Given the higher probability of seen the motif in a 200bp fragment I was still able to to compared Finput with Fuptake for 2fold, 1.1fold and 1.01 fold increase in uptake and find significant differences between both distributions; however, in smaller motifs the probability of finding the motif will increase to the point where most fragments will have a motif. In this circumstances, it might be harder to find differences between the distributions.

Synthetic fragments analysis

Last week in a laboratory meeting my supervisor gave me two pieces of advise regarding my calculations to predict how long and how strong the uptake motif has to be in order to be detectable by the technique I am going to use.

1.  I should generate an equation that calculates how the expected distribution of number of motifs in synthetic degenerate fragments changes. Differences between input DNA and the fragments taken up occurs because of  preferential DNA uptake of a given motif present on some DNA fragments.

2. Start with only one simple scenario with a motif of 4 bases and finish calculations with that motif first, before working with more complex scenarios.

Following the advice I calculated the frequency distributions of a 4 base motif AAAA in a 30 base degenerate fragments. In a first calculation I did this manually using formulas described in previous posts, however, this time I found an R functions (dbinom) that did the job without the pain of manual calculations.

After I generated the frequency distribution in degenerate fragments, I generated two equations to calculate:

1. The frequency of number of fragments with no motifs in a 30 base fragments after uptake

formula corrected

n = number of occurrences of the motif in a fragment

Finput (n) = matrix of frequency distribution of n in the input pool

Fuptake (n) = matrix of frequency distribution of n in the uptake pool

i = vector of  strength of bias favoring uptake of fragments (i.e., 1,2,2,2,2,2)

With these equations I was able to calculate the frequency distribution when having 1 or more motifs increases the chances of being taken up 2 fold (see graph below). I could also calculated with increases of  1.5 fold, 1.1 fold and even 1.01 fold.

AAAA 30bp 2fold numbers

In this graph for fragments with 0 motifs we see lower frequency in approximately 10% between the input (grey) and the fragments taken up (white).  For fragments with one motif there is an increase that nearly doubles the frequency between input and the fragments taken up.  I expect to see the same increase rate between Finput and Fuptake when n is higher  or equal to 1. This expectation was accomplished since 0.173/0.095 = 1.82 (n = 1) as was also 0.00882791/0.00485660 (n = 2).

It is important to clarify that the two fold increase refers to uptake (or chances to be taken up) and not to frequency increase. For instance, using the example of a previous post 

if my input is:

n =               0         1       2        3       4

Finput =    100   200   300   200   100

and 10% are taken up 90/900:

# =                 10       20         30      20      10

Fuptake      0.111  0.222  0.333 0.222 0.111

now what if having n>= 1 increase 2 fold uptake

# =                10           40       60       40      20

Fuptake =    0.058  0.235  0.353  0.235 0.118 ^

(^ this frequencies were calculated dividing the value to the total i.e. 10/170.  When I used my formula frequencies matched)

The difference between Finput and Fuptake when n>=1 is “1.06”, but it is not 2 fold.

Given the expectations that the difference between Finput and Fuptake are the same when n>= 1, plus the fact that the frequencies increase proportionally to their Finput, I can confidently trust that the calculations really reflect what I am looking for (expected frequency of fragments taken up).

In the same way  I was able to calculate the frequency distribution when the chances of been taken up 5% with the numbers of  motifs (difference in the equation is that “i” is a vector of numbers (e.g. 1.5, 2, 2.5, 3, 3.5) instead of one number.

AAAA 30bp 0.5f with # motif with numbers

In this graph for fragments with 0 motifs we see lower frequency in approximately 5% between the input (grey) and the fragments taken up (white).  For fragments with one motif there is an increase of 43% the frequency between input and the fragments taken up, while for fragments with two motifs the increase in frequency is of 90%.  In this example,  I expect to see the gradual increase between Finput and Fuptake as the number of motifs increase. This expectation is also accomplished since there is gradual increase of 47% (90%-43% ) as number of motifs increase.


Now that I have accomplished this I need to determine if the differences in frequency between input and fragments taken up will be detectable statistically

I first used a Goodness of fit chi-square, but unfortunately, this test does not work with frequencies (i.e. 0.2, 0.04, 0.5) only with absolute numbers (i.e. 6, 8, 10, 100) (and numbers have to be higher than 5). So, I multiplied my frequencies by 1e+6 to simulate absolute numbers. The results showed what is expected with p-value related tests when numbers are very large. That phenomenon is that the test would consider as significant even minimal deviations between frequencies. For example, even 1.01 fold uptake bias generated a significant difference (p-value: 1e-16) in input and uptake distributions, because that 1.01 fold difference generated 1e+4 number of fragments. If I multiply my relative frequencies by 1000 instead, 1.01 fold increase was not significant anymore (p-value: 0.9569).

Write a paragraph discussing what this tells me about the sensitivity of my experiment

Results of this preliminary analysis shows that the power to detect significant differences in synthetic fragments depend on the read deep to which the fragments will be sequenced. A higher read coverage will provide higher discrimination power detecting significant differences for even marginal differences such as a 1.01 fold increase in uptake. However, I have to be careful for false positives since the tests have too much power.

As an intend of finding an alternative that calculates the p-values (dealing better with the stochastic changes in expected number of fragments when large numbers are compared), I calculated the Goodness of fit chi-square p-values using Monte Carlo simulations based on the expected frequencies to estimate the null distribution (using the option simulate.p.value = TRUE of the function chisq.test ).

When a fold increase of 1.01 was tested, chi-square using this simulated p-values gave me a p-value of 0.02 (which is still high, but is more conservative than the 1e-16 that I calculated before).

I personally think that simulating p-values deals better with stochasticity  expected given the high power of the test because of large numbers of a high read deep. However, I will write a separate blog post about expanding on this.

A more complex problem (genomic fragments)

A more complex problem (at least in my mind) is the preliminary analysis of sheared genomic fragments.

The smaller size of genomic fragments is 200bp which in reality is just an average size of a distribution between a 100bp to 1000bp.

I am not sure exactly how I am going to do this analysis but this post is an intend to clear my mind.

First, what do I know:

Size of the genome of species to be studied: (e.g. H. influenzae KW20 = 1.8e+6 bp).

distribution probability of a motif in a random fragments of 200bp.

Frequency of a motif in a random genome and I can even estimate the frequency given a GC content.

What do I need to estimate:

Since when I will get the sequencing data the more simple analysis is to estimate the coverage per base or per motif. One of the questions to be asked is:

how the increment in reads at a certain position (or motif?) is (given how strong the motif is)? and based on the distribution of the motif in the 200bp fragment.

Of course that I will have to account for the distribution of sizes of the sheared genomic fragments, but for simplicity I will first think only in a 200bp size.

How would increment per position if there is 1 motif?

How would increment per position if there is two motifs (if two motifs have more probability than one)?

How would increment per position given a certain strength?

How would increment per position given a GC content bias (probability of  a nucleotide is not 0.25)?

How would increment per position given that the sheared fragment was 100bp? 500bp? 1000bp?

How would increment per position given a certain coverage in the input (coverage distribution)?

I think what I will do is calculating the increment frequency in a 200bp fragments of a motif (as I showed in the previous posts) and then calculate how this would increment the reads of the output compared to reads of the input (based on coverage per base distribution).

Increment of probability with number of motifs random fragments

More calculations….

This post is part of my thought process to try to understand the preliminary analysis that I have to do (read last two posts).

In the previous post I calculated the distribution of random fragments from a given size of fragment according to binomial distribution.

Than I calculated the increase expected in the distribution, if have 1 or more motifs increase the probability at a certain rate.  What I did was

First decreasing frequency when I have 0 motifs (x = 0)

The decreased amount divided by the total number of motifs present was added to each frequency with one or more fragments

This calculation however is wrong since I am not taking in account that the increase in frequency has to be proportional

For instance:

if my input is:

X =     0         1       2        3       4

# =    100   200   300   200   100

and 10% are taken up 90/900:

# =    10     20     30      20      10

Fx =   0.111  0.222  0.333 0.222 0.111

now what if having X >= 1 increase 2 fold

# =    10           40       60       40      20

Fx =   0.058  0.235  0.353 0.235 0.118

This same strategy can be used when the probability increases with the number of motifs as if each motif increment probability 5%:

prb  10%    15%     20%    25%    30%

#      10       30        60        50       30

Fx = 0.055  0.166  0.33   0.27     0.166