Monthly Archives: March 2015

Sensitivity analysis

This post is intended to be a plan of the steps I need to do for sensitivity analysis I need to detect experimental limitations of my study. In other words, the sensitivity analysis will help me determine what kind of motifs I will be able to detect.

My thesis study will use synthetic degenerate fragments flanked with Illumina tags, as well as sheared genomic DNA (input DNA fragments) of naturally competent species with (Campylobacter jejuni) or without self-specificity (Thermus thermophilus, Acinetobacter baylyi). Input fragments will be then recovered from the periplasm in Rec2 knockout mutants using an organic extraction technique, to then sequenced them to a high coverage (~ 1000X) and determine presence of uptake bias.

For degenerate fragments I need to:

1. I need to determine how small the uptake motifs need to be to be able to detect them.

I have already calculated frequency distribution of a dimer AA and trimer AAA motif per 30bp, 50bp and 100bp fragments.

AA 3 figuresAAA 3 figure









An important consideration that I have to take in account is what is the probability that the motif is found also in the rest of the synthetic fragments (spacers and Illumina flanking sequences).

2. Determine how strong uptake bias needs to be to be able to detect it.

2.1. Assuming that having one motif has the same probability as having more than one.

This probability distribution seems easy to calculate since a fragment without a motif would be expected to decrease while the rest of the fragments with 1 or more motifs would increase evenly regarding to the amount that 0 motifs decreased

2.2  Assuming that probability of fragments taken up increases as number of motifs increase.

I idea I have to solve point 2.2 is that first I can assume that fragments without a motif would decrease in frequency. Now the frequency that was subtracted from the fragments with 0 motifs is distributed unevenly by a certain amount. I am not sure how to calculate this, since the amount they increased has to be proportional with the total frequency of 1 (100%)

For sheared genomic fragments I need to:

1. I need to determine how small the uptake motifs need to be to be able to detect them.

1.1 how many uptake motifs (different sizes) will be expected according to distinct average size of sheared genomic fragments.

1.2 How does coverage will affect the ability to detect different uptake motifs.

For this a given genome coverage I need to calculate the coverage per base, and estimate the amount of bases with low number of reads (below a threshold, for example 10 reads).

1.3 Which will be the average number of reads per uptake motif given that the motif increment chances to be taken up by different amounts.



Fragments and probability law correction

This post is a correction of my previous post. Before I describe calculations that my supervisor and I made to calculate the probability of a simple uptake motif (lets say a dimer AA) be in a 30bp degenerate fragment.

The calculation was:

Given 4 bases = A, C, G, T probability of having an A is 0.25 (1/4)

The probability of having two AA is 0.25*0.25 = 0.0625, or of the 16 dimer combinations, we have 1 success (1/16).

0.0625*29 base positions = 1.8125

The calculated value 1.8125 is the mean number of AA present in a 30bp random fragment (when only one strand is considered).

Before I stated that this calculation was wrong; however this calculation is in fact correct.

I verified this by calculating the mean number of AA’s based on the frequency distribution that I calculated before:

30bp freq dist one strand

I did this by: sum(AA’s per fragment * frequency per 30bp fragments).

Mean number of AA’s was 1.8125 equal to what was previously calculated.

My confusion was originated by non clearly stated what I was calculating.

Fragments and laws of probability

Today, I spend the day working through calculation necessary to infer the probability of a simple uptake motif (lets say a dimer AA) be in a 30bp degenerate fragment and a 50bp degenerate fragment. The objective is to be able to chose the appropriate size of the degenerate region in my synthetic fragments.

First, my supervisor and I did a basic calculation to figure out the probability of finding an AA in a 30bp fragment.

Given 4 bases = A, C, G, T probability of having an A is 0.25 (1/4)

The probability of having two AA is 0.25*0.25 = 0.0625, or of the 16 dimer combinations, we have 1 success (1/16).

0.0625*29 positions (or events) = 1.8125

But first, to understand number of positions lets imagine we have a 5 base pair sequence:

number events

If we consider each nucleotide position as an independent event we have 5 events.

if we consider each dimer as an event we have 4 events

0.0625*29 positions (or events) = 1.8125 probability in one strand

1.8125*2 = 3.625 probability in 2 strand (since in random proportions probability of AA is equal to probability of TT and complementary strand should have the same proportions)

We thought (or at least I thought) that this result mean that we have 3.6 AAs on average in a 30bp fragment

However, after one day of calculations and frustration I realized that this calculations are wrong.

I went back to my statistics books from my undergraduate statistics classes (10 years ago) and I realize that this is a classic binomial distribution probability. The biggest mistake in the calculations above is that we only took in account the probability of “success” or “p”, but not the probability of failure or “q”.

the formula to calculate this probability comes from the formula:

P(x.n.p) = nCx (p^x) * q^(n-x)

where n = # of essays

x = binomial random variables (0,1,2,3….n)

p = success probability

q = failure probability (1 – p)

nCx can be resolves two ways. The first is using binomial coefficients of Pascal triangle and the second one is resolving nCx using factorials. If factorials are used then  the formula above turn into:

Using this formula (multiplying per 2 to account for two strands), I was able to calculate frequency distribution of probability of AA in a 30bp fragments and in a 50bp fragments. Note: If I am statistically strict I would not multipky the frequencies by 2 since they will not add to 1 (100%) anymore, they would add to 200% (100% each strand). Instead I would have to multiply by 2 the estimation of the number of fragments given a certain number of reads. This would just be an observation since It wouldn’t change my results.



Summary of last week meeting with former post-doc of the laboratory

Last week, my supervisor and I met with the former post-doc of the lab (now a PI in another university), regarding the plans for the experimental design of my thesis.

My experiments analyze DNA uptake bias in gram-negative bacteria by using a methodology that allow us to extract DNA taken up by a Rec2 knockout (DNA in this mutant is stuck in the periplasm) and sequenced them at a high coverage. The fragments taken up (output) are then compared to the input DNA to see motifs that might be taken up more frequently than others. As input fragments for my uptake experiments I will use sheared genomic DNA and a synthetic fragment with a  30 – 50 bp degenerate region.

One part of the discussion with the former-postdoc and my supervisor was how the degenerate fragments will be synthesized. This fragments would contain, illumina adapters, illumina priming site, the degenerate region, and a illumina barcode (or index). The main question was where to locate the barcodes?, it seems that the more effective strategy is to have a left fragment containing: an adapter, the primer site, the degenerate region, a an spacer complementary to the right fragment. Then, we can have 12 right fragments each with a different barcode as well as the other adapter.


Each sample tested would have 3 biological replicates and 3 input replicates. Each replicate would use a different barcode, which would allow to determine if a barcode accidentally matches an uptake sequence. I was thinking that the mix-and-match of samples and replicates could follow a randomize block design using two illumina lanes.

block design

Finally, we discuss that it would not be a good idea to sequence all the samples from my thesis in a same run (even though economically this would make more sense), since it could be risky if something does not go according to plans. So, first I will sequence the H. influenzae samples in a MiSeq.

The second part of the discussion refers to the short term plans about preliminary analysis I have to do:

First challenge is understanding how the input would look like. In other words, which would be the random variation (noise) in amount of fragments taken up if  all the fragments are taken up equally (null hypothesis)?  Once I already figure this out, the next step is understanding the limitations on how strong or weak have to be the bias to be detectable using our current methodology. This point is important since my thesis will use species that take up their own DNA as efficiently as DNA from distant-related species, and we expect to find very simple uptake bias that might not be as strong as in species that take up only DNA from closely related-species (Haemophilus influenzae and Neisseria spp.). Another challenge is to determine how useful each technique (degenerate vs. genomic fragments) is and which are their limitations. This information relates to the previously explained analysis of limitations of each technique