Monthly Archives: May 2016


Lately I have been working on generating a 200 bp fragment  that will contain a 69 bp randomized region. I will do this by using two large oligos with a 30bp overlapping region. As shown in the figure below, the right oligo has a illumina sequencing primer site, extra bases and the overlapping region. On the other hand, the left oligo has a illumina sequencing primer site, 3 fixed bases, 69 randomized base, and the overlapping region with the left oligo.



The illumina Nextera sequencing  priming sites on the flanks (see figure below) will allow the fragment to be sequenced in both directions,if needed. Additionally this region will act as template for the primers, used by Nextera, to attach the barcodes and the adapters to the fragments to be sequenced (see figure below, barcodes are shown in grey and light blue, adapters in yellow and dark blue). Generating this 200 bp fragment is an essential part of my thesis, since I will use them as input DNA for uptake experiments using competent Acinetobacter baylyi and Thermus thermophilus. By comparing the sequences from the fragments that were taken up and from the input DNA fragments, I will try to determine if this bacteria have an uptake biases for a particular sequence motif.



In a first experiment, I annealed the two oligos using 1.1 ug of the right (38pmol) and left (26.5 pmol) oligos, as well as Tris and water in a 20 ul reaction. The oligos were then denatured at 95 degrees, then the temperature was decrease at 0.1 degree/second until it reached 65 degrees for 5 minutes.

Then, I used 2 ul of this annealed oligos reaction as part of a 6 PCR reactions using primers that will amplify the 200 bp fragment.

Besides the 2 ul of anneal oligos, 4 of the 6 reactions used: 200uM dNTPs, 0.2uM primers, 1X buffer, 2 units of Onetaq (NEB).

One reaction (called extension only) used: 2 ul of anneal oligos, 200uM dNTPs, 1X buffer, 2 units of Onetaq (NEB), but no PCR primers.

One reaction (No taq) used: 2 ul of anneal oligos, 200uM dNTPs, 1X buffer, 0.2uM primers, but no taq polymerase.

Amplification program includes an extension step:

65 degrees   10 min

Followed by a standard PCR protocol:

95 degrees 2 min

95 degrees 30 sec ]

55 degrees 30 sec ]    1, 5, 10, 20 cycles

68 degrees 30 sec ]

95 degrees 5 min

The “taq only” and “extension only”samples were taken out of the thermocycler after the extension step. The rest of the 4 reactions were amplified for 1, 5, 10, 20 cycles.



Results showed that:

  1. Comparing notaq and extension only samples we can see that the extension works, producing a 200 bp fragment
  2. PCR cycles seem to be amplifying an unspecific band at 250 bp.


In a next experiment, I was looking to determine why I am getting this 250 bp band. My, hypothesis was that not-fully synthesized “intermediate” sequences, which are common in long oligo synthesis, were responsible of the unspecific band. With that in mind,  I tested if denaturing, re-annealing and re-extending oligos several times (without any PCR amplification, only extension) would remove, at least partially this “intermediate” oligos that could be responsible of  the 250 bp band observed.

In this experiment, I used  ~220 ug of the right (38pmol) and left (26.5 pmol) oligos, as well as Tris and water in a 20 ul reaction.

The oligos were then anneal and extended using three distinct protocols:

  1. Samples were denatured at 94 degrees, then the temperature was decrease at 0.1 degree/second (ramp) until it reached 58 degrees for 2 minutes and then temperature was increased by 10 minutes. Next, I used 5 cycles of: a 94 degrees denaturation (45 seconds), 58 degrees (1 minute) and 65 degrees (2 minutes).
  2.   Samples were denatured at 94 degrees, then the temperature was decrease at 0.1 degree/second (ramp) until it reached 58 degrees for 2 minutes and then temperature was increased by 10 minutes
  3. Samples were denatured at 94 degrees and then anneal and extended at 65 degrees 10 minutes.



Results showed that:

  1. Denaturing and re-annealing/extending the oligos (1 protocol) did not help, and as a matter of fact it produced a 250bp unspecific fuzzy band that looks very similar to what I saw in the first experiment.
  2. This unspecific band seems to be produced by mis-priming of the long oligos and not mis-priming of PCR primers.


In this point I still believed that oligo mis-priming originated because of not-fully synthesized oligo intermediate, so I gel purified the anneal/extended product of the second reaction.  Next, I did a PCR using a 1/100 dilution of the anneal extended gel purified and non-purified product with 4 different annealing temperatures each (62, 60, 58, 55.7 degrees).



Results showed a 250 bp band with a very faint 200bp band.  It is surprising that even after gel purification I still get a 250bp band, that it is even more intense that the expected 200bp band.

At this point, it seems that there is any problem with the oligo sequences themselves, so I looked at the oligo design carefully in order to discover if there is any region of the oligos prone to miss-priming.

By looking at the sequences of the oligos, I realized that a 16 bases region of the overlaping region from the left primer (third row of the figure below) anneal perfectly (region in black rectangule) with extra bases from the extended forward strand (second row of the figure below)


This mis-priming would leave a 5 ‘ overhang that could easily get extended from the 3′ end of the forward strand and the 3’ end of the reverse strand, generating a 245bp sequence.


Additionally the sequence located in the extra bases of the left primer is partially complementary to each other (ccgcatcAGGTGGCACGAGgatgcgg).

This mis-priming is consistent with the results from the second experiment in which I tested three annealing/extension protocols. The first and second protocols had different results, despite that the first part of them is identical (94 degree denaturation followed by a rap decrease of 0.1 degree/sec to 58 degrees annealing). However, when I denaturate and re-anneal/extend the oligos in the first protocol, then I see the ~250bp band. Maybe the slow ramp gave enough time at higher temperatures (~65 degrees) to anneal the oligos correctly.

Eventually I will have to re-design and re-order at least one of both oligos, since even if I am successful amplifying the 200 bp fragment, I do not like any strange mis-priming effects when barcodes and adapters are introduced later on during the library prep step previous to sequencing the fragments.











Peak Finder

As discussed extensively in previous posts, our analysis of DNA uptake data in Haemophilus influenzae is going very well. So far it seems that most of the variation in DNA uptake across the genome are explained by the proximity to an uptake signal sequence (USS).  We have scored the genome using two distinct uptake motif models. The first motif model or “genomic uptake motif” is based on the USS sequences enriched the genome, while the second model or “uptake bias motif” is based on experiments characterizing the biases on DNA uptake machinery.



Scoring the genome with both of these models results in a score per genomic position that allowed us to identify regions of the genome with USS or USS-like sequences.  After several distinct analysis, that I am not going to discuss in this post, we were able to determined: 1. the most adequate cuttoff  score to identify which sequences could be classified as USS. 2. which uptake motif model performed better, generating less false positives. Additionally, we now have a good idea of how well the proximity of a genomic region to a USS sequences could explain DNA uptake (at least for small donor DNA fragment sizes).

The next step in our analysis is to find an independent way to call for uptake peaks, seen as  positions with local peaks in uptake ratios across the genome. Once we identify this peaks, we will be able to compare them with the positions that we predicted as USS based on our scoring method (genomic motif or uptake bias motif) and in our chosen cutoff. I predict that this comparison will be able to determine:

  1. how many uptake peaks are explained by the presence of a USS position nearby.
  2. how many uptake peaks have high uptake despite been far away from a USS position.
  3. Among uptake peaks, does a higher uptake score also means a higher uptake peak.

The strategy that I will apply in this analysis is to first find all the peaks of the data by using the peak finder function created in R. This step was already done, and it was not easy to implement. The function depends on two parameters “w” and “span”.  As explained in here, the function works by smoothing the data  to then find local peaks by comparing local maxima with the smoothed points. By changing these parameters (w = 200, span = 0.02), I was able to optimize the function finding each local peak within the first 10 kb of the genome.

However as has can be seen,  some false positives were seen in position with low depth called as peaks.


Note: x represent the genomic positions, while y represent uptake ratio. Dots in red are peaks identified by the peak finder function. Black line is the smoothed line made by the function. Grey dots are the actual data points showing uptake ratio at each position

To eliminate this false positives, I introduced another parameter called “cut”. This parameter allowed me to filter all peaks in positions with uptake ratios lower than a certain threshold (cut = 0.5). Next, the entire genome has to be screened in order to generate a list of positions with uptake peaks. This was done by breaking the genome into small pieces (10kb/each) and then screening each piece for local peaks. The genome was splitted, since the function performed much better when smaller sections of the genome were used. In other words, when larger sections of the genome were used,  the peaks found were farther away from the real peaks.

This could be seen both graphically and by examining the peak lists generated.

For instance, the two figures above shows the DNA uptake profile of a section of genome. Blue point represent positions with more than 10 reads in the input, red point are sites with less than 10 reads in the input. Black dots represent positions identified as peaks, when the genome was split into 10kb pieces (first figure), and when the genome was split into 100kb pieces (second figure)



It is clear by seen both figures than when the genome was split into larger sections, the peak finder function was less precise.

In the table below we can see a list of the first 7 positions of the genome classified as USS by our uptake bias model , aligned by the focal position (column 1) and by the central C of the core of the motif( column 2) ; as well as the first 6 -7 positions identified as uptake peaks when the genome was split into 10kb fractions of the genome (column 3), into 100kb fractions (column 4), and to the entire non-split genome (column 5)




As can be seen, the smaller the genome fractions used to feed to the function, the closer the positions are to either the USS focal or to central C positions. For instance, when the genome was analyzed breaking it in 10kb pieces, most uptake peaks were just a few bases away from the USS focal or central C position. On the other hand, when the genome was analyzed breaking it in 100kb pieces, one position around 1023 bases was not recognized as a peak and the rest of the peaks were further away from the USS positions. As an extreme, we can see that running the function to the entire genome (without fractioning it first) resulted in a list with many missing peaks (no peaks in the first 80kb of the genome).

The table above also tells us that when the uptake peak list generated breaking the genome in 10 kb fractions correlates very well with the USS positions list. Now we only need a graphical way to visualize how well both of this lists match together.

One way to graphically analyze the uptake peak data is to plot simply the distribution of mean uptake ratio of the uptake peak list. Of course this analysis does not tell us how well uptake peaks match with the USS list, however it does give us an idea of the distribution of uptake ratios from peak positions.

I predict that the distribution should look like something similar to the figure below, where most peaks are located between 3 and 4, with a few peaks with higher or lower uptake ratios (red arrows). Positions with lower uptake ratios (less than 2) could represent peaks with a either a non-optimum USS sequence or artefacts of the peak finder that were not removed by the “cut” argument (of 0.5). On the other hand peaks with uptake ratios higher than 5 could represent positions with either a extremely good USS sequence or a overestimated uptake ratio given positions in the input with low sequencing depth.

predicted distribution


The actual distribution of mean uptake ratios of uptake peaks, shown below, looked somewhat similar to what I predicted. Now by simply extracting positions with higher and lower uptake ratios I could look carefully at the sequences of this positions in order to determine why they have those uptake ratios.


However, This analisis still does not tell me directly how well the uptake peak list correlate with the USS list. In order to graphically see how both list match each other I will calculate the distance of each uptake peak of the list to the nearest uptake USS (using the central C position). Then, I can plot this distance vs the uptake ratios of the uptake peaks.

I expect that this plot will look like the figure below:



I expect that most peaks will be just a few bases from the nearest USS (1 – 40 bases). Artefacts that were clasified as peaks but that are not real peaks are expected to be farther away (red arrows) from the nearest USS and are expected to have low uptake ratios, since they might not be eliminated by the cutoff argument chosen (0.5). If there were peaks that are not explained by a USS sequence, we would expect that this peaks should be farther away from a USS, but they would have a high uptake ratio (green arrows). This analisis, would tells us right away:

  1. how many uptake peaks are explained by the presence of a USS position nearby.
  2. how many uptake peaks have high uptake despite been far away from a USS position


Finally, to answer the question if a higher uptake score also means a higher uptake peak, I could plot a simple regression of both parameters and see how assess the slope of the regression. Based on what we have seen so far, it seems that increasing the USS score beyond a certain point does not increase uptake ratio any further. Of course this analysis would gives us only a rough idea since other factors such as interaction between more than one USS or interactions between positions in the USS motif are not included in the analysis.



Update (03-08-2016)

The plot of distance to the nearest USS to uptake ratios of the uptake peak list is shown below.




By looking at the figure, I realized that there are a few things I did not taken in account.

  1. There are more points than what I expected further away from 200 bases. Some of those points have an uptake ratio between 3-5. This point could represent either true peaks that are far away from a USS or true peaks that are actually close to a USS, but that USS was not identified as such given the cutoff used to classify a position as a USSs using the uptake scoring function.
  2. When I identified uptake peaks using the R function, I trimmed all peak lower than a variable called “cut”. The cuttoff chosen was 0.5. However, by comparing the number of USS identified by the scoring fuction and the number of peaks identified by the peak finder function, I can see that the number of USS (1391) is higher than the number of peaks (1273). This means that some peaks could have been eliminated by setting the cuttoff too high. This might not change the analysis much (in any case it is better to be conservative with any analysis), but I will repeat the analysis using a lower cutoff to see what happen.

As a result I can see than 80.2% of the peaks identified are closer than 200 bases from a USS position, and 77.7% of the uptake peaks are closer than 100 bases from a USS position. This results show that USSs explained a big percent of the uptake peaks seen in the data. Now the question is what happen with the rest 20%. Are they a result of artefacts of the analysis? or are they really far away from a USS?

The other analysis that I did was the regression of uptake scores vs the uptake ratios of the uptake peak list. The regression clearly show no association. In other words among peaks, increasing the USS score further does not increase uptake.


Note: In this figure I used the re-calculated uptake ratios (summing 1 read to the input to avoid having missing values) in order to mke the regression work.