Synthetic fragment troubleshooting

A crucial step in the next steps in my thesis is to be able to synthesize a 200bp fragment with a randomize section. This fragment is planned to be synthesized by using two oligos, 135 bp oligo and a 95 bp oligo.


The strategy to synthesize the fragment includes anneal and extend the oligos and finally amplify the fragment using two 15-16 bases primers.


Annealing and extension steps show a 200bp band, as expected, with a lot of crude showing as a tail in the agarose gel.  When I used this anneal/extension product as template for a PCR, two bands are visible in an agarose gel as well as a lot of smear, a 200 bp band and a 250 bp band.





‘Anneal oligo’ showed the gel migration of a anneal only product, ‘extension only’ shows migration of anneal/extended product. 1 to 20 are products of PCR amplification using the anneal oligos as template using 1,5,10 and 20 cycles.






Looking at this result, my next strategy was to first gel extract the 200 bp band of the  anneal/extended oligo products, and I used it as template for a PCR. Results showed that now the 200 bp band was stronger and the 200 bp band was fainter.














When I saw this result I decided to look closer at my oligo design. After closer analysis, I looked that a possible misspriming event was possible.


A hypothesis that I had at the momment was that maybe the PCR oligos were too small (15 – 16 bases) and they were mispriming somewhere in the template. I designed new primers 22 – 25 bases long that were more specific to the end of the annealed/extended fragment. However, amplification with this primers should a strong 250 bp band, with  very faint bands at higher molecular weight (e.g. 300, 350 bp)


This misspriming event was already described in past posts. I sequenced the 250 bp fragment to see if I can discover any evidence of misspriming, however, traces showed not any evidence of missprimming.  What they did showed was that traces were not very good. peaks were very close to each other, with shoulders overlapping, and in some cases, there were small peaks in the bottom.


My next strategy was to clone the PCR amplified 250 bp band and the 200 bp band into a pgem vector. Then I could sequence a clone with each band size using primers that would bind in the vector. This strategy would allow me to sequence the ends of the fragment accurately.

The entire fragment of the clone from the 250 bp band was sequenced, and the traces were very clear. This fragment was in reality 234 bp long and it had a duplication of the last 34 bases from the fragment right end.





Now that I know exactly why my synthetic fragment showed extra 34 bases, I only need to understand why it happened and how can I fix it.

The first idea that comes to my mind is a polymerase slippage caused by some type of hairpin structure in the template or the copied DNA strand. But I have no idea exactly of how this could be happening. I can only think of something like the polymerase slippage model that happens in minisatellite/microsatellite evolution


Sorry I could get a better figure!

My next idea was to test is there are places within the fragment where hairpins could happen. So, I took the right oligo and I run a harpin analysis using the oligoanalyzer program from IDT.

I took the hairpin structure with highest free energy, -15.03 ΔG(kcal.mole-1),   and  I saw several places that could potentially form hairpins within the right oligo.




Three potential places where hairpin could occur are shown. Place 1 hairpin happens in the illumina sequencing priming site (purple). If this hairpin would be happening then I would be in big trouble since I cannot change that region without changing the design entirely. The second place, the longest, happen within the extra bases of the oligo (dark green). This hairpin is key because is very close to the place that was duplicated  (pos 38). If this hairpin was the cause of the extra bases, then the solution would be to redesign only one oligo. The last hairpin is located in the overlapping region with the other oligo. If this hairpin were responsible of the  extra 34 bases, less likely since it is farther away, the solution would be to redesign both oligos.

Finally, I can try is to use a PCR additive that will reduce secondary structures in the template DNA. Biosizebio website describes two additives that I could use: DMSO and glycerol. However, even if these additives work, I might have to redesign at least one oligo since the hairpin structures might interfere with further steps in which adapter and barcodes will be added to the fragments by low cycle PCR .

primNote:Primers with Illumina adapters (in yellow and blue) and barcodes (in grey and light blue) are shown annealed to the strands of the synthetic fragment




Randomized fragment

I am back from vacation and what a better way to shake my mind than a blog post!

Before my vacation break, I was trying to make a double stranded 200 bp fragment with a randomized region by using two different long oligos. One 135 bp with a 70 randomized bases and a 30 bases overlapping region with the other 95 bases oligo.

The 200 bp fragment is intended to make by annealing the two oligos and then extending/amplifying them by a modified PCR protocol.

In my last post, I described all the different strategies that I did in order to obtain that fragment. However, despite my several attempts, I always obtained I 250 bp product, instead of the 200 bp expected product.

new post


After analyzing the oligos sequence in detail, I realized that a un-specific annealing between regions of the oligos could be responsible of the extra 50 bases.

In order to test if this misspriming event was responsible of the extra bases, I sequenced the purified 250 fragment using both left and right primers.

Analysis of the sequence traces from the forward primer and from the reverse primer showed no evidence of misspriming at all.







In fact the sequenced fragment coincide with the expected 200 base pair fragment sequence and not to the product expected if this misspriming event had taken place.

In conclusion, I have no idea why the synthesized fragment migrated as a 250 bp fragment in the agarose gel electrophoresis.   However, I was able to show that the misspriming event did not happen. Two possibilities come to my mind to explain this problem:

2 pos.png


It is possible that some extra bases were added at the 3’ end of the fragment. This extra bases might not be detected in the sequenced fragment since the sequence trace lost resolution near the end of the fragment. However, even if this is the case the extra bases are not going to have negative consequences in the illumina barcoding attachment protocol step.

In this step, a set of primers containing a barcode and illumina flow-cell adapter sequences will anneal with both ends of the fragment. Then, the fragments will be amplified using a short cycle PCR protocol. Since the sequencing results show that the priming sites of this barcoding oligos are intact, then the extra bases won’t be expected to have any effect.


Since no evidence of misspriming was detected, I consider that is safe to try some uptake experiments using the synthesized 250 base pair fragment and then quantify how much I can recover from the wild type Acinetobacter baylyi BD413 and from the no-uptake ADP239 comP::nptII mutant.








Lately I have been working on generating a 200 bp fragment  that will contain a 69 bp randomized region. I will do this by using two large oligos with a 30bp overlapping region. As shown in the figure below, the right oligo has a illumina sequencing primer site, extra bases and the overlapping region. On the other hand, the left oligo has a illumina sequencing primer site, 3 fixed bases, 69 randomized base, and the overlapping region with the left oligo.



The illumina Nextera sequencing  priming sites on the flanks (see figure below) will allow the fragment to be sequenced in both directions,if needed. Additionally this region will act as template for the primers, used by Nextera, to attach the barcodes and the adapters to the fragments to be sequenced (see figure below, barcodes are shown in grey and light blue, adapters in yellow and dark blue). Generating this 200 bp fragment is an essential part of my thesis, since I will use them as input DNA for uptake experiments using competent Acinetobacter baylyi and Thermus thermophilus. By comparing the sequences from the fragments that were taken up and from the input DNA fragments, I will try to determine if this bacteria have an uptake biases for a particular sequence motif.



In a first experiment, I annealed the two oligos using 1.1 ug of the right (38pmol) and left (26.5 pmol) oligos, as well as Tris and water in a 20 ul reaction. The oligos were then denatured at 95 degrees, then the temperature was decrease at 0.1 degree/second until it reached 65 degrees for 5 minutes.

Then, I used 2 ul of this annealed oligos reaction as part of a 6 PCR reactions using primers that will amplify the 200 bp fragment.

Besides the 2 ul of anneal oligos, 4 of the 6 reactions used: 200uM dNTPs, 0.2uM primers, 1X buffer, 2 units of Onetaq (NEB).

One reaction (called extension only) used: 2 ul of anneal oligos, 200uM dNTPs, 1X buffer, 2 units of Onetaq (NEB), but no PCR primers.

One reaction (No taq) used: 2 ul of anneal oligos, 200uM dNTPs, 1X buffer, 0.2uM primers, but no taq polymerase.

Amplification program includes an extension step:

65 degrees   10 min

Followed by a standard PCR protocol:

95 degrees 2 min

95 degrees 30 sec ]

55 degrees 30 sec ]    1, 5, 10, 20 cycles

68 degrees 30 sec ]

95 degrees 5 min

The “taq only” and “extension only”samples were taken out of the thermocycler after the extension step. The rest of the 4 reactions were amplified for 1, 5, 10, 20 cycles.



Results showed that:

  1. Comparing notaq and extension only samples we can see that the extension works, producing a 200 bp fragment
  2. PCR cycles seem to be amplifying an unspecific band at 250 bp.


In a next experiment, I was looking to determine why I am getting this 250 bp band. My, hypothesis was that not-fully synthesized “intermediate” sequences, which are common in long oligo synthesis, were responsible of the unspecific band. With that in mind,  I tested if denaturing, re-annealing and re-extending oligos several times (without any PCR amplification, only extension) would remove, at least partially this “intermediate” oligos that could be responsible of  the 250 bp band observed.

In this experiment, I used  ~220 ug of the right (38pmol) and left (26.5 pmol) oligos, as well as Tris and water in a 20 ul reaction.

The oligos were then anneal and extended using three distinct protocols:

  1. Samples were denatured at 94 degrees, then the temperature was decrease at 0.1 degree/second (ramp) until it reached 58 degrees for 2 minutes and then temperature was increased by 10 minutes. Next, I used 5 cycles of: a 94 degrees denaturation (45 seconds), 58 degrees (1 minute) and 65 degrees (2 minutes).
  2.   Samples were denatured at 94 degrees, then the temperature was decrease at 0.1 degree/second (ramp) until it reached 58 degrees for 2 minutes and then temperature was increased by 10 minutes
  3. Samples were denatured at 94 degrees and then anneal and extended at 65 degrees 10 minutes.



Results showed that:

  1. Denaturing and re-annealing/extending the oligos (1 protocol) did not help, and as a matter of fact it produced a 250bp unspecific fuzzy band that looks very similar to what I saw in the first experiment.
  2. This unspecific band seems to be produced by mis-priming of the long oligos and not mis-priming of PCR primers.


In this point I still believed that oligo mis-priming originated because of not-fully synthesized oligo intermediate, so I gel purified the anneal/extended product of the second reaction.  Next, I did a PCR using a 1/100 dilution of the anneal extended gel purified and non-purified product with 4 different annealing temperatures each (62, 60, 58, 55.7 degrees).



Results showed a 250 bp band with a very faint 200bp band.  It is surprising that even after gel purification I still get a 250bp band, that it is even more intense that the expected 200bp band.

At this point, it seems that there is any problem with the oligo sequences themselves, so I looked at the oligo design carefully in order to discover if there is any region of the oligos prone to miss-priming.

By looking at the sequences of the oligos, I realized that a 16 bases region of the overlaping region from the left primer (third row of the figure below) anneal perfectly (region in black rectangule) with extra bases from the extended forward strand (second row of the figure below)


This mis-priming would leave a 5 ‘ overhang that could easily get extended from the 3′ end of the forward strand and the 3’ end of the reverse strand, generating a 245bp sequence.


Additionally the sequence located in the extra bases of the left primer is partially complementary to each other (ccgcatcAGGTGGCACGAGgatgcgg).

This mis-priming is consistent with the results from the second experiment in which I tested three annealing/extension protocols. The first and second protocols had different results, despite that the first part of them is identical (94 degree denaturation followed by a rap decrease of 0.1 degree/sec to 58 degrees annealing). However, when I denaturate and re-anneal/extend the oligos in the first protocol, then I see the ~250bp band. Maybe the slow ramp gave enough time at higher temperatures (~65 degrees) to anneal the oligos correctly.

Eventually I will have to re-design and re-order at least one of both oligos, since even if I am successful amplifying the 200 bp fragment, I do not like any strange mis-priming effects when barcodes and adapters are introduced later on during the library prep step previous to sequencing the fragments.










Peak Finder

As discussed extensively in previous posts, our analysis of DNA uptake data in Haemophilus influenzae is going very well. So far it seems that most of the variation in DNA uptake across the genome are explained by the proximity to an uptake signal sequence (USS).  We have scored the genome using two distinct uptake motif models. The first motif model or “genomic uptake motif” is based on the USS sequences enriched the genome, while the second model or “uptake bias motif” is based on experiments characterizing the biases on DNA uptake machinery.



Scoring the genome with both of these models results in a score per genomic position that allowed us to identify regions of the genome with USS or USS-like sequences.  After several distinct analysis, that I am not going to discuss in this post, we were able to determined: 1. the most adequate cuttoff  score to identify which sequences could be classified as USS. 2. which uptake motif model performed better, generating less false positives. Additionally, we now have a good idea of how well the proximity of a genomic region to a USS sequences could explain DNA uptake (at least for small donor DNA fragment sizes).

The next step in our analysis is to find an independent way to call for uptake peaks, seen as  positions with local peaks in uptake ratios across the genome. Once we identify this peaks, we will be able to compare them with the positions that we predicted as USS based on our scoring method (genomic motif or uptake bias motif) and in our chosen cutoff. I predict that this comparison will be able to determine:

  1. how many uptake peaks are explained by the presence of a USS position nearby.
  2. how many uptake peaks have high uptake despite been far away from a USS position.
  3. Among uptake peaks, does a higher uptake score also means a higher uptake peak.

The strategy that I will apply in this analysis is to first find all the peaks of the data by using the peak finder function created in R. This step was already done, and it was not easy to implement. The function depends on two parameters “w” and “span”.  As explained in here, the function works by smoothing the data  to then find local peaks by comparing local maxima with the smoothed points. By changing these parameters (w = 200, span = 0.02), I was able to optimize the function finding each local peak within the first 10 kb of the genome.

However as has can be seen,  some false positives were seen in position with low depth called as peaks.


Note: x represent the genomic positions, while y represent uptake ratio. Dots in red are peaks identified by the peak finder function. Black line is the smoothed line made by the function. Grey dots are the actual data points showing uptake ratio at each position

To eliminate this false positives, I introduced another parameter called “cut”. This parameter allowed me to filter all peaks in positions with uptake ratios lower than a certain threshold (cut = 0.5). Next, the entire genome has to be screened in order to generate a list of positions with uptake peaks. This was done by breaking the genome into small pieces (10kb/each) and then screening each piece for local peaks. The genome was splitted, since the function performed much better when smaller sections of the genome were used. In other words, when larger sections of the genome were used,  the peaks found were farther away from the real peaks.

This could be seen both graphically and by examining the peak lists generated.

For instance, the two figures above shows the DNA uptake profile of a section of genome. Blue point represent positions with more than 10 reads in the input, red point are sites with less than 10 reads in the input. Black dots represent positions identified as peaks, when the genome was split into 10kb pieces (first figure), and when the genome was split into 100kb pieces (second figure)



It is clear by seen both figures than when the genome was split into larger sections, the peak finder function was less precise.

In the table below we can see a list of the first 7 positions of the genome classified as USS by our uptake bias model , aligned by the focal position (column 1) and by the central C of the core of the motif( column 2) ; as well as the first 6 -7 positions identified as uptake peaks when the genome was split into 10kb fractions of the genome (column 3), into 100kb fractions (column 4), and to the entire non-split genome (column 5)




As can be seen, the smaller the genome fractions used to feed to the function, the closer the positions are to either the USS focal or to central C positions. For instance, when the genome was analyzed breaking it in 10kb pieces, most uptake peaks were just a few bases away from the USS focal or central C position. On the other hand, when the genome was analyzed breaking it in 100kb pieces, one position around 1023 bases was not recognized as a peak and the rest of the peaks were further away from the USS positions. As an extreme, we can see that running the function to the entire genome (without fractioning it first) resulted in a list with many missing peaks (no peaks in the first 80kb of the genome).

The table above also tells us that when the uptake peak list generated breaking the genome in 10 kb fractions correlates very well with the USS positions list. Now we only need a graphical way to visualize how well both of this lists match together.

One way to graphically analyze the uptake peak data is to plot simply the distribution of mean uptake ratio of the uptake peak list. Of course this analysis does not tell us how well uptake peaks match with the USS list, however it does give us an idea of the distribution of uptake ratios from peak positions.

I predict that the distribution should look like something similar to the figure below, where most peaks are located between 3 and 4, with a few peaks with higher or lower uptake ratios (red arrows). Positions with lower uptake ratios (less than 2) could represent peaks with a either a non-optimum USS sequence or artefacts of the peak finder that were not removed by the “cut” argument (of 0.5). On the other hand peaks with uptake ratios higher than 5 could represent positions with either a extremely good USS sequence or a overestimated uptake ratio given positions in the input with low sequencing depth.

predicted distribution


The actual distribution of mean uptake ratios of uptake peaks, shown below, looked somewhat similar to what I predicted. Now by simply extracting positions with higher and lower uptake ratios I could look carefully at the sequences of this positions in order to determine why they have those uptake ratios.


However, This analisis still does not tell me directly how well the uptake peak list correlate with the USS list. In order to graphically see how both list match each other I will calculate the distance of each uptake peak of the list to the nearest uptake USS (using the central C position). Then, I can plot this distance vs the uptake ratios of the uptake peaks.

I expect that this plot will look like the figure below:



I expect that most peaks will be just a few bases from the nearest USS (1 – 40 bases). Artefacts that were clasified as peaks but that are not real peaks are expected to be farther away (red arrows) from the nearest USS and are expected to have low uptake ratios, since they might not be eliminated by the cutoff argument chosen (0.5). If there were peaks that are not explained by a USS sequence, we would expect that this peaks should be farther away from a USS, but they would have a high uptake ratio (green arrows). This analisis, would tells us right away:

  1. how many uptake peaks are explained by the presence of a USS position nearby.
  2. how many uptake peaks have high uptake despite been far away from a USS position


Finally, to answer the question if a higher uptake score also means a higher uptake peak, I could plot a simple regression of both parameters and see how assess the slope of the regression. Based on what we have seen so far, it seems that increasing the USS score beyond a certain point does not increase uptake ratio any further. Of course this analysis would gives us only a rough idea since other factors such as interaction between more than one USS or interactions between positions in the USS motif are not included in the analysis.



Update (03-08-2016)

The plot of distance to the nearest USS to uptake ratios of the uptake peak list is shown below.




By looking at the figure, I realized that there are a few things I did not taken in account.

  1. There are more points than what I expected further away from 200 bases. Some of those points have an uptake ratio between 3-5. This point could represent either true peaks that are far away from a USS or true peaks that are actually close to a USS, but that USS was not identified as such given the cutoff used to classify a position as a USSs using the uptake scoring function.
  2. When I identified uptake peaks using the R function, I trimmed all peak lower than a variable called “cut”. The cuttoff chosen was 0.5. However, by comparing the number of USS identified by the scoring fuction and the number of peaks identified by the peak finder function, I can see that the number of USS (1391) is higher than the number of peaks (1273). This means that some peaks could have been eliminated by setting the cuttoff too high. This might not change the analysis much (in any case it is better to be conservative with any analysis), but I will repeat the analysis using a lower cutoff to see what happen.

As a result I can see than 80.2% of the peaks identified are closer than 200 bases from a USS position, and 77.7% of the uptake peaks are closer than 100 bases from a USS position. This results show that USSs explained a big percent of the uptake peaks seen in the data. Now the question is what happen with the rest 20%. Are they a result of artefacts of the analysis? or are they really far away from a USS?

The other analysis that I did was the regression of uptake scores vs the uptake ratios of the uptake peak list. The regression clearly show no association. In other words among peaks, increasing the USS score further does not increase uptake.


Note: In this figure I used the re-calculated uptake ratios (summing 1 read to the input to avoid having missing values) in order to mke the regression work.





Preliminary data analysis

In my last post, I explain the plan that I had to analyze the data from the project I am working on. This project is looking to determine  why some regions of Haemophilus influenzae genome are taken up more than others by competent cells in DNA uptake experiments.

As explained in that post, the first step in my analysis was to determine which positions correspond to USS or USS-like sequences. This is done by scoring the genome using a position specific scoring matrix (pssm), using a similar  approach as the one used by various software to create sequence logos (such as the sequence logo of a genomic USS region shown below).


However, as explain in a post from my supervisor, the USS scoring can be done in multiple ways (using the genome enriched motif, the uptake motif, taking in account positions interactions, weighting positions within the motif according to how important they are for uptake, taking in account GC content).

In the analysis showed in this post, I used the most simplistic approach to score the genome. That approach was to score the genome using the genome enriched motif and giving all positions the same relative weight to the overall score (no position is more important than the other). Of course, this approach is not realistic, but it will give a first glims at the data and will allow further comparisons with more realistic and complicated scoring models.

The second step in my data analysis plan was to plot the scores against the uptake ratios. However, as explained in the post that analysis is not very useful, so I will not discuss it any further. Instead, I will focus  on the third analysis described in my post which is to plot the uptake ratios to the distance to the nearest USS at each position in the genome.

My supervisor did a figure (included below) that explain what we expected to see in this analysis plot:

post Rosie



As explained in my supervisor post, In this figure, we can see the points for a band that goes down from an uptake ratio of 4 (enriched presumably because of a USS nearby) to a uptake ratio < 1 when distances from a USS are more than 500 bases. It is important to remember that the donor DNA in this data was sheared to an average size of 250 bp. A wide band of points would mean that there are a lot of points where the USS score does not capture all aspects explaining the influence of USS on uptake. She stated on her blog that there are three reasons why this could happen:

(Cited directly from her blog post)

  1. whether the USS’s orientation on the chromosome affects uptake (USS motifs are asymmetric)
  2. how well the USS’s sequence matches the several different ways we can score sequences as possible USS (genome-based, uptake-based, and with or without internal interaction effects between positions)
  3. how much the presence and relative locations of additional USSs adds to uptake


As part of this analysis, I have not only to calculate the distance of each position to a USS in the genome but also, I need to choose a threshold to determine what is the lowest score that can be identified as a USS. The figure below shows the distribution of scores in the genome. It is evident that only a small fraction of the score (see second figure below) have score higher than  15.



The threshold (red line) that is chosen, to discriminate what is a USS and what is not, could have an impact on the data because true USS’s with a score just below the threshold could be identified as non-USS sequences. This points will appear as outliers to the right of the curve (below) and can be confounded as non-USS sequence factors that influence uptake.  This effect is exactly what I saw the data from the figure below:



In contrast to the expected figure made by my supervisor, we can see a very broad band goes down from an uptake ratio of 6 (with some points higher) to a distance ~500 bp. Additionally, there are a few peaks (red arrows) that have increased uptake ratios at distances > 500 bp. Even though these peaks uptake ratios are not very high (an uptake ratio of 1 represent the same coverage as the input), they are considerably higher than the background. The threshold to determine the lowest score that could be considered as a USS in this analysis was 19.04. This score was chosen because it was the lowest score among the sequences chosen to build the pssm matrix, but there is no a stronger reason to have chosen that number. As a result, when I evaluated the positions that belong to those peaks closer, I saw that they belong to descrite narrow regions (~ 200 bp), and that they had one position centered around the middle of the range with a score slightly lower than the threshold chosen. When the same data was re-analyzed chosing a lower threshold (18) then most of these peaks were eliminated, as it shows the figure below:

distance 2

It can also be seen that when the threshold of 19.04 was chosen, the points farther away to a USS were at > 6000 bases, while when the threshold of 18 was chosen, the points farther away to a USS were at < 4000 bases. However there are still a couple more peaks that can be seen closer to a distance of 1000bp.

As a conclusion from this analysis, I can see that despite the simplistic scoring methodology that I chose to identify a USS,  the distances to the closest USS are apparently able to explain a lot of the variation in uptake. However, ther are still many questions that remain, such as what are the effects of interaction between USS’s?, why some USS have higher influence in uptake than others?, and which percentaje of the variance can be explained by USS related variables?. However, from this analysis there is a question that need to be answered first:

  1. Why the main band of points observed in the data figure is more width than the  expected from the figure from my supervisor (above). This broad band means that there are several points that are close to a position identified as a USS that have low uptake ratios. This points can be caused by two factors:
    • Since the scoring method chosen was very simplistic there are many positions that do not contribute to uptake, and well as others that contribute a lot to uptake but are scored equally than other non-important positions. This could cause positions with a high score that are identified as a USS, but that have crucial mutations in the core of the uptake motif (positions 7-9, GCGG, see sequence logo above) that would cause decrease uptake. The contrary scenario is also posible, positions that are different from the USS consensus, lowering the score, but that do not contribute to uptake.
    • Positions close a true USS, but that are close to another un-identified sequence that decreases uptake.

To answer this question I need to determine how low are the uptake ratios in the regions depleated of uptake. In other words, how low are the valleys shown when uptake ratios are plotted vs genomic positions

0.25kb and 6kb

In red, the mean uptake ratios from donor DNA sheared at an average size of 250bp are shown. Peaks represent position enriched in DNA uptake while valleys are positions depleted in DNA uptake. To calculate the mean uptake score of the valleys, I took all positions with a ratio lowers than 0.3 (depleted 5 fold) and I calculated the mean uptake ratio, which was 0.044. With this in mind, I wanted to determine if there are positions in this range that have high USS scores (At least higher than the threshold)

The next figure I explain the expectations of this analysis. First, I calculated the mean SS score of positions with an uptake ratio below 0.3, which was 0.044. Then I wondered, what would I see if I plotted the mean uptake ratio of positions with ratios lower than 0.3 over the USS scores? Since the mean of this points is 0.044, I expect most points to be located closer to 0. Additionally, I expect that most point should have a low score (below the threshold), and that points above this threshold (red arrows) should be either positions with low uptake despite having a true USS, or positions with a sequence incorrectly classified as a USS.

predicted USS_0.3.png

The actual data is shown below:


The red line is set indicating the threshold of 18. The figure shows how there are still some points with USS scores above the threshold. This points were identified as USS’s, however they clearly have low uptake ratios. As explained earlier, these points could very well be positions that have USS-like sequences, but that that have mutations in the core of the motif that decrease uptake; also these points could be true USS’s with low uptake (for an unknown reason). One way to answers this question is to repeat the analysis using a more complex scoring model.

However, before jumping into that, there is one more analysis I need to make. The former post-doc of the lab has suggested me to calculate the maximum score of each position to a USS within different distances (e.g. 100 bp, 200bp,  500bp). Then, I could plot that max score against uptake ratios. Before analyzing the data, I expected to see that most points would increase in uptake ratios as the maximum score within  200bp


Arrows show positions that, either have high uptake ratios despite being close to a USS (two arrows on the left), or have extremely high uptake ratios (arrow on the right).  I expect the band of points to be broader close to a max score of 10 and narrower at the max score increases. I also expect to see this pattern reversed when I plot a max score within 500 bp instead of 100 or 200 bp (more points should now cluster close to a higher max score).



When we evaluate the data using this approach we can see:




  1. The mean uptake ratio increase as the max score also increase, but not as much as I thought. This in part is because a mistake in judgement by my part, since I didn’t realize that most peaks (in the genome uptake profile figure shown above) go until 5 or 6 and not until 10.
  2. Outliers are shown as vertical lines starting at the broad line of points. This could mean that they belong to a discrete narrow region of the genome with possibly a USS misrepresented.


At this point the more smartest thing to do would be to reanalyze the data including a better scoring model. I could keep writting about strategies to do this but I am starting to get tired and this post is getting too long.

Data analysis plan

Finally, I got the sequencing data from a project I had been working on. This project is trying to determine the factors explaining why some regions of Haemophilus influenzae genome are taken up more than others by competent cells in DNA uptake experiments. To answer this question, I did a series of uptake experiments using donor DNA from two distinct strains 86-028NP and PittGG, sheared to average sizes of 0.25kb and 6 kb. Those donor DNA’s were incubated with competent Rd cells.  Donor DNA was later recovered using an organic extraction method that allowed us to isolate periplasmic DNA from (most)  chromosomal DNA. The recovered donor DNA’s, as well as the inputs from each experiment, were deep sequenced using Illumina technology.

The former post-doc of the lab, who is now a PI, with one of his students have already done the job of aligning the reads to the donor and recipient genomes as well as normalize reads by sequencing depth. Reads aligned to recipient genome represent chromosomal contamination that was able to co-extracted (despite of the agarose gel purification clean-up methods used) . Additionally, they divided the normalize reads per each position to the normalized reads of the input to generate an uptake ratio where 1 represent  indicates no enrichment or depletion of the position in periplasm relative to the controls.  A number > 1 indicate enrichment and < 1 indicate depletion. When plotted over the genome positions, this uptake scores show positions of the genome that are taken up more than others and that are enriched in the samples compared with the input control.

0.25kb and 6kb

In red, mean uptake ratios for each genomic position are shown from the 86-028NP 0.25kb (average) sheared donor DNA. In blue mean uptake ratios for each genomic position are shown from the 86-028NP 6kb (average) sheared donor DNA.

My objective is to determine if the variance in uptake across regions of the genome can be explained by the presence of a ‘uptake-signal-sequence (USS), as well as distance from a certain point in the genome to the nearest USS, or even interactions between several USS (when this are close together). This analysis will also help us to determine regions that have high uptake but no nearby USS as well as regions with low uptake despite having a USS closeby. These regions in which uptake is not explained by USS presence can be studied in more detail to try to identify non-USS motifs that might explain low or high uptake.

As it is shown in the figure uptake ratios of 0.25kb donor DNA (in red) are sharper and more spaced than 6kb donor DNA (blue). For this reason, I will start my analysis using the uptake ratios from the 0.25kb 86-028NP donor DNA.

My first task in this analysis is to determine if USS-dependent variables (e.g. presence of USS, distance to the closest USS, interactions between more than one USS) would explain regions of the genome that are enriched (peaks) of depleted in the uptake genomic maps.

  1. To do this I need to first which positions can be called as USS or are USS-like. This will be done by scoring the genome of the donor DNA using a position-specific scoring matrix (pssm) that was derived from a set of 2206 sequences found by Rosie that held a USS motif. These sequences were obtained by using the Gibbs motif sampler on Rd H. influenzae genome. 
  2. The second step is to plot the scores against the uptake ratios. But there is a problem, since the scoring  is done following a sliding window of the same size as of the pssm matrix (37 bp), each score represents how well the window of 37 bp, starting at that position, matches with the USS sequence. Then, a position immediately upstream or downstream from a USS site (high score) will have a low score. This happens because now the window would be out of frame from the motif. This is the reason this analysis is expected to not fully represent the relationship between a nearby USS in a genomic region and uptake.
  3. To have a better graphical representation of the correlation between USS presence in a genomic region with DNA uptake, I will have to plot the uptake ratios to the distance to the nearest USS at each position in the genome. However, the distance to the nearest USS in genome can be calculated in several ways: distance to any nearby USS (both strands included), distance to the nearest USS on the top strand, distance to the nearest USS on the bottom strand, distance to the nearest USS upstream or downstream from the position under question. I will start using distance to any USS (both strands included) plotted against uptake ratios. Later as I am able to evaluate the results I can try other options. If USS explain fully DNA uptake, then I expect that the positions with higher uptake ratios will have the shorter distances to a USS.
  4. Another analysis that can be done is to try to identify the peaks (highly enriched positions) and valleys (depleted positions) positions in order to determine the positions that highly enriched despite not having USS positions nearby. To do this I must first identify the peaks, which will be easier when working with the 0.25kb donor DNA data. I am still not sure how exactly I will do this but the former post-doc of the lab has found a solution using an R code. Once the peaks are identified the next step is to determine the uptake ratio of each peak, so we know which peaks are ‘taller’ than others. Later, I could plot the peak uptake ratios against the USS score or against the distance to a nearest USS score. In the first case I would expect that the ‘taller’ peaks (with highest uptake ratios)  would have the highest USS score; or in the second case, I expect that the ‘taller’ would be closer to the nearest USS than shortest peaks. In this analysis will be very interesting to determine which ‘taller peaks’ do not have high USS scores and are not close to any USS nearby. Rosie has suggested that we can take this genomic regions and run a Gibbs motif sampler analysis to try to figure out is there is an unknown motif that might explain this regios with high uptake.

Problems and aspects to consider:

This first set of analysis will give us just a preliminary of how well USS presence/distance correlate to uptake ratios. It will also allow us to determine genomic positions with high uptake that are not explained by USS presence/distance. However, several aspects have to be considered:

  1. Not every position in the USS motif is equally important to DNA uptake. During the first analysis I plan to simply score the genome as explained before assuming that all positions are equally important to uptake. However this is not the case, as some positions (GCGG) within the core of the USS motif AAGTGCGGT are critical for DNA uptake, while other have a weaker effect. Additionally, I have also not taken in account yet the interactions between different positions in the motif. All this interactions has been very well characterize before by Mell et. al 2012.
  2. Something else that was discovered in this paper (Mell et. al 2012) is that the uptake bias motif is not exactly the same as the genome enriched motif that Rosie found using the Gibbs motif sampler.  However, the former post-doc has told me that I should not worry about this yet. He estimate that the analysis results should not change much when his uptake bias motif model is used to score the genome instead of the genome enriched model.
  3. What is a peak and what is not. Determine a threshold of what is consider a peak is not a trivial task. Even though, it can be easier for 0.25kb donor DNA data (see picture above) to determine what is a peak (since their peaks are sharper and taller). For 6kb donor DNA data, identifying peaks might not be very simple. Of course, I won’t have to worry about this until I start thinking about interaction between multiple USS’s, which is when the 6kb donor DNA data is more useful.
  4. Positions next to a USS will have a low USS score given the fact that they are out-of-frame regarding the motif. This problem will complicate the analysis if I use the USS score directly. I have thought in complicated solutions such as calculating for each position the probability that a read including that position included at least one USS. For instance,  if the distribution of the  donor sheared DNA fragments goes from 0.1kb to 1 kb and closest USS to a given position is >1 kb then the probability that a read came from a fragment containing at least one USS will be 0; but if the distance to the closest USS  is 1 bp, then the probability will be very high.  Maybe this will have to be done eventually, the complications are: 1. the implementation and 2. we need to know the exact distribution of the sheared fragments. As a starting point using the raw distances to the closest USS will be a good idea, since they are not expected to change much if, for instance, different positions in the USS motif are weighted differently.

I could work in this text forever since there is so many things that can be done with the data, but I have been working in this for a long time and it is time to start putting the plan in accion.