Monthly Archives: March 2016

Data analysis plan

Finally, I got the sequencing data from a project I had been working on. This project is trying to determine the factors explaining why some regions of Haemophilus influenzae genome are taken up more than others by competent cells in DNA uptake experiments. To answer this question, I did a series of uptake experiments using donor DNA from two distinct strains 86-028NP and PittGG, sheared to average sizes of 0.25kb and 6 kb. Those donor DNA’s were incubated with competent Rd cells.  Donor DNA was later recovered using an organic extraction method that allowed us to isolate periplasmic DNA from (most)  chromosomal DNA. The recovered donor DNA’s, as well as the inputs from each experiment, were deep sequenced using Illumina technology.

The former post-doc of the lab, who is now a PI, with one of his students have already done the job of aligning the reads to the donor and recipient genomes as well as normalize reads by sequencing depth. Reads aligned to recipient genome represent chromosomal contamination that was able to co-extracted (despite of the agarose gel purification clean-up methods used) . Additionally, they divided the normalize reads per each position to the normalized reads of the input to generate an uptake ratio where 1 represent  indicates no enrichment or depletion of the position in periplasm relative to the controls.  A number > 1 indicate enrichment and < 1 indicate depletion. When plotted over the genome positions, this uptake scores show positions of the genome that are taken up more than others and that are enriched in the samples compared with the input control.

0.25kb and 6kb

In red, mean uptake ratios for each genomic position are shown from the 86-028NP 0.25kb (average) sheared donor DNA. In blue mean uptake ratios for each genomic position are shown from the 86-028NP 6kb (average) sheared donor DNA.

My objective is to determine if the variance in uptake across regions of the genome can be explained by the presence of a ‘uptake-signal-sequence (USS), as well as distance from a certain point in the genome to the nearest USS, or even interactions between several USS (when this are close together). This analysis will also help us to determine regions that have high uptake but no nearby USS as well as regions with low uptake despite having a USS closeby. These regions in which uptake is not explained by USS presence can be studied in more detail to try to identify non-USS motifs that might explain low or high uptake.

As it is shown in the figure uptake ratios of 0.25kb donor DNA (in red) are sharper and more spaced than 6kb donor DNA (blue). For this reason, I will start my analysis using the uptake ratios from the 0.25kb 86-028NP donor DNA.

My first task in this analysis is to determine if USS-dependent variables (e.g. presence of USS, distance to the closest USS, interactions between more than one USS) would explain regions of the genome that are enriched (peaks) of depleted in the uptake genomic maps.

  1. To do this I need to first which positions can be called as USS or are USS-like. This will be done by scoring the genome of the donor DNA using a position-specific scoring matrix (pssm) that was derived from a set of 2206 sequences found by Rosie that held a USS motif. These sequences were obtained by using the Gibbs motif sampler on Rd H. influenzae genome. 
  2. The second step is to plot the scores against the uptake ratios. But there is a problem, since the scoring  is done following a sliding window of the same size as of the pssm matrix (37 bp), each score represents how well the window of 37 bp, starting at that position, matches with the USS sequence. Then, a position immediately upstream or downstream from a USS site (high score) will have a low score. This happens because now the window would be out of frame from the motif. This is the reason this analysis is expected to not fully represent the relationship between a nearby USS in a genomic region and uptake.
  3. To have a better graphical representation of the correlation between USS presence in a genomic region with DNA uptake, I will have to plot the uptake ratios to the distance to the nearest USS at each position in the genome. However, the distance to the nearest USS in genome can be calculated in several ways: distance to any nearby USS (both strands included), distance to the nearest USS on the top strand, distance to the nearest USS on the bottom strand, distance to the nearest USS upstream or downstream from the position under question. I will start using distance to any USS (both strands included) plotted against uptake ratios. Later as I am able to evaluate the results I can try other options. If USS explain fully DNA uptake, then I expect that the positions with higher uptake ratios will have the shorter distances to a USS.
  4. Another analysis that can be done is to try to identify the peaks (highly enriched positions) and valleys (depleted positions) positions in order to determine the positions that highly enriched despite not having USS positions nearby. To do this I must first identify the peaks, which will be easier when working with the 0.25kb donor DNA data. I am still not sure how exactly I will do this but the former post-doc of the lab has found a solution using an R code. Once the peaks are identified the next step is to determine the uptake ratio of each peak, so we know which peaks are ‘taller’ than others. Later, I could plot the peak uptake ratios against the USS score or against the distance to a nearest USS score. In the first case I would expect that the ‘taller’ peaks (with highest uptake ratios)  would have the highest USS score; or in the second case, I expect that the ‘taller’ would be closer to the nearest USS than shortest peaks. In this analysis will be very interesting to determine which ‘taller peaks’ do not have high USS scores and are not close to any USS nearby. Rosie has suggested that we can take this genomic regions and run a Gibbs motif sampler analysis to try to figure out is there is an unknown motif that might explain this regios with high uptake.

Problems and aspects to consider:

This first set of analysis will give us just a preliminary of how well USS presence/distance correlate to uptake ratios. It will also allow us to determine genomic positions with high uptake that are not explained by USS presence/distance. However, several aspects have to be considered:

  1. Not every position in the USS motif is equally important to DNA uptake. During the first analysis I plan to simply score the genome as explained before assuming that all positions are equally important to uptake. However this is not the case, as some positions (GCGG) within the core of the USS motif AAGTGCGGT are critical for DNA uptake, while other have a weaker effect. Additionally, I have also not taken in account yet the interactions between different positions in the motif. All this interactions has been very well characterize before by Mell et. al 2012.
  2. Something else that was discovered in this paper (Mell et. al 2012) is that the uptake bias motif is not exactly the same as the genome enriched motif that Rosie found using the Gibbs motif sampler.  However, the former post-doc has told me that I should not worry about this yet. He estimate that the analysis results should not change much when his uptake bias motif model is used to score the genome instead of the genome enriched model.
  3. What is a peak and what is not. Determine a threshold of what is consider a peak is not a trivial task. Even though, it can be easier for 0.25kb donor DNA data (see picture above) to determine what is a peak (since their peaks are sharper and taller). For 6kb donor DNA data, identifying peaks might not be very simple. Of course, I won’t have to worry about this until I start thinking about interaction between multiple USS’s, which is when the 6kb donor DNA data is more useful.
  4. Positions next to a USS will have a low USS score given the fact that they are out-of-frame regarding the motif. This problem will complicate the analysis if I use the USS score directly. I have thought in complicated solutions such as calculating for each position the probability that a read including that position included at least one USS. For instance,  if the distribution of the  donor sheared DNA fragments goes from 0.1kb to 1 kb and closest USS to a given position is >1 kb then the probability that a read came from a fragment containing at least one USS will be 0; but if the distance to the closest USS  is 1 bp, then the probability will be very high.  Maybe this will have to be done eventually, the complications are: 1. the implementation and 2. we need to know the exact distribution of the sheared fragments. As a starting point using the raw distances to the closest USS will be a good idea, since they are not expected to change much if, for instance, different positions in the USS motif are weighted differently.

I could work in this text forever since there is so many things that can be done with the data, but I have been working in this for a long time and it is time to start putting the plan in accion.