Preliminary data analysis

In my last post, I explain the plan that I had to analyze the data from the project I am working on. This project is looking to determine  why some regions of Haemophilus influenzae genome are taken up more than others by competent cells in DNA uptake experiments.

As explained in that post, the first step in my analysis was to determine which positions correspond to USS or USS-like sequences. This is done by scoring the genome using a position specific scoring matrix (pssm), using a similar  approach as the one used by various software to create sequence logos (such as the sequence logo of a genomic USS region shown below).


However, as explain in a post from my supervisor, the USS scoring can be done in multiple ways (using the genome enriched motif, the uptake motif, taking in account positions interactions, weighting positions within the motif according to how important they are for uptake, taking in account GC content).

In the analysis showed in this post, I used the most simplistic approach to score the genome. That approach was to score the genome using the genome enriched motif and giving all positions the same relative weight to the overall score (no position is more important than the other). Of course, this approach is not realistic, but it will give a first glims at the data and will allow further comparisons with more realistic and complicated scoring models.

The second step in my data analysis plan was to plot the scores against the uptake ratios. However, as explained in the post that analysis is not very useful, so I will not discuss it any further. Instead, I will focus  on the third analysis described in my post which is to plot the uptake ratios to the distance to the nearest USS at each position in the genome.

My supervisor did a figure (included below) that explain what we expected to see in this analysis plot:

post Rosie



As explained in my supervisor post, In this figure, we can see the points for a band that goes down from an uptake ratio of 4 (enriched presumably because of a USS nearby) to a uptake ratio < 1 when distances from a USS are more than 500 bases. It is important to remember that the donor DNA in this data was sheared to an average size of 250 bp. A wide band of points would mean that there are a lot of points where the USS score does not capture all aspects explaining the influence of USS on uptake. She stated on her blog that there are three reasons why this could happen:

(Cited directly from her blog post)

  1. whether the USS’s orientation on the chromosome affects uptake (USS motifs are asymmetric)
  2. how well the USS’s sequence matches the several different ways we can score sequences as possible USS (genome-based, uptake-based, and with or without internal interaction effects between positions)
  3. how much the presence and relative locations of additional USSs adds to uptake


As part of this analysis, I have not only to calculate the distance of each position to a USS in the genome but also, I need to choose a threshold to determine what is the lowest score that can be identified as a USS. The figure below shows the distribution of scores in the genome. It is evident that only a small fraction of the score (see second figure below) have score higher than  15.



The threshold (red line) that is chosen, to discriminate what is a USS and what is not, could have an impact on the data because true USS’s with a score just below the threshold could be identified as non-USS sequences. This points will appear as outliers to the right of the curve (below) and can be confounded as non-USS sequence factors that influence uptake.  This effect is exactly what I saw the data from the figure below:



In contrast to the expected figure made by my supervisor, we can see a very broad band goes down from an uptake ratio of 6 (with some points higher) to a distance ~500 bp. Additionally, there are a few peaks (red arrows) that have increased uptake ratios at distances > 500 bp. Even though these peaks uptake ratios are not very high (an uptake ratio of 1 represent the same coverage as the input), they are considerably higher than the background. The threshold to determine the lowest score that could be considered as a USS in this analysis was 19.04. This score was chosen because it was the lowest score among the sequences chosen to build the pssm matrix, but there is no a stronger reason to have chosen that number. As a result, when I evaluated the positions that belong to those peaks closer, I saw that they belong to descrite narrow regions (~ 200 bp), and that they had one position centered around the middle of the range with a score slightly lower than the threshold chosen. When the same data was re-analyzed chosing a lower threshold (18) then most of these peaks were eliminated, as it shows the figure below:

distance 2

It can also be seen that when the threshold of 19.04 was chosen, the points farther away to a USS were at > 6000 bases, while when the threshold of 18 was chosen, the points farther away to a USS were at < 4000 bases. However there are still a couple more peaks that can be seen closer to a distance of 1000bp.

As a conclusion from this analysis, I can see that despite the simplistic scoring methodology that I chose to identify a USS,  the distances to the closest USS are apparently able to explain a lot of the variation in uptake. However, ther are still many questions that remain, such as what are the effects of interaction between USS’s?, why some USS have higher influence in uptake than others?, and which percentaje of the variance can be explained by USS related variables?. However, from this analysis there is a question that need to be answered first:

  1. Why the main band of points observed in the data figure is more width than the  expected from the figure from my supervisor (above). This broad band means that there are several points that are close to a position identified as a USS that have low uptake ratios. This points can be caused by two factors:
    • Since the scoring method chosen was very simplistic there are many positions that do not contribute to uptake, and well as others that contribute a lot to uptake but are scored equally than other non-important positions. This could cause positions with a high score that are identified as a USS, but that have crucial mutations in the core of the uptake motif (positions 7-9, GCGG, see sequence logo above) that would cause decrease uptake. The contrary scenario is also posible, positions that are different from the USS consensus, lowering the score, but that do not contribute to uptake.
    • Positions close a true USS, but that are close to another un-identified sequence that decreases uptake.

To answer this question I need to determine how low are the uptake ratios in the regions depleated of uptake. In other words, how low are the valleys shown when uptake ratios are plotted vs genomic positions

0.25kb and 6kb

In red, the mean uptake ratios from donor DNA sheared at an average size of 250bp are shown. Peaks represent position enriched in DNA uptake while valleys are positions depleted in DNA uptake. To calculate the mean uptake score of the valleys, I took all positions with a ratio lowers than 0.3 (depleted 5 fold) and I calculated the mean uptake ratio, which was 0.044. With this in mind, I wanted to determine if there are positions in this range that have high USS scores (At least higher than the threshold)

The next figure I explain the expectations of this analysis. First, I calculated the mean SS score of positions with an uptake ratio below 0.3, which was 0.044. Then I wondered, what would I see if I plotted the mean uptake ratio of positions with ratios lower than 0.3 over the USS scores? Since the mean of this points is 0.044, I expect most points to be located closer to 0. Additionally, I expect that most point should have a low score (below the threshold), and that points above this threshold (red arrows) should be either positions with low uptake despite having a true USS, or positions with a sequence incorrectly classified as a USS.

predicted USS_0.3.png

The actual data is shown below:


The red line is set indicating the threshold of 18. The figure shows how there are still some points with USS scores above the threshold. This points were identified as USS’s, however they clearly have low uptake ratios. As explained earlier, these points could very well be positions that have USS-like sequences, but that that have mutations in the core of the motif that decrease uptake; also these points could be true USS’s with low uptake (for an unknown reason). One way to answers this question is to repeat the analysis using a more complex scoring model.

However, before jumping into that, there is one more analysis I need to make. The former post-doc of the lab has suggested me to calculate the maximum score of each position to a USS within different distances (e.g. 100 bp, 200bp,  500bp). Then, I could plot that max score against uptake ratios. Before analyzing the data, I expected to see that most points would increase in uptake ratios as the maximum score within  200bp


Arrows show positions that, either have high uptake ratios despite being close to a USS (two arrows on the left), or have extremely high uptake ratios (arrow on the right).  I expect the band of points to be broader close to a max score of 10 and narrower at the max score increases. I also expect to see this pattern reversed when I plot a max score within 500 bp instead of 100 or 200 bp (more points should now cluster close to a higher max score).



When we evaluate the data using this approach we can see:




  1. The mean uptake ratio increase as the max score also increase, but not as much as I thought. This in part is because a mistake in judgement by my part, since I didn’t realize that most peaks (in the genome uptake profile figure shown above) go until 5 or 6 and not until 10.
  2. Outliers are shown as vertical lines starting at the broad line of points. This could mean that they belong to a discrete narrow region of the genome with possibly a USS misrepresented.


At this point the more smartest thing to do would be to reanalyze the data including a better scoring model. I could keep writting about strategies to do this but I am starting to get tired and this post is getting too long.


One thought on “Preliminary data analysis

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s