As discussed extensively in previous posts, our analysis of DNA uptake data in Haemophilus influenzae is going very well. So far it seems that most of the variation in DNA uptake across the genome are explained by the proximity to an uptake signal sequence (USS). We have scored the genome using two distinct uptake motif models. The first motif model or “genomic uptake motif” is based on the USS sequences enriched the genome, while the second model or “uptake bias motif” is based on experiments characterizing the biases on DNA uptake machinery.
Scoring the genome with both of these models results in a score per genomic position that allowed us to identify regions of the genome with USS or USS-like sequences. After several distinct analysis, that I am not going to discuss in this post, we were able to determined: 1. the most adequate cuttoff score to identify which sequences could be classified as USS. 2. which uptake motif model performed better, generating less false positives. Additionally, we now have a good idea of how well the proximity of a genomic region to a USS sequences could explain DNA uptake (at least for small donor DNA fragment sizes).
The next step in our analysis is to find an independent way to call for uptake peaks, seen as positions with local peaks in uptake ratios across the genome. Once we identify this peaks, we will be able to compare them with the positions that we predicted as USS based on our scoring method (genomic motif or uptake bias motif) and in our chosen cutoff. I predict that this comparison will be able to determine:
- how many uptake peaks are explained by the presence of a USS position nearby.
- how many uptake peaks have high uptake despite been far away from a USS position.
- Among uptake peaks, does a higher uptake score also means a higher uptake peak.
The strategy that I will apply in this analysis is to first find all the peaks of the data by using the peak finder function created in R. This step was already done, and it was not easy to implement. The function depends on two parameters “w” and “span”. As explained in here, the function works by smoothing the data to then find local peaks by comparing local maxima with the smoothed points. By changing these parameters (w = 200, span = 0.02), I was able to optimize the function finding each local peak within the first 10 kb of the genome.
However as has can be seen, some false positives were seen in position with low depth called as peaks.
Note: x represent the genomic positions, while y represent uptake ratio. Dots in red are peaks identified by the peak finder function. Black line is the smoothed line made by the function. Grey dots are the actual data points showing uptake ratio at each position
To eliminate this false positives, I introduced another parameter called “cut”. This parameter allowed me to filter all peaks in positions with uptake ratios lower than a certain threshold (cut = 0.5). Next, the entire genome has to be screened in order to generate a list of positions with uptake peaks. This was done by breaking the genome into small pieces (10kb/each) and then screening each piece for local peaks. The genome was splitted, since the function performed much better when smaller sections of the genome were used. In other words, when larger sections of the genome were used, the peaks found were farther away from the real peaks.
This could be seen both graphically and by examining the peak lists generated.
For instance, the two figures above shows the DNA uptake profile of a section of genome. Blue point represent positions with more than 10 reads in the input, red point are sites with less than 10 reads in the input. Black dots represent positions identified as peaks, when the genome was split into 10kb pieces (first figure), and when the genome was split into 100kb pieces (second figure)
It is clear by seen both figures than when the genome was split into larger sections, the peak finder function was less precise.
In the table below we can see a list of the first 7 positions of the genome classified as USS by our uptake bias model , aligned by the focal position (column 1) and by the central C of the core of the motif( column 2) ; as well as the first 6 -7 positions identified as uptake peaks when the genome was split into 10kb fractions of the genome (column 3), into 100kb fractions (column 4), and to the entire non-split genome (column 5)
As can be seen, the smaller the genome fractions used to feed to the function, the closer the positions are to either the USS focal or to central C positions. For instance, when the genome was analyzed breaking it in 10kb pieces, most uptake peaks were just a few bases away from the USS focal or central C position. On the other hand, when the genome was analyzed breaking it in 100kb pieces, one position around 1023 bases was not recognized as a peak and the rest of the peaks were further away from the USS positions. As an extreme, we can see that running the function to the entire genome (without fractioning it first) resulted in a list with many missing peaks (no peaks in the first 80kb of the genome).
The table above also tells us that when the uptake peak list generated breaking the genome in 10 kb fractions correlates very well with the USS positions list. Now we only need a graphical way to visualize how well both of this lists match together.
One way to graphically analyze the uptake peak data is to plot simply the distribution of mean uptake ratio of the uptake peak list. Of course this analysis does not tell us how well uptake peaks match with the USS list, however it does give us an idea of the distribution of uptake ratios from peak positions.
I predict that the distribution should look like something similar to the figure below, where most peaks are located between 3 and 4, with a few peaks with higher or lower uptake ratios (red arrows). Positions with lower uptake ratios (less than 2) could represent peaks with a either a non-optimum USS sequence or artefacts of the peak finder that were not removed by the “cut” argument (of 0.5). On the other hand peaks with uptake ratios higher than 5 could represent positions with either a extremely good USS sequence or a overestimated uptake ratio given positions in the input with low sequencing depth.
The actual distribution of mean uptake ratios of uptake peaks, shown below, looked somewhat similar to what I predicted. Now by simply extracting positions with higher and lower uptake ratios I could look carefully at the sequences of this positions in order to determine why they have those uptake ratios.
However, This analisis still does not tell me directly how well the uptake peak list correlate with the USS list. In order to graphically see how both list match each other I will calculate the distance of each uptake peak of the list to the nearest uptake USS (using the central C position). Then, I can plot this distance vs the uptake ratios of the uptake peaks.
I expect that this plot will look like the figure below:
I expect that most peaks will be just a few bases from the nearest USS (1 – 40 bases). Artefacts that were clasified as peaks but that are not real peaks are expected to be farther away (red arrows) from the nearest USS and are expected to have low uptake ratios, since they might not be eliminated by the cutoff argument chosen (0.5). If there were peaks that are not explained by a USS sequence, we would expect that this peaks should be farther away from a USS, but they would have a high uptake ratio (green arrows). This analisis, would tells us right away:
- how many uptake peaks are explained by the presence of a USS position nearby.
- how many uptake peaks have high uptake despite been far away from a USS position
Finally, to answer the question if a higher uptake score also means a higher uptake peak, I could plot a simple regression of both parameters and see how assess the slope of the regression. Based on what we have seen so far, it seems that increasing the USS score beyond a certain point does not increase uptake ratio any further. Of course this analysis would gives us only a rough idea since other factors such as interaction between more than one USS or interactions between positions in the USS motif are not included in the analysis.
The plot of distance to the nearest USS to uptake ratios of the uptake peak list is shown below.
By looking at the figure, I realized that there are a few things I did not taken in account.
- There are more points than what I expected further away from 200 bases. Some of those points have an uptake ratio between 3-5. This point could represent either true peaks that are far away from a USS or true peaks that are actually close to a USS, but that USS was not identified as such given the cutoff used to classify a position as a USSs using the uptake scoring function.
- When I identified uptake peaks using the R function, I trimmed all peak lower than a variable called “cut”. The cuttoff chosen was 0.5. However, by comparing the number of USS identified by the scoring fuction and the number of peaks identified by the peak finder function, I can see that the number of USS (1391) is higher than the number of peaks (1273). This means that some peaks could have been eliminated by setting the cuttoff too high. This might not change the analysis much (in any case it is better to be conservative with any analysis), but I will repeat the analysis using a lower cutoff to see what happen.
As a result I can see than 80.2% of the peaks identified are closer than 200 bases from a USS position, and 77.7% of the uptake peaks are closer than 100 bases from a USS position. This results show that USSs explained a big percent of the uptake peaks seen in the data. Now the question is what happen with the rest 20%. Are they a result of artefacts of the analysis? or are they really far away from a USS?
The other analysis that I did was the regression of uptake scores vs the uptake ratios of the uptake peak list. The regression clearly show no association. In other words among peaks, increasing the USS score further does not increase uptake.
Note: In this figure I used the re-calculated uptake ratios (summing 1 read to the input to avoid having missing values) in order to mke the regression work.