A more complex problem (at least in my mind) is the preliminary analysis of sheared genomic fragments.
The smaller size of genomic fragments is 200bp which in reality is just an average size of a distribution between a 100bp to 1000bp.
I am not sure exactly how I am going to do this analysis but this post is an intend to clear my mind.
First, what do I know:
Size of the genome of species to be studied: (e.g. H. influenzae KW20 = 1.8e+6 bp).
distribution probability of a motif in a random fragments of 200bp.
Frequency of a motif in a random genome and I can even estimate the frequency given a GC content.
What do I need to estimate:
Since when I will get the sequencing data the more simple analysis is to estimate the coverage per base or per motif. One of the questions to be asked is:
how the increment in reads at a certain position (or motif?) is (given how strong the motif is)? and based on the distribution of the motif in the 200bp fragment.
Of course that I will have to account for the distribution of sizes of the sheared genomic fragments, but for simplicity I will first think only in a 200bp size.
How would increment per position if there is 1 motif?
How would increment per position if there is two motifs (if two motifs have more probability than one)?
How would increment per position given a certain strength?
How would increment per position given a GC content bias (probability of a nucleotide is not 0.25)?
How would increment per position given that the sheared fragment was 100bp? 500bp? 1000bp?
How would increment per position given a certain coverage in the input (coverage distribution)?
I think what I will do is calculating the increment frequency in a 200bp fragments of a motif (as I showed in the previous posts) and then calculate how this would increment the reads of the output compared to reads of the input (based on coverage per base distribution).