Ramensky V. E., Makeev V. Ju., Tumanyan V. G.
Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 117984, Moscow, Vavilov St. 32, Russia
We consider the problem of DNA segmentation into the blocks of uniform
nucleotide composition. In doing so, one have to overcome two main obstacles.
First, it is a nontrivial task to determine what is a block in DNA sequence;
it is difficult to separate two DNA regions in which nucleotides of all
four types are present. Second, for the lower-sized blocks, containing
only a few letters, the frequency-count compositional estimator is highly
sensitive to nucleotide substitutions. We argue that the second problem
may be solved by the Bayessian estimator of the composition. As an optimal
segmentation we take that, for which the segmented sequence, or a set of
blocks, has the highest probability to be generated through a series of
independent tests with multinomial, constant within every block, probabilities
of the nucleotide occurrences. Our approach yields the results consistent
with the segmentation produced by the complexity DNA analysis for long
sequences, but enables obtaining the short blocks in a straightforward
way.
This work was supported by The Russian Human Genome Program.