Shepelev V. A.
Institute of Molecular Genetics, Russian Acad. Sci., 123182, Moscow, Kurchatov Sq., Russia; Fax: 1960221; E-mail: spl@img.ras.ru
The following problem is to be resolved. There is a number of fragments
of sequences, the majority of them presumably containing some functional
signal, one or more, of single or several types. As for the functional
signals, the following assumptions have been maid: i) the signal is a segment
of a sequence no more than W letters-long; ii) the signals are rare in
the genome, but relatively frequent in a given sample; iii) the signals
of the same type show a textual similarity, namely, at least M letters
out of W match in each two examples of the signal. An algorithm to solve
this problem has been suggested. At the first stage, sets of segments with
anomalous frequent occurrence in a sample are constructed. Afterwards each
set is considered separately. A consensus is determined for every set,
and a score for every segment. At the second stage, the set of segments
is filtered according to one or many tests: i) score value distribution;
ii) distance matrix; iii) the expected genome frequency. As the result,
a model of the functional signal is obtained which may be used for examination
of new sequences not present in the original sample.
This method is shown to be applicable in the case of functional signals
displaying a marked consensus, e.g. late promoters of phage T7, SOS-boxes
of E.coli etc.