IDENTIFICATION OF FUNCTIONAL SIGNALS IN A SET OF SEQUENCES

The following problem is to be resolved. There is a number of fragments of sequences, the majority of them presumably containing some functional signal, one or more, of single or several types. As for the functional signals, the following assumptions have been maid: i) the signal is a segment of a sequence no more than W letters-long; ii) the signals are rare in the genome, but relatively frequent in a given sample; iii) the signals of the same type show a textual similarity, namely, at least M letters out of W match in each two examples of the signal. An algorithm to solve this problem has been suggested. At the first stage, sets of segments with anomalous frequent occurrence in a sample are constructed. Afterwards each set is considered separately. A consensus is determined for every set, and a score for every segment. At the second stage, the set of segments is filtered according to one or many tests: i) score value distribution; ii) distance matrix; iii) the expected genome frequency. As the result, a model of the functional signal is obtained which may be used for examination of new sequences not present in the original sample.

This method is shown to be applicable in the case of functional signals displaying a marked consensus, e.g. late promoters of phage T7, SOS-boxes of E.coli etc.