Hatzigeorgiou A., Sanida P., Papakostas E., Reczko M.
Synaptic Ltd., Po Box 51340, Athens 14510, Greece, E-mail: synaptic@brainlink.com
In this work we describe an improved method for finding sequencing errors
in coding regions based only on the nucleotide sequence information, useful
when no close homologue is already known. To achieve this basic assessments
of coding measures are compared and the coding measure giving the best
in-frame prediction on a test set of coding sequences is selected. The
best results are obtained by a combination of codon usage statistics, which
transforms the sequence to a frequency vector and an artificial neural
network which separates the coding from the non-coding vectors. On an independent
set the coding frame in the cDNA is correctly predicted for 90% of the
nucleotides. The results from this prediction are processed using a dynamic
programming algorithm to find the optimal assignment of frames and to locate
the exact location of sequencing errors. Frameshifts of length at least
40 bases can be exactly located.
Not only the predicted frame, but the reliability of the prediction,
varies along the sequence. For this reason the output of the program includes
a very detailed prediction: for each nucleotide a score is given, with
a high score indicating that the given nucleotide is the first one of a
codon. If the prediction is "0 9 0 0 9 0 0 9" the user gets an
accurate prediction for the particular frame . The sequence "0 6 3
4 7 5 0" indicates a region with very low prediction accuracy. If
the prediction is "0 9 0 0 9 0 0 8 1 7 1 1 8 0 0 9 0 0 9" the
probability of a sequencing error is very high.
The output contains:
- an assignment of reading frames to the sequence
- information about the overall coding potential of the sequence
- for each nucleotide, the coding potential of the codon starting with
that nucleotide and the reading frame surrounding that codon
- an assignment of insertions and deletions that maximizes the overall
coding potential of the sequence
- an amino acid sequence translation of this corrected sequence
The programm is available under the address : http://www.imbb.forth.gr/seqerr.html