Dubchak I., Muchnik I.
E. O. Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Tel. (510)486-4338, Fax (510)486-6059, E-mail ildubchak@lbl.gov ; Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University, Piscataway, NJ 08855-1179, USA
Predicting a protein fold and implied function from the amino acid sequence
is a problem of great interest. We have developed a neural networks (NN)
based expert system which, given a classification of protein folds, can
assign a protein to a folding class using primary sequence data. It addresses
the inverse protein folding problem from a taxonometric rather than threading
perspective. Recent classifications suggest the existence of ~80-350 different
folds. The occurrence of several representatives for each fold allows extraction
of the common features of its members. Our method (i) provides a global
description of a protein sequence in terms of the biochemical and structural
properties of the constituent amino acids, (ii) combines the descriptors
using NNs allowing discrimination of members of a given folding class from
members of all other folding classes and (iii) uses a voting procedure
among predictions based on different descriptors to decide on the final
assignment. The level of generalization in this method is higher than in
the direct sequence-sequence and sequence-structure comparison approaches.
Two sequences belonging to the same folding class can differ significantly
at the amino acid level but the vectors of their global descriptors will
be located very close in parameter space. Thus, utilizing these aggregate
properties for fold recognition has an advantage over using detailed sequence
comparisons
All proteins in the non-redundant database of folds were transformed
into inputs for the learning system in two steps:
(a) The sequence of amino acids was replaced by a sequence expressed
in terms of their particular local physico-chemical or structural property,
such as predicted secondary structure, predicted solvent accessibility,
polarity, polarizability, van der Waals volume, and hydrophobicity;
(b) Three descriptors, "composition" (C), "transition"
(T), and "distribution" (D), were calculated to describe the
global composition of a given local amino acid property in the protein,
the frequencies with which the property changes along the entire length
of the protein, and the distribution pattern of the property along the
sequence. The vectors of parameters containing 21 scalar components (C,
T, and D combined), were constructed for all six properties to use as independent
inputs to the NN. Percent composition of amino acids was also used as the
parameter set.
In order to distinguish a particular fold from all other folds, seven
neural networks (NNs) based on seven sets of parameters were trained accordingly.
In such a way, any sequence in question had seven individual predictions.
A majority rule was used in decision making. This procedure is simple,
efficient, and incorporated into easy-to-use-software. It was applied to
the fold predictions in the context of fine-grained classifications 3D_ALI
[1] and the Structural Classification of Proteins, SCOP [2]. In attempt
to simplify the fold recognition problem and to increase the reliability
of predictions, we also approached a reduced fold recognition problem,
when the choice is limited to two folds. Our prediction scheme demonstrated
high accuracy in extensive testing on the independent sets of proteins.
- Pascarella, S., Argos, P. (1992). Prot. Engng., 5: 121-137
- Murzin, A. G., S. E. Brenner, T. Hubbard and C. Chothia. (1995). J.
Molec. Biol., 247: 536-540.