MODELS AND METHODS IN BIOSEQUENCE ANALYSIS

With the Human Genome Project and other model organism genome sequencing projects now in full swing, databases of DNA, RNA and protein sequences are growing at an explosive rate, and the need for statistical methods for biosequence analysis has become acute. Right now we need effective methods for locating genes in DNA sequences, along with their splice sites and regulatory binding sites, and for classifying new proteins by their predicted structure or function. Hidden Markov Models (HMMs) have proven to be useful tools for this task. We will describe what HMMs are and how they are used in biosequence analysis. Then we will report how they performed in comparison to other methods in the CASP2 international test of protein structure prediction methods and a recent larger test conducted at the Laboratory for Molecular Biology in Cambridge.

Finally, we will discuss a new, as yet untested method of biosequence classification called the Fisher kernel method. Here an HMM (or any parametric generative model for a family of biosequences) is used to embed the sequences into a linear space with a natural inner product defined using the Fisher information matrix. One can then employ a variety of classification methods to discriminate members of the family from nonmembers.