Haussler D.
Computer Science Department, University of California, Santa Cruz, CA 95064, USA Tel: 408 459 2105, FAX: 408 459 4829
With the Human Genome Project and other model organism genome sequencing
projects now in full swing, databases of DNA, RNA and protein sequences
are growing at an explosive rate, and the need for statistical methods
for biosequence analysis has become acute. Right now we need effective
methods for locating genes in DNA sequences, along with their splice sites
and regulatory binding sites, and for classifying new proteins by their
predicted structure or function. Hidden Markov Models (HMMs) have proven
to be useful tools for this task. We will describe what HMMs are and how
they are used in biosequence analysis. Then we will report how they performed
in comparison to other methods in the CASP2 international test of protein
structure prediction methods and a recent larger test conducted at the
Laboratory for Molecular Biology in Cambridge.
Finally, we will discuss a new, as yet untested method of biosequence
classification called the Fisher kernel method. Here an HMM (or any parametric
generative model for a family of biosequences) is used to embed the sequences
into a linear space with a natural inner product defined using the Fisher
information matrix. One can then employ a variety of classification methods
to discriminate members of the family from nonmembers.