PREDICTION OF GERMLINE SET OF SEQUENCES FOR THE IMMUNOGLOBULIN FAMILY
Galitsky B., Gelfand I., Kister A.
Mathematics Department, Rutgers University, Piscataway, NJ, 08854, USA; E-mail: galitsky@dimacs.rutgers.edu , igelfand@math.rutgers.edu , akister@math.rutgers.edu
Immunoglobulin (human heavy chain) sequences from Kabat database are
analyzed in terms of keywords (motifs) of the small amino acid fragments(blocks).
Representation of the sequences as the combination of 17 keywords of each
fragments reveals that 6 principle combinations describe the majority of
sequences (60%exactly, 40% with 1-3 fragments deviation). Furthermore,
exhaustive sequence classification is built which relate a sequence to
a class, subclass and sub subclass. The class determination is based on
the residues in three positions and the subclass one is based on the residues
in the other 8 positions. An important feature of this classification principle
is that knowledge of few keywords, or even of the residues at several key
positions, allows one to predict the residue or residue type in almost
any position of a sequence. Classification graph is drawn with the following
three levels: class-determining nodes (strand E), subclass-determining
nodes (strand A) and sub subclass determining nodes (loops). Edges link
the first with the second and the second with the third levels. Suggested
classification is verified on the set of germline sequences. The keywords,
which were obtained from the Kabat sequences, are found to be appropriate
for the germline sequences in even higher degree. Germline sequences are
split into the same classes and subclasses as Kabat sequences. The corresponding
classification graphs for germline and Kabat sequences are similar except
extra sub subclasses for the latter ones. It seems plausible, that under
the natural sequence modification (somatic mutations) a sequence remains
within the same class and subclass but could possible change its sub subclass.
The purpose of this report is to predict the set of germline sequences
given Kabat sequences for various immunoglobulin families. Comparison of
the classification graphs for germline and Kabat sequences allowed to define
a formal procedure of the transformation (simplification) of the latter
graph into the former one. For each subclass all its nodes for sub subclasses
are merged to become identical. The residues with the highest likelihood
are assigned to the resultant nodes. (In other words, we chose the graph
edges, which represent the higher number of sequences). The study included
three following stages. The prediction algorithm was developed for the
human heavy chains. The accuracy of the germline prediction is estimated
in respect to the repertoire of the totality of reconstructed sequences
and in respect to the individual sequence match for kappa and lambda chains
of human immunoglobulin. The germline prediction results for the other
immunoglobulin families, where there is no experimental data available,
are presented at our homepage.