automated protein subfamily identification and classification自动化的蛋白质亚识别和分类.pdf
文本预览下载声明
Automated Protein Subfamily Identification
and Classification
*
¨
Duncan P. Brown, Nandini Krishnamurthy, Kimmen Sjolander
Department of Bioengineering, University of California, Berkeley, California, United States of America
Function prediction by homology is widely used to provide preliminary functional annotations for genes for which
experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic
error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these
errors in function prediction but has been difficult to automate for high-throughput application. To address this
limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline
uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification,
followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring
scheme using family and subfamily HMMs enables classification of novel sequences to protein families and
subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to
subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an
information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation
patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional
subtypes defined by experts and to conserved cl
显示全部