当前位置： SCI文献检索 > BMC BIOINFORMATICS期刊下所有文献 > A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

Abstract：

BACKGROUND:Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. RESULTS:In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. CONCLUSION:The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Liu B,Wang X,Lin L,Dong Q,Wang X

doi

10.1186/1471-2105-9-510

subject

Has Abstract

pub_date

2008-12-01 00:00:00

pages

510

issn

1471-2105

pii

1471-2105-9-510

journal_volume

pub_type

杂志文章

在线工具

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.