A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis.

Abstract:

BACKGROUND:Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. RESULTS:In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. CONCLUSION:The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Liu B,Wang X,Lin L,Dong Q,Wang X

doi

10.1186/1471-2105-9-510

subject

Has Abstract

pub_date

2008-12-01 00:00:00

pages

510

issn

1471-2105

pii

1471-2105-9-510

journal_volume

9

pub_type

杂志文章
  • A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.

    abstract:BACKGROUND:The process of horizontal gene transfer (HGT) is believed to be widespread in Bacteria and Archaea, but little comparative data is available addressing its occurrence in complete microbial genomes. Collection of high-quality, automated HGT prediction data based on phylogenetic evidence has previously been im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-419

    authors: Podell S,Gaasterland T,Allen EE

    更新日期:2008-10-07 00:00:00

  • Computational approaches to protein inference in shotgun proteomics.

    abstract::Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2105-13-S16-S4

    authors: Li YF,Radivojac P

    更新日期:2012-01-01 00:00:00

  • Assessing and predicting protein interactions by combining manifold embedding with multiple information integration.

    abstract:BACKGROUND:Protein-protein interactions (PPIs) play crucial roles in virtually every aspect of cellular function within an organism. Over the last decade, the development of novel high-throughput techniques has resulted in enormous amounts of data and provided valuable resources for studying protein interactions. Howev...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S7-S3

    authors: Lei YK,You ZH,Ji Z,Zhu L,Huang DS

    更新日期:2012-05-08 00:00:00

  • Characterization and sequence prediction of structural variations in α-helix.

    abstract:BACKGROUND:The structure conservation in various α-helix subclasses reveals the sequence and context dependent factors causing distortions in the α-helix. The sequence-structure relationship in these subclasses can be used to predict structural variations in α-helix purely based on its sequence. We train support vector...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S1-S20

    authors: Tendulkar AV,Wangikar PP

    更新日期:2011-02-15 00:00:00

  • SDA: a semi-parametric differential abundance analysis method for metabolomics and proteomics data.

    abstract:BACKGROUND:Identifying differentially abundant features between different experimental groups is a common goal for many metabolomics and proteomics studies. However, analyzing data from mass spectrometry (MS) is difficult because the data may not be normally distributed and there is often a large fraction of zero value...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3067-z

    authors: Li Y,Fan TWM,Lane AN,Kang WY,Arnold SM,Stromberg AJ,Wang C,Chen L

    更新日期:2019-10-17 00:00:00

  • Structural characterization of genomes by large scale sequence-structure threading: application of reliability analysis in structural genomics.

    abstract:BACKGROUND:We establish that the occurrence of protein folds among genomes can be accurately described with a Weibull function. Systems which exhibit Weibull character can be interpreted with reliability theory commonly used in engineering analysis. For instance, Weibull distributions are widely used in reliability, ma...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-101

    authors: Cherkasov A,Ho Sui SJ,Brunham RC,Jones SJ

    更新日期:2004-07-26 00:00:00

  • A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes.

    abstract:BACKGROUND:Great strides have been made in the effective treatment of HIV-1 with the development of second-generation protease inhibitors (PIs) that are effective against historically multi-PI-resistant HIV-1 variants. Nevertheless, mutation patterns that confer decreasing susceptibility to available PIs continue to ar...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-477

    authors: Doherty KM,Nakka P,King BM,Rhee SY,Holmes SP,Shafer RW,Radhakrishnan ML

    更新日期:2011-12-15 00:00:00

  • Predicting nucleosome positioning using a duration Hidden Markov Model.

    abstract:BACKGROUND:The nucleosome is the fundamental packing unit of DNAs in eukaryotic cells. Its detailed positioning on the genome is closely related to chromosome functions. Increasing evidence has shown that genomic DNA sequence itself is highly predictive of nucleosome positioning genome-wide. Therefore a fast software t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-346

    authors: Xi L,Fondufe-Mittendorf Y,Xia L,Flatow J,Widom J,Wang JP

    更新日期:2010-06-24 00:00:00

  • Correlation analysis reveals the emergence of coherence in the gene expression dynamics following system perturbation.

    abstract::Time course gene expression experiments are a popular means to infer co-expression. Many methods have been proposed to cluster genes or to build networks based on similarity measures of their expression dynamics. In this paper we apply a correlation based approach to network reconstruction to three datasets of time se...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-S1-S16

    authors: Neretti N,Remondini D,Tatar M,Sedivy JM,Pierini M,Mazzatti D,Powell J,Franceschi C,Castellani GC

    更新日期:2007-03-08 00:00:00

  • Linear space string correction algorithm using the Damerau-Levenshtein distance.

    abstract:BACKGROUND:The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3184-8

    authors: Zhao C,Sahni S

    更新日期:2020-12-09 00:00:00

  • AnyExpress: integrated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm.

    abstract:BACKGROUND:Cross-platform analysis of gene express data requires multiple, intricate processes at different layers with various platforms. However, existing tools handle only a single platform and are not flexible enough to support custom changes, which arise from the new statistical methods, updated versions of refere...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-75

    authors: Kim J,Patel K,Jung H,Kuo WP,Ohno-Machado L

    更新日期:2011-03-17 00:00:00

  • Automatic localization and identification of mitochondria in cellular electron cryo-tomography using faster-RCNN.

    abstract:BACKGROUND:Cryo-electron tomography (cryo-ET) enables the 3D visualization of cellular organization in near-native state which plays important roles in the field of structural cell biology. However, due to the low signal-to-noise ratio (SNR), large volume and high content complexity within cells, it remains difficult a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2650-7

    authors: Li R,Zeng X,Sigmund SE,Lin R,Zhou B,Liu C,Wang K,Jiang R,Freyberg Z,Lv H,Xu M

    更新日期:2019-03-29 00:00:00

  • Quick, "imputation-free" meta-analysis with proxy-SNPs.

    abstract:BACKGROUND:Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increase the power to detect strong or weak genotype effects or b) as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-231

    authors: Meesters C,Leber M,Herold C,Angisch M,Mattheisen M,Drichel D,Lacour A,Becker T

    更新日期:2012-09-12 00:00:00

  • Towards barcode markers in Fungi: an intron map of Ascomycota mitochondria.

    abstract:BACKGROUND:A standardized and cost-effective molecular identification system is now an urgent need for Fungi owing to their wide involvement in human life quality. In particular the potential use of mitochondrial DNA species markers has been taken in account. Unfortunately, a serious difficulty in the PCR and bioinform...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S6-S15

    authors: Santamaria M,Vicario S,Pappadà G,Scioscia G,Scazzocchio C,Saccone C

    更新日期:2009-06-16 00:00:00

  • Bayesian semiparametric regression models to characterize molecular evolution.

    abstract:BACKGROUND:Statistical models and methods that associate changes in the physicochemical properties of amino acids with natural selection at the molecular level typically do not take into account the correlations between such properties. We propose a Bayesian hierarchical regression model with a generalization of the Di...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-278

    authors: Datta S,Rodriguez A,Prado R

    更新日期:2012-10-30 00:00:00

  • SIMAT: GC-SIM-MS data analysis tool.

    abstract:BACKGROUND:Gas chromatography coupled with mass spectrometry (GC-MS) is one of the technologies widely used for qualitative and quantitative analysis of small molecules. In particular, GC coupled to single quadrupole MS can be utilized for targeted analysis by selected ion monitoring (SIM). However, to our knowledge, t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0681-2

    authors: Ranjbar MR,Di Poto C,Wang Y,Ressom HW

    更新日期:2015-08-19 00:00:00

  • Mining locus tags in PubMed Central to improve microbial gene annotation.

    abstract:BACKGROUND:The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-43

    authors: Stubben CJ,Challacombe JF

    更新日期:2014-02-05 00:00:00

  • On reliable discovery of molecular signatures.

    abstract:BACKGROUND:Molecular signatures are sets of genes, proteins, genetic variants or other variables that can be used as markers for a particular phenotype. Reliable signature discovery methods could yield valuable insight into cell biology and mechanisms of human disease. However, it is currently not clear how to control ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-38

    authors: Nilsson R,Björkegren J,Tegnér J

    更新日期:2009-01-29 00:00:00

  • An integrative method to normalize RNA-Seq data.

    abstract:BACKGROUND:Transcriptome sequencing is a powerful tool for measuring gene expression, but as well as some other technologies, various artifacts and biases affect the quantification. In order to correct some of them, several normalization approaches have emerged, differing both in the statistical strategy employed and i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-188

    authors: Filloux C,Cédric M,Romain P,Lionel F,Christophe K,Dominique R,Abderrahman M,Daniel P

    更新日期:2014-06-14 00:00:00

  • Large scale analysis of protein conformational transitions from aqueous to non-aqueous media.

    abstract:BACKGROUND:Biocatalysis in organic solvents is nowadays a common practice with a large potential in Biotechnology. Several studies report that proteins which are co-crystallized or soaked in organic solvents preserve their fold integrity showing almost identical arrangements when compared to their aqueous forms. Howeve...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2044-2

    authors: Rueda AJV,Monzon AM,Ardanaz SM,Iglesias LE,Parisi G

    更新日期:2018-01-30 00:00:00

  • Roar: detecting alternative polyadenylation with standard mRNA sequencing libraries.

    abstract:BACKGROUND:Post-transcriptional regulation is a complex mechanism that plays a central role in defining multiple cellular identities starting from a common genome. Modifications in the length of 3'UTRs have been found to play an important role in this context, since alternative 3' UTRs could lead to differences for exa...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1254-8

    authors: Grassi E,Mariella E,Lembo A,Molineris I,Provero P

    更新日期:2016-10-18 00:00:00

  • Efficient computation of absent words in genomic sequences.

    abstract:BACKGROUND:Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-167

    authors: Herold J,Kurtz S,Giegerich R

    更新日期:2008-03-26 00:00:00

  • PuFFIN--a parameter-free method to build nucleosome maps from paired-end reads.

    abstract:BACKGROUND:We introduce a novel method, called PuFFIN, that takes advantage of paired-end short reads to build genome-wide nucleosome maps with larger numbers of detected nucleosomes and higher accuracy than existing tools. In contrast to other approaches that require users to optimize several parameters according to t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S11

    authors: Polishko A,Bunnik EM,Le Roch KG,Lonardi S

    更新日期:2014-01-01 00:00:00

  • Bias detection and correction in RNA-Sequencing data.

    abstract:BACKGROUND:High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with mult...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-290

    authors: Zheng W,Chung LM,Zhao H

    更新日期:2011-07-19 00:00:00

  • JISTIC: identification of significant targets in cancer.

    abstract:BACKGROUND:Cancer is caused through a multistep process, in which a succession of genetic changes, each conferring a competitive advantage for growth and proliferation, leads to the progressive conversion of normal human cells into malignant cancer cells. Interrogation of cancer genomes holds the promise of understandi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-189

    authors: Sanchez-Garcia F,Akavia UD,Mozes E,Pe'er D

    更新日期:2010-04-14 00:00:00

  • GenNon-h: generating multiple sequence alignments on nonhomogeneous phylogenetic trees.

    abstract:BACKGROUND:A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-216

    authors: Kedzierska AM,Casanellas M

    更新日期:2012-08-28 00:00:00

  • Analysis of genomic and transcriptomic variations as prognostic signature for lung adenocarcinoma.

    abstract:BACKGROUND:Lung cancer is the leading cause of the largest number of deaths worldwide and lung adenocarcinoma is the most common form of lung cancer. In order to understand the molecular basis of lung adenocarcinoma, integrative analysis have been performed by using genomics, transcriptomics, epigenomics, proteomics an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03691-3

    authors: Zengin T,Önal-Süzek T

    更新日期:2020-09-30 00:00:00

  • Subfamily specific conservation profiles for proteins based on n-gram patterns.

    abstract:BACKGROUND:A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profile...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-72

    authors: Vries JK,Liu X

    更新日期:2008-01-30 00:00:00

  • Statistical shape analysis of tap roots: a methodological case study on laser scanned sugar beets.

    abstract:BACKGROUND:The efficient and robust statistical analysis of the shape of plant organs of different cultivars is an important investigation issue in plant breeding and enables a robust cultivar description within the breeding progress. Laserscanning is a highly accurate and high resolution technique to acquire the 3D sh...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03654-8

    authors: Heeren B,Paulus S,Goldbach H,Kuhlmann H,Mahlein AK,Rumpf M,Wirth B

    更新日期:2020-07-29 00:00:00

  • Functional relevance of dynamic properties of Dimeric NADP-dependent Isocitrate Dehydrogenases.

    abstract:BACKGROUND:Isocitrate Dehydrogenases (IDHs) are important enzymes present in all living cells. Three subfamilies of functionally dimeric IDHs (subfamilies I, II, III) are known. Subfamily I are well-studied bacterial IDHs, like that of Escherischia coli. Subfamily II has predominantly eukaryotic members, but it also ha...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S17-S2

    authors: Vinekar R,Verma C,Ghosh I

    更新日期:2012-01-01 00:00:00