A weighted string kernel for protein fold recognition.

Abstract:

BACKGROUND:Alignment-free methods for comparing protein sequences have proved to be viable alternatives to approaches that first rely on an alignment of the sequences to be compared. Much work however need to be done before those methods provide reliable fold recognition for proteins whose sequences share little similarity. We have recently proposed an alignment-free method based on the concept of string kernels, SeqKernel (Nojoomi and Koehl, BMC Bioinformatics, 2017, 18:137). In this previous study, we have shown that while Seqkernel performs better than standard alignment-based methods, its applications are potentially limited, because of biases due mostly to sequence length effects. METHODS:In this study, we propose improvements to SeqKernel that follows two directions. First, we developed a weighted version of the kernel, WSeqKernel. Second, we expand the concept of string kernels into a novel framework for deriving information on amino acids from protein sequences. RESULTS:Using a dataset that only contains remote homologs, we have shown that WSeqKernel performs remarkably well in fold recognition experiments. We have shown that with the appropriate weighting scheme, we can remove the length effects on the kernel values. WSeqKernel, just like any alignment-based sequence comparison method, depends on a substitution matrix. We have shown that this matrix can be optimized so that sequence similarity scores correlate well with structure similarity scores. Starting from no information on amino acid similarity, we have shown that we can derive a scoring matrix that echoes the physico-chemical properties of amino acids. CONCLUSION:We have made progress in characterizing and parametrizing string kernels as alignment-based methods for comparing protein sequences, and we have shown that they provide a framework for extracting sequence information from structure.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Nojoomi S,Koehl P

doi

10.1186/s12859-017-1795-5

subject

Has Abstract

pub_date

2017-08-25 00:00:00

pages

378

issue

1

issn

1471-2105

pii

10.1186/s12859-017-1795-5

journal_volume

18

pub_type

杂志文章
  • Progressive multiple sequence alignment with indel evolution.

    abstract:BACKGROUND:Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modell...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2357-1

    authors: Maiolo M,Zhang X,Gil M,Anisimova M

    更新日期:2018-09-21 00:00:00

  • Coverage statistics for sequence census methods.

    abstract:BACKGROUND:We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how thi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-430

    authors: Evans SN,Hower V,Pachter L

    更新日期:2010-08-18 00:00:00

  • Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance.

    abstract:BACKGROUND:PSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position-Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate est...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1686-9

    authors: Oda T,Lim K,Tomii K

    更新日期:2017-06-02 00:00:00

  • Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses.

    abstract:BACKGROUND:The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S1-S7

    authors: Miotto O,Tan TW,Brusic V

    更新日期:2008-01-01 00:00:00

  • Prediction of virus-host infectious association by supervised learning methods.

    abstract:BACKGROUND:The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the simila...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1473-7

    authors: Zhang M,Yang L,Ren J,Ahlgren NA,Fuhrman JA,Sun F

    更新日期:2017-03-14 00:00:00

  • Insertion and deletion correcting DNA barcodes based on watermarks.

    abstract:BACKGROUND:Barcode multiplexing is a key strategy for sharing the rising capacity of next-generation sequencing devices: Synthetic DNA tags, called barcodes, are attached to natural DNA fragments within the library preparation procedure. Different libraries, can individually be labeled with barcodes for a joint sequenc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0482-7

    authors: Kracht D,Schober S

    更新日期:2015-02-18 00:00:00

  • Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks.

    abstract:BACKGROUND:Protein function prediction is an important problem in the post-genomic era. Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data. Thus, using PPI data to functionally annotate proteins has been extensively studied. However, most existi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S12-S4

    authors: Xiong W,Liu H,Guan J,Zhou S

    更新日期:2013-01-01 00:00:00

  • Challenges in estimating percent inclusion of alternatively spliced junctions from RNA-seq data.

    abstract::Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S6-S11

    authors: Kakaradov B,Xiong HY,Lee LJ,Jojic N,Frey BJ

    更新日期:2012-04-19 00:00:00

  • A universal genomic coordinate translator for comparative genomics.

    abstract:BACKGROUND:Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic seq...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-227

    authors: Zamani N,Sundström G,Meadows JR,Höppner MP,Dainat J,Lantz H,Haas BJ,Grabherr MG

    更新日期:2014-06-30 00:00:00

  • ICoVax 2013: the 3rd ISV Pre-conference Computational Vaccinology Workshop.

    abstract::Following last year's computational vaccinology workshop in Shanghai, China, the third ISV Pre-conference Computational Vaccinology Workshop (ICoVax 2013) was held in Barcelona, Spain. ICoVax 2013 provided an international platform for the attendees to showcase their research and discuss problems and solutions in the ...

    journal_title:BMC bioinformatics

    pub_type:

    doi:10.1186/1471-2105-15-S4-I1

    authors: De Groot AS,De Groot P,He Y

    更新日期:2014-01-01 00:00:00

  • Simulating variance heterogeneity in quantitative genome wide association studies.

    abstract:BACKGROUND:Analyzing Variance heterogeneity in genome wide association studies (vGWAS) is an emerging approach for detecting genetic loci involved in gene-gene and gene-environment interactions. vGWAS analysis detects variability in phenotype values across genotypes, as opposed to typical GWAS analysis, which detects v...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2061-1

    authors: Al Kawam A,Alshawaqfeh M,Cai JJ,Serpedin E,Datta A

    更新日期:2018-03-21 00:00:00

  • LDpop: an interactive online tool to calculate and visualize geographic LD patterns.

    abstract:BACKGROUND:Linkage disequilibrium (LD)-the non-random association of alleles at different loci-defines population-specific haplotypes which vary by genomic ancestry. Assessment of allelic frequencies and LD patterns from a variety of ancestral populations enables researchers to better understand population histories as...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3340-1

    authors: Alexander TA,Machiela MJ

    更新日期:2020-01-10 00:00:00

  • Promoting ranking diversity for genomics search with relevance-novelty combined model.

    abstract:BACKGROUND:In the biomedical domain, the desired information of a question (query) asked by biologists usually is a list of a certain type of entities covering different aspects that are related to the question, such as genes, proteins, diseases, mutations, etc. Hence it is important for a biomedical information retrie...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S5-S8

    authors: Yin X,Li Z,Huang JX,Hu X

    更新日期:2011-01-01 00:00:00

  • Mining differential top-k co-expression patterns from time course comparative gene expression datasets.

    abstract:BACKGROUND:Frequent pattern mining analysis applied on microarray dataset appears to be a promising strategy for identifying relationships between gene expression levels. Unfortunately, too many itemsets (co-expressed genes) are identified by this analysis method since it does not consider the importance of each gene w...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-230

    authors: Liu YC,Cheng CP,Tseng VS

    更新日期:2013-07-21 00:00:00

  • Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

    abstract:BACKGROUND:Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-272

    authors: He X,Sarma MS,Ling X,Chee B,Zhai C,Schatz B

    更新日期:2010-05-20 00:00:00

  • TooT-T: discrimination of transport proteins from non-transport proteins.

    abstract:BACKGROUND:Membrane transport proteins (transporters) play an essential role in every living cell by transporting hydrophilic molecules across the hydrophobic membranes. While the sequences of many membrane proteins are known, their structure and function is still not well characterized and understood, owing to the imm...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3311-6

    authors: Alballa M,Butler G

    更新日期:2020-04-23 00:00:00

  • Maximum expected accuracy structural neighbors of an RNA secondary structure.

    abstract:BACKGROUND:Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detec...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S5-S6

    authors: Clote P,Lou F,Lorenz WA

    更新日期:2012-04-12 00:00:00

  • Partition-based optimization model for generative anatomy modeling language (POM-GAML).

    abstract:BACKGROUND:This paper presents a novel approach for Generative Anatomy Modeling Language (GAML). This approach automatically detects the geometric partitions in 3D anatomy that in turn speeds up integrated non-linear optimization model in GAML for 3D anatomy modeling with constraints (e.g. joints). This integrated non-...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2626-7

    authors: Demirel D,Cetinsaya B,Halic T,Kockara S,Ahmadi S

    更新日期:2019-03-14 00:00:00

  • A web services choreography scenario for interoperating bioinformatics applications.

    abstract:BACKGROUND:Very often genome-wide data analysis requires the interoperation of multiple databases and analytic tools. A large number of genome databases and bioinformatics applications are available through the web, but it is difficult to automate interoperation because: 1) the platforms on which the applications run a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-25

    authors: de Knikker R,Guo Y,Li JL,Kwan AK,Yip KY,Cheung DW,Cheung KH

    更新日期:2004-03-10 00:00:00

  • Simultaneous phylogeny reconstruction and multiple sequence alignment.

    abstract:BACKGROUND:A phylogeny is the evolutionary history of a group of organisms. To date, sequence data is still the most used data type for phylogenetic reconstruction. Before any sequences can be used for phylogeny reconstruction, they must be aligned, and the quality of the multiple sequence alignment has been shown to a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S1-S11

    authors: Yue F,Shi J,Tang J

    更新日期:2009-01-30 00:00:00

  • HMM Logos for visualization of protein families.

    abstract:BACKGROUND:Profile Hidden Markov Models (pHMMs) are a widely used tool for protein family research. Up to now, however, there exists no method to visualize all of their central aspects graphically in an intuitively understandable way. RESULTS:We present a visualization method that incorporates both emission and transi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-7

    authors: Schuster-Böckler B,Schultz J,Rahmann S

    更新日期:2004-01-21 00:00:00

  • Island method for estimating the statistical significance of profile-profile alignment scores.

    abstract:BACKGROUND:In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many exp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-112

    authors: Poleksic A

    更新日期:2009-04-20 00:00:00

  • Combining sequence and network information to enhance protein-protein interaction prediction.

    abstract:BACKGROUND:Protein-protein interactions (PPIs) are of great importance in cellular systems of organisms, since they are the basis of cellular structure and function and many essential cellular processes are related to that. Most proteins perform their functions by interacting with other proteins, so predicting PPIs acc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03896-6

    authors: Liu L,Zhu X,Ma Y,Piao H,Yang Y,Hao X,Fu Y,Wang L,Peng J

    更新日期:2020-12-16 00:00:00

  • Texture based skin lesion abruptness quantification to detect malignancy.

    abstract:BACKGROUND:Abruptness of pigment patterns at the periphery of a skin lesion is one of the most important dermoscopic features for detection of malignancy. In current clinical setting, abrupt cutoff of a skin lesion determined by an examination of a dermatologist. This process is subjective, nonquantitative, and error-p...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1892-5

    authors: Erol R,Bayraktar M,Kockara S,Kaya S,Halic T

    更新日期:2017-12-28 00:00:00

  • Mutation status coupled with RNA-sequencing data can efficiently identify important non-significantly mutated genes serving as diagnostic biomarkers of endometrial cancer.

    abstract:BACKGROUND:Endometrial cancers (ECs) are one of the most common types of malignant tumor in females. Substantial efforts had been made to identify significantly mutated genes (SMGs) in ECs and use them as biomarkers for the classification of histological subtypes and the prediction of clinical outcomes. However, the im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1891-6

    authors: Liu K,He L,Liu Z,Xu J,Liu Y,Kuang Q,Wen Z,Li M

    更新日期:2017-12-28 00:00:00

  • πBUSS: a parallel BEAST/BEAGLE utility for sequence simulation under complex evolutionary scenarios.

    abstract:BACKGROUND:Simulated nucleotide or amino acid sequences are frequently used to assess the performance of phylogenetic reconstruction methods. BEAST, a Bayesian statistical framework that focuses on reconstructing time-calibrated molecular evolutionary processes, supports a wide array of evolutionary models, but lacked ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-133

    authors: Bielejec F,Lemey P,Carvalho LM,Baele G,Rambaut A,Suchard MA

    更新日期:2014-05-07 00:00:00

  • BioIMAX: a Web 2.0 approach for easy exploratory and collaborative access to multivariate bioimage data.

    abstract:BACKGROUND:Innovations in biological and biomedical imaging produce complex high-content and multivariate image data. For decision-making and generation of hypotheses, scientists need novel information technology tools that enable them to visually explore and analyze the data and to discuss and communicate results or f...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-297

    authors: Loyek C,Rajpoot NM,Khan M,Nattkemper TW

    更新日期:2011-07-21 00:00:00

  • Random forest versus logistic regression: a large-scale benchmark experiment.

    abstract:BACKGROUND AND GOAL:The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS:In this conte...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2264-5

    authors: Couronné R,Probst P,Boulesteix AL

    更新日期:2018-07-17 00:00:00

  • Asymmetric bagging and feature selection for activities prediction of drug molecules.

    abstract:BACKGROUND:Activities of drug molecules can be predicted by QSAR (quantitative structure activity relationship) models, which overcomes the disadvantages of high cost and long cycle by employing the traditional experimental method. With the fact that the number of drug molecules with positive activity is rather fewer t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S6-S7

    authors: Li GZ,Meng HH,Lu WC,Yang JY,Yang MQ

    更新日期:2008-05-28 00:00:00

  • Identification and utilization of inter-species conserved (ISC) probesets on Affymetrix human GeneChip platforms for the optimization of the assessment of expression patterns in non human primate (NHP) samples.

    abstract:BACKGROUND:While researchers have utilized versions of the Affymetrix human GeneChip for the assessment of expression patterns in non human primate (NHP) samples, there has been no comprehensive sequence analysis study undertaken to demonstrate that the probe sequences designed to detect human transcripts are reliably ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-165

    authors: Wang Z,Lewis MG,Nau ME,Arnold A,Vahey MT

    更新日期:2004-10-26 00:00:00