Predicting and improving the protein sequence alignment quality by support vector regression.

Abstract:

BACKGROUND:For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. RESULTS:In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. CONCLUSION:The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Lee M,Jeong CS,Kim D

doi

10.1186/1471-2105-8-471

subject

Has Abstract

pub_date

2007-12-03 00:00:00

pages

471

issn

1471-2105

pii

1471-2105-8-471

journal_volume

8

pub_type

杂志文章
  • Cyclic nucleotide binding proteins in the Arabidopsis thaliana and Oryza sativa genomes.

    abstract:BACKGROUND:Cyclic nucleotides are ubiquitous intracellular messengers. Until recently, the roles of cyclic nucleotides in plant cells have proven difficult to uncover. With an understanding of the protein domains which can bind cyclic nucleotides (CNB and GAF domains) we scanned the completed genomes of the higher plan...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-6

    authors: Bridges D,Fraser ME,Moorhead GB

    更新日期:2005-01-11 00:00:00

  • Class prediction for high-dimensional class-imbalanced data.

    abstract:BACKGROUND:The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-523

    authors: Blagus R,Lusa L

    更新日期:2010-10-20 00:00:00

  • ILP-based maximum likelihood genome scaffolding.

    abstract:BACKGROUND:Interest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S9

    authors: Lindsay J,Salooti H,Măndoiu I,Zelikovsky A

    更新日期:2014-01-01 00:00:00

  • Gene ontology based transfer learning for protein subcellular localization.

    abstract:BACKGROUND:Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-44

    authors: Mei S,Fei W,Zhou S

    更新日期:2011-02-02 00:00:00

  • Phylogenomics and sequence-structure-function relationships in the GmrSD family of Type IV restriction enzymes.

    abstract:BACKGROUND:GmrSD is a modification-dependent restriction endonuclease that specifically targets and cleaves glucosylated hydroxymethylcytosine (glc-HMC) modified DNA. It is encoded either as two separate single-domain GmrS and GmrD proteins or as a single protein carrying both domains. Previous studies suggested that G...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0773-z

    authors: Machnicka MA,Kaminska KH,Dunin-Horkawicz S,Bujnicki JM

    更新日期:2015-10-23 00:00:00

  • Analyzing miRNA co-expression networks to explore TF-miRNA regulation.

    abstract:BACKGROUND:Current microRNA (miRNA) research in progress has engendered rapid accumulation of expression data evolving from microarray experiments. Such experiments are generally performed over different tissues belonging to a specific species of metazoan. For disease diagnosis, microarray probes are also prepared with...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-163

    authors: Bandyopadhyay S,Bhattacharyya M

    更新日期:2009-05-28 00:00:00

  • Identification of sequence motifs significantly associated with antisense activity.

    abstract:BACKGROUND:Predicting the suppression activity of antisense oligonucleotide sequences is the main goal of the rational design of nucleic acids. To create an effective predictive model, it is important to know what properties of an oligonucleotide sequence associate significantly with antisense activity. Also, for the m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-184

    authors: McQuisten KA,Peek AS

    更新日期:2007-06-07 00:00:00

  • Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.

    abstract:BACKGROUND:A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/s12859-019-3060-6

    authors: Nakano FK,Lietaert M,Vens C

    更新日期:2019-09-23 00:00:00

  • Predicting substrates of the human breast cancer resistance protein using a support vector machine method.

    abstract:BACKGROUND:Human breast cancer resistance protein (BCRP) is an ATP-binding cassette (ABC) efflux transporter that confers multidrug resistance in cancers and also plays an important role in the absorption, distribution and elimination of drugs. Prediction as to if drugs or new molecular entities are BCRP substrates sho...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-130

    authors: Hazai E,Hazai I,Ragueneau-Majlessi I,Chung SP,Bikadi Z,Mao Q

    更新日期:2013-04-15 00:00:00

  • Linear space string correction algorithm using the Damerau-Levenshtein distance.

    abstract:BACKGROUND:The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3184-8

    authors: Zhao C,Sahni S

    更新日期:2020-12-09 00:00:00

  • OpWise: operons aid the identification of differentially expressed genes in bacterial microarray experiments.

    abstract:BACKGROUND:Differentially expressed genes are typically identified by analyzing the variation between replicate measurements. These procedures implicitly assume that there are no systematic errors in the data even though several sources of systematic error are known. RESULTS:OpWise estimates the amount of systematic e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-19

    authors: Price MN,Arkin AP,Alm EJ

    更新日期:2006-01-13 00:00:00

  • Performance of a genetic algorithm for mass spectrometry proteomics.

    abstract:BACKGROUND:Recently, mass spectrometry data have been mined using a genetic algorithm to produce discriminatory models that distinguish healthy individuals from those with cancer. This algorithm is the basis for claims of 100% sensitivity and specificity in two related publicly available datasets. To date, no detailed ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-180

    authors: Jeffries NO

    更新日期:2004-11-19 00:00:00

  • Critique of the pairwise method for estimating qPCR amplification efficiency: beware of correlated data!

    abstract:BACKGROUND:A recently proposed method for estimating qPCR amplification efficiency E analyzes fluorescence intensity ratios from pairs of points deemed to lie in the exponential growth region on the amplification curves for all reactions in a dilution series. This method suffers from a serious problem: The resulting ra...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03604-4

    authors: Tellinghuisen J

    更新日期:2020-07-08 00:00:00

  • Identification of common coexpression modules based on quantitative network comparison.

    abstract:BACKGROUND:Finding common molecular interactions from different samples is essential work to understanding diseases and other biological processes. Coexpression networks and their modules directly reflect sample-specific interactions among genes. Therefore, identification of common coexpression network or modules may r...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2193-3

    authors: Jo Y,Kim S,Lee D

    更新日期:2018-06-13 00:00:00

  • GeneBins: a database for classifying gene expression data, with application to plant genome arrays.

    abstract:BACKGROUND:To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms. RESULTS:We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-87

    authors: Goffard N,Weiller G

    更新日期:2007-03-12 00:00:00

  • MapMi: automated mapping of microRNA loci.

    abstract:BACKGROUND:A large effort to discover microRNAs (miRNAs) has been under way. Currently miRBase is their primary repository, providing annotations of primary sequences, precursors and probable genomic loci. In many cases miRNAs are identical or very similar between related (or in some cases more distant) species. Howeve...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-133

    authors: Guerra-Assunção JA,Enright AJ

    更新日期:2010-03-16 00:00:00

  • The scoring of poses in protein-protein docking: current capabilities and future directions.

    abstract:BACKGROUND:Protein-protein docking, which aims to predict the structure of a protein-protein complex from its unbound components, remains an unresolved challenge in structural bioinformatics. An important step is the ranking of docked poses using a scoring function, for which many methods have been developed. There is ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-286

    authors: Moal IH,Torchala M,Bates PA,Fernández-Recio J

    更新日期:2013-10-01 00:00:00

  • Venn-diaNet : venn diagram based network propagation analysis framework for comparing multiple biological experiments.

    abstract:BACKGROUND:The main research topic in this paper is how to compare multiple biological experiments using transcriptome data, where each experiment is measured and designed to compare control and treated samples. Comparison of multiple biological experiments is usually performed in terms of the number of DEGs in an arbi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3302-7

    authors: Hur B,Kang D,Lee S,Moon JH,Lee G,Kim S

    更新日期:2019-12-27 00:00:00

  • Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks.

    abstract:BACKGROUND:Protein function prediction is an important problem in the post-genomic era. Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data. Thus, using PPI data to functionally annotate proteins has been extensively studied. However, most existi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S12-S4

    authors: Xiong W,Liu H,Guan J,Zhou S

    更新日期:2013-01-01 00:00:00

  • An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    abstract:BACKGROUND:Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-121

    authors: Tech M,Meinicke P

    更新日期:2006-03-09 00:00:00

  • A novel method to identify high order gene-gene interactions in genome-wide association studies: gene-based MDR.

    abstract:BACKGROUND:Because common complex diseases are affected by multiple genes and environmental factors, it is essential to investigate gene-gene and/or gene-environment interactions to understand genetic architecture of complex diseases. After the great success of large scale genome-wide association (GWA) studies using th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S9-S5

    authors: Oh S,Lee J,Kwon MS,Weir B,Ha K,Park T

    更新日期:2012-06-11 00:00:00

  • Genoviz Software Development Kit: Java tool kit for building genomics visualization applications.

    abstract:BACKGROUND:Visualization software can expose previously undiscovered patterns in genomic data and advance biological science. RESULTS:The Genoviz Software Development Kit (SDK) is an open source, Java-based framework designed for rapid assembly of visualization software applications for genomics. The Genoviz SDK frame...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-266

    authors: Helt GA,Nicol JW,Erwin E,Blossom E,Blanchard SG Jr,Chervitz SA,Harmon C,Loraine AE

    更新日期:2009-08-25 00:00:00

  • The rise and fall of breakpoint reuse depending on genome resolution.

    abstract:BACKGROUND:During evolution, large-scale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ("synteny blocks") across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements recur, causing breakpoints to be reused. METHODS:We investi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S9-S1

    authors: Attie O,Darling AE,Yancopoulos S

    更新日期:2011-10-05 00:00:00

  • MOSBIE: a tool for comparison and analysis of rule-based biochemical models.

    abstract:BACKGROUND:Mechanistic models that describe the dynamical behaviors of biochemical systems are common in computational systems biology, especially in the realm of cellular signaling. The development of families of such models, either by a single research group or by different groups working within the same area, presen...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-316

    authors: Wenskovitch JE Jr,Harris LA,Tapia JJ,Faeder JR,Marai GE

    更新日期:2014-09-25 00:00:00

  • Identification and utilization of inter-species conserved (ISC) probesets on Affymetrix human GeneChip platforms for the optimization of the assessment of expression patterns in non human primate (NHP) samples.

    abstract:BACKGROUND:While researchers have utilized versions of the Affymetrix human GeneChip for the assessment of expression patterns in non human primate (NHP) samples, there has been no comprehensive sequence analysis study undertaken to demonstrate that the probe sequences designed to detect human transcripts are reliably ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-165

    authors: Wang Z,Lewis MG,Nau ME,Arnold A,Vahey MT

    更新日期:2004-10-26 00:00:00

  • Compartmentalization of the Edinburgh Human Metabolic Network.

    abstract:BACKGROUND:Direct in vivo investigation of human metabolism is complicated by the distinct metabolic functions of various sub-cellular organelles. Diverse micro-environments in different organelles may lead to distinct functions of the same protein and the use of different enzymes for the same metabolic reaction. To be...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-393

    authors: Hao T,Ma HW,Zhao XM,Goryanin I

    更新日期:2010-07-22 00:00:00

  • Prediction of protein structural class with Rough Sets.

    abstract:BACKGROUND:A new method for the prediction of protein structural classes is constructed based on Rough Sets algorithm, which is a rule-based data mining method. Amino acid compositions and 8 physicochemical properties data are used as conditional attributes for the construction of decision system. After reducing the de...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-20

    authors: Cao Y,Liu S,Zhang L,Qin J,Wang J,Tang K

    更新日期:2006-01-14 00:00:00

  • Rigorous assessment and integration of the sequence and structure based features to predict hot spots.

    abstract:BACKGROUND:Systematic mutagenesis studies have shown that only a few interface residues termed hot spots contribute significantly to the binding free energy of protein-protein interactions. Therefore, hot spots prediction becomes increasingly important for well understanding the essence of proteins interactions and hel...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-311

    authors: Chen R,Chen W,Yang S,Wu D,Wang Y,Tian Y,Shi Y

    更新日期:2011-07-29 00:00:00

  • Blazing Signature Filter: a library for fast pairwise similarity comparisons.

    abstract:BACKGROUND:Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phe...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2210-6

    authors: Lee JY,Fujimoto GM,Wilson R,Wiley HS,Payne SH

    更新日期:2018-06-11 00:00:00

  • NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data.

    abstract:BACKGROUND:Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is ne...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2207-1

    authors: Fang L,Hu J,Wang D,Wang K

    更新日期:2018-05-23 00:00:00