Multi-label literature classification based on the Gene Ontology graph.

Abstract:

BACKGROUND:The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. RESULTS:In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. CONCLUSION:Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Jin B,Muller B,Zhai C,Lu X

doi

10.1186/1471-2105-9-525

subject

Has Abstract

pub_date

2008-12-08 00:00:00

pages

525

issn

1471-2105

pii

1471-2105-9-525

journal_volume

9

pub_type

杂志文章
  • Extended analysis of benchmark datasets for Agilent two-color microarrays.

    abstract:BACKGROUND:As part of its broad and ambitious mission, the MicroArray Quality Control (MAQC) project reported the results of experiments using External RNA Controls (ERCs) on five microarray platforms. For most platforms, several different methods of data processing were considered. However, there was no similar consid...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-371

    authors: Kerr KF

    更新日期:2007-10-03 00:00:00

  • A new method for 2D gel spot alignment: application to the analysis of large sample sets in clinical proteomics.

    abstract:BACKGROUND:In current comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. Unfortunately, differences in gel migration and sample variability make efficient spot alignment very difficult to obtain, and, as consequence most of the software ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-460

    authors: Pérès S,Molina L,Salvetat N,Granier C,Molina F

    更新日期:2008-10-28 00:00:00

  • Sample entropy analysis of cervical neoplasia gene-expression signatures.

    abstract:BACKGROUND:We introduce Approximate Entropy as a mathematical method of analysis for microarray data. Approximate entropy is applied here as a method to classify the complex gene expression patterns resultant of a clinical sample set. Since Entropy is a measure of disorder in a system, we believe that by choosing genes...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-66

    authors: Botting SK,Trzeciakowski JP,Benoit MF,Salama SA,Diaz-Arrastia CR

    更新日期:2009-02-20 00:00:00

  • A simple method for assessing sample sizes in microarray experiments.

    abstract:BACKGROUND:In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments. RESULTS:Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various sample...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-106

    authors: Tibshirani R

    更新日期:2006-03-02 00:00:00

  • Improving the prediction of mRNA extremities in the parasitic protozoan Leishmania.

    abstract:BACKGROUND:Leishmania and other members of the Trypanosomatidae family diverged early on in eukaryotic evolution and consequently display unique cellular properties. Their apparent lack of transcriptional regulation is compensated by complex post-transcriptional control mechanisms, including the processing of polycistr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-158

    authors: Smith M,Blanchette M,Papadopoulou B

    更新日期:2008-03-20 00:00:00

  • Combining calls from multiple somatic mutation-callers.

    abstract:BACKGROUND:Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still challenging and there is unlikely to be one established caller that systematically outperfor...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-154

    authors: Kim SY,Jacob L,Speed TP

    更新日期:2014-05-21 00:00:00

  • HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding.

    abstract:BACKGROUND:Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03939-y

    authors: Solares EA,Tao Y,Long AD,Gaut BS

    更新日期:2021-01-06 00:00:00

  • Model based analysis of real-time PCR data from DNA binding dye protocols.

    abstract:BACKGROUND:Reverse transcription followed by real-time PCR is widely used for quantification of specific mRNA, and with the use of double-stranded DNA binding dyes it is becoming a standard for microarray data validation. Despite the kinetic information generated by real-time PCR, most popular analysis methods assume c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-85

    authors: Alvarez MJ,Vila-Ortiz GJ,Salibe MC,Podhajcer OL,Pitossi FJ

    更新日期:2007-03-09 00:00:00

  • Bioinformatics approach to predict target genes for dysregulated microRNAs in hepatocellular carcinoma: study on a chemically-induced HCC mouse model.

    abstract:BACKGROUND:Hepatocellular carcinoma (HCC) is an aggressive epithelial tumor which shows very poor prognosis and high rate of recurrence, representing an urgent problem for public healthcare. MicroRNAs (miRNAs/miRs) are a class of small, non-coding RNAs that attract great attention because of their role in regulation of...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0836-1

    authors: Del Vecchio F,Gallo F,Di Marco A,Mastroiaco V,Caianiello P,Zazzeroni F,Alesse E,Tessitore A

    更新日期:2015-12-10 00:00:00

  • Trees on networks: resolving statistical patterns of phylogenetic similarities among interacting proteins.

    abstract:BACKGROUND:Phylogenies capture the evolutionary ancestry linking extant species. Correlations and similarities among a set of species are mediated by and need to be understood in terms of the phylogenic tree. In a similar way it has been argued that biological networks also induce correlations among sets of interacting...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-470

    authors: Kelly WP,Stumpf MP

    更新日期:2010-09-20 00:00:00

  • An iterative block-shifting approach to retention time alignment that preserves the shape and area of gas chromatography-mass spectrometry peaks.

    abstract:BACKGROUND:Metabolomics, petroleum and biodiesel chemistry, biomarker discovery, and other fields which rely on high-resolution profiling of complex chemical mixtures generate datasets which contain millions of detector intensity readings, each uniquely addressed along dimensions of time (e.g., retention time of chemic...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S9-S15

    authors: Chae M,Shmookler Reis RJ,Thaden JJ

    更新日期:2008-08-12 00:00:00

  • svapls: an R package to correct for hidden factors of variability in gene expression studies.

    abstract:BACKGROUND:Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distort...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-236

    authors: Chakraborty S,Datta S,Datta S

    更新日期:2013-07-24 00:00:00

  • GenNon-h: generating multiple sequence alignments on nonhomogeneous phylogenetic trees.

    abstract:BACKGROUND:A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-216

    authors: Kedzierska AM,Casanellas M

    更新日期:2012-08-28 00:00:00

  • Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event data.

    abstract:BACKGROUND:Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-58

    authors: Sariyar M,Hoffmann I,Binder H

    更新日期:2014-02-26 00:00:00

  • Texture based skin lesion abruptness quantification to detect malignancy.

    abstract:BACKGROUND:Abruptness of pigment patterns at the periphery of a skin lesion is one of the most important dermoscopic features for detection of malignancy. In current clinical setting, abrupt cutoff of a skin lesion determined by an examination of a dermatologist. This process is subjective, nonquantitative, and error-p...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1892-5

    authors: Erol R,Bayraktar M,Kockara S,Kaya S,Halic T

    更新日期:2017-12-28 00:00:00

  • Discovering functional interaction patterns in protein-protein interaction networks.

    abstract:BACKGROUND:In recent years, a considerable amount of research effort has been directed to the analysis of biological networks with the availability of genome-scale networks of genes and/or proteins of an increasing number of organisms. A protein-protein interaction (PPI) network is a particular biological network which...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-276

    authors: Turanalp ME,Can T

    更新日期:2008-06-11 00:00:00

  • Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.

    abstract:BACKGROUND:The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. RESULTS:...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2103-8

    authors: Müller HM,Van Auken KM,Li Y,Sternberg PW

    更新日期:2018-03-09 00:00:00

  • A knowledge discovery object model API for Java.

    abstract:BACKGROUND:Biological data resources have become heterogeneous and derive from multiple sources. This introduces challenges in the management and utilization of this data in software development. Although efforts are underway to create a standard format for the transmission and storage of biological data, this objectiv...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-51

    authors: Zuyderduyn SD,Jones SJ

    更新日期:2003-10-28 00:00:00

  • Critique of the pairwise method for estimating qPCR amplification efficiency: beware of correlated data!

    abstract:BACKGROUND:A recently proposed method for estimating qPCR amplification efficiency E analyzes fluorescence intensity ratios from pairs of points deemed to lie in the exponential growth region on the amplification curves for all reactions in a dilution series. This method suffers from a serious problem: The resulting ra...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03604-4

    authors: Tellinghuisen J

    更新日期:2020-07-08 00:00:00

  • Progressive multiple sequence alignment with indel evolution.

    abstract:BACKGROUND:Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modell...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2357-1

    authors: Maiolo M,Zhang X,Gil M,Anisimova M

    更新日期:2018-09-21 00:00:00

  • Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations.

    abstract:BACKGROUND:Detecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1790-x

    authors: Nguyen LH,Holmes S

    更新日期:2017-09-13 00:00:00

  • PVT: an efficient computational procedure to speed up next-generation sequence analysis.

    abstract:BACKGROUND:High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the dif...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-167

    authors: Maji RK,Sarkar A,Khatua S,Dasgupta S,Ghosh Z

    更新日期:2014-06-04 00:00:00

  • Uncovering packaging features of co-regulated modules based on human protein interaction and transcriptional regulatory networks.

    abstract:BACKGROUND:Network co-regulated modules are believed to have the functionality of packaging multiple biological entities, and can thus be assumed to coordinate many biological functions in their network neighbouring regions. RESULTS:Here, we weighted edges of a human protein interaction network and a transcriptional r...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-392

    authors: Chen L,Wang H,Zhang L,Li W,Wang Q,Shang Y,He Y,He W,Li X,Tai J,Li X

    更新日期:2010-07-22 00:00:00

  • dupRadar: a Bioconductor package for the assessment of PCR artifacts in RNA-Seq data.

    abstract:BACKGROUND:PCR clonal artefacts originating from NGS library preparation can affect both genomic as well as RNA-Seq applications when protocols are pushed to their limits. In RNA-Seq however the artifactual reads are not easy to tell apart from normal read duplication due to natural over-sequencing of highly expressed ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1276-2

    authors: Sayols S,Scherzinger D,Klein H

    更新日期:2016-10-21 00:00:00

  • LAVA: an open-source approach to designing LAMP (loop-mediated isothermal amplification) DNA signatures.

    abstract:BACKGROUND:We developed an extendable open-source Loop-mediated isothermal AMPlification (LAMP) signature design program called LAVA (LAMP Assay Versatile Analysis). LAVA was created in response to limitations of existing LAMP signature programs. RESULTS:LAVA identifies combinations of six primer regions for basic LAM...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-240

    authors: Torres C,Vitalis EA,Baker BR,Gardner SN,Torres MW,Dzenitis JM

    更新日期:2011-06-16 00:00:00

  • Genome Projector: zoomable genome map with multiple views.

    abstract:BACKGROUND:Molecular biology data exist on diverse scales, from the level of molecules to -omics. At the same time, the data at each scale can be categorised into multiple layers, such as the genome, transcriptome, proteome, metabolome, and biochemical pathways. Due to the highly multi-layer and multi-dimensional natur...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-31

    authors: Arakawa K,Tamaki S,Kono N,Kido N,Ikegami K,Ogawa R,Tomita M

    更新日期:2009-01-23 00:00:00

  • Detecting broad domains and narrow peaks in ChIP-seq data with hiddenDomains.

    abstract:BACKGROUND:Correctly identifying genomic regions enriched with histone modifications and transcription factors is key to understanding their regulatory and developmental roles. Conceptually, these regions are divided into two categories, narrow peaks and broad domains, and different algorithms are used to identify each...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-0991-z

    authors: Starmer J,Magnuson T

    更新日期:2016-03-24 00:00:00

  • A two-phase procedure for non-normal quantitative trait genetic association study.

    abstract:BACKGROUND:The nonparametric trend test (NPT) is well suitable for identifying the genetic variants associated with quantitative traits when the trait values do not satisfy the normal distribution assumption. If the genetic model, defined according to the mode of inheritance, is known, the NPT derived under the given g...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-0888-x

    authors: Zhang W,Li H,Li Z,Li Q

    更新日期:2016-01-28 00:00:00

  • Predicting peptide presentation by major histocompatibility complex class I: an improved machine learning approach to the immunopeptidome.

    abstract:BACKGROUND:To further our understanding of immunopeptidomics, improved tools are needed to identify peptides presented by major histocompatibility complex class I (MHC-I). Many existing tools are limited by their reliance upon chemical affinity data, which is less biologically relevant than sampling by mass spectrometr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2561-z

    authors: Boehm KM,Bhinder B,Raja VJ,Dephoure N,Elemento O

    更新日期:2019-01-05 00:00:00

  • CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies.

    abstract:BACKGROUND:Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. RESULTS:We d...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03777-y

    authors: Bui VK,Wei C

    更新日期:2020-10-20 00:00:00