Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy.

Abstract:

BACKGROUND:Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. RESULTS:The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. CONCLUSION:Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. AVAILABILITY:The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Alexopoulou D,Andreopoulos B,Dietze H,Doms A,Gandon F,Hakenberg J,Khelif K,Schroeder M,Wächter T

doi

10.1186/1471-2105-10-28

subject

Has Abstract

pub_date

2009-01-21 00:00:00

pages

28

issn

1471-2105

pii

1471-2105-10-28

journal_volume

10

pub_type

杂志文章
  • Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework.

    abstract:BACKGROUND:In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multip...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1273-5

    authors: Voillet V,Besse P,Liaubet L,San Cristobal M,González I

    更新日期:2016-10-03 00:00:00

  • Epiviz: a view inside the design of an integrated visual analysis software for genomics.

    abstract:BACKGROUND:Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S11-S4

    authors: Chelaru F,Corrada Bravo H

    更新日期:2015-01-01 00:00:00

  • FocAn: automated 3D analysis of DNA repair foci in image stacks acquired by confocal fluorescence microscopy.

    abstract:BACKGROUND:Phosphorylated histone H2AX, also known as γH2AX, forms μm-sized nuclear foci at the sites of DNA double-strand breaks (DSBs) induced by ionizing radiation and other agents. Due to their specificity and sensitivity, γH2AX immunoassays have become the gold standard for studying DSB induction and repair. One o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3370-8

    authors: Memmel S,Sisario D,Zimmermann H,Sauer M,Sukhorukov VL,Djuzenova CS,Flentje M

    更新日期:2020-01-28 00:00:00

  • A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies.

    abstract:BACKGROUND:Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-284

    authors: Zuber V,Duarte Silva AP,Strimmer K

    更新日期:2012-10-31 00:00:00

  • IPRStats: visualization of the functional potential of an InterProScan run.

    abstract:BACKGROUND:InterPro is a collection of protein signatures for the classification and automated annotation of proteins. Interproscan is a software tool that scans protein sequences against Interpro member databases using a variety of profile-based, hidden markov model and positional specific score matrix methods. It not...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S12-S13

    authors: Kelly RJ,Vincent DE,Friedberg I

    更新日期:2010-12-21 00:00:00

  • A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes.

    abstract:BACKGROUND:Great strides have been made in the effective treatment of HIV-1 with the development of second-generation protease inhibitors (PIs) that are effective against historically multi-PI-resistant HIV-1 variants. Nevertheless, mutation patterns that confer decreasing susceptibility to available PIs continue to ar...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-477

    authors: Doherty KM,Nakka P,King BM,Rhee SY,Holmes SP,Shafer RW,Radhakrishnan ML

    更新日期:2011-12-15 00:00:00

  • Membrane protein orientation and refinement using a knowledge-based statistical potential.

    abstract:BACKGROUND:Recent increases in the number of deposited membrane protein crystal structures necessitate the use of automated computational tools to position them within the lipid bilayer. Identifying the correct orientation allows us to study the complex relationship between sequence, structure and the lipid environment...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-276

    authors: Nugent T,Jones DT

    更新日期:2013-09-18 00:00:00

  • MOSBIE: a tool for comparison and analysis of rule-based biochemical models.

    abstract:BACKGROUND:Mechanistic models that describe the dynamical behaviors of biochemical systems are common in computational systems biology, especially in the realm of cellular signaling. The development of families of such models, either by a single research group or by different groups working within the same area, presen...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-316

    authors: Wenskovitch JE Jr,Harris LA,Tapia JJ,Faeder JR,Marai GE

    更新日期:2014-09-25 00:00:00

  • MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction.

    abstract:BACKGROUND:Knowledge of subcellular localization of proteins is crucial to proteomics, drug target discovery and systems biology since localization and biological function are highly correlated. In recent years, numerous computational prediction methods have been developed. Nevertheless, there is still a need for predi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-274

    authors: Blum T,Briesemeister S,Kohlbacher O

    更新日期:2009-09-01 00:00:00

  • An improved classification of G-protein-coupled receptors using sequence-derived features.

    abstract:BACKGROUND:G-protein-coupled receptors (GPCRs) play a key role in diverse physiological processes and are the targets of almost two-thirds of the marketed drugs. The 3 D structures of GPCRs are largely unavailable; however, a large number of GPCR primary sequences are known. To facilitate the identification and charact...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-420

    authors: Peng ZL,Yang JY,Chen X

    更新日期:2010-08-09 00:00:00

  • Performance of a genetic algorithm for mass spectrometry proteomics.

    abstract:BACKGROUND:Recently, mass spectrometry data have been mined using a genetic algorithm to produce discriminatory models that distinguish healthy individuals from those with cancer. This algorithm is the basis for claims of 100% sensitivity and specificity in two related publicly available datasets. To date, no detailed ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-180

    authors: Jeffries NO

    更新日期:2004-11-19 00:00:00

  • Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model.

    abstract:BACKGROUND:Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-272

    authors: He X,Sarma MS,Ling X,Chee B,Zhai C,Schatz B

    更新日期:2010-05-20 00:00:00

  • A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

    abstract:BACKGROUND:The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire geno...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2077-6

    authors: Wang W,Sun W,Wang W,Szatkiewicz J

    更新日期:2018-03-01 00:00:00

  • libgapmis: extending short-read alignments.

    abstract:BACKGROUND:A wide variety of short-read alignment programmes have been published recently to tackle the problem of mapping millions of short reads to a reference genome, focusing on different aspects of the procedure such as time and memory efficiency, sensitivity, and accuracy. These tools allow for a small number of ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S11-S4

    authors: Alachiotis N,Berger S,Flouri T,Pissis SP,Stamatakis A

    更新日期:2013-01-01 00:00:00

  • Approaching the taxonomic affiliation of unidentified sequences in public databases--an example from the mycorrhizal fungi.

    abstract:BACKGROUND:During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-178

    authors: Nilsson RH,Kristiansson E,Ryberg M,Larsson KH

    更新日期:2005-07-18 00:00:00

  • CollapsABEL: an R library for detecting compound heterozygote alleles in genome-wide association studies.

    abstract:BACKGROUND:Compound Heterozygosity (CH) in classical genetics is the presence of two different recessive mutations at a particular gene locus. A relaxed form of CH alleles may account for an essential proportion of the missing heritability, i.e. heritability of phenotypes so far not accounted for by single genetic vari...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1006-9

    authors: Zhong K,Karssen LC,Kayser M,Liu F

    更新日期:2016-04-08 00:00:00

  • Repliscan: a tool for classifying replication timing regions.

    abstract:BACKGROUND:Replication timing experiments that use label incorporation and high throughput sequencing produce peaked data similar to ChIP-Seq experiments. However, the differences in experimental design, coverage density, and possible results make traditional ChIP-Seq analysis methods inappropriate for use with replica...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1774-x

    authors: Zynda GJ,Song J,Concia L,Wear EE,Hanley-Bowdoin L,Thompson WF,Vaughn MW

    更新日期:2017-08-07 00:00:00

  • PESM: predicting the essentiality of miRNAs based on gradient boosting machines and sequences.

    abstract:BACKGROUND:MicroRNAs (miRNAs) are a kind of small noncoding RNA molecules that are direct posttranscriptional regulations of mRNA targets. Studies have indicated that miRNAs play key roles in complex diseases by taking part in many biological processes, such as cell growth, cell death and so on. Therefore, in order to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3426-9

    authors: Yan C,Wu FX,Wang J,Duan G

    更新日期:2020-03-18 00:00:00

  • The Lair: a resource for exploratory analysis of published RNA-Seq data.

    abstract::Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1357-2

    authors: Pimentel H,Sturmfels P,Bray N,Melsted P,Pachter L

    更新日期:2016-12-01 00:00:00

  • Mutation status coupled with RNA-sequencing data can efficiently identify important non-significantly mutated genes serving as diagnostic biomarkers of endometrial cancer.

    abstract:BACKGROUND:Endometrial cancers (ECs) are one of the most common types of malignant tumor in females. Substantial efforts had been made to identify significantly mutated genes (SMGs) in ECs and use them as biomarkers for the classification of histological subtypes and the prediction of clinical outcomes. However, the im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1891-6

    authors: Liu K,He L,Liu Z,Xu J,Liu Y,Kuang Q,Wen Z,Li M

    更新日期:2017-12-28 00:00:00

  • A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.

    abstract:BACKGROUND:The process of horizontal gene transfer (HGT) is believed to be widespread in Bacteria and Archaea, but little comparative data is available addressing its occurrence in complete microbial genomes. Collection of high-quality, automated HGT prediction data based on phylogenetic evidence has previously been im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-419

    authors: Podell S,Gaasterland T,Allen EE

    更新日期:2008-10-07 00:00:00

  • GenNon-h: generating multiple sequence alignments on nonhomogeneous phylogenetic trees.

    abstract:BACKGROUND:A number of software packages are available to generate DNA multiple sequence alignments (MSAs) evolved under continuous-time Markov processes on phylogenetic trees. On the other hand, methods of simulating the DNA MSA directly from the transition matrices do not exist. Moreover, existing software restricts ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-216

    authors: Kedzierska AM,Casanellas M

    更新日期:2012-08-28 00:00:00

  • Natural computation meta-heuristics for the in silico optimization of microbial strains.

    abstract:BACKGROUND:One of the greatest challenges in Metabolic Engineering is to develop quantitative models and algorithms to identify a set of genetic manipulations that will result in a microbial strain with a desirable metabolic phenotype which typically means having a high yield/productivity. This challenge is not only du...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-499

    authors: Rocha M,Maia P,Mendes R,Pinto JP,Ferreira EC,Nielsen J,Patil KR,Rocha I

    更新日期:2008-11-27 00:00:00

  • Rigorous assessment and integration of the sequence and structure based features to predict hot spots.

    abstract:BACKGROUND:Systematic mutagenesis studies have shown that only a few interface residues termed hot spots contribute significantly to the binding free energy of protein-protein interactions. Therefore, hot spots prediction becomes increasingly important for well understanding the essence of proteins interactions and hel...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-311

    authors: Chen R,Chen W,Yang S,Wu D,Wang Y,Tian Y,Shi Y

    更新日期:2011-07-29 00:00:00

  • Inferring latent task structure for Multitask Learning by Multiple Kernel Learning.

    abstract:BACKGROUND:The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S8-S5

    authors: Widmer C,Toussaint NC,Altun Y,Rätsch G

    更新日期:2010-10-26 00:00:00

  • Snpdat: easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms.

    abstract:BACKGROUND:Single nucleotide polymorphisms (SNPs) are the most abundant genetic variant found in vertebrates and invertebrates. SNP discovery has become a highly automated, robust and relatively inexpensive process allowing the identification of many thousands of mutations for model and non-model organisms. Annotating ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-45

    authors: Doran AG,Creevey CJ

    更新日期:2013-02-08 00:00:00

  • Promoter prediction in E. coli based on SIDD profiles and Artificial Neural Networks.

    abstract:BACKGROUND:One of the major challenges in biology is the correct identification of promoter regions. Computational methods based on motif searching have been the traditional approach taken. Recent studies have shown that DNA structural properties, such as curvature, stacking energy, and stress-induced duplex destabiliz...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S6-S17

    authors: Bland C,Newsome AS,Markovets AA

    更新日期:2010-10-07 00:00:00

  • IRSS: a web-based tool for automatic layout and analysis of IRES secondary structure prediction and searching system in silico.

    abstract:BACKGROUND:Internal ribosomal entry sites (IRESs) provide alternative, cap-independent translation initiation sites in eukaryotic cells. IRES elements are important factors in viral genomes and are also useful tools for bi-cistronic expression vectors. Most existing RNA structure prediction programs are unable to deal ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-160

    authors: Wu TY,Hsieh CC,Hong JJ,Chen CY,Tsai YS

    更新日期:2009-05-27 00:00:00

  • MultiDCoX: Multi-factor analysis of differential co-expression.

    abstract:BACKGROUND:Differential co-expression (DCX) signifies change in degree of co-expression of a set of genes among different biological conditions. It has been used to identify differential co-expression networks or interactomes. Many algorithms have been developed for single-factor differential co-expression analysis and...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1963-7

    authors: Liany H,Rajapakse JC,Karuturi RKM

    更新日期:2017-12-28 00:00:00

  • Missing genes in the annotation of prokaryotic genomes.

    abstract:BACKGROUND:Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question ari...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-131

    authors: Warren AS,Archuleta J,Feng WC,Setubal JC

    更新日期:2010-03-15 00:00:00