ChemEx: information extraction system for chemical data curation.

Abstract:

BACKGROUND:Manual chemical data curation from publications is error-prone, time consuming, and hard to maintain up-to-date data sets. Automatic information extraction can be used as a tool to reduce these problems. Since chemical structures usually described in images, information extraction needs to combine structure image recognition and text mining together. RESULTS:We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests. CONCLUSIONS:ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from http://www.biotec.or.th/isl/ChemEx.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Tharatipyakul A,Numnark S,Wichadakul D,Ingsriswang S

doi

10.1186/1471-2105-13-S17-S9

subject

Has Abstract

pub_date

2012-01-01 00:00:00

pages

S9

issn

1471-2105

pii

1471-2105-13-S17-S9

journal_volume

13 Suppl 17

pub_type

杂志文章
  • NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

    abstract:BACKGROUND:Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0357-3

    authors: McCorrison JM,Venepally P,Singh I,Fouts DE,Lasken RS,Methé BA

    更新日期:2014-11-19 00:00:00

  • Metabolic network alignment in large scale by network compression.

    abstract::Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far lim...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S3-S2

    authors: Ay F,Dang M,Kahveci T

    更新日期:2012-03-21 00:00:00

  • FastGroup: a program to dereplicate libraries of 16S rDNA sequences.

    abstract:BACKGROUND:Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-2-9

    authors: Seguritan V,Rohwer F

    更新日期:2001-01-01 00:00:00

  • Coordinates and intervals in graph-based reference genomes.

    abstract:BACKGROUND:It has been proposed that future reference genomes should be graph structures in order to better represent the sequence diversity present in a species. However, there is currently no standard method to represent genomic intervals, such as the positions of genes or transcription factor binding sites, on graph...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1678-9

    authors: Rand KD,Grytten I,Nederbragt AJ,Storvik GO,Glad IK,Sandve GK

    更新日期:2017-05-18 00:00:00

  • Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns.

    abstract:BACKGROUND:With the advent of Next-Generation Sequencing technologies (NGS), a large amount of short read data has been generated. If a reference genome is not available, the assembly of a template sequence is usually challenging because of repeats and the short length of reads. When NGS reads cannot be mapped onto a r...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S1

    authors: Comin M,Schimd M

    更新日期:2014-01-01 00:00:00

  • Cell subset prediction for blood genomic studies.

    abstract:BACKGROUND:Genome-wide transcriptional profiling of patient blood samples offers a powerful tool to investigate underlying disease mechanisms and personalized treatment decisions. Most studies are based on analysis of total peripheral blood mononuclear cells (PBMCs), a mixed population. In this case, accuracy is inhere...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-258

    authors: Bolen CR,Uduman M,Kleinstein SH

    更新日期:2011-06-24 00:00:00

  • Computational identification of ubiquitylation sites from protein sequences.

    abstract:BACKGROUND:Ubiquitylation plays an important role in regulating protein functions. Recently, experimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this study aims to develop an accurate sequence-based prediction method...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-310

    authors: Tung CW,Ho SY

    更新日期:2008-07-15 00:00:00

  • HAT: hypergeometric analysis of tiling-arrays with application to promoter-GeneChip data.

    abstract:BACKGROUND:Tiling-arrays are applicable to multiple types of biological research questions. Due to its advantages (high sensitivity, resolution, unbiased), the technology is often employed in genome-wide investigations. A major challenge in the analysis of tiling-array data is to define regions-of-interest, i.e., conti...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-275

    authors: Taskesen E,Beekman R,de Ridder J,Wouters BJ,Peeters JK,Touw IP,Reinders MJ,Delwel R

    更新日期:2010-05-21 00:00:00

  • Meta-aligner: long-read alignment based on genome statistics.

    abstract:BACKGROUND:Current development of sequencing technologies is towards generating longer and noisier reads. Evidently, accurate alignment of these reads play an important role in any downstream analysis. Similarly, reducing the overall cost of sequencing is related to the time consumption of the aligner. The tradeoff bet...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1518-y

    authors: Nashta-Ali D,Aliyari A,Ahmadian Moghadam A,Edrisi MA,Motahari SA,Hossein Khalaj B

    更新日期:2017-02-23 00:00:00

  • Ab-origin: an enhanced tool to identify the sourcing gene segments in germline for rearranged antibodies.

    abstract:BACKGROUND:In the adaptive immune system, variable regions of immunoglobulin (IG) are encoded by random recombination of variable (V), diversity (D), and joining (J) gene segments in the germline. Partitioning the functional antibody sequences to their sourcing germline gene segments is vital not only for understanding...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S12-S20

    authors: Wang X,Wu D,Zheng S,Sun J,Tao L,Li Y,Cao Z

    更新日期:2008-12-12 00:00:00

  • Texture based skin lesion abruptness quantification to detect malignancy.

    abstract:BACKGROUND:Abruptness of pigment patterns at the periphery of a skin lesion is one of the most important dermoscopic features for detection of malignancy. In current clinical setting, abrupt cutoff of a skin lesion determined by an examination of a dermatologist. This process is subjective, nonquantitative, and error-p...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1892-5

    authors: Erol R,Bayraktar M,Kockara S,Kaya S,Halic T

    更新日期:2017-12-28 00:00:00

  • Predicting and improving the protein sequence alignment quality by support vector regression.

    abstract:BACKGROUND:For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significant...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-471

    authors: Lee M,Jeong CS,Kim D

    更新日期:2007-12-03 00:00:00

  • Logical development of the cell ontology.

    abstract:BACKGROUND:The Cell Ontology (CL) is an ontology for the representation of in vivo cell types. As biological ontologies such as the CL grow in complexity, they become increasingly difficult to use and maintain. By making the information in the ontology computable, we can use automated reasoners to detect errors and ass...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-6

    authors: Meehan TF,Masci AM,Abdulla A,Cowell LG,Blake JA,Mungall CJ,Diehl AD

    更新日期:2011-01-05 00:00:00

  • Random forest versus logistic regression: a large-scale benchmark experiment.

    abstract:BACKGROUND AND GOAL:The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS:In this conte...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2264-5

    authors: Couronné R,Probst P,Boulesteix AL

    更新日期:2018-07-17 00:00:00

  • Evaluation of methods for differential expression analysis on multi-group RNA-seq count data.

    abstract:BACKGROUND:RNA-seq is a powerful tool for measuring transcriptomes, especially for identifying differentially expressed genes or transcripts (DEGs) between sample groups. A number of methods have been developed for this task, and several evaluation studies have also been reported. However, those evaluations so far have...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0794-7

    authors: Tang M,Sun J,Shimizu K,Kadota K

    更新日期:2015-11-04 00:00:00

  • Detecting transitions in protein dynamics using a recurrence quantification analysis based bootstrap method.

    abstract:BACKGROUND:Proteins undergo conformational transitions over different time scales. These transitions are closely intertwined with the protein's function. Numerous standard techniques such as principal component analysis are used to detect these transitions in molecular dynamics simulations. In this work, we add a new m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1943-y

    authors: Karain WI

    更新日期:2017-11-28 00:00:00

  • Sequencing error correction without a reference genome.

    abstract:BACKGROUND:Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors fr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-367

    authors: Sleep JA,Schreiber AW,Baumann U

    更新日期:2013-12-18 00:00:00

  • Graph regularized L2,1-nonnegative matrix factorization for miRNA-disease association prediction.

    abstract:BACKGROUND:The aberrant expression of microRNAs is closely connected to the occurrence and development of a great deal of human diseases. To study human diseases, numerous effective computational models that are valuable and meaningful have been presented by researchers. RESULTS:Here, we present a computational framew...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3409-x

    authors: Gao Z,Wang YT,Wu QW,Ni JC,Zheng CH

    更新日期:2020-02-18 00:00:00

  • BatchPrimer3: a high throughput web application for PCR and sequencing primer design.

    abstract:BACKGROUND:Microsatellite (simple sequence repeat - SSR) and single nucleotide polymorphism (SNP) markers are two types of important genetic markers useful in genetic mapping and genotyping. Often, large-scale genomic research projects require high-throughput computer-assisted primer design. Numerous such web-based or ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-253

    authors: You FM,Huo N,Gu YQ,Luo MC,Ma Y,Hane D,Lazo GR,Dvorak J,Anderson OD

    更新日期:2008-05-29 00:00:00

  • A framework for space-efficient read clustering in metagenomic samples.

    abstract:BACKGROUND:A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Si...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1466-6

    authors: Alanko J,Cunial F,Belazzougui D,Mäkinen V

    更新日期:2017-03-14 00:00:00

  • LSX: automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference.

    abstract:BACKGROUND:Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS3, a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a r...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3020-1

    authors: Rivera-Rivera CJ,Montoya-Burgos JI

    更新日期:2019-08-13 00:00:00

  • Bioinformatics approach to predict target genes for dysregulated microRNAs in hepatocellular carcinoma: study on a chemically-induced HCC mouse model.

    abstract:BACKGROUND:Hepatocellular carcinoma (HCC) is an aggressive epithelial tumor which shows very poor prognosis and high rate of recurrence, representing an urgent problem for public healthcare. MicroRNAs (miRNAs/miRs) are a class of small, non-coding RNAs that attract great attention because of their role in regulation of...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0836-1

    authors: Del Vecchio F,Gallo F,Di Marco A,Mastroiaco V,Caianiello P,Zazzeroni F,Alesse E,Tessitore A

    更新日期:2015-12-10 00:00:00

  • ILP-based maximum likelihood genome scaffolding.

    abstract:BACKGROUND:Interest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S9

    authors: Lindsay J,Salooti H,Măndoiu I,Zelikovsky A

    更新日期:2014-01-01 00:00:00

  • In silico design of targeted SRM-based experiments.

    abstract::Selected reaction monitoring (SRM)-based proteomics approaches enable highly sensitive and reproducible assays for profiling of thousands of peptides in one experiment. The development of such assays involves the determination of retention time, detectability and fragmentation properties of peptides, followed by an op...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S16-S8

    authors: Nahnsen S,Kohlbacher O

    更新日期:2012-01-01 00:00:00

  • Leveraging TCGA gene expression data to build predictive models for cancer drug response.

    abstract:BACKGROUND:Machine learning has been utilized to predict cancer drug response from multi-omics data generated from sensitivities of cancer cell lines to different therapeutic compounds. Here, we build machine learning models using gene expression data from patients' primary tumor tissues to predict whether a patient wi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03690-4

    authors: Clayton EA,Pujol TA,McDonald JF,Qiu P

    更新日期:2020-09-30 00:00:00

  • NeurphologyJ: an automatic neuronal morphology quantification method and its application in pharmacological discovery.

    abstract:BACKGROUND:Automatic quantification of neuronal morphology from images of fluorescence microscopy plays an increasingly important role in high-content screenings. However, there exist very few freeware tools and methods which provide automatic neuronal morphology quantification for pharmacological discovery. RESULTS:T...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-230

    authors: Ho SY,Chao CY,Huang HL,Chiu TW,Charoenkwan P,Hwang E

    更新日期:2011-06-08 00:00:00

  • Francisella tularensis novicida proteomic and transcriptomic data integration and annotation based on semantic web technologies.

    abstract:BACKGROUND:This paper summarises the lessons and experiences gained from a case study of the application of semantic web technologies to the integration of data from the bacterial species Francisella tularensis novicida (Fn). Fn data sources are disparate and heterogeneous, as multiple laboratories across the world, us...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S10-S3

    authors: Anwar N,Hunt E

    更新日期:2009-10-01 00:00:00

  • WellInverter: a web application for the analysis of fluorescent reporter gene data.

    abstract:BACKGROUND:Fluorescent reporter genes have become widely used for monitoring gene expression in living cells. When a microbial strain carrying a reporter gene is grown in a microplate reader, the fluorescence and the absorbance (optical density) of the culture can be automatically measured every few minutes in a highly...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2920-4

    authors: Martin Y,Page M,Blanchet C,de Jong H

    更新日期:2019-06-11 00:00:00

  • Efficient use of unlabeled data for protein sequence classification: a comparative study.

    abstract:BACKGROUND:Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved acc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S4-S2

    authors: Kuksa P,Huang PH,Pavlovic V

    更新日期:2009-04-29 00:00:00

  • SplicerAV: a tool for mining microarray expression data for changes in RNA processing.

    abstract:BACKGROUND:Over the past two decades more than fifty thousand unique clinical and biological samples have been assayed using the Affymetrix HG-U133 and HG-U95 GeneChip microarray platforms. This substantial repository has been used extensively to characterize changes in gene expression between biological samples, but h...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-108

    authors: Robinson TJ,Dinan MA,Dewhirst M,Garcia-Blanco MA,Pearson JL

    更新日期:2010-02-25 00:00:00