Random generalized linear model: a highly accurate and interpretable ensemble predictor.

Abstract:

BACKGROUND:Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature. RESULTS:Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a "thinned" ensemble predictor (involving few features) that retains excellent predictive accuracy. CONCLUSION:RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Song L,Langfelder P,Horvath S

doi

10.1186/1471-2105-14-5

subject

Has Abstract

pub_date

2013-01-16 00:00:00

pages

5

issn

1471-2105

pii

1471-2105-14-5

journal_volume

14

pub_type

杂志文章
  • Model based heritability scores for high-throughput sequencing data.

    abstract:BACKGROUND:Heritability of a phenotypic or molecular trait measures the proportion of variance that is attributable to genotypic variance. It is an important concept in breeding and genetics. Few methods are available for calculating heritability for traits derived from high-throughput sequencing. RESULTS:We propose s...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1539-6

    authors: Rudra P,Shi WJ,Vestal B,Russell PH,Odell A,Dowell RD,Radcliffe RA,Saba LM,Kechris K

    更新日期:2017-03-02 00:00:00

  • iMEGES: integrated mental-disorder GEnome score by deep neural network for prioritizing the susceptibility genes for mental disorders in personal genomes.

    abstract:BACKGROUND:A range of rare and common genetic variants have been discovered to be potentially associated with mental diseases, but many more have not been uncovered. Powerful integrative methods are needed to systematically prioritize both variants and genes that confer susceptibility to mental diseases in personal gen...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2469-7

    authors: Khan A,Liu Q,Wang K

    更新日期:2018-12-28 00:00:00

  • Bioinformatics research in the Asia Pacific: a 2007 update.

    abstract::We provide a 2007 update on the bioinformatics research in the Asia-Pacific from the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation set up in 1998. From 2002, APBioNet has organized the first International Conference on Bioinformatics (InCoB) bringing together scientists work...

    journal_title:BMC bioinformatics

    pub_type:

    doi:10.1186/1471-2105-9-S1-S1

    authors: Ranganathan S,Gribskov M,Tan TW

    更新日期:2008-01-01 00:00:00

  • Restricted DCJ-indel model: sorting linear genomes with DCJ and indels.

    abstract:BACKGROUND:The double-cut-and-join (DCJ) is a model that is able to efficiently sort a genome into another, generalizing the typical mutations (inversions, fusions, fissions, translocations) to which genomes are subject, but allowing the existence of circular chromosomes at the intermediate steps. In the general model ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S19-S14

    authors: da Silva PH,Machado R,Dantas S,Braga MD

    更新日期:2012-01-01 00:00:00

  • MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans.

    abstract:BACKGROUND:MicroRNAs (miRNAs) are recognized as one of the most important families of non-coding RNAs that serve as important sequence-specific post-transcriptional regulators of gene expression. Identification of miRNAs is an important requirement for understanding the mechanisms of post-transcriptional regulation. Hu...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-341

    authors: Huang TH,Fan B,Rothschild MF,Hu ZL,Li K,Zhao SH

    更新日期:2007-09-17 00:00:00

  • The rise and fall of breakpoint reuse depending on genome resolution.

    abstract:BACKGROUND:During evolution, large-scale genome rearrangements of chromosomes shuffle the order of homologous genome sequences ("synteny blocks") across species. Some years ago, a controversy erupted in genome rearrangement studies over whether rearrangements recur, causing breakpoints to be reused. METHODS:We investi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S9-S1

    authors: Attie O,Darling AE,Yancopoulos S

    更新日期:2011-10-05 00:00:00

  • AntiBP2: improved version of antibacterial peptide prediction.

    abstract:BACKGROUND:Antibacterial peptides are one of the effecter molecules of innate immune system. Over the last few decades several antibacterial peptides have successfully approved as drug by FDA, which has prompted an interest in these antibacterial peptides. In our recent study we analyzed 999 antibacterial peptides, whi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S1-S19

    authors: Lata S,Mishra NK,Raghava GP

    更新日期:2010-01-18 00:00:00

  • Improved identification of conserved cassette exons using Bayesian networks.

    abstract:BACKGROUND:Alternative splicing is a major contributor to the diversity of eukaryotic transcriptomes and proteomes. Currently, large scale detection of alternative splicing using expressed sequence tags (ESTs) or microarrays does not capture all alternative splicing events. Moreover, for many species genomic data is be...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-477

    authors: Sinha R,Hiller M,Pudimat R,Gausmann U,Platzer M,Backofen R

    更新日期:2008-11-12 00:00:00

  • Graph based fusion of miRNA and mRNA expression data improves clinical outcome prediction in prostate cancer.

    abstract:BACKGROUND:One of the main goals in cancer studies including high-throughput microRNA (miRNA) and mRNA data is to find and assess prognostic signatures capable of predicting clinical outcome. Both mRNA and miRNA expression changes in cancer diseases are described to reflect clinical characteristics like staging and pro...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-488

    authors: Gade S,Porzelius C,Fälth M,Brase JC,Wuttig D,Kuner R,Binder H,Sültmann H,Beissbarth T

    更新日期:2011-12-21 00:00:00

  • An iterative block-shifting approach to retention time alignment that preserves the shape and area of gas chromatography-mass spectrometry peaks.

    abstract:BACKGROUND:Metabolomics, petroleum and biodiesel chemistry, biomarker discovery, and other fields which rely on high-resolution profiling of complex chemical mixtures generate datasets which contain millions of detector intensity readings, each uniquely addressed along dimensions of time (e.g., retention time of chemic...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S9-S15

    authors: Chae M,Shmookler Reis RJ,Thaden JJ

    更新日期:2008-08-12 00:00:00

  • Connectivity independent protein-structure alignment: a hierarchical approach.

    abstract:BACKGROUND:Protein-structure alignment is a fundamental tool to study protein function, evolution and model building. In the last decade several methods for structure alignment were introduced, but most of them ignore that structurally similar proteins can share the same spatial arrangement of secondary structure eleme...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-510

    authors: Kolbeck B,May P,Schmidt-Goenner T,Steinke T,Knapp EW

    更新日期:2006-11-21 00:00:00

  • Computational algorithms to predict Gene Ontology annotations.

    abstract:BACKGROUND:Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biologi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S6-S4

    authors: Pinoli P,Chicco D,Masseroli M

    更新日期:2015-01-01 00:00:00

  • Partitioning of functional gene expression data using principal points.

    abstract:BACKGROUND:DNA microarrays offer motivation and hope for the simultaneous study of variations in multiple genes. Gene expression is a temporal process that allows variations in expression levels with a characterized gene function over a period of time. Temporal gene expression curves can be treated as functional data s...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1860-0

    authors: Kim J,Kim H

    更新日期:2017-10-12 00:00:00

  • Comparative study of discretization methods of microarray data for inferring transcriptional regulatory networks.

    abstract:BACKGROUND:Microarray data discretization is a basic preprocess for many algorithms of gene regulatory network inference. Some common discretization methods in informatics are used to discretize microarray data. Selection of the discretization method is often arbitrary and no systematic comparison of different discreti...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-520

    authors: Li Y,Liu L,Bai X,Cai H,Ji W,Guo D,Zhu Y

    更新日期:2010-10-19 00:00:00

  • New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily.

    abstract:BACKGROUND:The Acel_2062 protein from Acidothermus cellulolyticus is a protein of unknown function. Initial sequence analysis predicted that it was a metallopeptidase from the presence of a motif conserved amongst the Asp-zincins, which are peptidases that contain a single, catalytic zinc ion ligated by the histidines ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-1

    authors: Trame CB,Chang Y,Axelrod HL,Eberhardt RY,Coggill P,Punta M,Rawlings ND

    更新日期:2014-01-03 00:00:00

  • A novel similarity-measure for the analysis of genetic data in complex phenotypes.

    abstract:BACKGROUND:Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machin...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S6-S24

    authors: Lagani V,Montesanto A,Di Cianni F,Moreno V,Landi S,Conforti D,Rose G,Passarino G

    更新日期:2009-06-16 00:00:00

  • Insertion and deletion correcting DNA barcodes based on watermarks.

    abstract:BACKGROUND:Barcode multiplexing is a key strategy for sharing the rising capacity of next-generation sequencing devices: Synthetic DNA tags, called barcodes, are attached to natural DNA fragments within the library preparation procedure. Different libraries, can individually be labeled with barcodes for a joint sequenc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0482-7

    authors: Kracht D,Schober S

    更新日期:2015-02-18 00:00:00

  • Thresher: determining the number of clusters while removing outliers.

    abstract:BACKGROUND:Cluster analysis is the most common unsupervised method for finding hidden groups in data. Clustering presents two main challenges: (1) finding the optimal number of clusters, and (2) removing "outliers" among the objects being clustered. Few clustering algorithms currently deal directly with the outlier pro...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1998-9

    authors: Wang M,Abrams ZB,Kornblau SM,Coombes KR

    更新日期:2018-01-08 00:00:00

  • Coverage statistics for sequence census methods.

    abstract:BACKGROUND:We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how thi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-430

    authors: Evans SN,Hower V,Pachter L

    更新日期:2010-08-18 00:00:00

  • E-CAI: a novel server to estimate an expected value of Codon Adaptation Index (eCAI).

    abstract:BACKGROUND:The Codon Adaptation Index (CAI) is a measure of the synonymous codon usage bias for a DNA or RNA sequence. It quantifies the similarity between the synonymous codon usage of a gene and the synonymous codon frequency of a reference set. Extreme values in the nucleotide or in the amino acid composition have a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-65

    authors: Puigbò P,Bravo IG,Garcia-Vallvé S

    更新日期:2008-01-29 00:00:00

  • CNV Workshop: an integrated platform for high-throughput copy number variation discovery and clinical diagnostics.

    abstract:BACKGROUND:Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly ass...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-74

    authors: Gai X,Perin JC,Murphy K,O'Hara R,D'arcy M,Wenocur A,Xie HM,Rappaport EF,Shaikh TH,White PS

    更新日期:2010-02-04 00:00:00

  • MACSIMS: multiple alignment of complete sequences information management system.

    abstract:BACKGROUND:In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-318

    authors: Thompson JD,Muller A,Waterhouse A,Procter J,Barton GJ,Plewniak F,Poch O

    更新日期:2006-06-23 00:00:00

  • GenomeBlast: a web tool for small genome comparison.

    abstract:BACKGROUND:Comparative genomics has become an essential approach for identifying homologous gene candidates and their functions, and for studying genome evolution. There are many tools available for genome comparisons. Unfortunately, most of them are not applicable for the identification of unique genes and the inferen...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-S4-S18

    authors: Lu G,Jiang L,Helikar RM,Rowley TW,Zhang L,Chen X,Moriyama EN

    更新日期:2006-12-12 00:00:00

  • A molecular model of the full-length human NOD-like receptor family CARD domain containing 5 (NLRC5) protein.

    abstract:BACKGROUND:Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-275

    authors: Mótyán JA,Bagossi P,Benkő S,Tőzsér J

    更新日期:2013-09-17 00:00:00

  • BicPAMS: software for biological data analysis with pattern-based biclustering.

    abstract:BACKGROUND:Biclustering has been largely applied for the unsupervised analysis of biological data, being recognised today as a key technique to discover putative modules in both expression data (subsets of genes correlated in subsets of conditions) and network data (groups of coherently interconnected biological entiti...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1493-3

    authors: Henriques R,Ferreira FL,Madeira SC

    更新日期:2017-02-02 00:00:00

  • pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties.

    abstract:BACKGROUND:Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. Ho...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-152

    authors: Sarda D,Chua GH,Li KB,Krishnan A

    更新日期:2005-06-17 00:00:00

  • Informative gene selection and the direct classification of tumors based on relative simplicity.

    abstract:BACKGROUND:Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-0893-0

    authors: Chen Y,Wang L,Li L,Zhang H,Yuan Z

    更新日期:2016-01-20 00:00:00

  • libcov: a C++ bioinformatic library to manipulate protein structures, sequence alignments and phylogeny.

    abstract:BACKGROUND:An increasing number of bioinformatics methods are considering the phylogenetic relationships between biological sequences. Implementing new methodologies using the maximum likelihood phylogenetic framework can be a time consuming task. RESULTS:The bioinformatics library libcov is a collection of C++ classe...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-138

    authors: Butt D,Roger AJ,Blouin C

    更新日期:2005-06-06 00:00:00

  • Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy.

    abstract:BACKGROUND:Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as si...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-28

    authors: Alexopoulou D,Andreopoulos B,Dietze H,Doms A,Gandon F,Hakenberg J,Khelif K,Schroeder M,Wächter T

    更新日期:2009-01-21 00:00:00

  • Statistical assessment and visualization of synergies for large-scale sparse drug combination datasets.

    abstract:BACKGROUND:Drug combinations have the potential to improve efficacy while limiting toxicity. To robustly identify synergistic combinations, high-throughput screens using full dose-response surface are desirable but require an impractical number of data points. Screening of a sparse number of doses per drug allows to sc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2642-7

    authors: Amzallag A,Ramaswamy S,Benes CH

    更新日期:2019-02-18 00:00:00