Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection.

Abstract:

BACKGROUND:When constructing new biomarker or gene signature scores for time-to-event outcomes, the underlying aims are to develop a discrimination model that helps to predict whether patients have a poor or good prognosis and to identify the most influential variables for this task. In practice, this is often done fitting Cox models. Those are, however, not necessarily optimal with respect to the resulting discriminatory power and are based on restrictive assumptions. We present a combined approach to automatically select and fit sparse discrimination models for potentially high-dimensional survival data based on boosting a smooth version of the concordance index (C-index). Due to this objective function, the resulting prediction models are optimal with respect to their ability to discriminate between patients with longer and shorter survival times. The gradient boosting algorithm is combined with the stability selection approach to enhance and control its variable selection properties. RESULTS:The resulting algorithm fits prediction models based on the rankings of the survival times and automatically selects only the most stable predictors. The performance of the approach, which works best for small numbers of informative predictors, is demonstrated in a large scale simulation study: C-index boosting in combination with stability selection is able to identify a small subset of informative predictors from a much larger set of non-informative ones while controlling the per-family error rate. In an application to discover biomarkers for breast cancer patients based on gene expression data, stability selection yielded sparser models and the resulting discriminatory power was higher than with lasso penalized Cox regression models. CONCLUSION:The combination of stability selection and C-index boosting can be used to select small numbers of informative biomarkers and to derive new prediction rules that are optimal with respect to their discriminatory power. Stability selection controls the per-family error rate which makes the new approach also appealing from an inferential point of view, as it provides an alternative to classical hypothesis tests for single predictor effects. Due to the shrinkage and variable selection properties of statistical boosting algorithms, the latter tests are typically unfeasible for prediction models fitted by boosting.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Mayr A,Hofner B,Schmid M

doi

10.1186/s12859-016-1149-8

subject

Has Abstract

pub_date

2016-07-22 00:00:00

pages

288

issn

1471-2105

pii

10.1186/s12859-016-1149-8

journal_volume

17

pub_type

杂志文章
  • Automatic localization and identification of mitochondria in cellular electron cryo-tomography using faster-RCNN.

    abstract:BACKGROUND:Cryo-electron tomography (cryo-ET) enables the 3D visualization of cellular organization in near-native state which plays important roles in the field of structural cell biology. However, due to the low signal-to-noise ratio (SNR), large volume and high content complexity within cells, it remains difficult a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2650-7

    authors: Li R,Zeng X,Sigmund SE,Lin R,Zhou B,Liu C,Wang K,Jiang R,Freyberg Z,Lv H,Xu M

    更新日期:2019-03-29 00:00:00

  • REGULATOR: a database of metazoan transcription factors and maternal factors for developmental studies.

    abstract:BACKGROUND:Genes encoding transcription factors that constitute gene-regulatory networks and maternal factors accumulating in egg cytoplasm are two classes of essential genes that play crucial roles in developmental processes. Transcription factors control the expression of their downstream target genes by interacting ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0552-x

    authors: Wang K,Nishida H

    更新日期:2015-04-10 00:00:00

  • GObar: a gene ontology based analysis and visualization tool for gene sets.

    abstract:BACKGROUND:Microarray experiments, as well as other genomic analyses, often result in large gene sets containing up to several hundred genes. The biological significance of such sets of genes is, usually, not readily apparent. Identification of the functions of the genes in the set can help highlight features of intere...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-189

    authors: Lee JS,Katari G,Sachidanandam R

    更新日期:2005-07-25 00:00:00

  • The PowerAtlas: a power and sample size atlas for microarray experimental design and research.

    abstract:BACKGROUND:Microarrays permit biologists to simultaneously measure the mRNA abundance of thousands of genes. An important issue facing investigators planning microarray experiments is how to estimate the sample size required for good statistical power. What is the projected sample size or number of replicate chips need...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-84

    authors: Page GP,Edwards JW,Gadbury GL,Yelisetti P,Wang J,Trivedi P,Allison DB

    更新日期:2006-02-22 00:00:00

  • Supervised segmentation of phenotype descriptions for the human skeletal phenome using hybrid methods.

    abstract:BACKGROUND:Over the course of the last few years there has been a significant amount of research performed on ontology-based formalization of phenotype descriptions. In order to fully capture the intrinsic value and knowledge expressed within them, we need to take advantage of their inner structure, which implicitly co...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-265

    authors: Groza T,Hunter J,Zankl A

    更新日期:2012-10-15 00:00:00

  • A multiobjective approach to the genetic code adaptability problem.

    abstract:BACKGROUND:The organization of the canonical code has intrigued researches since it was first described. If we consider all codes mapping the 64 codes into 20 amino acids and one stop codon, there are more than 1.51×10(84) possible genetic codes. The main question related to the organization of the genetic code is why ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0480-9

    authors: de Oliveira LL,de Oliveira PS,Tinós R

    更新日期:2015-02-19 00:00:00

  • Probe-level linear model fitting and mixture modeling results in high accuracy detection of differential gene expression.

    abstract:BACKGROUND:The identification of differentially expressed genes (DEGs) from Affymetrix GeneChips arrays is currently done by first computing expression levels from the low-level probe intensities, then deriving significance by comparing these expression levels between conditions. The proposed PL-LM (Probe-Level Linear ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-391

    authors: Lemieux S

    更新日期:2006-08-25 00:00:00

  • Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots.

    abstract:BACKGROUND:Analyses of molecular high-throughput data often lack in robustness, i.e. results are very sensitive to the addition or removal of a single observation. Therefore, the identification of extreme observations is an important step of quality control before doing further data analysis. Standard outlier detection...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1645-5

    authors: Kruppa J,Jung K

    更新日期:2017-05-02 00:00:00

  • Shared data science infrastructure for genomics data.

    abstract:BACKGROUND:Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data co...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2967-2

    authors: Bagheri H,Muppirala U,Masonbrink RE,Severin AJ,Rajan H

    更新日期:2019-08-22 00:00:00

  • Graph-representation of oxidative folding pathways.

    abstract:BACKGROUND:The process of oxidative folding combines the formation of native disulfide bond with conformational folding resulting in the native three-dimensional fold. Oxidative folding pathways can be described in terms of disulfide intermediate species (DIS) which can also be isolated and characterized. Each DIS corr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-19

    authors: Agoston V,Cemazar M,Kaján L,Pongor S

    更新日期:2005-01-27 00:00:00

  • Inferring the role of transcription factors in regulatory networks.

    abstract:BACKGROUND:Expression profiles obtained from multiple perturbation experiments are increasingly used to reconstruct transcriptional regulatory networks, from well studied, simple organisms up to higher eukaryotes. Admittedly, a key ingredient in developing a reconstruction method is its ability to integrate heterogeneo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-228

    authors: Veber P,Guziolowski C,Le Borgne M,Radulescu O,Siegel A

    更新日期:2008-05-06 00:00:00

  • Software for selecting the most informative sets of genomic loci for multi-target microbial typing.

    abstract:BACKGROUND:High-throughput sequencing can identify numerous potential genomic targets for microbial strain typing, but identification of the most informative combinations requires the use of computational screening tools. This paper describes novel software-- Automated Selection of Typing Target Subsets (AuSeTTS)--that...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-148

    authors: O'Sullivan MV,Sintchenko V,Gilbert GL

    更新日期:2013-05-01 00:00:00

  • Primary orthologs from local sequence context.

    abstract:BACKGROUND:The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3384-2

    authors: Gao K,Miller J

    更新日期:2020-02-06 00:00:00

  • MicroSyn: a user friendly tool for detection of microsynteny in a gene family.

    abstract:BACKGROUND:The traditional phylogeny analysis within gene family is mainly based on DNA or amino acid sequence homologies. However, these phylogenetic tree analyses are not suitable for those "non-traditional" gene families like microRNA with very short sequences. For the normal protein-coding gene families, low bootst...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-79

    authors: Cai B,Yang X,Tuskan GA,Cheng ZM

    更新日期:2011-03-18 00:00:00

  • Cell subset prediction for blood genomic studies.

    abstract:BACKGROUND:Genome-wide transcriptional profiling of patient blood samples offers a powerful tool to investigate underlying disease mechanisms and personalized treatment decisions. Most studies are based on analysis of total peripheral blood mononuclear cells (PBMCs), a mixed population. In this case, accuracy is inhere...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-258

    authors: Bolen CR,Uduman M,Kleinstein SH

    更新日期:2011-06-24 00:00:00

  • Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis.

    abstract:BACKGROUND:Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving f...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-178

    authors: Baty F,Rüdiger J,Miglino N,Kern L,Borger P,Brutsche M

    更新日期:2013-06-06 00:00:00

  • DART: Denoising Algorithm based on Relevance network Topology improves molecular pathway activity inference.

    abstract:BACKGROUND:Inferring molecular pathway activity is an important step towards reducing the complexity of genomic data, understanding the heterogeneity in clinical outcome, and obtaining molecular correlates of cancer imaging traits. Increasingly, approaches towards pathway activity inference combine molecular profiles (...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-403

    authors: Jiao Y,Lawler K,Patel GS,Purushotham A,Jones AF,Grigoriadis A,Tutt A,Ng T,Teschendorff AE

    更新日期:2011-10-19 00:00:00

  • EGenBio: a data management system for evolutionary genomics and biodiversity.

    abstract:BACKGROUND:Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; http://egenbio.lsu.edu) to begin to address this....

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-S2-S7

    authors: Nahum LA,Reynolds MT,Wang ZO,Faith JJ,Jonna R,Jiang ZJ,Meyer TJ,Pollock DD

    更新日期:2006-09-06 00:00:00

  • FastGroup: a program to dereplicate libraries of 16S rDNA sequences.

    abstract:BACKGROUND:Ribosomal 16S DNA sequences are an essential tool for identifying and classifying microbes. High-throughput DNA sequencing now makes it economically possible to produce very large datasets of 16S rDNA sequences in short time periods, necessitating new computer tools for analyses. Here we describe FastGroup, ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-2-9

    authors: Seguritan V,Rohwer F

    更新日期:2001-01-01 00:00:00

  • ProLego: tool for extracting and visualizing topological modules in protein structures.

    abstract:BACKGROUND:In protein design, correct use of topology is among the initial and most critical feature. Meticulous selection of backbone topology aids in drastically reducing the structure search space. With ProLego, we present a server application to explore the component aspect of protein structures and provide an intu...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2171-9

    authors: Khan T,Panday SK,Ghosh I

    更新日期:2018-05-04 00:00:00

  • Accelerating a cross-correlation score function to search modifications using a single GPU.

    abstract:BACKGROUND:A cross-correlation (XCorr) score function is one of the most popular score functions utilized to search peptide identifications in databases, and many computer programs, such as SEQUEST, Comet, and Tide, currently use this score function. Recently, the HiXCorr algorithm was developed to speed up this score ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2559-6

    authors: Kim H,Han S,Um JH,Park K

    更新日期:2018-12-12 00:00:00

  • GLIDERS--a web-based search engine for genome-wide linkage disequilibrium between HapMap SNPs.

    abstract:BACKGROUND:A number of tools for the examination of linkage disequilibrium (LD) patterns between nearby alleles exist, but none are available for quickly and easily investigating LD at longer ranges (>500 kb). We have developed a web-based query tool (GLIDERS: Genome-wide LInkage DisEquilibrium Repository and Search en...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-367

    authors: Lawrence R,Day-Williams AG,Mott R,Broxholme J,Cardon LR,Zeggini E

    更新日期:2009-10-31 00:00:00

  • Assessing and predicting protein interactions by combining manifold embedding with multiple information integration.

    abstract:BACKGROUND:Protein-protein interactions (PPIs) play crucial roles in virtually every aspect of cellular function within an organism. Over the last decade, the development of novel high-throughput techniques has resulted in enormous amounts of data and provided valuable resources for studying protein interactions. Howev...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S7-S3

    authors: Lei YK,You ZH,Ji Z,Zhu L,Huang DS

    更新日期:2012-05-08 00:00:00

  • A CoD-based stationary control policy for intervening in large gene regulatory networks.

    abstract:BACKGROUND:One of the most important goals of the mathematical modeling of gene regulatory networks is to alter their behavior toward desirable phenotypes. Therapeutic techniques are derived for intervention in terms of stationary control policies. In large networks, it becomes computationally burdensome to derive an o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S10-S10

    authors: Ghaffari N,Ivanov I,Qian X,Dougherty ER

    更新日期:2011-10-18 00:00:00

  • Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses.

    abstract:BACKGROUND:The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S1-S7

    authors: Miotto O,Tan TW,Brusic V

    更新日期:2008-01-01 00:00:00

  • Metabolite coupling in genome-scale metabolic networks.

    abstract:BACKGROUND:Biochemically detailed stoichiometric matrices have now been reconstructed for various bacteria, yeast, and for the human cardiac mitochondrion based on genomic and proteomic data. These networks have been manually curated based on legacy data and elementally and charge balanced. Comparative analysis of thes...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-111

    authors: Becker SA,Price ND,Palsson BØ

    更新日期:2006-03-06 00:00:00

  • Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis.

    abstract:BACKGROUND:In mass spectrometry (MS) based proteomic data analysis, peak detection is an essential step for subsequent analysis. Recently, there has been significant progress in the development of various peak detection algorithms. However, neither a comprehensive survey nor an experimental comparison of these algorith...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-4

    authors: Yang C,He Z,Yu W

    更新日期:2009-01-06 00:00:00

  • GeneBins: a database for classifying gene expression data, with application to plant genome arrays.

    abstract:BACKGROUND:To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms. RESULTS:We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-87

    authors: Goffard N,Weiller G

    更新日期:2007-03-12 00:00:00

  • LncRNA HOTAIR-mediated Wnt/β-catenin network modeling to predict and validate therapeutic targets for cartilage damage.

    abstract:BACKGROUND:Cartilage damage is a crucial feature involved in several pathological conditions characterized by joint disorders, such as osteoarthritis and rheumatoid arthritis. Accumulated evidences showed that Wnt/β-catenin pathway plays a role in the pathogenesis of cartilage damage. In addition, it is experimentally ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2981-4

    authors: Zhou W,He X,Chen Z,Fan D,Wang Y,Feng H,Zhang G,Lu A,Xiao L

    更新日期:2019-07-31 00:00:00

  • IPRStats: visualization of the functional potential of an InterProScan run.

    abstract:BACKGROUND:InterPro is a collection of protein signatures for the classification and automated annotation of proteins. Interproscan is a software tool that scans protein sequences against Interpro member databases using a variety of profile-based, hidden markov model and positional specific score matrix methods. It not...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S12-S13

    authors: Kelly RJ,Vincent DE,Friedberg I

    更新日期:2010-12-21 00:00:00