Identifying representative trees from ensembles.

Abstract:

:Tree-based methods have become popular for analyzing complex data structures where the primary goal is risk stratification of patients. Ensemble techniques improve the accuracy in prediction and address the instability in a single tree by growing an ensemble of trees and aggregating. However, in the process, individual trees get lost. In this paper, we propose a methodology for identifying the most representative trees in an ensemble on the basis of several tree distance metrics. Although our focus is on binary outcomes, the methods are applicable to censored data as well. For any two trees, the distance metrics are chosen to (1) measure similarity of the covariates used to split the trees; (2) reflect similar clustering of patients in the terminal nodes of the trees; and (3) measure similarity in predictions from the two trees. Whereas the latter focuses on prediction, the first two metrics focus on the architectural similarity between two trees. The most representative trees in the ensemble are chosen on the basis of the average distance between a tree and all other trees in the ensemble. Out-of-bag estimate of error rate is obtained using neighborhoods of representative trees. Simulations and data examples show gains in predictive accuracy when averaging over such neighborhoods. We illustrate our methods using a dataset of kidney cancer treatment receipt (binary outcome) and a second dataset of breast cancer survival (censored outcome).

journal_name

Stat Med

journal_title

Statistics in medicine

authors

Banerjee M,Ding Y,Noone AM

doi

10.1002/sim.4492

subject

Has Abstract

pub_date

2012-07-10 00:00:00

pages

1601-16

issue

15

eissn

0277-6715

issn

1097-0258

journal_volume

31

pub_type

杂志文章
  • Binary partitioning for continuous longitudinal data: categorizing a prognostic variable.

    abstract::We investigate a binary partitioning algorithm in the case of a continuous repeated measures outcome. The procedure is based on the use of the likelihood ratio statistic to evaluate the performance of individual splits. The procedure partitions a set of longitudinal data into two mutually exclusive groups based on an ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.1266

    authors: Abdolell M,LeBlanc M,Stephens D,Harrison RV

    更新日期:2002-11-30 00:00:00

  • R2: a useful measure of model performance when predicting a dichotomous outcome.

    abstract::R2 has been criticized as a measure of model performance when predicting a dichotomous outcome, both because its value is often low and because it is sensitive to the prevalence of the event of interest. The C statistic is more widely used to measure model performance in a 0/1 setting. We use a simple parametric famil...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19990228)18:4<375::aid-sim

    authors: Ash A,Shwartz M

    更新日期:1999-02-28 00:00:00

  • Ratio of geometric means to analyze continuous outcomes in meta-analysis: comparison to mean differences and ratio of arithmetic means using empiric data and simulation.

    abstract::Meta-analyses pooling continuous outcomes can use mean differences (MD), standardized MD (MD in pooled standard deviation units, SMD), or ratio of arithmetic means (RoM). Recently, ratio of geometric means using ad hoc (RoGM (ad hoc) ) or Taylor series (RoGM (Taylor) ) methods for estimating variances have been propos...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4501

    authors: Friedrich JO,Adhikari NK,Beyene J

    更新日期:2012-07-30 00:00:00

  • Comparative calibration without a gold standard.

    abstract::Comparative calibration is the broad statistical methodology used to assess the calibration of a set of p instruments, each designed to measure the same characteristic, on a common group of individuals. Different from the usual calibration problem, the true underlying quantity measured is unobservable. Many authors ha...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19970830)16:16<1889::aid-s

    authors: Lu Y,Ye K,Mathur AK,Hui S,Fuerst TP,Genant HK

    更新日期:1997-08-30 00:00:00

  • Weighted estimation for confounded binary outcomes subject to misclassification.

    abstract::In the presence of confounding, the consistency assumption required for identification of causal effects may be violated due to misclassification of the outcome variable. We introduce an inverse probability weighted approach to rebalance covariates across treatment groups while mitigating the influence of differential...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7522

    authors: Gravel CA,Platt RW

    更新日期:2018-02-10 00:00:00

  • A frailty model approach for regression analysis of multivariate current status data.

    abstract::This paper discusses regression analysis of multivariate current status failure time data (The Statistical Analysis of Interval-censoring Failure Time Data. Springer: New York, 2006), which occur quite often in, for example, tumorigenicity experiments and epidemiologic investigations of the natural history of a diseas...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3715

    authors: Chen MH,Tong X,Sun J

    更新日期:2009-11-30 00:00:00

  • Model diagnostics for censored regression via randomized survival probabilities.

    abstract::Residuals in normal regression are used to assess a model's goodness-of-fit (GOF) and discover directions for improving the model. However, there is a lack of residuals with a characterized reference distribution for censored regression. In this article, we propose to diagnose censored regression with normalized rando...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8852

    authors: Li L,Wu T,Feng C

    更新日期:2020-12-13 00:00:00

  • A penalized robust semiparametric approach for gene-environment interactions.

    abstract::In genetic and genomic studies, gene-environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6609

    authors: Wu C,Shi X,Cui Y,Ma S

    更新日期:2015-12-30 00:00:00

  • An empirical Bayes method for studying variation in knee replacement rates.

    abstract::Knee replacement is the most commonly used surgical treatment for knee arthritis. It has been reported that knee replacement rates vary across both regions and counties. This paper used data from Medicare patients to develop explanations for the variation. One problem with our data is that we do not have patient level...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(SICI)1097-0258(19960915)15:17<1875::AID-S

    authors: Zhou XH,Katz BP,Holleman E,Melfi CA,Dittus R

    更新日期:1996-09-15 00:00:00

  • Fast linear mixed model computations for genome-wide association studies with longitudinal data.

    abstract::Genome-wide association studies are characterized by a huge number of statistical tests performed to discover new disease-related genetic variants [in the form of single-nucleotide polymorphisms (SNPs)] in human DNA. Many SNPs have been identified for cross-sectionally measured phenotypes. However, there is a growing ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5517

    authors: Sikorska K,Rivadeneira F,Groenen PJ,Hofman A,Uitterlinden AG,Eilers PH,Lesaffre E

    更新日期:2013-01-15 00:00:00

  • Estimating heterogeneous treatment effects for latent subgroups in observational studies.

    abstract::Individuals may vary in their responses to treatment, and identification of subgroups differentially affected by a treatment is an important issue in medical research. The risk of misleading subgroup analyses has become well known, and some exploratory analyses can be helpful in clarifying how covariates potentially i...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7970

    authors: Kim HJ,Lu B,Nehus EJ,Kim MO

    更新日期:2019-02-10 00:00:00

  • Estimation of secondary endpoints in two-stage phase II oncology trials.

    abstract::In the development of a new treatment in oncology, phase II trials play a key role. On the basis of the data obtained during phase II, it is decided whether the treatment should be studied further. Therefore, the decision to be made on the basis of the data of a phase II trial must be as accurate as possible. For ethi...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5585

    authors: Kunz CU,Kieser M

    更新日期:2012-12-30 00:00:00

  • Mean square error of estimates of HIV prevalence and short-term AIDS projections derived by backcalculation.

    abstract::We simulated multinomial AIDS incidence counts from 27 'representative' AIDS epidemics that spanned a period corresponding to previous applications of backcalculation (1 January 1977 to 1 July 1987) and assessed mean square error for several back-calculated estimators of HIV prevalence and short-term AIDS projections....

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780100802

    authors: Rosenberg PS,Gail MH,Pee D

    更新日期:1991-08-01 00:00:00

  • Fisher's game with the devil.

    abstract::The publication of Fisher's correspondence on statistics has shed new light on his views on randomization. Quotations from this correspondence and from other works of Fisher are used to illustrate the role of randomization in clinical trials. It is concluded that Fisher's views not only are coherent but, despite havin...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780130305

    authors: Senn S

    更新日期:1994-02-15 00:00:00

  • Goodness-of-fit test for proportional subdistribution hazards model.

    abstract::This paper concerns using modified weighted Schoenfeld residuals to test the proportionality of subdistribution hazards for the Fine-Gray model, similar to the tests proposed by Grambsch and Therneau for independently censored data. We develop a score test for the time-varying coefficients based on the modified Schoen...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5815

    authors: Zhou B,Fine J,Laird G

    更新日期:2013-09-30 00:00:00

  • A comparison of arm-based and contrast-based models for network meta-analysis.

    abstract::Differences between arm-based (AB) and contrast-based (CB) models for network meta-analysis (NMA) are controversial. We compare the CB model of Lu and Ades (2006), the AB model of Hong et al(2016), and two intermediate models, using hypothetical data and a selected real data set. Differences between models arise prima...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8360

    authors: White IR,Turner RM,Karahalios A,Salanti G

    更新日期:2019-11-30 00:00:00

  • Statistical issues related to dietary intake as the response variable in intervention trials.

    abstract::The focus of this paper is dietary intervention trials. We explore the statistical issues involved when the response variable, intake of a food or nutrient, is based on self-report data that are subject to inherent measurement error. There has been little work on handling error in this context. A particular feature of...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7011

    authors: Keogh RH,Carroll RJ,Tooze JA,Kirkpatrick SI,Freedman LS

    更新日期:2016-11-10 00:00:00

  • Estimating the stage-specific numbers of HIV infection using a Markov model and back-calculation.

    abstract::The back-calculation method has been used to estimate the number of HIV infections from AIDS incidence data in a particular population. We present an extension of back calculation that provides estimates of the numbers of HIV infectives in different stages of infection. We model the staging process with a time-depende...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780110612

    authors: Longini IM Jr,Byers RH,Hessol NA,Tan WY

    更新日期:1992-04-01 00:00:00

  • Predictive value of statistical models.

    abstract::A review is given of different ways of estimating the error rate of a prediction rule based on a statistical model. A distinction is drawn between apparent, optimum and actual error rates. Moreover it is shown how cross-validation can be used to obtain an adjusted predictor with smaller error rate. A detailed discussi...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780091109

    authors: Van Houwelingen JC,Le Cessie S

    更新日期:1990-11-01 00:00:00

  • Risk-adjusted CUSUM charts under model error.

    abstract::In recent years, quality control charts have been increasingly applied in the healthcare environment, for example, to monitor surgical performance. Risk-adjusted cumulative (CUSUM) charts that utilize risk scores like the Parsonnet score to estimate the probability of death of a patient from an operation turn out to b...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8104

    authors: Knoth S,Wittenberg P,Gan FF

    更新日期:2019-05-30 00:00:00

  • Comparison of methods for the analysis of longitudinal interval count data.

    abstract::Longitudinal studies are often concerned with estimating the recurrence rate of a non-fatal event. In many cases, only the total number of events occurring during successive time intervals is known. We compared a mixed Poisson-gamma regression method proposed by Thall and a quasi-likelihood method proposed by Zeger an...

    journal_title:Statistics in medicine

    pub_type: 临床试验,杂志文章,随机对照试验

    doi:10.1002/sim.4780121406

    authors: Stukel TA

    更新日期:1993-07-30 00:00:00

  • Joint modeling of repeated multivariate cognitive measures and competing risks of dementia and death: a latent process and latent class approach.

    abstract::Joint models initially dedicated to a single longitudinal marker and a single time-to-event need to be extended to account for the rich longitudinal data of cohort studies. Multiple causes of clinical progression are indeed usually observed, and multiple longitudinal markers are collected when the true latent trait of...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6731

    authors: Proust-Lima C,Dartigues JF,Jacqmin-Gadda H

    更新日期:2016-02-10 00:00:00

  • Biostatistical concepts and methods in the legal setting.

    abstract::Biostatistical concepts and methods apply to various problems arising in actual U.S. legal cases. These involve: measures of association, assessing the potential effect of omitted variables and the Peters-Belson approach to regression. In particular, we present the inapplicability of Fisher's exact test in the case wh...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780141505

    authors: Gastwirth JL,Greenhouse SW

    更新日期:1995-08-15 00:00:00

  • Statistical methods for multivariate interval-censored recurrent events.

    abstract::Multi-type recurrent event data arise when two or more different kinds of events may occur repeatedly over a period of observation. The scientific objectives in such settings are often to describe features of the marginal processes and to study the association between the different types of events. Interval-censored m...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.1936

    authors: Chen BE,Cook RJ,Lawless JF,Zhan M

    更新日期:2005-03-15 00:00:00

  • Estimation of haemophilia-associated AIDS incidence in Japan using individual dates of diagnosis.

    abstract::This paper presents a procedure for obtaining short-term projections and lower bounds on the size of the acquired immunodeficiency syndrome (AIDS) epidemic. The method is similar to that proposed by Brookmeyer and Gail but adapted to the situation where individual dates of AIDS diagnosis are available. It gives result...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780081210

    authors: Tango T

    更新日期:1989-12-01 00:00:00

  • Missing data and sensitivity analysis for binary data with implications for sample size and power of randomized clinical trials.

    abstract::Despite our best efforts, missing outcomes are common in randomized controlled clinical trials. The National Research Council's Committee on National Statistics panel report titled The Prevention and Treatment of Missing Data in Clinical Trials noted that further research is required to assess the impact of missing da...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8428

    authors: Cook T,Zea R

    更新日期:2020-01-30 00:00:00

  • The ghosts of departed quantities: approaches to dealing with observations below the limit of quantitation.

    abstract::A common but not necessarily logical requirement in drug development is that a 'limit of quantitation' be set for chemical assays and that observations that fall below the limit should not be treated as real data but should be labelled as below the limit and set aside for special treatment. We examine five of seven ap...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5515

    authors: Senn S,Holford N,Hockey H

    更新日期:2012-12-30 00:00:00

  • Reducing false alarms in syndromic surveillance.

    abstract::Algorithms for identifying public health threats or disease outbreaks are vulnerable to false alarms arising from sudden shifts in health-care utilization or data participation. This paper describes a method of reducing false alerts in automated public health surveillance algorithms, and in particular, automated syndr...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4204

    authors: Peter W,Najmi AH,Burkom HS

    更新日期:2011-06-30 00:00:00

  • An empirical comparison of univariate and multivariate meta-analyses for categorical outcomes.

    abstract::Treatment effects for multiple outcomes can be meta-analyzed separately or jointly, but no systematic empirical comparison of the two approaches exists. From the Cochrane Library of Systematic Reviews, we identified 45 reviews, including 1473 trials and 258,675 patients, that contained two or three univariate meta-ana...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6044

    authors: Trikalinos TA,Hoaglin DC,Schmid CH

    更新日期:2014-04-30 00:00:00

  • Seasonal and other short-term influences on United States AIDS incidence.

    abstract::This paper models monthly AIDS diagnosis counts in terms of smooth secular trend, calendar month effects, and the number of workdays per month. A parameterization of month effects allows separation of true seasonal effects from a linear trend over the calendar year and an arbitrary June effect. There is strong evidenc...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780131905

    authors: Bacchetti P

    更新日期:1994-10-15 00:00:00