Abstract:
:Tree-based methods have become popular for analyzing complex data structures where the primary goal is risk stratification of patients. Ensemble techniques improve the accuracy in prediction and address the instability in a single tree by growing an ensemble of trees and aggregating. However, in the process, individual trees get lost. In this paper, we propose a methodology for identifying the most representative trees in an ensemble on the basis of several tree distance metrics. Although our focus is on binary outcomes, the methods are applicable to censored data as well. For any two trees, the distance metrics are chosen to (1) measure similarity of the covariates used to split the trees; (2) reflect similar clustering of patients in the terminal nodes of the trees; and (3) measure similarity in predictions from the two trees. Whereas the latter focuses on prediction, the first two metrics focus on the architectural similarity between two trees. The most representative trees in the ensemble are chosen on the basis of the average distance between a tree and all other trees in the ensemble. Out-of-bag estimate of error rate is obtained using neighborhoods of representative trees. Simulations and data examples show gains in predictive accuracy when averaging over such neighborhoods. We illustrate our methods using a dataset of kidney cancer treatment receipt (binary outcome) and a second dataset of breast cancer survival (censored outcome).
journal_name
Stat Medjournal_title
Statistics in medicineauthors
Banerjee M,Ding Y,Noone AMdoi
10.1002/sim.4492subject
Has Abstractpub_date
2012-07-10 00:00:00pages
1601-16issue
15eissn
0277-6715issn
1097-0258journal_volume
31pub_type
杂志文章abstract::We investigate a binary partitioning algorithm in the case of a continuous repeated measures outcome. The procedure is based on the use of the likelihood ratio statistic to evaluate the performance of individual splits. The procedure partitions a set of longitudinal data into two mutually exclusive groups based on an ...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.1266
更新日期:2002-11-30 00:00:00
abstract::R2 has been criticized as a measure of model performance when predicting a dichotomous outcome, both because its value is often low and because it is sensitive to the prevalence of the event of interest. The C statistic is more widely used to measure model performance in a 0/1 setting. We use a simple parametric famil...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/(sici)1097-0258(19990228)18:4<375::aid-sim
更新日期:1999-02-28 00:00:00
abstract::Meta-analyses pooling continuous outcomes can use mean differences (MD), standardized MD (MD in pooled standard deviation units, SMD), or ratio of arithmetic means (RoM). Recently, ratio of geometric means using ad hoc (RoGM (ad hoc) ) or Taylor series (RoGM (Taylor) ) methods for estimating variances have been propos...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4501
更新日期:2012-07-30 00:00:00
abstract::Comparative calibration is the broad statistical methodology used to assess the calibration of a set of p instruments, each designed to measure the same characteristic, on a common group of individuals. Different from the usual calibration problem, the true underlying quantity measured is unobservable. Many authors ha...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/(sici)1097-0258(19970830)16:16<1889::aid-s
更新日期:1997-08-30 00:00:00
abstract::In the presence of confounding, the consistency assumption required for identification of causal effects may be violated due to misclassification of the outcome variable. We introduce an inverse probability weighted approach to rebalance covariates across treatment groups while mitigating the influence of differential...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.7522
更新日期:2018-02-10 00:00:00
abstract::This paper discusses regression analysis of multivariate current status failure time data (The Statistical Analysis of Interval-censoring Failure Time Data. Springer: New York, 2006), which occur quite often in, for example, tumorigenicity experiments and epidemiologic investigations of the natural history of a diseas...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.3715
更新日期:2009-11-30 00:00:00
abstract::Residuals in normal regression are used to assess a model's goodness-of-fit (GOF) and discover directions for improving the model. However, there is a lack of residuals with a characterized reference distribution for censored regression. In this article, we propose to diagnose censored regression with normalized rando...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.8852
更新日期:2020-12-13 00:00:00
abstract::In genetic and genomic studies, gene-environment (G×E) interactions have important implications. Some of the existing G×E interaction methods are limited by analyzing a small number of G factors at a time, by assuming linear effects of E factors, by assuming no data contamination, and by adopting ineffective selection...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.6609
更新日期:2015-12-30 00:00:00
abstract::Knee replacement is the most commonly used surgical treatment for knee arthritis. It has been reported that knee replacement rates vary across both regions and counties. This paper used data from Medicare patients to develop explanations for the variation. One problem with our data is that we do not have patient level...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/(SICI)1097-0258(19960915)15:17<1875::AID-S
更新日期:1996-09-15 00:00:00
abstract::Genome-wide association studies are characterized by a huge number of statistical tests performed to discover new disease-related genetic variants [in the form of single-nucleotide polymorphisms (SNPs)] in human DNA. Many SNPs have been identified for cross-sectionally measured phenotypes. However, there is a growing ...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.5517
更新日期:2013-01-15 00:00:00
abstract::Individuals may vary in their responses to treatment, and identification of subgroups differentially affected by a treatment is an important issue in medical research. The risk of misleading subgroup analyses has become well known, and some exploratory analyses can be helpful in clarifying how covariates potentially i...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.7970
更新日期:2019-02-10 00:00:00
abstract::In the development of a new treatment in oncology, phase II trials play a key role. On the basis of the data obtained during phase II, it is decided whether the treatment should be studied further. Therefore, the decision to be made on the basis of the data of a phase II trial must be as accurate as possible. For ethi...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.5585
更新日期:2012-12-30 00:00:00
abstract::We simulated multinomial AIDS incidence counts from 27 'representative' AIDS epidemics that spanned a period corresponding to previous applications of backcalculation (1 January 1977 to 1 July 1987) and assessed mean square error for several back-calculated estimators of HIV prevalence and short-term AIDS projections....
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780100802
更新日期:1991-08-01 00:00:00
abstract::The publication of Fisher's correspondence on statistics has shed new light on his views on randomization. Quotations from this correspondence and from other works of Fisher are used to illustrate the role of randomization in clinical trials. It is concluded that Fisher's views not only are coherent but, despite havin...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780130305
更新日期:1994-02-15 00:00:00
abstract::This paper concerns using modified weighted Schoenfeld residuals to test the proportionality of subdistribution hazards for the Fine-Gray model, similar to the tests proposed by Grambsch and Therneau for independently censored data. We develop a score test for the time-varying coefficients based on the modified Schoen...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.5815
更新日期:2013-09-30 00:00:00
abstract::Differences between arm-based (AB) and contrast-based (CB) models for network meta-analysis (NMA) are controversial. We compare the CB model of Lu and Ades (2006), the AB model of Hong et al(2016), and two intermediate models, using hypothetical data and a selected real data set. Differences between models arise prima...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.8360
更新日期:2019-11-30 00:00:00
abstract::The focus of this paper is dietary intervention trials. We explore the statistical issues involved when the response variable, intake of a food or nutrient, is based on self-report data that are subject to inherent measurement error. There has been little work on handling error in this context. A particular feature of...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.7011
更新日期:2016-11-10 00:00:00
abstract::The back-calculation method has been used to estimate the number of HIV infections from AIDS incidence data in a particular population. We present an extension of back calculation that provides estimates of the numbers of HIV infectives in different stages of infection. We model the staging process with a time-depende...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780110612
更新日期:1992-04-01 00:00:00
abstract::A review is given of different ways of estimating the error rate of a prediction rule based on a statistical model. A distinction is drawn between apparent, optimum and actual error rates. Moreover it is shown how cross-validation can be used to obtain an adjusted predictor with smaller error rate. A detailed discussi...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780091109
更新日期:1990-11-01 00:00:00
abstract::In recent years, quality control charts have been increasingly applied in the healthcare environment, for example, to monitor surgical performance. Risk-adjusted cumulative (CUSUM) charts that utilize risk scores like the Parsonnet score to estimate the probability of death of a patient from an operation turn out to b...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.8104
更新日期:2019-05-30 00:00:00
abstract::Longitudinal studies are often concerned with estimating the recurrence rate of a non-fatal event. In many cases, only the total number of events occurring during successive time intervals is known. We compared a mixed Poisson-gamma regression method proposed by Thall and a quasi-likelihood method proposed by Zeger an...
journal_title:Statistics in medicine
pub_type: 临床试验,杂志文章,随机对照试验
doi:10.1002/sim.4780121406
更新日期:1993-07-30 00:00:00
abstract::Joint models initially dedicated to a single longitudinal marker and a single time-to-event need to be extended to account for the rich longitudinal data of cohort studies. Multiple causes of clinical progression are indeed usually observed, and multiple longitudinal markers are collected when the true latent trait of...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.6731
更新日期:2016-02-10 00:00:00
abstract::Biostatistical concepts and methods apply to various problems arising in actual U.S. legal cases. These involve: measures of association, assessing the potential effect of omitted variables and the Peters-Belson approach to regression. In particular, we present the inapplicability of Fisher's exact test in the case wh...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780141505
更新日期:1995-08-15 00:00:00
abstract::Multi-type recurrent event data arise when two or more different kinds of events may occur repeatedly over a period of observation. The scientific objectives in such settings are often to describe features of the marginal processes and to study the association between the different types of events. Interval-censored m...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.1936
更新日期:2005-03-15 00:00:00
abstract::This paper presents a procedure for obtaining short-term projections and lower bounds on the size of the acquired immunodeficiency syndrome (AIDS) epidemic. The method is similar to that proposed by Brookmeyer and Gail but adapted to the situation where individual dates of AIDS diagnosis are available. It gives result...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780081210
更新日期:1989-12-01 00:00:00
abstract::Despite our best efforts, missing outcomes are common in randomized controlled clinical trials. The National Research Council's Committee on National Statistics panel report titled The Prevention and Treatment of Missing Data in Clinical Trials noted that further research is required to assess the impact of missing da...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.8428
更新日期:2020-01-30 00:00:00
abstract::A common but not necessarily logical requirement in drug development is that a 'limit of quantitation' be set for chemical assays and that observations that fall below the limit should not be treated as real data but should be labelled as below the limit and set aside for special treatment. We examine five of seven ap...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.5515
更新日期:2012-12-30 00:00:00
abstract::Algorithms for identifying public health threats or disease outbreaks are vulnerable to false alarms arising from sudden shifts in health-care utilization or data participation. This paper describes a method of reducing false alerts in automated public health surveillance algorithms, and in particular, automated syndr...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4204
更新日期:2011-06-30 00:00:00
abstract::Treatment effects for multiple outcomes can be meta-analyzed separately or jointly, but no systematic empirical comparison of the two approaches exists. From the Cochrane Library of Systematic Reviews, we identified 45 reviews, including 1473 trials and 258,675 patients, that contained two or three univariate meta-ana...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.6044
更新日期:2014-04-30 00:00:00
abstract::This paper models monthly AIDS diagnosis counts in terms of smooth secular trend, calendar month effects, and the number of workdays per month. A parameterization of month effects allows separation of true seasonal effects from a linear trend over the calendar year and an arbitrary June effect. There is strong evidenc...
journal_title:Statistics in medicine
pub_type: 杂志文章
doi:10.1002/sim.4780131905
更新日期:1994-10-15 00:00:00