Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes.

Abstract:

:In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects (n) relative to the number of predictor parameters (p) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African-American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.

journal_name

Stat Med

journal_title

Statistics in medicine

authors

Riley RD,Snell KIE,Ensor J,Burke DL,Harrell FE Jr,Moons KGM,Collins GS

doi

10.1002/sim.7993

subject

Has Abstract

pub_date

2019-03-30 00:00:00

pages

1262-1275

issue

7

eissn

0277-6715

issn

1097-0258

journal_volume

38

pub_type

杂志文章
  • Cancer immunotherapy trial design with cure rate and delayed treatment effect.

    abstract::Cancer immunotherapy trials have two special features: a delayed treatment effect and a cure rate. Both features violate the proportional hazard model assumption and ignoring either one of the two features in an immunotherapy trial design will result in substantial loss of statistical power. To properly design immunot...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8440

    authors: Wei J,Wu J

    更新日期:2020-03-15 00:00:00

  • Robust and efficient estimation in the parametric proportional hazards model under random censoring.

    abstract::Cox proportional hazard regression model is a popular tool to analyze the relationship between a censored lifetime variable with other relevant factors. The semiparametric Cox model is widely used to study different types of data arising from applied disciplines such as medical science, biology, and reliability studie...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8377

    authors: Ghosh A,Basu A

    更新日期:2019-11-30 00:00:00

  • A spatial scan statistic for ordinal data.

    abstract::Spatial scan statistics are widely used for count data to detect geographical disease clusters of high or low incidence, mortality or prevalence and to evaluate their statistical significance. Some data are ordinal or continuous in nature, however, so that it is necessary to dichotomize the data to use a traditional s...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2607

    authors: Jung I,Kulldorff M,Klassen AC

    更新日期:2007-03-30 00:00:00

  • The impact of heterogeneity on the comparison of survival times.

    abstract::We consider several sources of heterogeneity in a clinical trial with patients' survival time as the main response criterion: differences in prognosis which can be attributed to a latent or ignored prognostic factor; differences in treatment efficacy in subgroups of patients, and differences in treatment combinations ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780060708

    authors: Schumacher M,Olschewski M,Schmoor C

    更新日期:1987-10-01 00:00:00

  • Bias in the evaluation of DNA-amplification tests for detecting Chlamydia trachomatis.

    abstract::The purpose of this paper is to show that the sensitivity and specificity estimates obtained by 'discrepant analysis' are biased. Discrepant analysis is a widely used technique that attempts to provide estimates of sensitivity and specificity in the presence of an imperfect gold standard. Many researchers have applied...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19970630)16:12<1391::aid-s

    authors: Hadgu A

    更新日期:1997-06-30 00:00:00

  • Combining individual and aggregated data to investigate the role of socioeconomic disparities on cancer burden in Italy.

    abstract::Quantifying socioeconomic disparities and understanding the roots of inequalities are growing topics in cancer research. However, socioeconomic differences are challenging to investigate mainly due to the lack of accurate data at individual-level, while aggregate indicators are only partially informative. We implement...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8392

    authors: Mezzetti M,Palli D,Dominici F

    更新日期:2020-01-15 00:00:00

  • Construction, validation and updating of a prognostic model for kidney graft survival.

    abstract::The construction, validation and updating of a prognostic model for kidney graft survival is reported using data from the Eurotransplant database. First, a model is constructed for data from transplantations in the period 1984 to 1987. The model is later updated for the 1988 1990 data. The first data set was randomly ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780141806

    authors: Van Houwelingen HC,Thorogood J

    更新日期:1995-09-30 00:00:00

  • Assurance calculations for planning clinical trials with time-to-event outcomes.

    abstract::We consider the use of the assurance method in clinical trial planning. In the assurance method, which is an alternative to a power calculation, we calculate the probability of a clinical trial resulting in a successful outcome, via eliciting a prior probability distribution about the relevant treatment effect. This i...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5916

    authors: Ren S,Oakley JE

    更新日期:2014-01-15 00:00:00

  • Some considerations in the analysis of rates of change in longitudinal studies.

    abstract::This paper discusses and compares several estimators of mean rate of change in unbalanced longitudinal data based on a model with randomly distributed regression coefficients across individuals. The estimators are unweighted and weighted means of these coefficients. The paper also evaluates commonly used variance esti...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780060509

    authors: Palta M,Cook T

    更新日期:1987-07-01 00:00:00

  • Multi-state models for colon cancer recurrence and death with a cured fraction.

    abstract::In cancer clinical trials, patients often experience a recurrence of disease prior to the outcome of interest, overall survival. Additionally, for many cancers, there is a cured fraction of the population who will never experience a recurrence. There is often interest in how different covariates affect the probability...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6056

    authors: Conlon AS,Taylor JM,Sargent DJ

    更新日期:2014-05-10 00:00:00

  • Individualizing drug dosage with longitudinal data.

    abstract::We propose a two-step procedure to personalize drug dosage over time under the framework of a log-linear mixed-effect model. We model patients' heterogeneity using subject-specific random effects, which are treated as the realizations of an unspecified stochastic process. We extend the conditional quadratic inference ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7016

    authors: Zhu X,Qu A

    更新日期:2016-10-30 00:00:00

  • A random forest approach for competing risks based on pseudo-values.

    abstract::Random forest is a supervised learning method that combines many classification or regression trees for prediction. Here we describe an extension of the random forest method for building event risk prediction models in survival analysis with competing risks. In case of right-censored data, the event status at the pred...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5775

    authors: Mogensen UB,Gerds TA

    更新日期:2013-08-15 00:00:00

  • A recycling framework for the construction of Bonferroni-based multiple tests.

    abstract::In this paper we describe Bonferroni-based multiple testing procedures (MTPs) as strategies to split and recycle test mass. Here, 'test mass' refers to (parts of) the nominal level alpha at which the family-wise error rate is controlled. Briefly, test mass is split between different null hypotheses, and whenever a nul...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3513

    authors: Burman CF,Sonesson C,Guilbaud O

    更新日期:2009-02-28 00:00:00

  • A review of methods for futility stopping based on conditional power.

    abstract::Conditional power (CP) is the probability that the final study result will be statistically significant, given the data observed thus far and a specific assumption about the pattern of the data to be observed in the remainder of the study, such as assuming the original design effect, or the effect estimated from the c...

    journal_title:Statistics in medicine

    pub_type: 杂志文章,评审

    doi:10.1002/sim.2151

    authors: Lachin JM

    更新日期:2005-09-30 00:00:00

  • Parametric multistate survival models: Flexible modelling allowing transition-specific distributions with application to estimating clinically useful measures of effect differences.

    abstract::Multistate models are increasingly being used to model complex disease profiles. By modelling transitions between disease states, accounting for competing events at each transition, we can gain a much richer understanding of patient trajectories and how risk factors impact over the entire disease pathway. In this arti...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7448

    authors: Crowther MJ,Lambert PC

    更新日期:2017-12-20 00:00:00

  • A model for space-time cluster detection using spatial clusters with flexible temporal risk patterns.

    abstract::Maps of estimated disease rates over multiple time periods are useful tools for gaining etiologic insights regarding potential exposures associated with specific locations and times. In this paper, we describe an extension of the Gangnon-Clayton model for spatial clustering to spatio-temporal data. As in the purely sp...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3984

    authors: Gangnon RE

    更新日期:2010-09-30 00:00:00

  • Efficient evaluation of treatment effects in the presence of missing covariate values.

    abstract::In clinical trials, treatment comparisons are often performed by models that incorporate important prognostic factors. Since these models require complete covariate information on all patients, statisticians frequently resort to complete case analysis or to omission of an important covariate. A probability imputation ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780090707

    authors: Schemper M,Smith TL

    更新日期:1990-07-01 00:00:00

  • Causal conclusions are most sensitive to unobserved binary covariates.

    abstract::There is a rich literature that considers whether an observed relation between treatment and response is due to an unobserved covariate. In order to quantify this unmeasured bias, an assumption is made about the distribution of this unobserved covariate; typically that it is either binary or at least confined to the u...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2344

    authors: Wang L,Krieger AM

    更新日期:2006-07-15 00:00:00

  • Measurement error in continuous endpoints in randomised trials: Problems and solutions.

    abstract::In randomised trials, continuous endpoints are often measured with some degree of error. This study explores the impact of ignoring measurement error and proposes methods to improve statistical inference in the presence of measurement error. Three main types of measurement error in continuous endpoints are considered:...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8359

    authors: Nab L,Groenwold RHH,Welsing PMJ,van Smeden M

    更新日期:2019-11-30 00:00:00

  • Correcting for regression in assessing the response to treatment in a selected population.

    abstract::Previous work on the consequences of regression to the mean for the interpretation of responses to treatment is extended to the situation where the response measured is the proportional change in some variable. Methods for correcting for the bias are discussed. ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780060203

    authors: Curnow RN

    更新日期:1987-03-01 00:00:00

  • Bias resulting from the use of 'assay sensitivity' as an inclusion criterion for meta-analysis.

    abstract::Assay sensitivity has been proposed as a criterion for including psychiatric clinical outcome studies in meta-analyses. The authors assess the performance of assay sensitivity as a method for determining study appropriateness for meta-analysis by calculating expected standard drug vs placebo effect sizes for various c...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2240

    authors: Gelfand LA,Strunk DR,Tu XM,Noble RE,Derubeis RJ

    更新日期:2006-03-30 00:00:00

  • Interval estimation for rank correlation coefficients based on the probit transformation with extension to measurement error correction of correlated ranked data.

    abstract::The Spearman (rho(s)) and Kendall (tau) rank correlation coefficient are routinely used as measures of association between non-normally distributed random variables. However, confidence limits for rho(s) are only available under the assumption of bivariate normality and for tau under the assumption of asymptotic norma...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2547

    authors: Rosner B,Glynn RJ

    更新日期:2007-02-10 00:00:00

  • Estimating population effects of vaccination using large, routinely collected data.

    abstract::Vaccination in populations can have several kinds of effects. Establishing that vaccination produces population-level effects beyond the direct effects in the vaccinated individuals can have important consequences for public health policy. Formal methods have been developed for study designs and analysis that can esti...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7392

    authors: Halloran ME,Hudgens MG

    更新日期:2018-01-30 00:00:00

  • A linear exponent AR(1) family of correlation structures.

    abstract::In repeated measures settings, modeling the correlation pattern of the data can be immensely important for proper analyses. Accurate inference requires proper choice of the correlation model. Optimal efficiency of the estimation procedure demands a parsimonious parameterization of the correlation structure, with suffi...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3928

    authors: Simpson SL,Edwards LJ,Muller KE,Sen PK,Styner MA

    更新日期:2010-07-30 00:00:00

  • Assessing the incremental predictive performance of novel biomarkers over standard predictors.

    abstract::It is unclear to what extent the incremental predictive performance of a novel biomarker is impacted by the method used to control for standard predictors. We investigated whether adding a biomarker to a model with a published risk score overestimates its incremental performance as compared to adding it to a multivari...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6165

    authors: Xanthakis V,Sullivan LM,Vasan RS,Benjamin EJ,Massaro JM,D'Agostino RB Sr,Pencina MJ

    更新日期:2014-07-10 00:00:00

  • Additive and multiplicative covariate regression models for relative survival incorporating fractional polynomials for time-dependent effects.

    abstract::Relative survival is used to estimate patient survival excluding causes of death not related to the disease of interest. Rather than using cause of death information from death certificates, which is often poorly recorded, relative survival compares the observed survival to that expected in a matched group from the ge...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2399

    authors: Lambert PC,Smith LK,Jones DR,Botha JL

    更新日期:2005-12-30 00:00:00

  • A robust goodness-of-fit test statistic with application to ordinal regression models.

    abstract::We propose a goodness-of-fit test statistic for linear regression with heterogeneous variance, which is asymptotically chi-square if the given model is correct. The test statistic is computed as a quadratic form of observed minus predicted responses. We apply the method to a linear regression for an ordinal categorica...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780130205

    authors: Lipsitz SR,Buoncristiani JF

    更新日期:1994-01-30 00:00:00

  • Assessing heterogeneity and correlation of paired failure times with the bivariate frailty model.

    abstract::We consider bivariate survival times for heterogeneous populations, where heterogeneity induces deviations in an individual's risk of an event as well as associations between survival times. The heterogeneity is characterized by a bivariate frailty model. We measure the heterogeneity effects through deviations associa...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19990430)18:8<907::aid-sim

    authors: Xue X,Ding Y

    更新日期:1999-04-30 00:00:00

  • Estimating probit models with self-selected treatments.

    abstract::Outcomes research often requires estimating the impact of a binary treatment on a binary outcome in a non-randomized setting, such as the effect of taking a drug on mortality. The data often come from self-selected samples, leading to a spurious correlation between the treatment and outcome when standard binary depend...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2226

    authors: Bhattacharya J,Goldman D,McCaffrey D

    更新日期:2006-02-15 00:00:00

  • Survival analyses of randomized clinical trials adjusted for patients who switch treatments.

    abstract::Patients who switch treatment groups in randomized clinical trials can cause problems in the interpretation of the results. Although the intention-to-treat method is recognized as being the most reliable analysis, it may result in an underestimate of the treatment effect if there have been patients who switch treatmen...

    journal_title:Statistics in medicine

    pub_type: 临床试验,杂志文章,随机对照试验

    doi:10.1002/(SICI)1097-0258(19961015)15:19<2069::AID-S

    authors: Law MG,Kaldor JM

    更新日期:1996-10-15 00:00:00