Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets.

Abstract:

:Logistic regression analysis may well be used to develop a prognostic model for a dichotomous outcome. Especially when limited data are available, it is difficult to determine an appropriate selection of covariables for inclusion in such models. Also, predictions may be improved by applying some sort of shrinkage in the estimation of regression coefficients. In this study we compare the performance of several selection and shrinkage methods in small data sets of patients with acute myocardial infarction, where we aim to predict 30-day mortality. Selection methods included backward stepwise selection with significance levels alpha of 0.01, 0.05, 0. 157 (the AIC criterion) or 0.50, and the use of qualitative external information on the sign of regression coefficients in the model. Estimation methods included standard maximum likelihood, the use of a linear shrinkage factor, penalized maximum likelihood, the Lasso, or quantitative external information on univariable regression coefficients. We found that stepwise selection with a low alpha (for example, 0.05) led to a relatively poor model performance, when evaluated on independent data. Substantially better performance was obtained with full models with a limited number of important predictors, where regression coefficients were reduced with any of the shrinkage methods. Incorporation of external information for selection and estimation improved the stability and quality of the prognostic models. We therefore recommend shrinkage methods in full models including prespecified predictors and incorporation of external information, when prognostic models are constructed in small data sets.

journal_name

Stat Med

journal_title

Statistics in medicine

authors

Steyerberg EW,Eijkemans MJ,Harrell FE Jr,Habbema JD

doi

10.1002/(sici)1097-0258(20000430)19:8<1059::aid-si

subject

Has Abstract

pub_date

2000-04-30 00:00:00

pages

1059-79

issue

8

eissn

0277-6715

issn

1097-0258

pii

10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SI

journal_volume

19

pub_type

杂志文章
  • REML and ML estimation for clustered grouped survival data.

    abstract::Clustered grouped survival data arise naturally in clinical medicine and biological research. For example, in a randomized clinical trial, the variable of interest is the time to occurrence of a certain event with or without a new treatment and the data are collected from possibly correlated subjects from independent ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.1323

    authors: Lam KF,Ip D

    更新日期:2003-06-30 00:00:00

  • A Markov mixed effect regression model for drug compliance.

    abstract::Patient compliance (adherence) with prescribed medication is often erratic, while clinical outcomes are causally linked to actual, rather than nominal medication dosage. We propose here a hierarchical Markov model for patient compliance. At the first stage, conditional upon individual random effects and a set of indiv...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19981030)17:20<2313::aid-s

    authors: Girard P,Blaschke TF,Kastrissios H,Sheiner LB

    更新日期:1998-10-30 00:00:00

  • A cluster model for space-time disease counts.

    abstract::Modelling disease clustering over space and time can be helpful in providing indications of possible exposures and planning corresponding public health practices. Though a considerable number of studies focus on modelling spatio-temporal patterns of disease, most of them do not directly model a spatio-temporal cluster...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2424

    authors: Yan P,Clayton MK

    更新日期:2006-03-15 00:00:00

  • Sample size calculation for stepped wedge and other longitudinal cluster randomised trials.

    abstract::The sample size required for a cluster randomised trial is inflated compared with an individually randomised trial because outcomes of participants from the same cluster are correlated. Sample size calculations for longitudinal cluster randomised trials (including stepped wedge trials) need to take account of at least...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7028

    authors: Hooper R,Teerenstra S,de Hoop E,Eldridge S

    更新日期:2016-11-20 00:00:00

  • Analysis of incomplete multivariate data using linear models with structured covariance matrices.

    abstract::Incomplete and unbalanced multivariate data often arise in longitudinal studies due to missing or unequally-timed repeated measurements and/or the presence of time-varying covariates. A general approach to analysing such data is through maximum likelihood analysis using a linear model for the expected responses, and s...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780070132

    authors: Schluchter MD

    更新日期:1988-01-01 00:00:00

  • Model-checking techniques for stratified case-control studies.

    abstract::We present graphical and numerical methods for assessing the adequacy of the logistic regression model for stratified case-control data. The proposed methods are derived from the cumulative sum of residuals over the covariate or linear predictor. Under the assumed model, the cumulative residual process converges weakl...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.1932

    authors: Arbogast PG,Lin DY

    更新日期:2005-01-30 00:00:00

  • Joint estimation of multiple disease-specific sensitivities and specificities via crossed random effects models for correlated reader-based diagnostic data: application of data cloning.

    abstract::We present a model for describing correlated binocular data from reader-based diagnostic studies, where the same group of readers evaluates the presence or absence of certain diseases on binocular organs (e.g., fellow eyes) of patients. Multiple random effects are incorporated to meaningfully delineate various associa...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6584

    authors: Withanage N,de Leon AR,Rudnisky CJ

    更新日期:2015-12-20 00:00:00

  • A restricted mixture model for dietary pattern analysis in small samples.

    abstract::Multivariate finite mixture models have been applied to the identification of dietary patterns. These models are known to have many parameters, and consequently large samples are usually required. We present a special case of a multivariate mixture model that reduces the number of parameters to be estimated and seems ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5336

    authors: Rita Gaio A,Costa JP,Santos AC,Ramos E,Lopes C

    更新日期:2012-08-30 00:00:00

  • Correction of sampling bias in a cross-sectional study of post-surgical complications.

    abstract::Cross-sectional designs are often used to monitor the proportion of infections and other post-surgical complications acquired in hospitals. However, conventional methods for estimating incidence proportions when applied to cross-sectional data may provide estimators that are highly biased, as cross-sectional designs t...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5608

    authors: Fluss R,Mandel M,Freedman LS,Weiss IS,Zohar AE,Haklai Z,Gordon ES,Simchen E

    更新日期:2013-06-30 00:00:00

  • Statistical methods for multivariate interval-censored recurrent events.

    abstract::Multi-type recurrent event data arise when two or more different kinds of events may occur repeatedly over a period of observation. The scientific objectives in such settings are often to describe features of the marginal processes and to study the association between the different types of events. Interval-censored m...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.1936

    authors: Chen BE,Cook RJ,Lawless JF,Zhan M

    更新日期:2005-03-15 00:00:00

  • A simulation-free approach to assessing the performance of the continual reassessment method.

    abstract::The continual reassessment method (CRM) is an adaptive design for Phase I trials whose operating characteristics, including appropriate sample size, probability of correctly identifying the maximum tolerated dose, and the expected proportion of participants assigned to each dose, can only be determined via simulation....

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8746

    authors: Braun TM

    更新日期:2020-09-16 00:00:00

  • Power and sample size calculation for log-rank test with a time lag in treatment effect.

    abstract::The log-rank test is the most powerful non-parametric test for detecting a proportional hazards alternative and thus is the most commonly used testing procedure for comparing time-to-event distributions between different treatments in clinical trials. When the log-rank test is used for the primary data analysis, the s...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3501

    authors: Zhang D,Quan H

    更新日期:2009-02-28 00:00:00

  • Association models for periodontal disease progression: a comparison of methods for clustered binary data.

    abstract::We investigate population-averaged (PA) and cluster-specific (CS) associations for clustered binary logistic regression in the context of a longitudinal clinical trial that investigated the association between tooth-specific visual elastase kit results and periodontal disease progression within 26 weeks of follow-up. ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780140407

    authors: Ten Have TR,Landis JR,Weaver SL

    更新日期:1995-02-28 00:00:00

  • On the use of the generalized t and generalized rank-sum statistics in medical research.

    abstract::We have used Monte Carlo methods to compare the type I error properties of the conditional and unconditional versions of the generalized t and the generalized rank-sum tests to those of the independent samples t and Wilcoxon rank-sum tests. Results showed inflated type I errors for the conditional generalized tests bu...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780110410

    authors: Blair RC,Morel JG

    更新日期:1992-02-28 00:00:00

  • Global goodness-of-fit tests for group testing regression models.

    abstract::In a variety of biomedical applications, particularly those involving screening for infectious diseases, testing individuals (e.g. blood/urine samples, etc.) in pools has become a standard method of data collection. This experimental design, known as group testing (or pooled testing), can provide a large reduction in ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.3678

    authors: Chen P,Tebbs JM,Bilder CR

    更新日期:2009-10-15 00:00:00

  • The power prior: theory and applications.

    abstract::The power prior has been widely used in many applications covering a large number of disciplines. The power prior is intended to be an informative prior constructed from historical data. It has been used in clinical trials, genetics, health care, psychology, environmental health, engineering, economics, and business. ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6728

    authors: Ibrahim JG,Chen MH,Gwon Y,Chen F

    更新日期:2015-12-10 00:00:00

  • Some extensions and applications of a Bayesian strategy for monitoring multiple outcomes in clinical trials.

    abstract::We present some practical extensions and applications of a strategy proposed by Thall, Simon and Estey for designing and monitoring single-arm clinical trials with multiple outcomes. We show by application how the strategy may be applied to construct designs for phase IIA activity trials and phase II equivalence trial...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/(sici)1097-0258(19980730)17:14<1563::aid-s

    authors: Thall PF,Sung HG

    更新日期:1998-07-30 00:00:00

  • Estimation of sojourn time distributions and false negative rates in screening programmes which use two modalities.

    abstract::Day and Walter derived methods of joint maximum likelihood estimation for the sojourn time distribution and the false negative rate for a screening programme. Their methods are not directly applicable to a programme which uses alternate screening by two modalities whose sojourn times and false negative rates will diff...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780080611

    authors: Alexander FE

    更新日期:1989-06-01 00:00:00

  • Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data.

    abstract::The matched case-control designs are commonly used to control for potential confounding factors in genetic epidemiology studies especially epigenetic studies with DNA methylation. Compared with unmatched case-control studies with high-dimensional genomic or epigenetic data, there have been few variable selection metho...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5694

    authors: Sun H,Wang S

    更新日期:2013-05-30 00:00:00

  • Cluster without fluster: The effect of correlated outcomes on inference in randomized clinical trials.

    abstract::Inference for randomized clinical trials is generally based on the assumption that outcomes are independently and identically distributed under the null hypothesis. In some trials, particularly in infectious disease, outcomes may be correlated. This may be known in advance (e.g. allowing randomization of family member...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2977

    authors: Proschan M,Follmann D

    更新日期:2008-03-15 00:00:00

  • Survival time models for analysing drug combination treatments.

    abstract::Several relative risk models for survival time data in drug combination therapy are derived and their properties are discussed. The main intention of this paper is to clarify the differences among the models in order to help to choose the appropriate one in a given situation. The models are motivated by discussing the...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780091216

    authors: Kübler J,Schumacher M

    更新日期:1990-12-01 00:00:00

  • Correcting for the dependent competing risk of treatment using inverse probability of censoring weighting and copulas in the estimation of natural conception chances.

    abstract::When estimating the probability of natural conception from observational data on couples with an unfulfilled child wish, the start of assisted reproductive therapy (ART) is a competing event that cannot be assumed to be independent of natural conception. In clinical practice, interest lies in the probability of natura...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6280

    authors: van Geloven N,Geskus RB,Mol BW,Zwinderman AH

    更新日期:2014-11-20 00:00:00

  • Memory and other properties of multiple test procedures generated by entangled graphs.

    abstract::Methods for addressing multiplicity in clinical trials have attracted much attention during the past 20 years. They include the investigation of new classes of multiple test procedures, such as fixed sequence, fallback and gatekeeping procedures. More recently, sequentially rejective graphical test procedures have bee...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.5711

    authors: Maurer W,Bretz F

    更新日期:2013-05-10 00:00:00

  • Power and money in cluster randomized trials: when is it worth measuring a covariate?

    abstract::The power to detect a treatment effect in cluster randomized trials can be increased by increasing the number of clusters. An alternative is to include covariates into the regression model that relates treatment condition to outcome. In this paper, formulae are derived in order to evaluate both strategies on basis of ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.2297

    authors: Moerbeek M

    更新日期:2006-08-15 00:00:00

  • Assessing the robustness of sisVIVE in a Mendelian randomization study to estimate the causal effect of body mass index on income using multiple SNPs from understanding society.

    abstract::The "some invalid, some valid instrumental variable estimator" (sisVIVE) is a lasso-based method for instrumental variables (IVs) regression of outcome on an exposure. In principle, sisVIVE is robust to some of the IVs in the analysis being invalid, in the sense of being related to the outcome variable through pathway...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.8066

    authors: Bao Y,Clarke PS,Smart M,Kumari M

    更新日期:2019-04-30 00:00:00

  • Dropouts in the AB/BA crossover design.

    abstract::Missing data arise in crossover trials, as they do in any form of clinical trial. Several papers have addressed the problems that missing data create, although almost all of these assume that the probability that a planned observation is missing does not depend on the value that would have been observed; that is, the ...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4497

    authors: Ho WK,Matthews JN,Henderson R,Farewell D,Rodgers LR

    更新日期:2012-07-20 00:00:00

  • The standard error of Cohen's Kappa.

    abstract::This paper gives a standard error for Cohen's Kappa, conditional on the margins of the observed r x r table. An explicit formula is given for the 2 x 2 table, and a procedure for the more general situation. A parsimonious log-linear model is suggested for the general case and an approximate confidence interval for kap...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780100512

    authors: Garner JB

    更新日期:1991-05-01 00:00:00

  • Seasonal and other short-term influences on United States AIDS incidence.

    abstract::This paper models monthly AIDS diagnosis counts in terms of smooth secular trend, calendar month effects, and the number of workdays per month. A parameterization of month effects allows separation of true seasonal effects from a linear trend over the calendar year and an arbitrary June effect. There is strong evidenc...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.4780131905

    authors: Bacchetti P

    更新日期:1994-10-15 00:00:00

  • Minimum sample size for developing a multivariable prediction model: Part I - Continuous outcomes.

    abstract::In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model developm...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.7993

    authors: Riley RD,Snell KIE,Ensor J,Burke DL,Harrell FE Jr,Moons KGM,Collins GS

    更新日期:2019-03-30 00:00:00

  • Survival probabilities with time-dependent treatment indicator: quantities and non-parametric estimators.

    abstract::The 'landmark' and 'Simon and Makuch' non-parametric estimators of the survival function are commonly used to contrast the survival experience of time-dependent treatment groups in applications such as stem cell transplant versus chemotherapy in leukemia. However, the theoretical survival functions corresponding to th...

    journal_title:Statistics in medicine

    pub_type: 杂志文章

    doi:10.1002/sim.6765

    authors: Bernasconi DP,Rebora P,Iacobelli S,Valsecchi MG,Antolini L

    更新日期:2016-03-30 00:00:00