A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies.

Abstract:

BACKGROUND:Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. METHODS:We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. RESULTS:For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study.

journal_name

Stat Methods Med Res

authors

Khondoker M,Dobson R,Skirrow C,Simmons A,Stahl D

doi

10.1177/0962280213502437

subject

Has Abstract

pub_date

2016-10-01 00:00:00

pages

1804-1823

issue

5

eissn

0962-2802

issn

1477-0334

pii

0962280213502437

journal_volume

25

pub_type

杂志文章
  • Analysis of phase II methodologies for single-arm clinical trials with multiple endpoints in rare cancers: An example in Ewing's sarcoma.

    abstract::Trials run in either rare diseases, such as rare cancers, or rare sub-populations of common diseases are challenging in terms of identifying, recruiting and treating sufficient patients in a sensible period. Treatments for rare diseases are often designed for other disease areas and then later proposed as possible tre...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280216662070

    authors: Dutton P,Love SB,Billingham L,Hassan AB

    更新日期:2018-05-01 00:00:00

  • Latent mixture models for multivariate and longitudinal outcomes.

    abstract::Repeated measures and multivariate outcomes are an increasingly common feature of trials. Their joint analysis by means of random effects and latent variable models is appealing but patterns of heterogeneity in outcome profile may not conform to standard multivariate normal assumptions. In addition, there is much inte...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章,评审

    doi:10.1177/0962280209105016

    authors: Pickles A,Croudace T

    更新日期:2010-06-01 00:00:00

  • A curvilinear bivariate random changepoint model to assess temporal order of markers.

    abstract::In biomedical research, various longitudinal markers measuring different quantities are often collected over time. For example, repeated measures of psychometric scores are very informative about the degradation process toward dementia. These trajectories are generally nonlinear with an acceleration of the decline a f...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280219898719

    authors: Segalas C,Helmer C,Jacqmin-Gadda H

    更新日期:2020-09-01 00:00:00

  • The EM algorithm in medical imaging.

    abstract::This article outlines the statistical developments that have taken place in the use of the EM algorithm in emission and transmission tomography during the past decade or so. We discuss the statistical aspects of the modelling of the projection data for both the emission and transmission cases and define the relevant p...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/096228029700600105

    authors: Kay J

    更新日期:1997-03-01 00:00:00

  • Mixture modelling for cluster analysis.

    abstract::Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of clusters g by first fitting a mixture model with g components. An outright clustering of the data is then obtained by assigning an observation to the component to...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1191/0962280204sm372ra

    authors: McLachlan GJ,Chang SU

    更新日期:2004-10-01 00:00:00

  • Joint modelling for organ transplantation outcomes for patients with diabetes and the end-stage renal disease.

    abstract::This article is motivated by jointly modelling longitudinal and time-to-event clinical data of patients with diabetes and end-stage renal disease. All patients are on the waiting list for the pancreas transplant after kidney transplant, and some of them have a pancreas transplant before kidney transplant failure or de...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280218786980

    authors: Dong JJ,Wang S,Wang L,Gill J,Cao J

    更新日期:2019-09-01 00:00:00

  • Statistical methods in computational anatomy.

    abstract::This paper reviews recent developments by the Washington/Brown groups for the study of anatomical shape in the emerging new discipline of computational anatomy. Parametric representations of anatomical variation for computational anatomy are reviewed, restricted to the assumption of small deformations. The generation ...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章,评审

    doi:10.1177/096228029700600305

    authors: Miller M,Banerjee A,Christensen G,Joshi S,Khaneja N,Grenander U,Matejic L

    更新日期:1997-09-01 00:00:00

  • Power and sample size for multivariate logistic modeling of unmatched case-control studies.

    abstract::Sample size calculations are needed to design and assess the feasibility of case-control studies. Although such calculations are readily available for simple case-control designs and univariate analyses, there is limited theory and software for multivariate unconditional logistic analysis of case-control data. Here we...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217737157

    authors: Gail MH,Haneuse S

    更新日期:2019-03-01 00:00:00

  • Testing for association in case-control genome-wide association studies with shared controls.

    abstract::The statistical analysis of genome-wide association studies (GWASs) with multiple diseases and shared controls (SCs) is discussed. The usual method for analyzing data from these studies is to compare each individual disease with either the SCs or the pooled controls which include other diseases. We observed that apply...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280212474061

    authors: Chen Z,Huang H,Ng HK

    更新日期:2016-04-01 00:00:00

  • A comparison of power analysis methods for evaluating effects of a predictor on slopes in longitudinal designs with missing data.

    abstract::In many longitudinal studies, evaluating the effect of a binary or continuous predictor variable on the rate of change of the outcome, i.e. slope, is often of primary interest. Sample size determination of these studies, however, is complicated by the expectation that missing data will occur due to missed visits, earl...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280212437452

    authors: Wang C,Hall CB,Kim M

    更新日期:2015-12-01 00:00:00

  • Parametric models for incomplete continuous and categorical longitudinal data.

    abstract::This paper reviews models for incomplete continuous and categorical longitudinal data. In terms of Rubin's classification of missing value processes we are specifically concerned with the problem of nonrandom missingness. A distinction is drawn between the classes of selection and pattern-mixture models and, using sev...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章,评审

    doi:10.1177/096228029900800105

    authors: Kenward MG,Molenberghs G

    更新日期:1999-03-01 00:00:00

  • Measuring agreement in method comparison studies.

    abstract::Agreement between two methods of clinical measurement can be quantified using the differences between observations made using the two methods on the same subjects. The 95% limits of agreement, estimated by mean difference +/- 1.96 standard deviation of the differences, provide an interval within which 95% of differenc...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章,评审

    doi:10.1177/096228029900800204

    authors: Bland JM,Altman DG

    更新日期:1999-06-01 00:00:00

  • Statistical challenges in assessing potential efficacy of complex interventions in pilot or feasibility studies.

    abstract::Early phase trials of complex interventions currently focus on assessing the feasibility of a large randomised control trial and on conducting pilot work. Assessing the efficacy of the proposed intervention is generally discouraged, due to concerns of underpowered hypothesis testing. In contrast, early assessment of e...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280215589507

    authors: Wilson DT,Walwyn RE,Brown J,Farrin AJ,Brown SR

    更新日期:2016-06-01 00:00:00

  • A monotone data augmentation algorithm for longitudinal data analysis via multivariate skew-t, skew-normal or t distributions.

    abstract::The mixed effects model for repeated measures has been widely used for the analysis of longitudinal clinical data collected at a number of fixed time points. We propose a robust extension of the mixed effects model for repeated measures for skewed and heavy-tailed data on basis of the multivariate skew-t distribution,...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280219865579

    authors: Tang Y

    更新日期:2020-06-01 00:00:00

  • Assessing the reliability of ordered categorical scales using kappa-type statistics.

    abstract::Methods for the analysis of reliability of ordered categorical scales are discussed, focussing on the limitation of the single summary-weighted kappa coefficients. A symmetric matrix of kappa-type coefficients is suggested as an alternative. The method is proposed as being suitable for ordinal scale where there is no ...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1191/0962280205sm413oa

    authors: Roberts C,McNamee R

    更新日期:2005-10-01 00:00:00

  • Promoting structural effects of covariates in the cure rate model with penalization.

    abstract::Cure rate models have been widely adopted for characterizing survival data that have long-term survivors. Under a mixture cure rate model where the population is a mixture of cured and susceptible subjects, a primary goal is to study covariate effects on the cure probability and survival function of the susceptible su...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217708684

    authors: Fan X,Liu M,Fang K,Huang Y,Ma S

    更新日期:2017-10-01 00:00:00

  • Evaluation of software for multiple imputation of semi-continuous data.

    abstract::It is now widely accepted that multiple imputation (MI) methods properly handle the uncertainty of missing data over single imputation methods. Several standard statistical software packages, such as SAS, R and STATA, have standard procedures or user-written programs to perform MI. The performance of these packages is...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280206074464

    authors: Yu LM,Burton A,Rivero-Arias O

    更新日期:2007-06-01 00:00:00

  • Comparing cluster-level dynamic treatment regimens using sequential, multiple assignment, randomized trials: Regression estimation and sample size considerations.

    abstract::Cluster-level dynamic treatment regimens can be used to guide sequential treatment decision-making at the cluster level in order to improve outcomes at the individual or patient-level. In a cluster-level dynamic treatment regimen, the treatment is potentially adapted and re-adapted over time based on changes in the cl...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217708654

    authors: NeCamp T,Kilbourne A,Almirall D

    更新日期:2017-08-01 00:00:00

  • Separating variability in healthcare practice patterns from random error.

    abstract::Improving the quality of care that patients receive is a major focus of clinical research, particularly in the setting of cardiovascular hospitalization. Quality improvement studies seek to estimate and visualize the degree of variability in dichotomous treatment patterns and outcomes across different providers, where...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217754230

    authors: Thomas LE,Schulte PJ

    更新日期:2019-04-01 00:00:00

  • Optimal scheduling of post-therapeutic follow-up of patients treated for cancer for early detection of relapses.

    abstract::Post-therapeutic surveillance is one important component of cancer care. However, there still is no evidence-based strategies to schedule patients' follow-up examinations. Our approach is based on the modeling of the probability of the onset of relapse at an early asymptotic or preclinical stage and its transition to ...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280214524178

    authors: Somda SM,Leconte E,Boher JM,Asselain B,Kramar A,Filleron T

    更新日期:2016-12-01 00:00:00

  • Inferences about a linear combination of proportions.

    abstract::Statistical methods for carrying out asymptotic inferences (tests or confidence intervals) relative to one or two independent binomial proportions are very frequent. However, inferences about a linear combination of K independent proportions L = Σβ(i)p(i) (in which the first two are special cases) have had very little...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280209347953

    authors: Martín Andrés A,Alvarez Hernández M,Herranz Tejedor I

    更新日期:2011-08-01 00:00:00

  • Letter to the editor: Fitting truncated normal distributions.

    abstract::I comment here on a recent paper in this journal, on the fitting of truncated normal distributions by the EM algorithm. I show that the fitting of such distributions by direct numerical maximization of likelihood (rather than EM) is straightforward, contrary to an assertion made by the authors of that paper. ...

    journal_title:Statistical methods in medical research

    pub_type: 评论,信件

    doi:10.1177/0962280217712089

    authors: MacDonald IL

    更新日期:2018-12-01 00:00:00

  • Linear time-dependent reference intervals where there is measurement error in the time variable-a parametric approach.

    abstract::This article re-examines parametric methods for the calculation of time specific reference intervals where there is measurement error present in the time covariate. Previous published work has commonly been based on the standard ordinary least squares approach, weighted where appropriate. In fact, this is an incorrect...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280211426617

    authors: Gillard J

    更新日期:2015-12-01 00:00:00

  • Change-point detection for infinite horizon dynamic treatment regimes.

    abstract::A dynamic treatment regime is a set of decision rules for how to treat a patient at multiple time points. At each time point, a treatment decision is made depending on the patient's medical history up to that point. We consider the infinite-horizon setting in which the number of decision points is very large. Specific...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217708655

    authors: Goldberg Y,Pollak M,Mitelpunkt A,Orlovsky M,Weiss-Meilik A,Gorfine M

    更新日期:2017-08-01 00:00:00

  • Exposure-response modelling approaches for determining optimal dosing rules in children.

    abstract::Within paediatric populations, there may be distinct age groups characterised by different exposure-response relationships. Several regulatory guidance documents have suggested general age groupings. However, it is not clear whether these categorisations will be suitable for all new medicines and in all disease areas....

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280220903751

    authors: Wadsworth I,Hampson LV,Bornkamp B,Jaki T

    更新日期:2020-09-01 00:00:00

  • The application of methods to quantify attributable risk in medical practice.

    abstract::Several epidemiological parameters have been introduced for quantifying the population impact of a certain exposure on morbidity on a population level, termed 'attributable risk' (AR). Of these definitions, the AR as suggested by Levin in 1953 or some algebraic transformations of it are most commonly used. A structure...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/096228020101000305

    authors: Uter W,Pfahlberg A

    更新日期:2001-06-01 00:00:00

  • Multilevel growth curve models that incorporate a random coefficient model for the level 1 variance function.

    abstract::Aim To present a flexible model for repeated measures longitudinal growth data within individuals that allows trends over time to incorporate individual-specific random effects. These may reflect the timing of growth events and characterise within-individual variability which can be modelled as a function of age. Subj...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280217706728

    authors: Goldstein H,Leckie G,Charlton C,Tilling K,Browne WJ

    更新日期:2018-11-01 00:00:00

  • Fitting competing risks with an assumed copula.

    abstract::We propose a fully parametric model for the analysis of competing risks data where the types of failure may not be independent. We show how the dependence between the cause-specific survival times can be modelled with a copula function. Features include: identifiability of the problem; accessible understanding of the ...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1191/0962280203sm335ra

    authors: Escarela G,Carrière JF

    更新日期:2003-08-01 00:00:00

  • Evaluation of change in CD4+ cell counts in AIDS clinical trials.

    abstract::To evaluate the antiretroviral activity of antiretroviral agents and to compare the effects of two different antiretroviral agents, we propose a non-parametric mixed-effects model to investigate change of CD4+ counts. The proposed model and methods are applied to analyse the data from PACTG345 study. Population and in...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280206075524

    authors: Liang H

    更新日期:2008-04-01 00:00:00

  • Bayesian spatially dependent variable selection for small area health modeling.

    abstract::Statistical methods for spatial health data to identify the significant covariates associated with the health outcomes are of critical importance. Most studies have developed variable selection approaches in which the covariates included appear within the spatial domain and their effects are fixed across space. Howeve...

    journal_title:Statistical methods in medical research

    pub_type: 杂志文章

    doi:10.1177/0962280215627184

    authors: Choi J,Lawson AB

    更新日期:2018-01-01 00:00:00