Abstract:
BACKGROUND:Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. METHODS:We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. RESULTS:For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study.
journal_name
Stat Methods Med Resjournal_title
Statistical methods in medical researchauthors
Khondoker M,Dobson R,Skirrow C,Simmons A,Stahl Ddoi
10.1177/0962280213502437subject
Has Abstractpub_date
2016-10-01 00:00:00pages
1804-1823issue
5eissn
0962-2802issn
1477-0334pii
0962280213502437journal_volume
25pub_type
杂志文章abstract::Trials run in either rare diseases, such as rare cancers, or rare sub-populations of common diseases are challenging in terms of identifying, recruiting and treating sufficient patients in a sensible period. Treatments for rare diseases are often designed for other disease areas and then later proposed as possible tre...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280216662070
更新日期:2018-05-01 00:00:00
abstract::Repeated measures and multivariate outcomes are an increasingly common feature of trials. Their joint analysis by means of random effects and latent variable models is appealing but patterns of heterogeneity in outcome profile may not conform to standard multivariate normal assumptions. In addition, there is much inte...
journal_title:Statistical methods in medical research
pub_type: 杂志文章,评审
doi:10.1177/0962280209105016
更新日期:2010-06-01 00:00:00
abstract::In biomedical research, various longitudinal markers measuring different quantities are often collected over time. For example, repeated measures of psychometric scores are very informative about the degradation process toward dementia. These trajectories are generally nonlinear with an acceleration of the decline a f...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280219898719
更新日期:2020-09-01 00:00:00
abstract::This article outlines the statistical developments that have taken place in the use of the EM algorithm in emission and transmission tomography during the past decade or so. We discuss the statistical aspects of the modelling of the projection data for both the emission and transmission cases and define the relevant p...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/096228029700600105
更新日期:1997-03-01 00:00:00
abstract::Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of clusters g by first fitting a mixture model with g components. An outright clustering of the data is then obtained by assigning an observation to the component to...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1191/0962280204sm372ra
更新日期:2004-10-01 00:00:00
abstract::This article is motivated by jointly modelling longitudinal and time-to-event clinical data of patients with diabetes and end-stage renal disease. All patients are on the waiting list for the pancreas transplant after kidney transplant, and some of them have a pancreas transplant before kidney transplant failure or de...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280218786980
更新日期:2019-09-01 00:00:00
abstract::This paper reviews recent developments by the Washington/Brown groups for the study of anatomical shape in the emerging new discipline of computational anatomy. Parametric representations of anatomical variation for computational anatomy are reviewed, restricted to the assumption of small deformations. The generation ...
journal_title:Statistical methods in medical research
pub_type: 杂志文章,评审
doi:10.1177/096228029700600305
更新日期:1997-09-01 00:00:00
abstract::Sample size calculations are needed to design and assess the feasibility of case-control studies. Although such calculations are readily available for simple case-control designs and univariate analyses, there is limited theory and software for multivariate unconditional logistic analysis of case-control data. Here we...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217737157
更新日期:2019-03-01 00:00:00
abstract::The statistical analysis of genome-wide association studies (GWASs) with multiple diseases and shared controls (SCs) is discussed. The usual method for analyzing data from these studies is to compare each individual disease with either the SCs or the pooled controls which include other diseases. We observed that apply...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280212474061
更新日期:2016-04-01 00:00:00
abstract::In many longitudinal studies, evaluating the effect of a binary or continuous predictor variable on the rate of change of the outcome, i.e. slope, is often of primary interest. Sample size determination of these studies, however, is complicated by the expectation that missing data will occur due to missed visits, earl...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280212437452
更新日期:2015-12-01 00:00:00
abstract::This paper reviews models for incomplete continuous and categorical longitudinal data. In terms of Rubin's classification of missing value processes we are specifically concerned with the problem of nonrandom missingness. A distinction is drawn between the classes of selection and pattern-mixture models and, using sev...
journal_title:Statistical methods in medical research
pub_type: 杂志文章,评审
doi:10.1177/096228029900800105
更新日期:1999-03-01 00:00:00
abstract::Agreement between two methods of clinical measurement can be quantified using the differences between observations made using the two methods on the same subjects. The 95% limits of agreement, estimated by mean difference +/- 1.96 standard deviation of the differences, provide an interval within which 95% of differenc...
journal_title:Statistical methods in medical research
pub_type: 杂志文章,评审
doi:10.1177/096228029900800204
更新日期:1999-06-01 00:00:00
abstract::Early phase trials of complex interventions currently focus on assessing the feasibility of a large randomised control trial and on conducting pilot work. Assessing the efficacy of the proposed intervention is generally discouraged, due to concerns of underpowered hypothesis testing. In contrast, early assessment of e...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280215589507
更新日期:2016-06-01 00:00:00
abstract::The mixed effects model for repeated measures has been widely used for the analysis of longitudinal clinical data collected at a number of fixed time points. We propose a robust extension of the mixed effects model for repeated measures for skewed and heavy-tailed data on basis of the multivariate skew-t distribution,...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280219865579
更新日期:2020-06-01 00:00:00
abstract::Methods for the analysis of reliability of ordered categorical scales are discussed, focussing on the limitation of the single summary-weighted kappa coefficients. A symmetric matrix of kappa-type coefficients is suggested as an alternative. The method is proposed as being suitable for ordinal scale where there is no ...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1191/0962280205sm413oa
更新日期:2005-10-01 00:00:00
abstract::Cure rate models have been widely adopted for characterizing survival data that have long-term survivors. Under a mixture cure rate model where the population is a mixture of cured and susceptible subjects, a primary goal is to study covariate effects on the cure probability and survival function of the susceptible su...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217708684
更新日期:2017-10-01 00:00:00
abstract::It is now widely accepted that multiple imputation (MI) methods properly handle the uncertainty of missing data over single imputation methods. Several standard statistical software packages, such as SAS, R and STATA, have standard procedures or user-written programs to perform MI. The performance of these packages is...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280206074464
更新日期:2007-06-01 00:00:00
abstract::Cluster-level dynamic treatment regimens can be used to guide sequential treatment decision-making at the cluster level in order to improve outcomes at the individual or patient-level. In a cluster-level dynamic treatment regimen, the treatment is potentially adapted and re-adapted over time based on changes in the cl...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217708654
更新日期:2017-08-01 00:00:00
abstract::Improving the quality of care that patients receive is a major focus of clinical research, particularly in the setting of cardiovascular hospitalization. Quality improvement studies seek to estimate and visualize the degree of variability in dichotomous treatment patterns and outcomes across different providers, where...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217754230
更新日期:2019-04-01 00:00:00
abstract::Post-therapeutic surveillance is one important component of cancer care. However, there still is no evidence-based strategies to schedule patients' follow-up examinations. Our approach is based on the modeling of the probability of the onset of relapse at an early asymptotic or preclinical stage and its transition to ...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280214524178
更新日期:2016-12-01 00:00:00
abstract::Statistical methods for carrying out asymptotic inferences (tests or confidence intervals) relative to one or two independent binomial proportions are very frequent. However, inferences about a linear combination of K independent proportions L = Σβ(i)p(i) (in which the first two are special cases) have had very little...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280209347953
更新日期:2011-08-01 00:00:00
abstract::I comment here on a recent paper in this journal, on the fitting of truncated normal distributions by the EM algorithm. I show that the fitting of such distributions by direct numerical maximization of likelihood (rather than EM) is straightforward, contrary to an assertion made by the authors of that paper. ...
journal_title:Statistical methods in medical research
pub_type: 评论,信件
doi:10.1177/0962280217712089
更新日期:2018-12-01 00:00:00
abstract::This article re-examines parametric methods for the calculation of time specific reference intervals where there is measurement error present in the time covariate. Previous published work has commonly been based on the standard ordinary least squares approach, weighted where appropriate. In fact, this is an incorrect...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280211426617
更新日期:2015-12-01 00:00:00
abstract::A dynamic treatment regime is a set of decision rules for how to treat a patient at multiple time points. At each time point, a treatment decision is made depending on the patient's medical history up to that point. We consider the infinite-horizon setting in which the number of decision points is very large. Specific...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217708655
更新日期:2017-08-01 00:00:00
abstract::Within paediatric populations, there may be distinct age groups characterised by different exposure-response relationships. Several regulatory guidance documents have suggested general age groupings. However, it is not clear whether these categorisations will be suitable for all new medicines and in all disease areas....
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280220903751
更新日期:2020-09-01 00:00:00
abstract::Several epidemiological parameters have been introduced for quantifying the population impact of a certain exposure on morbidity on a population level, termed 'attributable risk' (AR). Of these definitions, the AR as suggested by Levin in 1953 or some algebraic transformations of it are most commonly used. A structure...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/096228020101000305
更新日期:2001-06-01 00:00:00
abstract::Aim To present a flexible model for repeated measures longitudinal growth data within individuals that allows trends over time to incorporate individual-specific random effects. These may reflect the timing of growth events and characterise within-individual variability which can be modelled as a function of age. Subj...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280217706728
更新日期:2018-11-01 00:00:00
abstract::We propose a fully parametric model for the analysis of competing risks data where the types of failure may not be independent. We show how the dependence between the cause-specific survival times can be modelled with a copula function. Features include: identifiability of the problem; accessible understanding of the ...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1191/0962280203sm335ra
更新日期:2003-08-01 00:00:00
abstract::To evaluate the antiretroviral activity of antiretroviral agents and to compare the effects of two different antiretroviral agents, we propose a non-parametric mixed-effects model to investigate change of CD4+ counts. The proposed model and methods are applied to analyse the data from PACTG345 study. Population and in...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280206075524
更新日期:2008-04-01 00:00:00
abstract::Statistical methods for spatial health data to identify the significant covariates associated with the health outcomes are of critical importance. Most studies have developed variable selection approaches in which the covariates included appear within the spatial domain and their effects are fixed across space. Howeve...
journal_title:Statistical methods in medical research
pub_type: 杂志文章
doi:10.1177/0962280215627184
更新日期:2018-01-01 00:00:00