Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.

Abstract:

BACKGROUND:Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. METHODS:We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. RESULTS:In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. CONCLUSIONS:This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.

journal_name

BMC Med Res Methodol

authors

Handorf E,Yin Y,Slifker M,Lynch S

doi

10.1186/s12874-020-01183-9

subject

Has Abstract

pub_date

2020-12-10 00:00:00

pages

302

issue

1

issn

1471-2288

pii

10.1186/s12874-020-01183-9

journal_volume

20

pub_type

杂志文章
  • Consensus workshops on the development of an ADHD medication management protocol using QbTest: developing a clinical trial protocol with multidisciplinary stakeholders.

    abstract:BACKGROUND:The study design and protocol that underpin a randomised controlled trial (RCT) are critical for the ultimate success of the trial. Although RCTs are considered the gold standard for research, there are multiple threats to their validity such as participant recruitment and retention, identifying a meaningful...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-019-0772-2

    authors: Hall CL,Brown S,James M,Martin JL,Brown N,Selby K,Clarke J,Williams L,Sayal K,Hollis C,Groom MJ

    更新日期:2019-06-18 00:00:00

  • Quality of cause-of-death reporting using ICD-10 drowning codes: a descriptive study of 69 countries.

    abstract:BACKGROUND:The systematic collection of high-quality mortality data is a prerequisite in designing relevant drowning prevention programmes. This descriptive study aimed to assess the quality (i.e., level of specificity) of cause-of-death reporting using ICD-10 drowning codes across 69 countries. METHODS:World Health O...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-10-30

    authors: Lu TH,Lunetta P,Walker S

    更新日期:2010-04-08 00:00:00

  • Psychometric properties of a short version of the Job Anxiety Scale.

    abstract:BACKGROUND:Occupational stress and specifically job anxiety are crucial factors in determining health outcomes, job satisfaction as well as performance. In order to assess this phenomenon, the Job Anxiety Scale is one of the instruments available. It consists of 70 items that are clustered in 14 subscales and five dime...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章,收录出版

    doi:10.1186/s12874-020-00974-4

    authors: Schmalbach B,Kalkbrenner A,Bassler M,Hinz A,Petrowski K

    更新日期:2020-04-21 00:00:00

  • A factorial cluster-randomised controlled trial combining home-environmental and early child development interventions to improve child health and development: rationale, trial design and baseline findings.

    abstract:BACKGROUND:Exposure to unhealthy environments and inadequate child stimulation are main risk factors that affect children's health and wellbeing in low- and middle-income countries. Interventions that simultaneously address several risk factors at the household level have great potential to reduce these negative effect...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-020-00950-y

    authors: Hartinger SM,Nuño N,Hattendorf J,Verastegui H,Karlen W,Ortiz M,Mäusezahl D

    更新日期:2020-04-02 00:00:00

  • A statistical model to assess the risk of communicable diseases associated with multiple exposures in healthcare settings.

    abstract:BACKGROUND:The occurrence of communicable diseases (CD) depends on exposure to contagious persons. The effects of exposure to CD are delayed in time and contagious persons remain contagious for several days during which their contagiousness varies. Moreover when multiple exposures occur, it is difficult to know which e...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-13-26

    authors: Payet C,Voirin N,Vanhems P,Ecochard R

    更新日期:2013-02-20 00:00:00

  • Random allocation software for parallel group randomized trials.

    abstract:BACKGROUND:Typically, randomization software should allow users to exert control over the different aspects of randomization including block design, provision of unique identifiers and control over the format and type of program output. While some of these characteristics have been addressed by available software, none...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-4-26

    authors: Saghaei M

    更新日期:2004-11-09 00:00:00

  • Sample size calculations for cluster randomised controlled trials with a fixed number of clusters.

    abstract:BACKGROUND:Cluster randomised controlled trials (CRCTs) are frequently used in health service evaluation. Assuming an average cluster size, required sample sizes are readily computed for both binary and continuous outcomes, by estimating a design effect or inflation factor. However, where the number of clusters are fix...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-11-102

    authors: Hemming K,Girling AJ,Sitch AJ,Marsh J,Lilford RJ

    更新日期:2011-06-30 00:00:00

  • Which resources should be used to identify RCT/CCTs for systematic reviews: a systematic review.

    abstract:BACKGROUND:Systematic reviewers seek to comprehensively search for relevant studies and summarize these to present the most valid estimate of intervention effectiveness. The more resources searched, the higher the yield, and thus time and costs required to conduct a systematic review. While there is an abundance of evi...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2288-5-24

    authors: Crumley ET,Wiebe N,Cramer K,Klassen TP,Hartling L

    更新日期:2005-08-10 00:00:00

  • A proof of principle for using adaptive testing in Routine Outcome Monitoring: the efficiency of the Mood and Anxiety Symptoms Questionnaire -Anhedonic Depression CAT.

    abstract:BACKGROUND:In Routine Outcome Monitoring (ROM) there is a high demand for short assessments. Computerized Adaptive Testing (CAT) is a promising method for efficient assessment. In this article, the efficiency of a CAT version of the Mood and Anxiety Symptom Questionnaire, - Anhedonic Depression scale (MASQ-AD) for use ...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-12-4

    authors: Smits N,Zitman FG,Cuijpers P,den Hollander-Gijsman ME,Carlier IV

    更新日期:2012-01-10 00:00:00

  • Estimating cardiovascular disease incidence from prevalence: a spreadsheet based model.

    abstract:BACKGROUND:Disease incidence and prevalence are both core indicators of population health. Incidence is generally not as readily accessible as prevalence. Cohort studies and electronic health record systems are two major way to estimate disease incidence. The former is time-consuming and expensive; the latter is not av...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-016-0288-y

    authors: Hu XF,Young K,Chan HM

    更新日期:2017-01-23 00:00:00

  • Dealing with missing data in a multi-question depression scale: a comparison of imputation methods.

    abstract:BACKGROUND:Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputatio...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-6-57

    authors: Shrive FM,Stuart H,Quan H,Ghali WA

    更新日期:2006-12-13 00:00:00

  • Estimating required information size by quantifying diversity in random-effects model meta-analyses.

    abstract:BACKGROUND:There is increasing awareness that meta-analyses require a sufficiently large information size to detect or reject an anticipated intervention effect. The required information size in a meta-analysis may be calculated from an anticipated a priori intervention effect or from an intervention effect suggested b...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-9-86

    authors: Wetterslev J,Thorlund K,Brok J,Gluud C

    更新日期:2009-12-30 00:00:00

  • Error in statistical tests of error in statistical tests.

    abstract:BACKGROUND:A recent paper found that terminal digits of statistical values in Nature deviated significantly from an equiprobable distribution, indicating errors or inconsistencies in rounding. This finding, as well as the discovery that a large percentage of p values were inconsistent with reported test statistics, led...

    journal_title:BMC medical research methodology

    pub_type: 评论,杂志文章

    doi:10.1186/1471-2288-6-45

    authors: Jeng M

    更新日期:2006-09-13 00:00:00

  • Comparison of retention in observational cohorts and nested simulated HIV vaccine efficacy trials in the key populations in Uganda.

    abstract:BACKGROUND:Outcomes in observational studies may not best estimate those expected in the HIV vaccine efficacy trials. We compared retention in Simulated HIV Vaccine Efficacy Trials (SiVETs) and observational cohorts drawn from two key populations in Uganda. METHODS:Two SiVETs were nested within two observational cohor...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-020-00920-4

    authors: Abaasa A,Todd J,Nash S,Mayanja Y,Kaleebu P,Fast PE,Price M

    更新日期:2020-02-12 00:00:00

  • Methods to increase response rates to a population-based maternity survey: a comparison of two pilot studies.

    abstract:BACKGROUND:Surveys are established methods for collecting population data that are unavailable from other sources; however, response rates to surveys are declining. A number of methods have been identified to increase survey returns yet response rates remain low. This paper evaluates the impact of five selected methods...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-019-0702-3

    authors: Harrison S,Henderson J,Alderdice F,Quigley MA

    更新日期:2019-03-20 00:00:00

  • Quasi-linear Cox proportional hazards model with cross- L1 penalty.

    abstract:BACKGROUND:To accurately predict the response to treatment, we need a stable and effective risk score that can be calculated from patient characteristics. When we evaluate such risks from time-to-event data with right-censoring, Cox's proportional hazards model is the most popular for estimating the linear risk score. ...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-020-01063-2

    authors: Omae K,Eguchi S

    更新日期:2020-07-06 00:00:00

  • Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data.

    abstract:BACKGROUND:Multiple Imputation as usually implemented assumes that data are Missing At Random (MAR), meaning that the underlying missing data mechanism, given the observed data, is independent of the unobserved data. To explore the sensitivity of the inferences to departures from the MAR assumption, we applied the meth...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-12-73

    authors: Héraud-Bousquet V,Larsen C,Carpenter J,Desenclos JC,Le Strat Y

    更新日期:2012-06-08 00:00:00

  • Network-meta analysis made easy: detection of inconsistency using factorial analysis-of-variance models.

    abstract:BACKGROUND:Network meta-analysis can be used to combine results from several randomized trials involving more than two treatments. Potential inconsistency among different types of trial (designs) differing in the set of treatments tested is a major challenge, and application of procedures for detecting and locating inc...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-14-61

    authors: Piepho HP

    更新日期:2014-05-10 00:00:00

  • Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.

    abstract:BACKGROUND:Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size ("data hungriness"). METHODS:We performed simulation studies bas...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-14-137

    authors: van der Ploeg T,Austin PC,Steyerberg EW

    更新日期:2014-12-22 00:00:00

  • Scratch lottery tickets are a poor incentive to respond to mailed questionnaires.

    abstract:BACKGROUND:It has been demonstrated that the enclosure of money with a mailed questionnaire increases the response rate significantly. We evaluated scratch lottery tickets as an alternative to cash. METHODS:1500 randomly selected Norwegians between the ages of 40 and 65 years were sent a short questionnaire. 250 recei...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章,随机对照试验

    doi:10.1186/1471-2288-6-19

    authors: Finsen V,Storeheier AH

    更新日期:2006-04-28 00:00:00

  • On the censored cost-effectiveness analysis using copula information.

    abstract:BACKGROUND:Information and theory beyond copula concepts are essential to understand the dependence relationship between several marginal covariates distributions. In a therapeutic trial data scheme, most of the time, censoring occurs. That could lead to a biased interpretation of the dependence relationship between ma...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-017-0305-9

    authors: Fontaine C,Daurès JP,Landais P

    更新日期:2017-02-15 00:00:00

  • Measuring health-related quality of life in chronic obstructive pulmonary disease: properties of the EQ-5D-5L and PROMIS-43 short form.

    abstract:BACKGROUND:The Patient Reported Outcomes Measurement Information System 43-item short form (PROMIS-43) and the five-level EQ-5D (EQ-5D-5L) are recently developed measures of health-related quality of life (HRQL) that have potentially broad application in evaluating treatments and capturing burden of respiratory-related...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章,多中心研究

    doi:10.1186/1471-2288-14-78

    authors: Lin FJ,Pickard AS,Krishnan JA,Joo MJ,Au DH,Carson SS,Gillespie S,Henderson AG,Lindenauer PK,McBurnie MA,Mularski RA,Naureckas ET,Vollmer WM,Lee TA,CONCERT Consortium.

    更新日期:2014-06-16 00:00:00

  • Recruitment of adolescents with suicidal ideation in the emergency department: lessons from a randomized controlled pilot trial of a youth suicide prevention intervention.

    abstract:BACKGROUND:Emergency Departments (EDs) are a first point-of-contact for many youth with mental health and suicidality concerns and can serve as an effective recruitment source for randomized controlled trials (RCTs) of mental health interventions. However, recruitment in acute care settings is impeded by several challe...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-020-01117-5

    authors: Tracey M,Finkelstein Y,Schachter R,Cleverley K,Monga S,Barwick M,Szatmari P,Moretti ME,Willan A,Henderson J,Korczak DJ

    更新日期:2020-09-14 00:00:00

  • Using an onset-anchored Bayesian hierarchical model to improve predictions for amyotrophic lateral sclerosis disease progression.

    abstract:BACKGROUND:Amyotrophic Lateral Sclerosis (ALS), also known as Lou Gehrig's disease, is a rare disease with extreme between-subject variability, especially with respect to rate of disease progression. This makes modelling a subject's disease progression, which is measured by the ALS Functional Rating Scale (ALSFRS), ver...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-018-0479-9

    authors: Karanevich AG,Statland JM,Gajewski BJ,He J

    更新日期:2018-02-06 00:00:00

  • Structure formats of randomised controlled trial abstracts: a cross-sectional analysis of their current usage and association with methodology reporting.

    abstract:BACKGROUND:The reporting of randomised controlled trial (RCT) abstracts is of vital importance. The primary objective of this study was to investigate the association between structure format and RCT abstracts' quality of methodology reporting, informed by the current requirement and usage of structure formats by leadi...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-017-0469-3

    authors: Hua F,Walsh T,Glenny AM,Worthington H

    更新日期:2018-01-10 00:00:00

  • Adaptive propensity score procedure improves matching in prospective observational trials.

    abstract:BACKGROUND:Randomized controlled trials are the gold-standard for clinical trials. However, randomization is not always feasible. In this article we propose a prospective and adaptive matched case-control trial design assuming that a control group already exists. METHODS:We propose and discuss an interim analysis step...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-019-0763-3

    authors: Weber D,Uhlmann L,Schönenberger S,Kieser M

    更新日期:2019-07-16 00:00:00

  • Investigating hospital heterogeneity with a multi-state frailty model: application to nosocomial pneumonia disease in intensive care units.

    abstract:BACKGROUND:Multistate models have become increasingly useful to study the evolution of a patient's state over time in intensive care units ICU (e.g. admission, infections, alive discharge or death in ICU). In addition, in critically-ill patients, data come from different ICUs, and because observations are clustered int...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/1471-2288-12-79

    authors: Liquet B,Timsit JF,Rondeau V

    更新日期:2012-06-15 00:00:00

  • Awareness of wearing an accelerometer does not affect physical activity in youth.

    abstract:BACKGROUND:This study aimed to investigate whether awareness of being monitored by an accelerometer has an effect on physical activity in young people. METHODS:Eighty healthy participants aged 10-18 years were randomized between blinded and nonblinded groups. The blinded participants were informed that we were testing...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章,随机对照试验

    doi:10.1186/s12874-017-0378-5

    authors: Vanhelst J,Béghin L,Drumez E,Coopman S,Gottrand F

    更新日期:2017-07-11 00:00:00

  • A mixed methods case study investigating how randomised controlled trials (RCTs) are reported, understood and interpreted in practice.

    abstract:BACKGROUND:While randomised controlled trials (RCTs) provide high-quality evidence to guide practice, much routine care is not based upon available RCTs. This disconnect between evidence and practice is not sufficiently well understood. This case study explores this relationship using a novel approach. Better understan...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-020-01009-8

    authors: Byrne BE,Rooshenas L,Lambert HS,Blazeby JM

    更新日期:2020-05-12 00:00:00

  • Reference effect measures for quantifying, comparing and visualizing variation from random and fixed effects in non-normal multilevel models, with applications to site variation in medical procedure use and outcomes.

    abstract:BACKGROUND:Multilevel models for non-normal outcomes are widely used in medical and health sciences research. While methods for interpreting fixed effects are well-developed, methods to quantify and interpret random cluster variation and compare it with other sources of variation are less established. Random cluster va...

    journal_title:BMC medical research methodology

    pub_type: 杂志文章

    doi:10.1186/s12874-018-0517-7

    authors: Glorioso TJ,Grunwald GK,Ho PM,Maddox TM

    更新日期:2018-07-06 00:00:00