Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes.

Abstract:

INTRODUCTION:Machine learning (ML) and natural language processing have great potential to improve information extraction (IE) within electronic medical records (EMRs) for a wide variety of clinical search and summarization tools. Despite ML advancements, clinical adoption of real time IE tools for patient care remains low. Clinically motivated IE task definitions, publicly available annotated clinical datasets, and inclusion of subtasks such as coreference resolution and named entity normalization are critical for the development of useful clinical tools. MATERIALS AND METHODS:We provide a task definition and comprehensive annotation requirements for a clinically motivated symptom extraction task. Four annotators labeled symptom mentions within 1108 discharge summaries from two public clinical note datasets for the tasks of named entity recognition, coreference resolution, and named entity normalization; these annotations will be released to the public. Baseline human performance was assessed and two ML models were evaluated on the symptom extraction task. RESULTS:16,922 symptom mentions were identified within the discharge summaries, with 11,944 symptom instances after coreference resolution and 1255 unique normalized answer forms. Human annotator performance averaged 92.2% F1. Recurrent network model performance was 85.6% F1 (recall 85.8%, precision 85.4%), and Transformer-based model performance was 86.3% F1 (recall 86.6%, precision 86.1%). Our models extracted vague symptoms, acronyms, typographical errors, and grouping statements. The models generalized effectively to a separate clinical note corpus and can run in real time. CONCLUSION:To our knowledge, this dataset will be the largest and most comprehensive publicly released, annotated dataset for clinically motivated symptom extraction, as it includes annotations for named entity recognition, coreference, and normalization for more than 1000 clinical documents. Our neural network models extracted symptoms from unstructured clinical free text at near human performance in real time. In this paper, we present a clinically motivated task definition, dataset, and simple supervised natural language processing models to demonstrate the feasibility of building clinically applicable information extraction tools.

journal_name

J Biomed Inform

authors

Steinkamp JM,Bala W,Sharma A,Kantrowitz JJ

doi

10.1016/j.jbi.2019.103354

subject

Has Abstract

pub_date

2020-02-01 00:00:00

pages

103354

eissn

1532-0464

issn

1532-0480

pii

S1532-0464(19)30276-X

journal_volume

102

pub_type

杂志文章
  • Learning hidden patterns from patient multivariate time series data using convolutional neural networks: A case study of healthcare cost prediction.

    abstract:OBJECTIVE:To develop an effective and scalable individual-level patient cost prediction method by automatically learning hidden temporal patterns from multivariate time series data in patient insurance claims using a convolutional neural network (CNN) architecture. METHODS:We used three years of medical and pharmacy c...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2020.103565

    authors: Morid MA,Sheng ORL,Kawamoto K,Abdelrahman S

    更新日期:2020-11-01 00:00:00

  • A survey on literature based discovery approaches in biomedical domain.

    abstract::Literature Based Discovery (LBD) refers to the problem of inferring new and interesting knowledge by logically connecting independent fragments of information units through explicit or implicit means. This area of research, which incorporates techniques from Natural Language Processing (NLP), Information Retrieval and...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,评审

    doi:10.1016/j.jbi.2019.103141

    authors: Gopalakrishnan V,Jha K,Jin W,Zhang A

    更新日期:2019-05-01 00:00:00

  • Developing EHR-driven heart failure risk prediction models using CPXR(Log) with the probabilistic loss function.

    abstract::Computerized survival prediction in healthcare identifying the risk of disease mortality, helps healthcare providers to effectively manage their patients by providing appropriate treatment options. In this study, we propose to apply a classification algorithm, Contrast Pattern Aided Logistic Regression (CPXR(Log)) wit...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2016.01.009

    authors: Taslimitehrani V,Dong G,Pereira NL,Panahiazar M,Pathak J

    更新日期:2016-04-01 00:00:00

  • The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships.

    abstract::Corpora with specific entities and relationships annotated are essential to train and evaluate text-mining systems that are developed to extract specific structured information from a large corpus. In this paper we describe an approach where a named-entity recognition system produces a first annotation and annotators ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2012.04.004

    authors: van Mulligen EM,Fourrier-Reglat A,Gurwitz D,Molokhia M,Nieto A,Trifiro G,Kors JA,Furlong LI

    更新日期:2012-10-01 00:00:00

  • MorphoCol: An ontology-based knowledgebase for the characterisation of clinically significant bacterial colony morphologies.

    abstract:BACKGROUND:One of the major concerns of the biomedical community is the increasing prevalence of antimicrobial resistant microorganisms. Recent findings show that the diversification of colony morphology may be indicative of the expression of virulence factors and increased resistance to antibiotic therapeutics. To tra...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.03.007

    authors: Sousa AM,Pereira MO,Lourenço A

    更新日期:2015-06-01 00:00:00

  • Mapping high-dimensional data onto a relative distance plane--an exact method for visualizing and characterizing high-dimensional patterns.

    abstract::We introduce a distance (similarity)-based mapping for the visualization of high-dimensional patterns and their relative relationships. The mapping preserves exactly the original distances between points with respect to any two reference patterns in a special two-dimensional coordinate system, the relative distance pl...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2004.07.005

    authors: Somorjai RL,Dolenko B,Demko A,Mandelzweig M,Nikulin AE,Baumgartner R,Pizzi NJ

    更新日期:2004-10-01 00:00:00

  • A knowledge-based system to find over-the-counter medicines for self-medication.

    abstract::This study developed a medicine query system based on Semantic Web and open data especially for self-medication users to search over-the-counter (OTC) medicines. Most existing medicine query systems are based on keyword searches. If users are uncertain about the exact search words, these query systems do not offer eff...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2020.103504

    authors: Sung HY,Chi YL

    更新日期:2020-08-01 00:00:00

  • A Bayesian system to detect and characterize overlapping outbreaks.

    abstract::Outbreaks of infectious diseases such as influenza are a significant threat to human health. Because there are different strains of influenza which can cause independent outbreaks, and influenza can affect demographic groups at different rates and times, there is a need to recognize and characterize multiple outbreaks...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.08.003

    authors: Aronis JM,Millett NE,Wagner MM,Tsui F,Ye Y,Ferraro JP,Haug PJ,Gesteland PH,Cooper GF

    更新日期:2017-09-01 00:00:00

  • Homology assessment and molecular sequence alignment.

    abstract::Hypotheses of homology are the basis of phylogenetic analysis. All character data are considered to be equivalent regardless of the source of those characters. Putative homology statements are designated based on observations of similarity. Pairwise sequence alignment using the Needleman-Wunsch algorithm is the basis ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,评审

    doi:10.1016/j.jbi.2005.11.005

    authors: Phillips AJ

    更新日期:2006-02-01 00:00:00

  • Serum cancer biomarker discovery through analysis of gene expression data sets across multiple tumor and normal tissues.

    abstract::The development of convenient serum bioassays for cancer screening, diagnosis, prognosis, and monitoring of treatment is one of top priorities in cancer research community. Although numerous biomarker candidates have been generated by applying high-throughput technologies such as transcriptomics, proteomics, and metab...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2011.08.010

    authors: Jin H,Lee HC,Park SS,Jeong YS,Kim SY

    更新日期:2011-12-01 00:00:00

  • Knowledge-based personalized search engine for the Web-based Human Musculoskeletal System Resources (HMSR) in biomechanics.

    abstract::Human musculoskeletal system resources of the human body are valuable for the learning and medical purposes. Internet-based information from conventional search engines such as Google or Yahoo cannot response to the need of useful, accurate, reliable and good-quality human musculoskeletal resources related to medical ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2012.11.001

    authors: Dao TT,Hoang TN,Ta XH,Tho MC

    更新日期:2013-02-01 00:00:00

  • A hybrid of whale optimization and late acceptance hill climbing based imputation to enhance classification performance in electronic health records.

    abstract::Electronic health records (EHR) are a major source of information in biomedical informatics. Yet, missing values are prominent characteristics of EHR. Prediction on dataset with missing values results in inaccurate inferences. Nearest neighbour imputation based on lazy learning approach is a proven technique for missi...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103190

    authors: Nagarajan G,Dhinesh Babu LD

    更新日期:2019-06-01 00:00:00

  • PharmActa: Personalized pharmaceutical care eHealth platform for patients and pharmacists.

    abstract::Community pharmacists are critically placed in the patient care chain being an extended frontline within primary healthcare networks across Europe. They are trained to ensure safe and effective medication use, a crucial and responsible role, extending beyond the common misconception limited to just providing timely ac...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103336

    authors: Spanakis M,Sfakianakis S,Kallergis G,Spanakis EG,Sakkalis V

    更新日期:2019-12-01 00:00:00

  • Molecular property diagnostic suite for diabetes mellitus (MPDSDM): An integrated web portal for drug discovery and drug repurposing.

    abstract::Molecular Property Diagnostic Suite - Diabetes Mellitus (MPDSDM) is a Galaxy-based, open source disease-specific web portal for diabetes. It consists of three modules namely (i) data library (ii) data processing and (iii) data analysis tools. The data library (target library and literature) module provide extensive an...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2018.08.003

    authors: Gaur AS,Nagamani S,Tanneeru K,Druzhilovskiy D,Rudik A,Poroikov V,Narahari Sastry G

    更新日期:2018-09-01 00:00:00

  • A kernel-based clustering method for gene selection with gene expression data.

    abstract::Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of ge...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2016.05.007

    authors: Chen H,Zhang Y,Gutman I

    更新日期:2016-08-01 00:00:00

  • A flexible approach to distributed data anonymization.

    abstract::Sensitive biomedical data is often collected from distributed sources, involving different information systems and different organizational units. Local autonomy and legal reasons lead to the need of privacy preserving integration concepts. In this article, we focus on anonymization, which plays an important role for ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2013.12.002

    authors: Kohlmayer F,Prasser F,Eckert C,Kuhn KA

    更新日期:2014-08-01 00:00:00

  • A pilot study of a heuristic algorithm for novel template identification from VA electronic medical record text.

    abstract:RATIONALE:Templates in text notes pose challenges for automated information extraction algorithms. We propose a method that identifies novel templates in plain text medical notes. The identification can then be used to either include or exclude templates when processing notes for information extraction. METHODS:The tw...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2016.07.019

    authors: Redd AM,Gundlapalli AV,Divita G,Carter ME,Tran LT,Samore MH

    更新日期:2017-07-01 00:00:00

  • Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.

    abstract::The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.07.020

    authors: Stubbs A,Uzuner Ö

    更新日期:2015-12-01 00:00:00

  • 3D interactive surgical visualization system using mobile spatial information acquisition and autostereoscopic display.

    abstract::Three-dimensional (3D) visualization of preoperative and intraoperative medical information becomes more and more important in minimally invasive surgery. We develop a 3D interactive surgical visualization system using mobile spatial information acquisition and autostereoscopic display for surgeons to observe surgical...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.05.014

    authors: Fan Z,Weng Y,Chen G,Liao H

    更新日期:2017-07-01 00:00:00

  • Making sense: sensor-based investigation of clinician activities in complex critical care environments.

    abstract::In many respects, the critical care workplace resembles a paradigmatic complex system: on account of the dynamic and interactive nature of collaborative clinical work, these settings are characterized by non-linear, inter-dependent and emergent activities. Developing a comprehensive understanding of the work activitie...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2011.02.007

    authors: Kannampallil T,Li Z,Zhang M,Cohen T,Robinson DJ,Franklin A,Zhang J,Patel VL

    更新日期:2011-06-01 00:00:00

  • Predicting severe clinical events by learning about life-saving actions and outcomes using distant supervision.

    abstract::Medical error is a leading cause of patient death in the United States. Among the different types of medical errors, harm to patients caused by doctors missing early signs of deterioration is especially challenging to address due to the heterogeneity of patients' physiological patterns. In this study, we implemented r...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2020.103425

    authors: Lee DH,Yetisgen M,Vanderwende L,Horvitz E

    更新日期:2020-07-01 00:00:00

  • Spectral-dynamic representation of DNA sequences.

    abstract::A graphical representation of DNA sequences in which the distribution of a particular base B=A,C,G,T is represented by a set of discrete lines has been formulated. The methodology of this approach has been borrowed from two areas of physics: spectroscopy and dynamics. Consequently, the set of discrete lines is referre...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.06.001

    authors: Bielińska-Wąż D,Wąż P

    更新日期:2017-08-01 00:00:00

  • Lessons learnt from the DDIExtraction-2013 Shared Task.

    abstract::The DDIExtraction Shared Task 2013 is the second edition of the DDIExtraction Shared Task series, a community-wide effort to promote the implementation and comparative assessment of natural language processing (NLP) techniques in the field of the pharmacovigilance domain, in particular, to address the extraction of dr...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2014.05.007

    authors: Segura-Bedmar I,Martínez P,Herrero-Zazo M

    更新日期:2014-10-01 00:00:00

  • Enhancing phylogeography by improving geographical information from GenBank.

    abstract::Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2011.06.005

    authors: Scotch M,Sarkar IN,Mei C,Leaman R,Cheung KH,Ortiz P,Singraur A,Gonzalez G

    更新日期:2011-12-01 00:00:00

  • Exploring the associations between drug side-effects and therapeutic indications.

    abstract::Drug therapeutic indications and side-effects are both measurable patient phenotype changes in response to the treatment. Inferring potential drug therapeutic indications and identifying clinically interesting drug side-effects are both important and challenging tasks. Previous studies have utilized either chemical st...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2014.03.014

    authors: Wang F,Zhang P,Cao N,Hu J,Sorrentino R

    更新日期:2014-10-01 00:00:00

  • Semi-supervised medical entity recognition: A study on Spanish and Swedish clinical corpora.

    abstract:OBJECTIVE:The goal of this study is to investigate entity recognition within Electronic Health Records (EHRs) focusing on Spanish and Swedish. Of particular importance is a robust representation of the entities. In our case, we utilized unsupervised methods to generate such representations. METHODS:The significance of...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.05.009

    authors: Pérez A,Weegar R,Casillas A,Gojenola K,Oronoz M,Dalianis H

    更新日期:2017-07-01 00:00:00

  • An image score inference system for RNAi genome-wide screening based on fuzzy mixture regression modeling.

    abstract::With recent advances in fluorescence microscopy imaging techniques and methods of gene knock down by RNA interference (RNAi), genome-scale high-content screening (HCS) has emerged as a powerful approach to systematically identify all parts of complex biological processes. However, a critical barrier preventing fulfill...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2008.04.007

    authors: Wang J,Zhou X,Li F,Bradley PL,Chang SF,Perrimon N,Wong ST

    更新日期:2009-02-01 00:00:00

  • Automatic signal extraction, prioritizing and filtering approaches in detecting post-marketing cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS).

    abstract:OBJECTIVE:Targeted drugs dramatically improve the treatment outcomes in cancer patients; however, these innovative drugs are often associated with unexpectedly high cardiovascular toxicity. Currently, cardiovascular safety represents both a challenging issue for drug developers, regulators, researchers, and clinicians ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2013.10.008

    authors: Xu R,Wang Q

    更新日期:2014-02-01 00:00:00

  • Medical speciality classification system based on binary particle swarms and ensemble of one vs. rest support vector machines.

    abstract::Nowadays, artificial intelligence plays an integral role in medical and healthcare informatics. Developing an automatic question classification and answering system is essential for coping with constant advancements in science and technology. However, efficient online medical services are required to promote offline m...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2020.103525

    authors: Faris H,Habib M,Faris M,Alomari M,Alomari A

    更新日期:2020-09-01 00:00:00

  • Selecting significant genes by randomization test for cancer classification using gene expression data.

    abstract::Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In th...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2013.03.009

    authors: Mao Z,Cai W,Shao X

    更新日期:2013-08-01 00:00:00