An unsupervised and customizable misspelling generator for mining noisy health-related text sources.

Abstract:

BACKGROUND:Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. MATERIALS AND METHODS:The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. RESULTS:On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms demonstrated an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. DISCUSSION:Our proposed spelling variant generator has several advantages over past spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. CONCLUSION:The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.

journal_name

J Biomed Inform

authors

Sarker A,Gonzalez-Hernandez G

doi

10.1016/j.jbi.2018.11.007

subject

Has Abstract

pub_date

2018-12-01 00:00:00

pages

98-107

eissn

1532-0464

issn

1532-0480

pii

S1532-0464(18)30216-8

journal_volume

88

pub_type

杂志文章
  • Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

    abstract::A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metada...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.06.017

    authors: Panahiazar M,Dumontier M,Gevaert O

    更新日期:2017-08-01 00:00:00

  • Exploiting the contextual cues for bio-entity name recognition in biomedical literature.

    abstract::To extract biomedical information about bio-entities from the huge amount of biomedical literature, the first key step is recognizing their names in these literatures, which remains a challenging task due to the irregularities and ambiguities in bio-entities nomenclature. The recognition performances of the current po...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2008.01.002

    authors: Yang Z,Lin H,Li Y

    更新日期:2008-08-01 00:00:00

  • Unstructured medical image query using big data - An epilepsy case study.

    abstract::Big data technologies are critical to the medical field which requires new frameworks to leverage them. Such frameworks would benefit medical experts to test hypotheses by querying huge volumes of unstructured medical data to provide better patient care. The objective of this work is to implement and examine the feasi...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.12.005

    authors: Istephan S,Siadat MR

    更新日期:2016-02-01 00:00:00

  • Neural network-based approaches for biomedical relation classification: A review.

    abstract::The explosive growth of biomedical literature has created a rich source of knowledge, such as that on protein-protein interactions (PPIs) and drug-drug interactions (DDIs), locked in unstructured free text. Biomedical relation classification aims to automatically detect and classify biomedical relations, which has gre...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,评审

    doi:10.1016/j.jbi.2019.103294

    authors: Zhang Y,Lin H,Yang Z,Wang J,Sun Y,Xu B,Zhao Z

    更新日期:2019-11-01 00:00:00

  • Homology assessment and molecular sequence alignment.

    abstract::Hypotheses of homology are the basis of phylogenetic analysis. All character data are considered to be equivalent regardless of the source of those characters. Putative homology statements are designated based on observations of similarity. Pairwise sequence alignment using the Needleman-Wunsch algorithm is the basis ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,评审

    doi:10.1016/j.jbi.2005.11.005

    authors: Phillips AJ

    更新日期:2006-02-01 00:00:00

  • Algorithms for rapid outbreak detection: a research synthesis.

    abstract::The threat of bioterrorism has stimulated interest in enhancing public health surveillance to detect disease outbreaks more rapidly than is currently possible. To advance research on improving the timeliness of outbreak detection, the Defense Advanced Research Project Agency sponsored the Bio-event Advanced Leading In...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2004.11.007

    authors: Buckeridge DL,Burkom H,Campbell M,Hogan WR,Moore AW

    更新日期:2005-04-01 00:00:00

  • A controlled greedy supervised approach for co-reference resolution on clinical text.

    abstract::Identification of co-referent entity mentions inside text has significant importance for other natural language processing (NLP) tasks (e.g. event linking). However, this task, known as co-reference resolution, remains a complex problem, partly because of the confusion over different evaluation metrics and partly beca...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2013.03.007

    authors: Chowdhury MF,Zweigenbaum P

    更新日期:2013-06-01 00:00:00

  • Personal discovery in diabetes self-management: Discovering cause and effect using self-monitoring data.

    abstract:OBJECTIVE:To outline new design directions for informatics solutions that facilitate personal discovery with self-monitoring data. We investigate this question in the context of chronic disease self-management with the focus on type 2 diabetes. MATERIALS AND METHODS:We conducted an observational qualitative study of d...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2017.09.013

    authors: Mamykina L,Heitkemper EM,Smaldone AM,Kukafka R,Cole-Lewis HJ,Davidson PG,Mynatt ED,Cassells A,Tobin JN,Hripcsak G

    更新日期:2017-12-01 00:00:00

  • Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review.

    abstract::We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,评审

    doi:10.1016/j.jbi.2017.07.012

    authors: Kreimeyer K,Foster M,Pandey A,Arya N,Halford G,Jones SF,Forshee R,Walderhaug M,Botsis T

    更新日期:2017-09-01 00:00:00

  • Role of OpenEHR as an open source solution for the regional modelling of patient data in obstetrics.

    abstract::This work investigates, whether openEHR with its reference model, archetypes and templates is suitable for the digital representation of demographic as well as clinical data. Moreover, it elaborates openEHR as a tool for modelling Hospital Information Systems on a regional level based on a national logical infrastruct...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.04.004

    authors: Pahl C,Zare M,Nilashi M,de Faria Borges MA,Weingaertner D,Detschew V,Supriyanto E,Ibrahim O

    更新日期:2015-06-01 00:00:00

  • Does the use of structured reporting improve usability? A comparative evaluation of the usability of two approaches for findings reporting in a large-scale telecardiology context.

    abstract::One of the main reasons that leads to a low adoption rate of telemedicine systems is poor usability. An aspect that influences usability during the reporting of findings is the input mode, e.g., if a free-text (FT) or a structured report (SR) interface is employed. The objective of our study is to compare the usabilit...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2014.07.002

    authors: Lacerda TC,von Wangenheim CG,von Wangenheim A,Giuliano I

    更新日期:2014-12-01 00:00:00

  • R.A.P.I.D. (Root Aggregated Prioritized Information Display): A single screen display for efficient digital triaging of medical reports.

    abstract:OBJECTIVE:The timely acknowledgement of critical patient clinical reports is vital for the delivery of safe patient care. With current EHR systems, critical reports reside on different screens. This leads to treatment delays and inefficient work flows. As a remedy, the R.A.P.I.D. (Root Aggregated Prioritized Informatio...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章,随机对照试验

    doi:10.1016/j.jbi.2016.04.001

    authors: Ford JP,Huang L,Richards DS,Ambinder EP,Rosenberger JL

    更新日期:2016-06-01 00:00:00

  • Modified Needleman-Wunsch algorithm for clinical pathway clustering.

    abstract::Clinical pathways are used to guide clinicians to provide a standardised delivery of care. Because of their standardisation, the aim of clinical pathways is to reduce variation in both care process and patient outcomes. When learning clinical pathways from data through data mining, it is common practice to represent e...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2020.103668

    authors: Aspland E,Harper PR,Gartner D,Webb P,Barrett-Lee P

    更新日期:2021-01-27 00:00:00

  • Risk factor detection for heart disease by applying text analytics in electronic medical records.

    abstract::In the United States, about 600,000 people die of heart disease every year. The annual cost of care services, medications, and lost productivity reportedly exceeds 108.9 billion dollars. Effective disease risk assessment is critical to prevention, care, and treatment planning. Recent advancements in text analytics hav...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.08.011

    authors: Torii M,Fan JW,Yang WL,Lee T,Wiley MT,Zisook DS,Huang Y

    更新日期:2015-12-01 00:00:00

  • Tracking a moving user in indoor environments using Bluetooth low energy beacons.

    abstract:BACKGROUND:Bluetooth low energy (BLE) beacons have been used to track the locations of individuals in indoor environments for clinical applications such as workflow analysis and infectious disease modelling. Most current approaches use the received signal strength indicator (RSSI) to track locations. When using the RSS...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103288

    authors: Surian D,Kim V,Menon R,Dunn AG,Sintchenko V,Coiera E

    更新日期:2019-10-01 00:00:00

  • Predicting the function of transplanted kidney in long-term care processes: Application of a hybrid model.

    abstract:BACKGROUND:A tool that can predict the estimated glomerular filtration rate (eGFR) in routine daily care can help clinicians to make better decisions for kidney transplant patients and to improve transplantation outcome. In this paper, we proposed a hybrid prediction model for predicting a future value for eGFR during ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103116

    authors: Rashidi Khazaee P,Bagherzadeh M J,Niazkhani Z,Pirnejad H

    更新日期:2019-03-01 00:00:00

  • On the reproducibility of results of pathway analysis in genome-wide expression studies of colorectal cancers.

    abstract::One of the major problems in genomics and medicine is the identification of gene networks and pathways deregulated in complex and polygenic diseases, like cancer. In this paper, we address the problem of assessing the variability of results of pathways analysis identified in different and independent genome wide expre...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2009.09.005

    authors: Maglietta R,Distaso A,Piepoli A,Palumbo O,Carella M,D'Addabbo A,Mukherjee S,Ancona N

    更新日期:2010-06-01 00:00:00

  • Knowledge-based personalized search engine for the Web-based Human Musculoskeletal System Resources (HMSR) in biomechanics.

    abstract::Human musculoskeletal system resources of the human body are valuable for the learning and medical purposes. Internet-based information from conventional search engines such as Google or Yahoo cannot response to the need of useful, accurate, reliable and good-quality human musculoskeletal resources related to medical ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2012.11.001

    authors: Dao TT,Hoang TN,Ta XH,Tho MC

    更新日期:2013-02-01 00:00:00

  • A genetic algorithm-support vector machine method with parameter optimization for selecting the tag SNPs.

    abstract::SNPs (Single Nucleotide Polymorphisms) include millions of changes in human genome, and therefore, are promising tools for disease-gene association studies. However, this kind of studies is constrained by the high expense of genotyping millions of SNPs. For this reason, it is required to obtain a suitable subset of SN...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2012.12.002

    authors: Ilhan I,Tezel G

    更新日期:2013-04-01 00:00:00

  • Personal health information in research: Perceived risk, trustworthiness and opinions from patients attending a tertiary healthcare facility.

    abstract:BACKGROUND:Personal health information is a valuable resource to the advancement of research. In order to achieve a comprehensive reform of data infrastructure in Australia, both public engagement and building social trust is vital. In light of this, we conducted a study to explore the opinions, perceived risks and tru...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103222

    authors: Krahe M,Milligan E,Reilly S

    更新日期:2019-07-01 00:00:00

  • Annotating risk factors for heart disease in clinical narratives for diabetic patients.

    abstract::The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on identifying risk factors for heart disease (specifically, Cardiac Artery Disease) in clinical narratives. For this track, we used a "light" annotation paradigm to annotate a set of 1304 longitudinal medical records describing 29...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.05.009

    authors: Stubbs A,Uzuner Ö

    更新日期:2015-12-01 00:00:00

  • Characterizing and optimizing human anticancer drug targets based on topological properties in the context of biological pathways.

    abstract::One of the challenging problems in drug discovery is to identify the novel targets for drugs. Most of the traditional methods for drug targets optimization focused on identifying the particular families of "druggable targets", but ignored their topological properties based on the biological pathways. In this study, we...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2015.02.007

    authors: Zhang J,Wang Y,Shang D,Yu F,Liu W,Zhang Y,Feng C,Wang Q,Xu Y,Liu Y,Bai X,Li X,Li C

    更新日期:2015-04-01 00:00:00

  • Towards an on-demand peer feedback system for a clinical knowledge base: a case study with order sets.

    abstract:OBJECTIVE:We have developed an automated knowledge base peer feedback system as part of an effort to facilitate the creation and refinement of sound clinical knowledge content within an enterprise-wide knowledge base. The program collects clinical data stored in our Clinical Data Repository during usage of a physician ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2007.05.006

    authors: Hulse NC,Del Fiol G,Bradshaw RL,Roemer LK,Rocha RA

    更新日期:2008-02-01 00:00:00

  • Genome-wide analysis of multi-view data of miRNA-seq to identify miRNA biomarkers for stomach cancer.

    abstract::Stomach cancer is one of the leading causes of cancer-related deaths worldwide. More than 80% diagnosis of this cancer occur at later stages leading to low 5-year survival rate. This emphasizes the need to have better prognostic techniques for stomach cancer. In this regard, the Next-Generation Sequencing of whole gen...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2019.103254

    authors: Pant N,Rakshit S,Paul S,Saha I

    更新日期:2019-09-01 00:00:00

  • Medical diagnosis of atherosclerosis from Carotid Artery Doppler Signals using principal component analysis (PCA), k-NN based weighting pre-processing and Artificial Immune Recognition System (AIRS).

    abstract::In this study, we proposed a new medical diagnosis system based on principal component analysis (PCA), k-NN based weighting pre-processing, and Artificial Immune Recognition System (AIRS) for diagnosis of atherosclerosis from Carotid Artery Doppler Signals. The suggested system consists of four stages. First, in the f...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2007.04.001

    authors: Latifoğlu F,Polat K,Kara S,Güneş S

    更新日期:2008-02-01 00:00:00

  • GLIF3: a representation format for sharable computer-interpretable clinical practice guidelines.

    abstract::The Guideline Interchange Format (GLIF) is a model for representation of sharable computer-interpretable guidelines. The current version of GLIF (GLIF3) is a substantial update and enhancement of the model since the previous version (GLIF2). GLIF3 enables encoding of a guideline at three levels: a conceptual flowchart...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2004.04.002

    authors: Boxwala AA,Peleg M,Tu S,Ogunyemi O,Zeng QT,Wang D,Patel VL,Greenes RA,Shortliffe EH

    更新日期:2004-06-01 00:00:00

  • In defense of the Desiderata.

    abstract::A 1998 paper that delineated desirable characteristics, or desiderata for controlled medical terminologies attempted to summarize emerging consensus regarding structural issues of such terminologies. Among the Desiderata was a call for terminologies to be "concept oriented." Since then, research has trended toward the...

    journal_title:Journal of biomedical informatics

    pub_type: 评论,杂志文章

    doi:10.1016/j.jbi.2005.11.008

    authors: Cimino JJ

    更新日期:2006-06-01 00:00:00

  • Making sense: sensor-based investigation of clinician activities in complex critical care environments.

    abstract::In many respects, the critical care workplace resembles a paradigmatic complex system: on account of the dynamic and interactive nature of collaborative clinical work, these settings are characterized by non-linear, inter-dependent and emergent activities. Developing a comprehensive understanding of the work activitie...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2011.02.007

    authors: Kannampallil T,Li Z,Zhang M,Cohen T,Robinson DJ,Franklin A,Zhang J,Patel VL

    更新日期:2011-06-01 00:00:00

  • A flexible approach to distributed data anonymization.

    abstract::Sensitive biomedical data is often collected from distributed sources, involving different information systems and different organizational units. Local autonomy and legal reasons lead to the need of privacy preserving integration concepts. In this article, we focus on anonymization, which plays an important role for ...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2013.12.002

    authors: Kohlmayer F,Prasser F,Eckert C,Kuhn KA

    更新日期:2014-08-01 00:00:00

  • Applying semantic-based probabilistic context-free grammar to medical language processing--a preliminary study on parsing medication sentences.

    abstract::Semantic-based sublanguage grammars have been shown to be an efficient method for medical language processing. However, given the complexity of the medical domain, parsers using such grammars inevitably encounter ambiguous sentences, which could be interpreted by different groups of production rules and consequently r...

    journal_title:Journal of biomedical informatics

    pub_type: 杂志文章

    doi:10.1016/j.jbi.2011.08.009

    authors: Xu H,AbdelRahman S,Lu Y,Denny JC,Doan S

    更新日期:2011-12-01 00:00:00