Comparing enrichment analysis and machine learning for identifying gene properties that discriminate between gene classes.

Abstract:

:Biologists very often use enrichment methods based on statistical hypothesis tests to identify gene properties that are significantly over-represented in a given set of genes of interest, by comparison with a 'background' set of genes. These enrichment methods, although based on rigorous statistical foundations, are not always the best single option to identify patterns in biological data. In many cases, one can also use classification algorithms from the machine-learning field. Unlike enrichment methods, classification algorithms are designed to maximize measures of predictive performance and are capable of analysing combinations of gene properties, instead of one property at a time. In practice, however, the majority of studies use either enrichment or classification methods (rather than both), and there is a lack of literature discussing the pros and cons of both types of method. The goal of this paper is to compare and contrast enrichment and classification methods, offering two contributions. First, we discuss the (to some extent complementary) advantages and disadvantages of both types of methods for identifying gene properties that discriminate between gene classes. Second, we provide a set of high-level recommendations for using enrichment and classification methods. Overall, by highlighting the strengths and the weaknesses of both types of methods we argue that both should be used in bioinformatics analyses.

journal_name

Brief Bioinform

authors

Fabris F,Palmer D,de Magalhães JP,Freitas AA

doi

10.1093/bib/bbz028

subject

Has Abstract

pub_date

2020-05-21 00:00:00

pages

803-814

issue

3

eissn

1467-5463

issn

1477-4054

pii

5380425

journal_volume

21

pub_type

杂志文章
  • HpQTL: a geometric morphometric platform to compute the genetic architecture of heterophylly.

    abstract::Heterophylly, i.e. morphological changes in leaves along the axis of an individual plant, is regarded as a strategy used by plants to cope with environmental change. However, little is known of the extent to which heterophylly is controlled by genes and how each underlying gene exerts its effect on heterophyllous vari...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbx011

    authors: Sun L,Wang J,Zhu X,Jiang L,Gosik K,Sang M,Sun F,Cheng T,Zhang Q,Wu R

    更新日期:2018-07-20 00:00:00

  • RNA-mediated translation regulation in viral genomes: computational advances in the recognition of sequences and structures.

    abstract::RNA structures are widely distributed across all life forms. The global conformation of these structures is defined by a variety of constituent structural units such as helices, hairpin loops, kissing-loop motifs and pseudoknots, which often behave in a modular way. Their ubiquitous distribution is associated with a v...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbz054

    authors: Gupta A,Bansal M

    更新日期:2020-07-15 00:00:00

  • New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

    abstract::With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads ...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbt067

    authors: Song K,Ren J,Reinert G,Deng M,Waterman MS,Sun F

    更新日期:2014-05-01 00:00:00

  • TrimNet: learning molecular representation from triplet messages for biomedicine.

    abstract:MOTIVATION:Computational methods accelerate drug discovery and play an important role in biomedicine, such as molecular property prediction and compound-protein interaction (CPI) identification. A key challenge is to learn useful molecular representation. In the early years, molecular properties are mainly calculated b...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa266

    authors: Li P,Li Y,Hsieh CY,Zhang S,Liu X,Liu H,Song S,Yao X

    更新日期:2020-11-04 00:00:00

  • Opportunities for community awareness platforms in personal genomics and bioinformatics education.

    abstract::Precision and personalized medicine will be increasingly based on the integration of various type of information, particularly electronic health records and genome sequences. The availability of cheap genome sequencing services and the information interoperability will increase the role of online bioinformatics analys...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbw078

    authors: Bianchi L,Liò P

    更新日期:2017-11-01 00:00:00

  • Pathway enrichment analysis approach based on topological structure and updated annotation of pathway.

    abstract::Pathway enrichment analysis has been widely used to identify cancer risk pathways, and contributes to elucidating the mechanism of tumorigenesis. However, most of the existing approaches use the outdated pathway information and neglect the complex gene interactions in pathway. Here, we first reviewed the existing wide...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbx091

    authors: Yang Q,Wang S,Dai E,Zhou S,Liu D,Liu H,Meng Q,Jiang B,Jiang W

    更新日期:2019-01-18 00:00:00

  • Shaping the nebulous enhancer in the era of high-throughput assays and genome editing.

    abstract::Since the 1st discovery of transcriptional enhancers in 1981, their textbook definition has remained largely unchanged in the past 37 years. With the emergence of high-throughput assays and genome editing, which are switching the paradigm from bottom-up discovery and testing of individual enhancers to top-down profili...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbz030

    authors: Ho EY,Cao Q,Gu M,Chan RW,Wu Q,Gerstein M,Yip KY

    更新日期:2020-05-21 00:00:00

  • Structural database resources for biological macromolecules.

    abstract::This Briefing reviews the widely used, currently active, up-to-date databases derived from the worldwide Protein Data Bank (PDB) to facilitate browsing, finding and exploring its entries. These databases contain visualization and analysis tools tailored to specific kinds of molecules and interactions, often including ...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbw049

    authors: Abriata LA

    更新日期:2017-07-01 00:00:00

  • A statistical framework for predicting critical regions of p53-dependent enhancers.

    abstract::P53 is the 'guardian of the genome' and is responsible for regulating cell cycle and apoptosis. The genomic p53 binding regions, where activating transcriptional factors and cofactors like p300 simultaneously bind, are called 'p53-dependent enhancers', which play an important role in tumorigenesis. Current experimenta...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa053

    authors: Niu X,Deng K,Liu L,Yang K,Hu X

    更新日期:2020-05-11 00:00:00

  • A solid quality-control analysis of AB SOLiD short-read sequencing data.

    abstract::Next generation sequencers have greatly improved our ability to mine polymorphisms and mutations out of entire (or portions of) genomes. The reliability of their outputs, though, showed to be very related to the sequencing chemistry and to deeply affect the quality of the downstream analyses. We focus here on the two-...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbs048

    authors: Castellana S,Romani M,Valente EM,Mazza T

    更新日期:2013-11-01 00:00:00

  • Probe mapping across multiple microarray platforms.

    abstract::Access to gene expression data has become increasingly common in recent years; however, analysis has become more difficult as it is often desirable to integrate data from different platforms. Probe mapping across microarray platforms is the first and most crucial step for data integration. In this article, we systemat...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbr076

    authors: Allen JD,Wang S,Chen M,Girard L,Minna JD,Xie Y,Xiao G

    更新日期:2012-09-01 00:00:00

  • Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing.

    abstract::Technical advances such as the development of molecular cloning, Sanger sequencing, PCR and oligonucleotide microarrays are key to our current capacity to sequence, annotate and study complete organismal genomes. Recent years have seen the development of a variety of so-called 'next-generation' sequencing platforms, w...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbp046

    authors: Horner DS,Pavesi G,Castrignanò T,De Meo PD,Liuni S,Sammeth M,Picardi E,Pesole G

    更新日期:2010-03-01 00:00:00

  • Class-imbalanced classifiers for high-dimensional data.

    abstract::A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the ...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbs006

    authors: Lin WJ,Chen JJ

    更新日期:2013-01-01 00:00:00

  • DeepAtomicCharge: a new graph convolutional network-based architecture for accurate prediction of atomic charges.

    abstract::Atomic charges play a very important role in drug-target recognition. However, computation of atomic charges with high-level quantum mechanics (QM) calculations is very time-consuming. A number of machine learning (ML)-based atomic charge prediction methods have been proposed to speed up the calculation of high-accura...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa183

    authors: Wang J,Cao D,Tang C,Xu L,He Q,Yang B,Chen X,Sun H,Hou T

    更新日期:2020-08-25 00:00:00

  • Accounting for differential variability in detecting differentially methylated regions.

    abstract::DNA methylation plays an essential role in cancer. Differential variability (DV) in cancer was recently observed that contributes to cancer heterogeneity and has been shown to be crucial in detecting epigenetic field defects, DNA methylation alterations happening early in carcinogenesis. As neighboring CpG sites are h...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbx097

    authors: Wang Y,Teschendorff AE,Widschwendter M,Wang S

    更新日期:2019-01-18 00:00:00

  • iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites.

    abstract::Regulation of proteolysis plays a critical role in a myriad of important cellular processes. The key to better understanding the mechanisms that control this process is to identify the specific substrates that each protease targets. To address this, we have developed iProt-Sub, a powerful bioinformatics tool for the a...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bby028

    authors: Song J,Wang Y,Li F,Akutsu T,Rawlings ND,Webb GI,Chou KC

    更新日期:2019-03-25 00:00:00

  • A feature-based approach to predict hot spots in protein-DNA binding interfaces.

    abstract::DNA-binding hot spot residues of proteins are dominant and fundamental interface residues that contribute most of the binding free energy of protein-DNA interfaces. As experimental methods for identifying hot spots are expensive and time consuming, computational approaches are urgently required in predicting hot spots...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbz037

    authors: Zhang S,Zhao L,Zheng CH,Xia J

    更新日期:2020-05-21 00:00:00

  • Computational prediction of species-specific yeast DNA replication origin via iterative feature representation.

    abstract::Deoxyribonucleic acid replication is one of the most crucial tasks taking place in the cell, and it has to be precisely regulated. This process is initiated in the replication origins (ORIs), and thus it is essential to identify such sites for a deeper understanding of the cellular processes and functions related to t...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa304

    authors: Manavalan B,Basith S,Shin TH,Lee G

    更新日期:2020-11-25 00:00:00

  • A brief history of bioinformatics.

    abstract::It is easy for today's students and researchers to believe that modern bioinformatics emerged recently to assist next-generation sequencing data analysis. However, the very beginnings of bioinformatics occurred more than 50 years ago, when desktop computers were still a hypothesis and DNA could not yet be sequenced. T...

    journal_title:Briefings in bioinformatics

    pub_type: 历史文章,杂志文章,评审

    doi:10.1093/bib/bby063

    authors: Gauthier J,Vincent AT,Charette SJ,Derome N

    更新日期:2019-11-27 00:00:00

  • Discovery of G-quadruplex-forming sequences in SARS-CoV-2.

    abstract::The outbreak caused by the novel coronavirus SARS-CoV-2 has been declared a global health emergency. G-quadruplex structures in genomes have long been considered essential for regulating a number of biological processes in a plethora of organisms. We have analyzed and identified 25 four contiguous GG runs (G2NxG2NyG2N...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa114

    authors: Ji D,Juhas M,Tsang CM,Kwok CK,Li Y,Zhang Y

    更新日期:2020-06-01 00:00:00

  • Exploring the function of genetic variants in the non-coding genomic regions: approaches for identifying human regulatory variants affecting gene expression.

    abstract::Understanding the genetic basis of human traits/diseases and the underlying mechanisms of how these traits/diseases are affected by genetic variations is critical for public health. Current genome-wide functional genomics data uncovered a large number of functional elements in the noncoding regions of human genome, pr...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbu018

    authors: Li MJ,Yan B,Sham PC,Wang J

    更新日期:2015-05-01 00:00:00

  • Federating data with Information Integrator.

    abstract::Information Integrator is an extension to IBM's relational database DB2, which uses data federation to provide benefits to molecular biology researchers through two unique capabilities: increased flexibility in combining data from disparate sources, and SQL access to non-SQL data, easing the task of automating data an...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/4.4.375

    authors: Arenson AD

    更新日期:2003-12-01 00:00:00

  • Deep learning for brain disorders: from data processing to disease treatment.

    abstract::In order to reach precision medicine and improve patients' quality of life, machine learning is increasingly used in medicine. Brain disorders are often complex and heterogeneous, and several modalities such as demographic, clinical, imaging, genetics and environmental data have been studied to improve their understan...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa310

    authors: Burgos N,Bottani S,Faouzi J,Thibeau-Sutre E,Colliot O

    更新日期:2020-12-15 00:00:00

  • Automated glycopeptide analysis--review of current state and future directions.

    abstract::Glycosylation of proteins is involved in immune defense, cell-cell adhesion, cellular recognition and pathogen binding and is one of the most common and complex post-translational modifications. Science is still struggling to assign detailed mechanisms and functions to this form of conjugation. Even the structural ana...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbs045

    authors: Dallas DC,Martin WF,Hua S,German JB

    更新日期:2013-05-01 00:00:00

  • Pathogenicity phenomena in three model systems: from network mining to emerging system-level properties.

    abstract::Understanding the interconnections of microbial pathogenicity phenomena, such as biofilm formation, quorum sensing and antimicrobial resistance, is a tremendous open challenge for biomedical research. Progress made by wet-lab researchers and bioinformaticians in understanding the underlying regulatory phenomena has be...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbt071

    authors: Castelhano Santos N,Pereira MO,Lourenço A

    更新日期:2015-01-01 00:00:00

  • Exploration of cellular reaction systems.

    abstract::We discuss and review different ways to map cellular components and their temporal interaction with other such components to different non-spatially explicit mathematical models. The essential choices made in the literature are between discrete and continuous state spaces, between rule and event-based state updates an...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1093/bib/bbp062

    authors: Kirkilionis M

    更新日期:2010-01-01 00:00:00

  • Current development of integrated web servers for preclinical safety and pharmacokinetics assessments in drug development.

    abstract::In drug development, preclinical safety and pharmacokinetics assessments of candidate drugs to ensure the safety profile are a must. While in vivo and in vitro tests are traditionally used, experimental determinations have disadvantages, as they are usually time-consuming and costly. In silico predictions of these pre...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbaa160

    authors: Hsiao Y,Su BH,Tseng YJ

    更新日期:2020-08-07 00:00:00

  • Multiple Testing of Gene Sets from Gene Ontology: Possibilities and Pitfalls.

    abstract::The use of multiple testing procedures in the context of gene-set testing is an important but relatively underexposed topic. If a multiple testing method is used, this is usually a standard familywise error rate (FWER) or false discovery rate (FDR) controlling procedure in which the logical relationships that exist be...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbv091

    authors: Meijer RJ,Goeman JJ

    更新日期:2016-09-01 00:00:00

  • Reproducible probe-level analysis of the Affymetrix Exon 1.0 ST array with R/Bioconductor.

    abstract::The presence of different transcripts of a gene across samples can be analysed by whole-transcriptome microarrays. Reproducing results from published microarray data represents a challenge owing to the vast amounts of data and the large variety of preprocessing and filtering steps used before the actual analysis is ca...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbt011

    authors: Rodrigo-Domingo M,Waagepetersen R,Bødker JS,Falgreen S,Kjeldsen MK,Johnsen HE,Dybkær K,Bøgsted M

    更新日期:2014-07-01 00:00:00

  • A practical guide for the functional annotation of genetic variations using SNPnexus.

    abstract::Broader functional annotation of known as well as putative genetic variations is a valuable mean for prioritizing targets in disease studies and large-scale genotyping projects. In this article, we present a practical guide to SNPnexus, a web-based tool that provides an aggregate set of functional annotations for geno...

    journal_title:Briefings in bioinformatics

    pub_type: 杂志文章

    doi:10.1093/bib/bbt004

    authors: Dayem Ullah AZ,Lemoine NR,Chelala C

    更新日期:2013-07-01 00:00:00