Comparative evaluation of set-level techniques in predictive classification of gene expression samples.

Abstract:

BACKGROUND:Analysis of gene expression data in terms of a priori-defined gene sets has recently received significant attention as this approach typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set-level strategy can also be adopted with similar benefits in predictive classification tasks accomplished with machine learning algorithms. Initial studies into the predictive performance of set-level classifiers have yielded rather controversial results. The goal of this study is to provide a more conclusive evaluation by testing various components of the set-level framework within a large collection of machine learning experiments. RESULTS:Genuine curated gene sets constitute better features for classification than sets assembled without biological relevance. For identifying the best gene sets for classification, the Global test outperforms the gene-set methods GSEA and SAM-GS as well as two generic feature selection methods. To aggregate expressions of genes into a feature value, the singular value decomposition (SVD) method as well as the SetSig technique improve on simple arithmetic averaging. Set-level classifiers learned with 10 features constituted by the Global test slightly outperform baseline gene-level classifiers learned with all original data features although they are slightly less accurate than gene-level classifiers learned with a prior feature-selection step. CONCLUSION:Set-level classifiers do not boost predictive accuracy, however, they do achieve competitive accuracy if learned with the right combination of ingredients. AVAILABILITY:Open-source, publicly available software was used for classifier learning and testing. The gene expression datasets and the gene set database used are also publicly available. The full tabulation of experimental results is available at http://ida.felk.cvut.cz/CESLT.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Holec M,Kléma J,Zelezný F,Tolar J

doi

10.1186/1471-2105-13-S10-S15

subject

Has Abstract

pub_date

2012-06-25 00:00:00

pages

S15

issn

1471-2105

pii

1471-2105-13-S10-S15

journal_volume

13 Suppl 10

pub_type

杂志文章
  • Progressive multiple sequence alignment with indel evolution.

    abstract:BACKGROUND:Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modell...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2357-1

    authors: Maiolo M,Zhang X,Gil M,Anisimova M

    更新日期:2018-09-21 00:00:00

  • Moiety modeling framework for deriving moiety abundances from mass spectrometry measured isotopologues.

    abstract:BACKGROUND:Stable isotope tracing can follow individual atoms through metabolic transformations through the detection of the incorporation of stable isotope within metabolites. This resulting data can be interpreted in terms related to metabolic flux. However, detection of a stable isotope in metabolites by mass spectr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3096-7

    authors: Jin H,Moseley HNB

    更新日期:2019-10-28 00:00:00

  • IILLS: predicting virus-receptor interactions based on similarity and semi-supervised learning.

    abstract:BACKGROUND:Viral infectious diseases are the serious threat for human health. The receptor-binding is the first step for the viral infection of hosts. To more effectively treat human viral infectious diseases, the hidden virus-receptor interactions must be discovered. However, current computational methods for predicti...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3278-3

    authors: Yan C,Duan G,Wu FX,Wang J

    更新日期:2019-12-27 00:00:00

  • Large scale statistical inference of signaling pathways from RNAi and microarray data.

    abstract:BACKGROUND:The advent of RNA interference techniques enables the selective silencing of biologically interesting genes in an efficient way. In combination with DNA microarray technology this enables researchers to gain insights into signaling pathways by observing downstream effects of individual knock-downs on gene ex...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-386

    authors: Froehlich H,Fellmann M,Sueltmann H,Poustka A,Beissbarth T

    更新日期:2007-10-15 00:00:00

  • MetaMIS: a metagenomic microbial interaction simulator based on microbial community profiles.

    abstract:BACKGROUND:The complexity and dynamics of microbial communities are major factors in the ecology of a system. With the NGS technique, metagenomics data provides a new way to explore microbial interactions. Lotka-Volterra models, which have been widely used to infer animal interactions in dynamic systems, have recently ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1359-0

    authors: Shaw GT,Pao YY,Wang D

    更新日期:2016-11-25 00:00:00

  • Prototype semantic infrastructure for automated small molecule classification and annotation in lipidomics.

    abstract:BACKGROUND:The development of high-throughput experimentation has led to astronomical growth in biologically relevant lipids and lipid derivatives identified, screened, and deposited in numerous online databases. Unfortunately, efforts to annotate, classify, and analyze these chemical entities have largely remained in ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-303

    authors: Chepelev LL,Riazanov A,Kouznetsov A,Low HS,Dumontier M,Baker CJ

    更新日期:2011-07-26 00:00:00

  • KinMap: a web-based tool for interactive navigation through human kinome data.

    abstract:BACKGROUND:Annotations of the phylogenetic tree of the human kinome is an intuitive way to visualize compound profiling data, structural features of kinases or functional relationships within this important class of proteins. The increasing volume and complexity of kinase-related data underlines the need for a tool tha...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1433-7

    authors: Eid S,Turk S,Volkamer A,Rippmann F,Fulle S

    更新日期:2017-01-05 00:00:00

  • Computational approaches to protein inference in shotgun proteomics.

    abstract::Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2105-13-S16-S4

    authors: Li YF,Radivojac P

    更新日期:2012-01-01 00:00:00

  • Conceptual-level workflow modeling of scientific experiments using NMR as a case study.

    abstract:BACKGROUND:Scientific workflows improve the process of scientific experiments by making computations explicit, underscoring data flow, and emphasizing the participation of humans in the process when intuition and human reasoning are required. Workflows for experiments also highlight transitions among experimental phase...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-31

    authors: Verdi KK,Ellis HJ,Gryk MR

    更新日期:2007-01-30 00:00:00

  • ILP-based maximum likelihood genome scaffolding.

    abstract:BACKGROUND:Interest in de novo genome assembly has been renewed in the past decade due to rapid advances in high-throughput sequencing (HTS) technologies which generate relatively short reads resulting in highly fragmented assemblies consisting of contigs. Additional long-range linkage information is typically used to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S9

    authors: Lindsay J,Salooti H,Măndoiu I,Zelikovsky A

    更新日期:2014-01-01 00:00:00

  • Integration of shot-gun proteomics and bioinformatics analysis to explore plant hormone responses.

    abstract:BACKGROUND:Multidimensional protein identification technology (MudPIT)-based shot-gun proteomics has been proven to be an effective platform for functional proteomics. In particular, the various sample preparation methods and bioinformatics tools can be integrated to improve the proteomics platform for applications lik...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S15-S8

    authors: Zhang Y,Liu S,Dai SY,Yuan JS

    更新日期:2012-01-01 00:00:00

  • Prediction of dinucleotide-specific RNA-binding sites in proteins.

    abstract:BACKGROUND:Regulation of gene expression, protein synthesis, replication and assembly of many viruses involve RNA-protein interactions. Although some successful computational tools have been reported to recognize RNA binding sites in proteins, the problem of specificity remains poorly investigated. After the nucleotide...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S13-S5

    authors: Fernandez M,Kumagai Y,Standley DM,Sarai A,Mizuguchi K,Ahmad S

    更新日期:2011-01-01 00:00:00

  • Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.

    abstract:BACKGROUND:A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/s12859-019-3060-6

    authors: Nakano FK,Lietaert M,Vens C

    更新日期:2019-09-23 00:00:00

  • An integrated approach to the prediction of domain-domain interactions.

    abstract:BACKGROUND:The development of high-throughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-269

    authors: Lee H,Deng M,Sun F,Chen T

    更新日期:2006-05-25 00:00:00

  • LAVA: an open-source approach to designing LAMP (loop-mediated isothermal amplification) DNA signatures.

    abstract:BACKGROUND:We developed an extendable open-source Loop-mediated isothermal AMPlification (LAMP) signature design program called LAVA (LAMP Assay Versatile Analysis). LAVA was created in response to limitations of existing LAMP signature programs. RESULTS:LAVA identifies combinations of six primer regions for basic LAM...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-240

    authors: Torres C,Vitalis EA,Baker BR,Gardner SN,Torres MW,Dzenitis JM

    更新日期:2011-06-16 00:00:00

  • Combining calls from multiple somatic mutation-callers.

    abstract:BACKGROUND:Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still challenging and there is unlikely to be one established caller that systematically outperfor...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-154

    authors: Kim SY,Jacob L,Speed TP

    更新日期:2014-05-21 00:00:00

  • Genomic prediction of tuberculosis drug-resistance: benchmarking existing databases and prediction algorithms.

    abstract:BACKGROUND:It is possible to predict whether a tuberculosis (TB) patient will fail to respond to specific antibiotics by sequencing the genome of the infecting Mycobacterium tuberculosis (Mtb) and observing whether the pathogen carries specific mutations at drug-resistance sites. This advancement has led to the collati...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2658-z

    authors: Ngo TM,Teo YY

    更新日期:2019-02-08 00:00:00

  • Network-based analysis of comorbidities risk during an infection: SARS and HIV case studies.

    abstract:BACKGROUND:Infections are often associated to comorbidity that increases the risk of medical conditions which can lead to further morbidity and mortality. SARS is a threat which is similar to MERS virus, but the comorbidity is the key aspect to underline their different impacts. One UK doctor says "I'd rather have HIV ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-333

    authors: Moni MA,Liò P

    更新日期:2014-10-24 00:00:00

  • Inclusion of the fitness sharing technique in an evolutionary algorithm to analyze the fitness landscape of the genetic code adaptability.

    abstract:BACKGROUND:The canonical code, although prevailing in complex genomes, is not universal. It was shown the canonical genetic code superior robustness compared to random codes, but it is not clearly determined how it evolved towards its current form. The error minimization theory considers the minimization of point mutat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1608-x

    authors: Santos J,Monteagudo Á

    更新日期:2017-03-27 00:00:00

  • Learning by aggregating experts and filtering novices: a solution to crowdsourcing problems in bioinformatics.

    abstract:BACKGROUND:In many biomedical applications, there is a need for developing classification models based on noisy annotations. Recently, various methods addressed this scenario by relaying on unreliable annotations obtained from multiple sources. RESULTS:We proposed a probabilistic classification algorithm based on labe...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S12-S5

    authors: Zhang P,Cao W,Obradovic Z

    更新日期:2013-01-01 00:00:00

  • Identification of novel alternative splicing biomarkers for breast cancer with LC/MS/MS and RNA-Seq.

    abstract:BACKGROUND:Alternative splicing isoforms have been reported as a new and robust class of diagnostic biomarkers. Over 95% of human genes are estimated to be alternatively spliced as a powerful means of producing functionally diverse proteins from a single gene. The emergence of next-generation sequencing technologies, e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03824-8

    authors: Zhang F,Deng CK,Wang M,Deng B,Barber R,Huang G

    更新日期:2020-12-03 00:00:00

  • A new pooling strategy for high-throughput screening: the Shifted Transversal Design.

    abstract:BACKGROUND:In binary high-throughput screening projects where the goal is the identification of low-frequency events, beyond the obvious issue of efficiency, false positives and false negatives are a major concern. Pooling constitutes a natural solution: it reduces the number of tests, while providing critical duplicat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-28

    authors: Thierry-Mieg N

    更新日期:2006-01-19 00:00:00

  • RWRMTN: a tool for predicting disease-associated microRNAs based on a microRNA-target gene network.

    abstract:BACKGROUND:The misregulation of microRNA (miRNA) has been shown to cause diseases. Recently, we have proposed a computational method based on a random walk framework on a miRNA-target gene network to predict disease-associated miRNAs. The prediction performance of our method is better than that of some existing state-o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03578-3

    authors: Le DH,Tran TTH

    更新日期:2020-06-15 00:00:00

  • A molecular model of the full-length human NOD-like receptor family CARD domain containing 5 (NLRC5) protein.

    abstract:BACKGROUND:Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-275

    authors: Mótyán JA,Bagossi P,Benkő S,Tőzsér J

    更新日期:2013-09-17 00:00:00

  • Notos - a galaxy tool to analyze CpN observed expected ratios for inferring DNA methylation types.

    abstract:BACKGROUND:DNA methylation patterns store epigenetic information in the vast majority of eukaryotic species. The relatively high costs and technical challenges associated with the detection of DNA methylation however have created a bias in the number of methylation studies towards model organisms. Consequently, it rema...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2115-4

    authors: Bulla I,Aliaga B,Lacal V,Bulla J,Grunau C,Chaparro C

    更新日期:2018-03-27 00:00:00

  • OscoNet: inferring oscillatory gene networks.

    abstract:BACKGROUND:Oscillatory genes, with periodic expression at the mRNA and/or protein level, have been shown to play a pivotal role in many biological contexts. However, with the exception of the circadian clock and cell cycle, only a few such genes are known. Detecting oscillatory genes from snapshot single-cell experimen...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03561-y

    authors: Cutillo L,Boukouvalas A,Marinopoulou E,Papalopulu N,Rattray M

    更新日期:2020-08-21 00:00:00

  • Evaluation of gene-expression clustering via mutual information distance measure.

    abstract:BACKGROUND:The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pears...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-111

    authors: Priness I,Maimon O,Ben-Gal I

    更新日期:2007-03-30 00:00:00

  • hsegHMM: hidden Markov model-based allele-specific copy number alteration analysis accounting for hypersegmentation.

    abstract:BACKGROUND:Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next gene...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2412-y

    authors: Choo-Wosoba H,Albert PS,Zhu B

    更新日期:2018-11-14 00:00:00

  • Pushing the accuracy limit of shape complementarity for protein-protein docking.

    abstract:BACKGROUND:Protein-protein docking is a valuable computational approach for investigating protein-protein interactions. Shape complementarity is the most basic component of a scoring function and plays an important role in protein-protein docking. Despite significant progresses, shape representation remains an open que...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3270-y

    authors: Yan Y,Huang SY

    更新日期:2019-12-24 00:00:00

  • A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases.

    abstract:BACKGROUND:The PathoLogic program constructs Pathway/Genome databases by using a genome's annotation to predict the set of metabolic pathways present in an organism. PathoLogic determines the set of reactions composing those pathways from the enzymes annotated in the organism's genome. Most annotation efforts fail to a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-76

    authors: Green ML,Karp PD

    更新日期:2004-06-09 00:00:00