Class prediction for high-dimensional class-imbalanced data.

Abstract:

BACKGROUND:The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. RESULTS:Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. CONCLUSIONS:Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Blagus R,Lusa L

doi

10.1186/1471-2105-11-523

subject

Has Abstract

pub_date

2010-10-20 00:00:00

pages

523

issn

1471-2105

pii

1471-2105-11-523

journal_volume

11

pub_type

杂志文章
  • R/BHC: fast Bayesian hierarchical clustering for microarray data.

    abstract:BACKGROUND:Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained. RESULTS:We present an R/Bioconductor port of a fast novel algorithm for...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-242

    authors: Savage RS,Heller K,Xu Y,Ghahramani Z,Truman WM,Grant M,Denby KJ,Wild DL

    更新日期:2009-08-06 00:00:00

  • AntiBP2: improved version of antibacterial peptide prediction.

    abstract:BACKGROUND:Antibacterial peptides are one of the effecter molecules of innate immune system. Over the last few decades several antibacterial peptides have successfully approved as drug by FDA, which has prompted an interest in these antibacterial peptides. In our recent study we analyzed 999 antibacterial peptides, whi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S1-S19

    authors: Lata S,Mishra NK,Raghava GP

    更新日期:2010-01-18 00:00:00

  • Components of the antigen processing and presentation pathway revealed by gene expression microarray analysis following B cell antigen receptor (BCR) stimulation.

    abstract:BACKGROUND:Activation of naïve B lymphocytes by extracellular ligands, e.g. antigen, lipopolysaccharide (LPS) and CD40 ligand, induces a combination of common and ligand-specific phenotypic changes through complex signal transduction pathways. For example, although all three of these ligands induce proliferation, only ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-237

    authors: Lee JA,Sinkovits RS,Mock D,Rab EL,Cai J,Yang P,Saunders B,Hsueh RC,Choi S,Subramaniam S,Scheuermann RH,Alliance for Cellular Signaling.

    更新日期:2006-05-02 00:00:00

  • The Korean Bird Information System (KBIS) through open and public participation.

    abstract:BACKGROUND:The importance of biodiversity conservation has been increasing steadily due to its benefits to human beings. Recently, producing and managing biodiversity databases have become much easier because of the information technology (IT) advancement. This made the general public's participation in biodiversity co...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S15-S11

    authors: Paik IH,Lim J,Chun BS,Jin SD,Yu JP,Lee JW,Bhak J,Paek WK

    更新日期:2009-12-03 00:00:00

  • A stochastic context free grammar based framework for analysis of protein sequences.

    abstract:BACKGROUND:In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-323

    authors: Dyrka W,Nebel JC

    更新日期:2009-10-08 00:00:00

  • Error, reproducibility and sensitivity: a pipeline for data processing of Agilent oligonucleotide expression arrays.

    abstract:BACKGROUND:Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematica...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-344

    authors: Chain B,Bowen H,Hammond J,Posch W,Rasaiyaah J,Tsang J,Noursadeghi M

    更新日期:2010-06-24 00:00:00

  • GraphDNA: a Java program for graphical display of DNA composition analyses.

    abstract:BACKGROUND:Under conditions of no strand bias the number of Gs is equal to that of Cs for each DNA strand; similarly, the total number of Ts is equal to that of As. However, within each strand there are considerable local deviations from the A = T and G = C equality. These asymmetries in nucleotide composition have bee...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-21

    authors: Thomas JM,Horspool D,Brown G,Tcherepanov V,Upton C

    更新日期:2007-01-23 00:00:00

  • A sensitive short read homology search tool for paired-end read sequencing data.

    abstract:BACKGROUND:Homology search is still a significant step in functional analysis for genomic data. Profile Hidden Markov Model-based homology search has been widely used in protein domain analysis in many different species. In particular, with the fast accumulation of transcriptomic data of non-model species and metagenom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1826-2

    authors: Techa-Angkoon P,Sun Y,Lei J

    更新日期:2017-10-16 00:00:00

  • Reuse of imputed data in microarray analysis increases imputation efficiency.

    abstract:BACKGROUND:The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the va...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-160

    authors: Kim KY,Kim BJ,Yi GS

    更新日期:2004-10-26 00:00:00

  • scDC: single cell differential composition analysis.

    abstract:BACKGROUND:Differences in cell-type composition across subjects and conditions often carry biological significance. Recent advancements in single cell sequencing technologies enable cell-types to be identified at the single cell level, and as a result, cell-type composition of tissues can now be studied in exquisite de...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3211-9

    authors: Cao Y,Lin Y,Ormerod JT,Yang P,Yang JYH,Lo KK

    更新日期:2019-12-24 00:00:00

  • Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations.

    abstract:BACKGROUND:Detecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1790-x

    authors: Nguyen LH,Holmes S

    更新日期:2017-09-13 00:00:00

  • How large B-factors can be in protein crystal structures.

    abstract:BACKGROUND:Protein crystal structures are potentially over-interpreted since they are routinely refined without any restraint on the upper limit of atomic B-factors. Consequently, some of their atoms, undetected in the electron density maps, are allowed to reach extremely large B-factors, even above 100 square Angstrom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2083-8

    authors: Carugo O

    更新日期:2018-02-23 00:00:00

  • Prediction of dinucleotide-specific RNA-binding sites in proteins.

    abstract:BACKGROUND:Regulation of gene expression, protein synthesis, replication and assembly of many viruses involve RNA-protein interactions. Although some successful computational tools have been reported to recognize RNA binding sites in proteins, the problem of specificity remains poorly investigated. After the nucleotide...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S13-S5

    authors: Fernandez M,Kumagai Y,Standley DM,Sarai A,Mizuguchi K,Ahmad S

    更新日期:2011-01-01 00:00:00

  • Attenuating dependence on structural data in computing protein energy landscapes.

    abstract:BACKGROUND:Nearly all cellular processes involve proteins structurally rearranging to accommodate molecular partners. The energy landscape underscores the inherent nature of proteins as dynamic molecules interconverting between structures with varying energies. In principle, reconstructing a protein's energy landscape ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2822-5

    authors: Morris D,Maximova T,Plaku E,Shehu A

    更新日期:2019-06-06 00:00:00

  • Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

    abstract:BACKGROUND:Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology all...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-241

    authors: Aubry M,Monnier A,Chicault C,de Tayrac M,Galibert MD,Burgun A,Mosser J

    更新日期:2006-05-04 00:00:00

  • OMeta: an ontology-based, data-driven metadata tracking system.

    abstract:BACKGROUND:The development of high-throughput sequencing and analysis has accelerated multi-omics studies of thousands of microbial species, metagenomes, and infectious disease pathogens. Omics studies are enabling genotype-phenotype association studies which identify genetic determinants of pathogen virulence and drug...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2580-9

    authors: Singh I,Kuscuoglu M,Harkins DM,Sutton G,Fouts DE,Nelson KE

    更新日期:2019-01-07 00:00:00

  • Evaluation of gene-expression clustering via mutual information distance measure.

    abstract:BACKGROUND:The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pears...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-111

    authors: Priness I,Maimon O,Ben-Gal I

    更新日期:2007-03-30 00:00:00

  • Knowledge discovery of drug data on the example of adverse reaction prediction.

    abstract:BACKGROUND:Antibiotics are the widely prescribed drugs for children and most likely to be related with adverse reactions. Record on adverse reactions and allergies from antibiotics considerably affect the prescription choices. We consider this a biomedical decision-making problem and explore hidden knowledge in survey ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S6-S7

    authors: Yildirim P,Majnarić L,Ekmekci O,Holzinger A

    更新日期:2014-01-01 00:00:00

  • Computational method for reducing variance with Affymetrix microarrays.

    abstract:BACKGROUND:Affymetrix microarrays are used by many laboratories to generate gene expression profiles. Generally, only large differences (> 1.7-fold) between conditions have been reported. Computational methods to reduce inter-array variability might be of value when attempting to detect smaller differences. We examined...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-3-23

    authors: Welle S,Brooks AI,Thornton CA

    更新日期:2002-08-30 00:00:00

  • Analysis and prediction of antibacterial peptides.

    abstract:BACKGROUND:Antibacterial peptides are important components of the innate immune system, used by the host to protect itself from different types of pathogenic bacteria. Over the last few decades, the search for new drugs and drug targets has prompted an interest in these antibacterial peptides. We analyzed 486 antibacte...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-263

    authors: Lata S,Sharma BK,Raghava GP

    更新日期:2007-07-23 00:00:00

  • Mining published lists of cancer related microarray experiments: identification of a gene expression signature having a critical role in cell-cycle control.

    abstract:BACKGROUND:Routine application of gene expression microarray technology is rapidly producing large amounts of data that necessitate new approaches of analysis. The analysis of a specific microarray experiment profits enormously from cross-comparing to other experiments. This process is generally performed by numerical ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-S4-S14

    authors: Finocchiaro G,Mancuso F,Muller H

    更新日期:2005-12-01 00:00:00

  • Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction.

    abstract:BACKGROUND:One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-410

    authors: Scheeff ED,Bourne PE

    更新日期:2006-09-14 00:00:00

  • New challenges for text mining: mapping between text and manually curated pathways.

    abstract:BACKGROUND:Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactio...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S3-S5

    authors: Oda K,Kim JD,Ohta T,Okanohara D,Matsuzaki T,Tateisi Y,Tsujii J

    更新日期:2008-04-11 00:00:00

  • Quality determination and the repair of poor quality spots in array experiments.

    abstract:BACKGROUND:A common feature of microarray experiments is the occurrence of missing gene expression data. These missing values occur for a variety of reasons, in particular, because of the filtering of poor quality spots and the removal of undefined values when a logarithmic transformation is applied to negative backgro...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-234

    authors: Tom BD,Gilks WR,Brooke-Powell ET,Ajioka JW

    更新日期:2005-09-26 00:00:00

  • Promoter prediction in E. coli based on SIDD profiles and Artificial Neural Networks.

    abstract:BACKGROUND:One of the major challenges in biology is the correct identification of promoter regions. Computational methods based on motif searching have been the traditional approach taken. Recent studies have shown that DNA structural properties, such as curvature, stacking energy, and stress-induced duplex destabiliz...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S6-S17

    authors: Bland C,Newsome AS,Markovets AA

    更新日期:2010-10-07 00:00:00

  • MultiDCoX: Multi-factor analysis of differential co-expression.

    abstract:BACKGROUND:Differential co-expression (DCX) signifies change in degree of co-expression of a set of genes among different biological conditions. It has been used to identify differential co-expression networks or interactomes. Many algorithms have been developed for single-factor differential co-expression analysis and...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1963-7

    authors: Liany H,Rajapakse JC,Karuturi RKM

    更新日期:2017-12-28 00:00:00

  • Detection of transposable elements by their compositional bias.

    abstract:BACKGROUND:Transposable elements (TE) are mobile genetic entities present in nearly all genomes. Previous work has shown that TEs tend to have a different nucleotide composition than the host genes, either considering codon usage bias or dinucleotide frequencies. We show here how these compositional differences can be ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-94

    authors: Andrieu O,Fiston AS,Anxolabéhère D,Quesneville H

    更新日期:2004-07-13 00:00:00

  • Local functional descriptors for surface comparison based binding prediction.

    abstract:BACKGROUND:Molecular recognition in proteins occurs due to appropriate arrangements of physical, chemical, and geometric properties of an atomic surface. Similar surface regions should create similar binding interfaces. Effective methods for comparing surface regions can be used in identifying similar regions, and to p...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-314

    authors: Cipriano GM,Phillips GN Jr,Gleicher M

    更新日期:2012-11-24 00:00:00

  • Prediction of specificity-determining residues for small-molecule kinase inhibitors.

    abstract:BACKGROUND:Designing small-molecule kinase inhibitors with desirable selectivity profiles is a major challenge in drug discovery. A high-throughput screen for inhibitors of a given kinase will typically yield many compounds that inhibit more than one kinase. A series of chemical modifications are usually required befor...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-491

    authors: Caffrey DR,Lunney EA,Moshinsky DJ

    更新日期:2008-11-25 00:00:00

  • In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation.

    abstract:BACKGROUND:Microarray co-expression signatures are an important tool for studying gene function and relations between genes. In addition to genuine biological co-expression, correlated signals can result from technical deficiencies like hybridization of reporters with off-target transcripts. An approach that is able to...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-461

    authors: Casneuf T,Van de Peer Y,Huber W

    更新日期:2007-11-26 00:00:00