Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots.

Abstract:

BACKGROUND:Analyses of molecular high-throughput data often lack in robustness, i.e. results are very sensitive to the addition or removal of a single observation. Therefore, the identification of extreme observations is an important step of quality control before doing further data analysis. Standard outlier detection methods for univariate data are however not applicable, since the considered data are high-dimensional, i.e. multiple hundreds or thousands of features are observed in small samples. Usually, outliers in high-dimensional data are solely detected by visual inspection of a graphical representation of the data by the analyst. Typical graphical representation for high-dimensional data are hierarchical cluster tree or principal component plots. Pure visual approaches depend, however, on the individual judgement of the analyst and are hard to automate. Existing methods for automated outlier detection are only dedicated to data of a single experimental groups. RESULTS:In this work we propose to use bagplots, the 2-dimensional extension of the boxplot, to automatically identify outliers in the subspace of the first two principal components of the data. Furthermore, we present for the first time the gemplot, the 3-dimensional extension of boxplot and bagplot, which can be used in the subspace of the first three principal components. Bagplot and gemplot surround the regular observations with convex hulls and observations outside these hulls are regarded as outliers. The convex hulls are determined separately for the observations of each experimental group while the observations of all groups can be displayed in the same subspace of principal components. We demonstrate the usefulness of this approach on multiple sets of artificial data as well as one set of gene expression data from a next-generation sequencing experiment, and compare the new method to other common approaches. Furthermore, we provide an implementation of the gemplot in the package 'gemPlot' for the R programming environment. CONCLUSIONS:Bagplots and gemplots in subspaces of principal components are useful for automated and objective outlier identification in high-dimensional data from molecular high-throughput experiments. A clear advantage over other methods is that multiple experimental groups can be displayed in the same figure although outlier detection is performed for each individual group.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Kruppa J,Jung K

doi

10.1186/s12859-017-1645-5

subject

Has Abstract

pub_date

2017-05-02 00:00:00

pages

232

issue

1

issn

1471-2105

pii

10.1186/s12859-017-1645-5

journal_volume

18

pub_type

杂志文章
  • Analysis of Bovine Viral Diarrhea Viruses-infected monocytes: identification of cytopathic and non-cytopathic biotype differences.

    abstract:BACKGROUND:Bovine Viral Diarrhea Virus (BVDV) infection is widespread in cattle worldwide, causing important economic losses. Pathogenesis of the disease caused by BVDV is complex, as each BVDV strain has two biotypes: non-cytopathic (ncp) and cytopathic (cp). BVDV can cause a persistent latent infection and immune sup...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S6-S9

    authors: Ammari M,McCarthy FM,Nanduri B,Pinchuk LM

    更新日期:2010-10-07 00:00:00

  • VKCDB: voltage-gated potassium channel database.

    abstract:BACKGROUND:The family of voltage-gated potassium channels comprises a functionally diverse group of membrane proteins. They help maintain and regulate the potassium ion-based component of the membrane potential and are thus central to many critical physiological processes. VKCDB (Voltage-gated potassium [K] Channel Dat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2105-5-3

    authors: Li B,Gallin WJ

    更新日期:2004-01-09 00:00:00

  • EGNAS: an exhaustive DNA sequence design algorithm.

    abstract:BACKGROUND:The molecular recognition based on the complementary base pairing of deoxyribonucleic acid (DNA) is the fundamental principle in the fields of genetics, DNA nanotechnology and DNA computing. We present an exhaustive DNA sequence design algorithm that allows to generate sets containing a maximum number of seq...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-138

    authors: Kick A,Bönsch M,Mertig M

    更新日期:2012-06-20 00:00:00

  • Predicting MoRFs in protein sequences using HMM profiles.

    abstract:BACKGROUND:Intrinsically Disordered Proteins (IDPs) lack an ordered three-dimensional structure and are enriched in various biological processes. The Molecular Recognition Features (MoRFs) are functional regions within IDPs that undergo a disorder-to-order transition on binding to a partner protein. Identifying MoRFs i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1375-0

    authors: Sharma R,Kumar S,Tsunoda T,Patil A,Sharma A

    更新日期:2016-12-22 00:00:00

  • Fast batch searching for protein homology based on compression and clustering.

    abstract:BACKGROUND:In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn't exploit the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1938-8

    authors: Ge H,Sun L,Yu J

    更新日期:2017-11-21 00:00:00

  • Exploring community structure in biological networks with random graphs.

    abstract:BACKGROUND:Community structure is ubiquitous in biological networks. There has been an increased interest in unraveling the community structure of biological systems as it may provide important insights into a system's functional components and the impact of local structures on dynamics at a global scale. Choosing an a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-220

    authors: Sah P,Singh LO,Clauset A,Bansal S

    更新日期:2014-06-25 00:00:00

  • Determining significance of pairwise co-occurrences of events in bursty sequences.

    abstract:BACKGROUND:Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-336

    authors: Haiminen N,Mannila H,Terzi E

    更新日期:2008-08-08 00:00:00

  • SAMPI: protein identification with mass spectra alignments.

    abstract:BACKGROUND:Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs oft...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-102

    authors: Kaltenbach HM,Wilke A,Böcker S

    更新日期:2007-03-26 00:00:00

  • Investigating the concordance of Gene Ontology terms reveals the intra- and inter-platform reproducibility of enrichment analysis.

    abstract:BACKGROUND:Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-143

    authors: Zhang L,Zhang J,Yang G,Wu D,Jiang L,Wen Z,Li M

    更新日期:2013-04-29 00:00:00

  • SAMSA: a comprehensive metatranscriptome analysis pipeline.

    abstract:BACKGROUND:Although metatranscriptomics-the study of diverse microbial population activity based on RNA-seq data-is rapidly growing in popularity, there are limited options for biologists to analyze this type of data. Current approaches for processing metatranscriptomes rely on restricted databases and a dedicated comp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1270-8

    authors: Westreich ST,Korf I,Mills DA,Lemay DG

    更新日期:2016-09-29 00:00:00

  • An evaluation of copy number variation detection tools for cancer using whole exome sequencing data.

    abstract:BACKGROUND:Recently copy number variation (CNV) has gained considerable interest as a type of genomic/genetic variation that plays an important role in disease susceptibility. Advances in sequencing technology have created an opportunity for detecting CNVs more accurately. Recently whole exome sequencing (WES) has beco...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1705-x

    authors: Zare F,Dow M,Monteleone N,Hosny A,Nabavi S

    更新日期:2017-05-31 00:00:00

  • Identification of markers associated with global changes in DNA methylation regulation in cancers.

    abstract::DNA methylation exhibits different patterns in different cancers. DNA methylation rates at different genomic loci appear to be highly correlated in some samples but not in others. We call such phenomena conditional concordant relationships (CCRs). In this study, we explored DNA methylation patterns in 12 common cancer...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S13-S7

    authors: Qiu P,Zhang L

    更新日期:2012-01-01 00:00:00

  • Graph-representation of oxidative folding pathways.

    abstract:BACKGROUND:The process of oxidative folding combines the formation of native disulfide bond with conformational folding resulting in the native three-dimensional fold. Oxidative folding pathways can be described in terms of disulfide intermediate species (DIS) which can also be isolated and characterized. Each DIS corr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-19

    authors: Agoston V,Cemazar M,Kaján L,Pongor S

    更新日期:2005-01-27 00:00:00

  • Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task.

    abstract:BACKGROUND:The third edition of the BioNLP Shared Task was held with the grand theme "knowledge base construction (KB)". The Genia Event (GE) task was re-designed and implemented in light of this theme. For its final report, the participating systems were evaluated from a perspective of annotation. To further explore t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S10-S3

    authors: Kim JD,Kim JJ,Han X,Rebholz-Schuhmann D

    更新日期:2015-01-01 00:00:00

  • Determination of strongly overlapping signaling activity from microarray data.

    abstract:BACKGROUND:As numerous diseases involve errors in signal transduction, modern therapeutics often target proteins involved in cellular signaling. Interpretation of the activity of signaling pathways during disease development or therapeutic intervention would assist in drug development, design of therapy, and target ide...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-99

    authors: Bidaut G,Suhre K,Claverie JM,Ochs MF

    更新日期:2006-02-28 00:00:00

  • Multi-scale structural community organisation of the human genome.

    abstract:BACKGROUND:Structural interaction frequency matrices between all genome loci are now experimentally achievable thanks to high-throughput chromosome conformation capture technologies. This ensues a new methodological challenge for computational biology which consists in objectively extracting from these data the structu...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1616-x

    authors: Boulos RE,Tremblay N,Arneodo A,Borgnat P,Audit B

    更新日期:2017-04-11 00:00:00

  • Identification of novel alternative splicing biomarkers for breast cancer with LC/MS/MS and RNA-Seq.

    abstract:BACKGROUND:Alternative splicing isoforms have been reported as a new and robust class of diagnostic biomarkers. Over 95% of human genes are estimated to be alternatively spliced as a powerful means of producing functionally diverse proteins from a single gene. The emergence of next-generation sequencing technologies, e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03824-8

    authors: Zhang F,Deng CK,Wang M,Deng B,Barber R,Huang G

    更新日期:2020-12-03 00:00:00

  • Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes.

    abstract:BACKGROUND:A major goal of the analysis of high-dimensional RNA expression data from tumor tissue is to identify prognostic signatures for discriminating patient subgroups. For this purpose genome-wide identification of bimodally expressed genes from gene array data is relevant because distinguishability of high and lo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-276

    authors: Hellwig B,Hengstler JG,Schmidt M,Gehrmann MC,Schormann W,Rahnenführer J

    更新日期:2010-05-25 00:00:00

  • Detecting transitions in protein dynamics using a recurrence quantification analysis based bootstrap method.

    abstract:BACKGROUND:Proteins undergo conformational transitions over different time scales. These transitions are closely intertwined with the protein's function. Numerous standard techniques such as principal component analysis are used to detect these transitions in molecular dynamics simulations. In this work, we add a new m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1943-y

    authors: Karain WI

    更新日期:2017-11-28 00:00:00

  • MeDEStrand: an improved method to infer genome-wide absolute methylation levels from DNA enrichment data.

    abstract:BACKGROUND:DNA methylation of CpG dinucleotides is an essential epigenetic modification that plays a key role in transcription. Widely used DNA enrichment-based methods offer high coverage for measuring methylated CpG dinucleotides, with the lowest cost per CpG covered genome-wide. However, these methods measure the DN...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2574-7

    authors: Xu J,Liu S,Yin P,Bulun S,Dai Y

    更新日期:2018-12-22 00:00:00

  • TableButler - a Windows based tool for processing large data tables generated with high-throughput methods.

    abstract:BACKGROUND:High-throughput "omics" based data analysis play emerging roles in life sciences and molecular diagnostics. This emphasizes the urgent need for user-friendly windows-based software interfaces that could process the diversity of large tab-delimited raw data files generated by these methods. Depending on the s...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-235

    authors: Schwager C,Wirkner U,Abdollahi A,Huber PE

    更新日期:2009-07-29 00:00:00

  • Ferret: a sentence-based literature scanning system.

    abstract:BACKGROUND:The rapid pace of bioscience research makes it very challenging to track relevant articles in one's area of interest. MEDLINE, a primary source for biomedical literature, offers access to more than 20 million citations with three-quarters of a million new ones added each year. Thus it is not surprising to se...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0630-0

    authors: Srinivasan P,Zhang XN,Bouten R,Chang C

    更新日期:2015-06-20 00:00:00

  • Extracting transcription factor binding sites from unaligned gene sequences with statistical models.

    abstract:BACKGROUND:Transcription factor binding sites (TFBSs) are crucial in the regulation of gene transcription. Recently, chromatin immunoprecipitation followed by cDNA microarray hybridization (ChIP-chip array) has been used to identify potential regulatory sequences, but the procedure can only map the probable protein-DNA...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S12-S7

    authors: Lu CC,Yuan WH,Chen TM

    更新日期:2008-12-12 00:00:00

  • Knowledge discovery of drug data on the example of adverse reaction prediction.

    abstract:BACKGROUND:Antibiotics are the widely prescribed drugs for children and most likely to be related with adverse reactions. Record on adverse reactions and allergies from antibiotics considerably affect the prescription choices. We consider this a biomedical decision-making problem and explore hidden knowledge in survey ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S6-S7

    authors: Yildirim P,Majnarić L,Ekmekci O,Holzinger A

    更新日期:2014-01-01 00:00:00

  • MultiDCoX: Multi-factor analysis of differential co-expression.

    abstract:BACKGROUND:Differential co-expression (DCX) signifies change in degree of co-expression of a set of genes among different biological conditions. It has been used to identify differential co-expression networks or interactomes. Many algorithms have been developed for single-factor differential co-expression analysis and...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1963-7

    authors: Liany H,Rajapakse JC,Karuturi RKM

    更新日期:2017-12-28 00:00:00

  • ProbPS: a new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity.

    abstract:BACKGROUND:The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-346

    authors: Zhang S,Wang Y,Bu D,Zhang H,Sun S

    更新日期:2011-08-17 00:00:00

  • EGenBio: a data management system for evolutionary genomics and biodiversity.

    abstract:BACKGROUND:Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; http://egenbio.lsu.edu) to begin to address this....

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-S2-S7

    authors: Nahum LA,Reynolds MT,Wang ZO,Faith JJ,Jonna R,Jiang ZJ,Meyer TJ,Pollock DD

    更新日期:2006-09-06 00:00:00

  • Components of the antigen processing and presentation pathway revealed by gene expression microarray analysis following B cell antigen receptor (BCR) stimulation.

    abstract:BACKGROUND:Activation of naïve B lymphocytes by extracellular ligands, e.g. antigen, lipopolysaccharide (LPS) and CD40 ligand, induces a combination of common and ligand-specific phenotypic changes through complex signal transduction pathways. For example, although all three of these ligands induce proliferation, only ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-237

    authors: Lee JA,Sinkovits RS,Mock D,Rab EL,Cai J,Yang P,Saunders B,Hsueh RC,Choi S,Subramaniam S,Scheuermann RH,Alliance for Cellular Signaling.

    更新日期:2006-05-02 00:00:00

  • Compromise or optimize? The breakpoint anti-median.

    abstract:BACKGROUND:The median of k≥3 genomes was originally defined to find a compromise genome indicative of a common ancestor. However, in gene order comparisons, the usual definitions based on minimizing the sum of distances to the input genomes lead to degenerate medians reflecting only one of the input genomes. "Near-medi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1340-y

    authors: Larlee CA,Brandts A,Sankoff D

    更新日期:2016-12-15 00:00:00

  • Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations.

    abstract:BACKGROUND:Estimation of individual ancestry from genetic data is useful for the analysis of disease association studies, understanding human population history and interpreting personal genomic variation. New, computationally efficient methods are needed for ancestry inference that can effectively utilize existing inf...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0418-7

    authors: Bansal V,Libiger O

    更新日期:2015-01-16 00:00:00