Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots.


BACKGROUND:Analyses of molecular high-throughput data often lack in robustness, i.e. results are very sensitive to the addition or removal of a single observation. Therefore, the identification of extreme observations is an important step of quality control before doing further data analysis. Standard outlier detection methods for univariate data are however not applicable, since the considered data are high-dimensional, i.e. multiple hundreds or thousands of features are observed in small samples. Usually, outliers in high-dimensional data are solely detected by visual inspection of a graphical representation of the data by the analyst. Typical graphical representation for high-dimensional data are hierarchical cluster tree or principal component plots. Pure visual approaches depend, however, on the individual judgement of the analyst and are hard to automate. Existing methods for automated outlier detection are only dedicated to data of a single experimental groups. RESULTS:In this work we propose to use bagplots, the 2-dimensional extension of the boxplot, to automatically identify outliers in the subspace of the first two principal components of the data. Furthermore, we present for the first time the gemplot, the 3-dimensional extension of boxplot and bagplot, which can be used in the subspace of the first three principal components. Bagplot and gemplot surround the regular observations with convex hulls and observations outside these hulls are regarded as outliers. The convex hulls are determined separately for the observations of each experimental group while the observations of all groups can be displayed in the same subspace of principal components. We demonstrate the usefulness of this approach on multiple sets of artificial data as well as one set of gene expression data from a next-generation sequencing experiment, and compare the new method to other common approaches. Furthermore, we provide an implementation of the gemplot in the package 'gemPlot' for the R programming environment. CONCLUSIONS:Bagplots and gemplots in subspaces of principal components are useful for automated and objective outlier identification in high-dimensional data from molecular high-throughput experiments. A clear advantage over other methods is that multiple experimental groups can be displayed in the same figure although outlier detection is performed for each individual group.


BMC Bioinformatics


BMC bioinformatics


Kruppa J,Jung K




Has Abstract


2017-05-02 00:00:00












  • The tumor as an organ: comprehensive spatial and temporal modeling of the tumor and its microenvironment.

    abstract:BACKGROUND:Research related to cancer is vast, and continues in earnest in many directions. Due to the complexity of cancer, a better understanding of tumor growth dynamics can be gleaned from a dynamic computational model. We present a comprehensive, fully executable, spatial and temporal 3D computational model of the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Bloch N,Harel D

    更新日期:2016-08-24 00:00:00

  • A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

    abstract:BACKGROUND:The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire geno...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Wang W,Sun W,Wang W,Szatkiewicz J

    更新日期:2018-03-01 00:00:00

  • TAMEE: data management and analysis for tissue microarrays.

    abstract:BACKGROUND:With the introduction of tissue microarrays (TMAs) researchers can investigate gene and protein expression in tissues on a high-throughput scale. TMAs generate a wealth of data calling for extended, high level data management. Enhanced data analysis and systematic data management are required for traceabilit...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Thallinger GG,Baumgartner K,Pirklbauer M,Uray M,Pauritsch E,Mehes G,Buck CR,Zatloukal K,Trajanoski Z

    更新日期:2007-03-07 00:00:00

  • Application of text-mining for updating protein post-translational modification annotation in UniProtKB.

    abstract:BACKGROUND:The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Veuthey AL,Bridge A,Gobeill J,Ruch P,McEntyre JR,Bougueleret L,Xenarios I

    更新日期:2013-03-22 00:00:00

  • An SVM-based system for predicting protein subnuclear localizations.

    abstract:BACKGROUND:The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localiza...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Lei Z,Dai Y

    更新日期:2005-12-07 00:00:00

  • A novel parametric approach to mine gene regulatory relationship from microarray datasets.

    abstract:BACKGROUND:Microarray has been widely used to measure the gene expression level on the genome scale in the current decade. Many algorithms have been developed to reconstruct gene regulatory networks based on microarray data. Unfortunately, most of these models and algorithms focus on global properties of the expression...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Liu W,Li D,Liu Q,Zhu Y,He F

    更新日期:2010-12-14 00:00:00

  • Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data.

    abstract:BACKGROUND:Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published da...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Zhang Y,Xuan J,de los Reyes BG,Clarke R,Ressom HW

    更新日期:2008-04-21 00:00:00

  • SHIVA - a web application for drug resistance and tropism testing in HIV.

    abstract:BACKGROUND:Drug resistance testing is mandatory in antiretroviral therapy in human immunodeficiency virus (HIV) infected patients for successful treatment. The emergence of resistances against antiretroviral agents remains the major obstacle in inhibition of viral replication and thus to control infection. Due to the h...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Riemenschneider M,Hummel T,Heider D

    更新日期:2016-08-22 00:00:00

  • MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments.

    abstract:BACKGROUND:The generation of multiple sequence alignments (MSAs) is a crucial step for many bioinformatic analyses. Thus improving MSA accuracy and identifying potential errors in MSAs is important for a wide range of post-genomic research. We present a novel method called MergeAlign which constructs consensus MSAs fro...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Collingridge PW,Kelly S

    更新日期:2012-05-30 00:00:00

  • Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data.

    abstract:BACKGROUND:Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. In these experiments, the goal is to identify genes differentially expressed in time-course data, measured between different biological conditions. These differentially expressed genes can reveal the ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Jonnalagadda S,Srinivasan R

    更新日期:2008-06-06 00:00:00

  • Efficient inference of homologs in large eukaryotic pan-proteomes.

    abstract:BACKGROUND:Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Sheikhizadeh Anari S,de Ridder D,Schranz ME,Smit S

    更新日期:2018-09-26 00:00:00

  • A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies.

    abstract:BACKGROUND:Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Zuber V,Duarte Silva AP,Strimmer K

    更新日期:2012-10-31 00:00:00

  • GeneLibrarian: an effective gene-information summarization and visualization system.

    abstract:BACKGROUND:Abundant information about gene products is stored in online searchable databases such as annotation or literature. To efficiently obtain and digest such information, there is a pressing need for automated information-summarization and functional-similarity clustering of genes. RESULTS:We have developed a n...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Chiang JH,Shin JW,Liu HH,Chin CL

    更新日期:2006-08-29 00:00:00

  • ProCKSI: a decision support system for Protein (structure) Comparison, Knowledge, Similarity and Information.

    abstract:BACKGROUND:We introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Simila...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Barthel D,Hirst JD,Błazewicz J,Burke EK,Krasnogor N

    更新日期:2007-10-26 00:00:00

  • Bioinformatics research in the Asia Pacific: a 2007 update.

    abstract::We provide a 2007 update on the bioinformatics research in the Asia-Pacific from the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation set up in 1998. From 2002, APBioNet has organized the first International Conference on Bioinformatics (InCoB) bringing together scientists work...

    journal_title:BMC bioinformatics



    authors: Ranganathan S,Gribskov M,Tan TW

    更新日期:2008-01-01 00:00:00

  • Spot quantification in two dimensional gel electrophoresis image analysis: comparison of different approaches and presentation of a novel compound fitting algorithm.

    abstract:BACKGROUND:Various computer-based methods exist for the detection and quantification of protein spots in two dimensional gel electrophoresis images. Area-based methods are commonly used for spot quantification: an area is assigned to each spot and the sum of the pixel intensities in that area, the so-called volume, is ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Brauner JM,Groemer TW,Stroebel A,Grosse-Holz S,Oberstein T,Wiltfang J,Kornhuber J,Maler JM

    更新日期:2014-06-11 00:00:00

  • SDAR: a practical tool for graphical analysis of two-dimensional data.

    abstract:BACKGROUND:Two-dimensional data needs to be processed and analysed in almost any experimental laboratory. Some tasks in this context may be performed with generic software such as spreadsheet programs which are available ubiquitously, others may require more specialised software that requires paid licences. Additionall...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Weeratunga S,Hu NJ,Simon A,Hofmann A

    更新日期:2012-08-14 00:00:00

  • A semi-parametric statistical model for integrating gene expression profiles across different platforms.

    abstract:BACKGROUND:Determining differentially expressed genes (DEGs) between biological samples is the key to understand how genotype gives rise to phenotype. RNA-seq and microarray are two main technologies for profiling gene expression levels. However, considerable discrepancy has been found between DEGs detected using the t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Lyu Y,Li Q

    更新日期:2016-01-11 00:00:00

  • Conservation of regulatory elements between two species of Drosophila.

    abstract:BACKGROUND:One of the important goals in the post-genomic era is to determine the regulatory elements within the non-coding DNA of a given organism's genome. The identification of functional cis-regulatory modules has proven difficult since the component factor binding sites are small and the rules governing their arra...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Emberly E,Rajewsky N,Siggia ED

    更新日期:2003-11-20 00:00:00

  • NIFTI: an evolutionary approach for finding number of clusters in microarray data.

    abstract:BACKGROUND:Clustering techniques are routinely used in gene expression data analysis to organize the massive data. Clustering techniques arrange a large number of genes or assays into a few clusters while maximizing the intra-cluster similarity and inter-cluster separation. While clustering of genes facilitates learnin...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Jonnalagadda S,Srinivasan R

    更新日期:2009-01-30 00:00:00

  • Frnakenstein: multiple target inverse RNA folding.

    abstract:BACKGROUND:RNA secondary structure prediction, or folding, is a classic problem in bioinformatics: given a sequence of nucleotides, the aim is to predict the base pairs formed in its three dimensional conformation. The inverse problem of designing a sequence folding into a particular target structure has only more rece...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Lyngsø RB,Anderson JW,Sizikova E,Badugu A,Hyland T,Hein J

    更新日期:2012-10-09 00:00:00

  • Recodon: coalescent simulation of coding DNA sequences with recombination, migration and demography.

    abstract:BACKGROUND:Coalescent simulations have proven very useful in many population genetics studies. In order to arrive to meaningful conclusions, it is important that these simulations resemble the process of molecular evolution as much as possible. To date, no single coalescent program is able to simulate codon sequences s...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Arenas M,Posada D

    更新日期:2007-11-20 00:00:00

  • Species-specific analysis of protein sequence motifs using mutual information.

    abstract:BACKGROUND:Protein sequence motifs are by definition short fragments of conserved amino acids, often associated with a specific function. Accordingly protein sequence profiles derived from multiple sequence alignments provide an alternative description of functional motifs characterizing families of related sequences. ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Hummel J,Keshvari N,Weckwerth W,Selbig J

    更新日期:2005-06-29 00:00:00

  • FANTOM: Functional and taxonomic analysis of metagenomes.

    abstract:BACKGROUND:Interpretation of quantitative metagenomics data is important for our understanding of ecosystem functioning and assessing differences between various environmental samples. There is a need for an easy to use tool to explore the often complex metagenomics data in taxonomic and functional context. RESULTS:He...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Sanli K,Karlsson FH,Nookaew I,Nielsen J

    更新日期:2013-02-01 00:00:00

  • Ferret: a sentence-based literature scanning system.

    abstract:BACKGROUND:The rapid pace of bioscience research makes it very challenging to track relevant articles in one's area of interest. MEDLINE, a primary source for biomedical literature, offers access to more than 20 million citations with three-quarters of a million new ones added each year. Thus it is not surprising to se...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Srinivasan P,Zhang XN,Bouten R,Chang C

    更新日期:2015-06-20 00:00:00

  • Exploring community structure in biological networks with random graphs.

    abstract:BACKGROUND:Community structure is ubiquitous in biological networks. There has been an increased interest in unraveling the community structure of biological systems as it may provide important insights into a system's functional components and the impact of local structures on dynamics at a global scale. Choosing an a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Sah P,Singh LO,Clauset A,Bansal S

    更新日期:2014-06-25 00:00:00

  • Genotype calling in tetraploid species from bi-allelic marker data using mixture models.

    abstract:BACKGROUND:Automated genotype calling in tetraploid species was until recently not possible, which hampered genetic analysis. Modern genotyping assays often produce two signals, one for each allele of a bi-allelic marker. While ample software is available to obtain genotypes (homozygous for either allele, or heterozygo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Voorrips RE,Gort G,Vosman B

    更新日期:2011-05-19 00:00:00

  • TPMS: a set of utilities for querying collections of gene trees.

    abstract:BACKGROUND:The information in large collections of phylogenetic trees is useful for many comparative genomic studies. Therefore, there is a need for flexible tools that allow exploration of such collections in order to retrieve relevant data as quickly as possible. RESULTS:In this paper, we present TPMS (Tree Pattern-...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Bigot T,Daubin V,Lassalle F,Perrière G

    更新日期:2013-03-27 00:00:00

  • Bacterial protein meta-interactomes predict cross-species interactions and protein function.

    abstract:BACKGROUND:Protein-protein interactions (PPIs) can offer compelling evidence for protein function, especially when viewed in the context of proteome-wide interactomes. Bacteria have been popular subjects of interactome studies: more than six different bacterial species have been the subjects of comprehensive interactom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Caufield JH,Wimble C,Shary S,Wuchty S,Uetz P

    更新日期:2017-03-16 00:00:00

  • Metabolic network alignment in large scale by network compression.

    abstract::Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far lim...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章


    authors: Ay F,Dang M,Kahveci T

    更新日期:2012-03-21 00:00:00