Abstract:
BACKGROUND:Analyses of molecular high-throughput data often lack in robustness, i.e. results are very sensitive to the addition or removal of a single observation. Therefore, the identification of extreme observations is an important step of quality control before doing further data analysis. Standard outlier detection methods for univariate data are however not applicable, since the considered data are high-dimensional, i.e. multiple hundreds or thousands of features are observed in small samples. Usually, outliers in high-dimensional data are solely detected by visual inspection of a graphical representation of the data by the analyst. Typical graphical representation for high-dimensional data are hierarchical cluster tree or principal component plots. Pure visual approaches depend, however, on the individual judgement of the analyst and are hard to automate. Existing methods for automated outlier detection are only dedicated to data of a single experimental groups. RESULTS:In this work we propose to use bagplots, the 2-dimensional extension of the boxplot, to automatically identify outliers in the subspace of the first two principal components of the data. Furthermore, we present for the first time the gemplot, the 3-dimensional extension of boxplot and bagplot, which can be used in the subspace of the first three principal components. Bagplot and gemplot surround the regular observations with convex hulls and observations outside these hulls are regarded as outliers. The convex hulls are determined separately for the observations of each experimental group while the observations of all groups can be displayed in the same subspace of principal components. We demonstrate the usefulness of this approach on multiple sets of artificial data as well as one set of gene expression data from a next-generation sequencing experiment, and compare the new method to other common approaches. Furthermore, we provide an implementation of the gemplot in the package 'gemPlot' for the R programming environment. CONCLUSIONS:Bagplots and gemplots in subspaces of principal components are useful for automated and objective outlier identification in high-dimensional data from molecular high-throughput experiments. A clear advantage over other methods is that multiple experimental groups can be displayed in the same figure although outlier detection is performed for each individual group.
journal_name
BMC Bioinformaticsjournal_title
BMC bioinformaticsauthors
Kruppa J,Jung Kdoi
10.1186/s12859-017-1645-5subject
Has Abstractpub_date
2017-05-02 00:00:00pages
232issue
1issn
1471-2105pii
10.1186/s12859-017-1645-5journal_volume
18pub_type
杂志文章abstract:BACKGROUND:Research related to cancer is vast, and continues in earnest in many directions. Due to the complexity of cancer, a better understanding of tumor growth dynamics can be gleaned from a dynamic computational model. We present a comprehensive, fully executable, spatial and temporal 3D computational model of the...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1168-5
更新日期:2016-08-24 00:00:00
abstract:BACKGROUND:The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire geno...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2077-6
更新日期:2018-03-01 00:00:00
abstract:BACKGROUND:With the introduction of tissue microarrays (TMAs) researchers can investigate gene and protein expression in tissues on a high-throughput scale. TMAs generate a wealth of data calling for extended, high level data management. Enhanced data analysis and systematic data management are required for traceabilit...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-81
更新日期:2007-03-07 00:00:00
abstract:BACKGROUND:The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-14-104
更新日期:2013-03-22 00:00:00
abstract:BACKGROUND:The large gap between the number of protein sequences in databases and the number of functionally characterized proteins calls for the development of a fast computational tool for the prediction of subnuclear and subcellular localizations generally applicable to protein sequences. The information on localiza...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-291
更新日期:2005-12-07 00:00:00
abstract:BACKGROUND:Microarray has been widely used to measure the gene expression level on the genome scale in the current decade. Many algorithms have been developed to reconstruct gene regulatory networks based on microarray data. Unfortunately, most of these models and algorithms focus on global properties of the expression...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-S11-S15
更新日期:2010-12-14 00:00:00
abstract:BACKGROUND:Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published da...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-203
更新日期:2008-04-21 00:00:00
abstract:BACKGROUND:Drug resistance testing is mandatory in antiretroviral therapy in human immunodeficiency virus (HIV) infected patients for successful treatment. The emergence of resistances against antiretroviral agents remains the major obstacle in inhibition of viral replication and thus to control infection. Due to the h...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1179-2
更新日期:2016-08-22 00:00:00
abstract:BACKGROUND:The generation of multiple sequence alignments (MSAs) is a crucial step for many bioinformatic analyses. Thus improving MSA accuracy and identifying potential errors in MSAs is important for a wide range of post-genomic research. We present a novel method called MergeAlign which constructs consensus MSAs fro...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-117
更新日期:2012-05-30 00:00:00
abstract:BACKGROUND:Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. In these experiments, the goal is to identify genes differentially expressed in time-course data, measured between different biological conditions. These differentially expressed genes can reveal the ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-267
更新日期:2008-06-06 00:00:00
abstract:BACKGROUND:Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, th...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2362-4
更新日期:2018-09-26 00:00:00
abstract:BACKGROUND:Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-284
更新日期:2012-10-31 00:00:00
abstract:BACKGROUND:Abundant information about gene products is stored in online searchable databases such as annotation or literature. To efficiently obtain and digest such information, there is a pressing need for automated information-summarization and functional-similarity clustering of genes. RESULTS:We have developed a n...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-392
更新日期:2006-08-29 00:00:00
abstract:BACKGROUND:We introduce the decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information (ProCKSI). ProCKSI integrates various protein similarity measures through an easy to use interface that allows the comparison of multiple proteins simultaneously. It employs the Universal Simila...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-416
更新日期:2007-10-26 00:00:00
abstract::We provide a 2007 update on the bioinformatics research in the Asia-Pacific from the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation set up in 1998. From 2002, APBioNet has organized the first International Conference on Bioinformatics (InCoB) bringing together scientists work...
journal_title:BMC bioinformatics
pub_type:
doi:10.1186/1471-2105-9-S1-S1
更新日期:2008-01-01 00:00:00
abstract:BACKGROUND:Various computer-based methods exist for the detection and quantification of protein spots in two dimensional gel electrophoresis images. Area-based methods are commonly used for spot quantification: an area is assigned to each spot and the sum of the pixel intensities in that area, the so-called volume, is ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-15-181
更新日期:2014-06-11 00:00:00
abstract:BACKGROUND:Two-dimensional data needs to be processed and analysed in almost any experimental laboratory. Some tasks in this context may be performed with generic software such as spreadsheet programs which are available ubiquitously, others may require more specialised software that requires paid licences. Additionall...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-201
更新日期:2012-08-14 00:00:00
abstract:BACKGROUND:Determining differentially expressed genes (DEGs) between biological samples is the key to understand how genotype gives rise to phenotype. RNA-seq and microarray are two main technologies for profiling gene expression levels. However, considerable discrepancy has been found between DEGs detected using the t...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0847-y
更新日期:2016-01-11 00:00:00
abstract:BACKGROUND:One of the important goals in the post-genomic era is to determine the regulatory elements within the non-coding DNA of a given organism's genome. The identification of functional cis-regulatory modules has proven difficult since the component factor binding sites are small and the rules governing their arra...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-4-57
更新日期:2003-11-20 00:00:00
abstract:BACKGROUND:Clustering techniques are routinely used in gene expression data analysis to organize the massive data. Clustering techniques arrange a large number of genes or assays into a few clusters while maximizing the intra-cluster similarity and inter-cluster separation. While clustering of genes facilitates learnin...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-40
更新日期:2009-01-30 00:00:00
abstract:BACKGROUND:RNA secondary structure prediction, or folding, is a classic problem in bioinformatics: given a sequence of nucleotides, the aim is to predict the base pairs formed in its three dimensional conformation. The inverse problem of designing a sequence folding into a particular target structure has only more rece...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-260
更新日期:2012-10-09 00:00:00
abstract:BACKGROUND:Coalescent simulations have proven very useful in many population genetics studies. In order to arrive to meaningful conclusions, it is important that these simulations resemble the process of molecular evolution as much as possible. To date, no single coalescent program is able to simulate codon sequences s...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-458
更新日期:2007-11-20 00:00:00
abstract:BACKGROUND:Protein sequence motifs are by definition short fragments of conserved amino acids, often associated with a specific function. Accordingly protein sequence profiles derived from multiple sequence alignments provide an alternative description of functional motifs characterizing families of related sequences. ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-164
更新日期:2005-06-29 00:00:00
abstract:BACKGROUND:Interpretation of quantitative metagenomics data is important for our understanding of ecosystem functioning and assessing differences between various environmental samples. There is a need for an easy to use tool to explore the often complex metagenomics data in taxonomic and functional context. RESULTS:He...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-14-38
更新日期:2013-02-01 00:00:00
abstract:BACKGROUND:The rapid pace of bioscience research makes it very challenging to track relevant articles in one's area of interest. MEDLINE, a primary source for biomedical literature, offers access to more than 20 million citations with three-quarters of a million new ones added each year. Thus it is not surprising to se...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0630-0
更新日期:2015-06-20 00:00:00
abstract:BACKGROUND:Community structure is ubiquitous in biological networks. There has been an increased interest in unraveling the community structure of biological systems as it may provide important insights into a system's functional components and the impact of local structures on dynamics at a global scale. Choosing an a...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-15-220
更新日期:2014-06-25 00:00:00
abstract:BACKGROUND:Automated genotype calling in tetraploid species was until recently not possible, which hampered genetic analysis. Modern genotyping assays often produce two signals, one for each allele of a bi-allelic marker. While ample software is available to obtain genotypes (homozygous for either allele, or heterozygo...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-172
更新日期:2011-05-19 00:00:00
abstract:BACKGROUND:The information in large collections of phylogenetic trees is useful for many comparative genomic studies. Therefore, there is a need for flexible tools that allow exploration of such collections in order to retrieve relevant data as quickly as possible. RESULTS:In this paper, we present TPMS (Tree Pattern-...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-14-109
更新日期:2013-03-27 00:00:00
abstract:BACKGROUND:Protein-protein interactions (PPIs) can offer compelling evidence for protein function, especially when viewed in the context of proteome-wide interactomes. Bacteria have been popular subjects of interactome studies: more than six different bacterial species have been the subjects of comprehensive interactom...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1585-0
更新日期:2017-03-16 00:00:00
abstract::Metabolic network alignment is a system scale comparative analysis that discovers important similarities and differences across different metabolisms and organisms. Although the problem of aligning metabolic networks has been considered in the past, the computational complexity of the existing solutions has so far lim...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S3-S2
更新日期:2012-03-21 00:00:00