Abstract:
BACKGROUND:The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. RESULTS:Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of 85%-90%. When our method is implemented in the context of a Drosophila genome-scale RNAi image-based screening of cultured cells aimed to identifying the contribution of individual genes towards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes and provides novel biological insight. We also propose a two-step procedure to modify the novelty detection method based on one-class SVM, so that it can be used to online phenotype discovery. In different conditions, we compared the SVM based method with our method using various datasets and our methods consistently outperformed SVM based method in at least two of three tasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novel phenotypes in image-based datasets from a wide range of conditions and organisms. CONCLUSION:We demonstrate that our method can detect various novel phenotypes effectively in complex datasets. Experiment results also validate that our method performs consistently under different order of image input, variation of starting conditions including the number and composition of existing phenotypes, and dataset from different screens. In our findings, the proposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens.
journal_name
BMC Bioinformaticsjournal_title
BMC bioinformaticsauthors
Yin Z,Zhou X,Bakal C,Li F,Sun Y,Perrimon N,Wong STdoi
10.1186/1471-2105-9-264subject
Has Abstractpub_date
2008-06-05 00:00:00pages
264issn
1471-2105pii
1471-2105-9-264journal_volume
9pub_type
杂志文章abstract:BACKGROUND:Prostate carcinoma is among the most common types of cancer affecting hundreds of thousands people every year. Once the metastatic form of prostate carcinoma is documented, the majority of patients die from their tumors as opposed to other causes. The key to successful treatment is in the earliest possible d...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-S11-S6
更新日期:2009-10-08 00:00:00
abstract:BACKGROUND:The organization of the canonical code has intrigued researches since it was first described. If we consider all codes mapping the 64 codes into 20 amino acids and one stop codon, there are more than 1.51×10(84) possible genetic codes. The main question related to the organization of the genetic code is why ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0480-9
更新日期:2015-02-19 00:00:00
abstract:BACKGROUND:An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biolo...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-94
更新日期:2012-05-11 00:00:00
abstract:BACKGROUND:Formal classification of a large collection of protein structures aids the understanding of evolutionary relationships among them. Classifications involving manual steps, such as SCOP and CATH, face the challenge of increasing volume of available structures. Automatic methods such as FSSP or Dali Domain Dict...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-74
更新日期:2008-01-31 00:00:00
abstract::We developed a new computational technique called Step-Level Differential Response (SLDR) to identify genetic regulatory relationships. Our technique takes advantages of functional genomics data for the same species under different perturbation conditions, therefore complementary to current popular computational techn...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-15-S11-S1
更新日期:2014-01-01 00:00:00
abstract:BACKGROUND:Single-pass, partial sequencing of complementary DNA (cDNA) libraries generates thousands of chromatograms that are processed into high quality expressed sequence tags (ESTs), and then assembled into contigs representative of putative genes. Usually, to be of value, ESTs and contigs must be associated with m...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-5-176
更新日期:2004-11-04 00:00:00
abstract:BACKGROUND:Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machin...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-S6-S24
更新日期:2009-06-16 00:00:00
abstract:BACKGROUND:The incorporation of biological knowledge can enhance the analysis of biomedical data. We present a novel method that uses a proteomic knowledge base to enhance the performance of a rule-learning algorithm in identifying putative biomarkers of disease from high-dimensional proteomic mass spectral data. In pa...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-S9-S16
更新日期:2009-09-17 00:00:00
abstract:BACKGROUND:Prioritizing genes according to their associations with a cancer allows researchers to explore genes in more informed ways. By far, Gene-centric or network-centric gene prioritization methods are predominated. Genes and their protein products carry out cellular processes in the context of functional modules....
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2216-0
更新日期:2018-06-05 00:00:00
abstract:BACKGROUND:Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0651-8
更新日期:2015-07-16 00:00:00
abstract:BACKGROUND:Human immunology studies often rely on the isolation and quantification of cell populations from an input sample based on flow cytometry and related techniques. Such techniques classify cells into populations based on the detection of a pattern of markers. The description of the cell populations targeted in ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-019-2725-5
更新日期:2019-04-25 00:00:00
abstract:BACKGROUND:In current comparative proteomics studies, the large number of images generated by 2D gels is currently compared using spot matching algorithms. Unfortunately, differences in gel migration and sample variability make efficient spot alignment very difficult to obtain, and, as consequence most of the software ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-460
更新日期:2008-10-28 00:00:00
abstract:BACKGROUND:The goal of metabolomics analyses is a comprehensive and systematic understanding of all metabolites in biological samples. Many useful platforms have been developed to achieve this goal. Gas chromatography coupled to mass spectrometry (GC/MS) is a well-established analytical method in metabolomics study, an...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-131
更新日期:2011-05-04 00:00:00
abstract:BACKGROUND:Visualization tools for deep learning models typically focus on discovering key input features without considering how such low level features are combined in intermediate layers to make decisions. Moreover, many of these methods examine a network's response to specific input examples that may be insufficien...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-019-2957-4
更新日期:2019-07-19 00:00:00
abstract:BACKGROUND:Over the course of the last few years there has been a significant amount of research performed on ontology-based formalization of phenotype descriptions. In order to fully capture the intrinsic value and knowledge expressed within them, we need to take advantage of their inner structure, which implicitly co...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-265
更新日期:2012-10-15 00:00:00
abstract:BACKGROUND:When constructing new biomarker or gene signature scores for time-to-event outcomes, the underlying aims are to develop a discrimination model that helps to predict whether patients have a poor or good prognosis and to identify the most influential variables for this task. In practice, this is often done fit...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1149-8
更新日期:2016-07-22 00:00:00
abstract:BACKGROUND:Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next gene...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2412-y
更新日期:2018-11-14 00:00:00
abstract:BACKGROUND:Atomic details of protein-DNA complexes can provide insightful information for better understanding of the function and binding specificity of DNA binding proteins. In addition to experimental methods for solving protein-DNA complex structures, protein-DNA docking can be used to predict native or near-native...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2538-y
更新日期:2018-12-21 00:00:00
abstract:BACKGROUND:Dynamic programming algorithms provide exact solutions to many problems in computational biology, such as sequence alignment, RNA folding, hidden Markov models (HMMs), and scoring of phylogenetic trees. Structurally analogous algorithms compute optimal solutions, evaluate score distributions, and perform sto...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-16-S19-S2
更新日期:2015-01-01 00:00:00
abstract:BACKGROUND:Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-16-S11-S4
更新日期:2015-01-01 00:00:00
abstract:BACKGROUND:Direct in vivo investigation of human metabolism is complicated by the distinct metabolic functions of various sub-cellular organelles. Diverse micro-environments in different organelles may lead to distinct functions of the same protein and the use of different enzymes for the same metabolic reaction. To be...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-393
更新日期:2010-07-22 00:00:00
abstract:BACKGROUND:Recently differential variability has been showed to be valuable in evaluating the association of DNA methylation to the risks of complex human diseases. The statistical tests based on both differential methylation level and differential variability can be more powerful than those based only on differential ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2185-3
更新日期:2018-05-18 00:00:00
abstract:BACKGROUND:To infer the tree of life requires knowledge of the common characteristics of each species descended from a common ancestor as the measuring criteria and a method to calculate the distance between the resulting values of each measure. Conventional phylogenetic analysis based on genomic sequences provides inf...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-284
更新日期:2006-06-06 00:00:00
abstract:BACKGROUND:Techniques for reconstruction of biological networks which are based on perturbation experiments often predict direct interactions between nodes that do not exist. Transitive reduction removes such relations if they can be explained by an indirect path of influences. The existing algorithms for transitive re...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-281
更新日期:2012-10-30 00:00:00
abstract:BACKGROUND:Long short-term memory (LSTM) is one of the most attractive deep learning methods to learn time series or contexts of input data. Increasing studies, including biological sequence analyses in bioinformatics, utilize this architecture. Amino acid sequence profiles are widely used for bioinformatics studies, s...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2284-1
更新日期:2018-07-18 00:00:00
abstract:BACKGROUND:Genome scale data on protein interactions are generally represented as large networks, or graphs, where hundreds or thousands of proteins are linked to one another. Since proteins tend to function in groups, or complexes, an important goal has been to reliably identify protein complexes from these graphs. Th...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-99
更新日期:2009-03-30 00:00:00
abstract:BACKGROUND:Patient records contain valuable information regarding explanation of diagnosis, progression of disease, prescription and/or effectiveness of treatment, and more. Automatic recognition of clinically important concepts and the identification of relationships between those concepts in patient records are preli...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-S3-S3
更新日期:2011-06-09 00:00:00
abstract:BACKGROUND:Differential co-expression (DCX) signifies change in degree of co-expression of a set of genes among different biological conditions. It has been used to identify differential co-expression networks or interactomes. Many algorithms have been developed for single-factor differential co-expression analysis and...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1963-7
更新日期:2017-12-28 00:00:00
abstract:BACKGROUND:Many centrality measures have been proposed to mine and characterize the correlations between network topological properties and protein essentiality. However, most of them show limited prediction accuracy, and the number of common predicted essential proteins by different methods is very small. RESULTS:In ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1166-7
更新日期:2016-08-25 00:00:00
abstract:BACKGROUND:Cartilage damage is a crucial feature involved in several pathological conditions characterized by joint disorders, such as osteoarthritis and rheumatoid arthritis. Accumulated evidences showed that Wnt/β-catenin pathway plays a role in the pathogenesis of cartilage damage. In addition, it is experimentally ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-019-2981-4
更新日期:2019-07-31 00:00:00