Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering.

Abstract:

BACKGROUND:Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs. RESULTS:In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression. CONCLUSION:The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

de Brevern AG,Hazout S,Malpertuy A

doi

10.1186/1471-2105-5-114

keywords:

subject

Has Abstract

pub_date

2004-08-23 00:00:00

pages

114

issn

1471-2105

pii

1471-2105-5-114

journal_volume

5

pub_type

杂志文章
  • GLOSSI: a method to assess the association of genetic loci-sets with complex diseases.

    abstract:BACKGROUND:The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-102

    authors: Chai HS,Sicotte H,Bailey KR,Turner ST,Asmann YW,Kocher JP

    更新日期:2009-04-03 00:00:00

  • Inferring latent task structure for Multitask Learning by Multiple Kernel Learning.

    abstract:BACKGROUND:The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S8-S5

    authors: Widmer C,Toussaint NC,Altun Y,Rätsch G

    更新日期:2010-10-26 00:00:00

  • Comparative evaluation of gene set analysis approaches for RNA-Seq data.

    abstract:BACKGROUND:Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0397-8

    authors: Rahmatallah Y,Emmert-Streib F,Glazko G

    更新日期:2014-12-05 00:00:00

  • Subfamily specific conservation profiles for proteins based on n-gram patterns.

    abstract:BACKGROUND:A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profile...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-72

    authors: Vries JK,Liu X

    更新日期:2008-01-30 00:00:00

  • High-order dynamic Bayesian Network learning with hidden common causes for causal gene regulatory network.

    abstract:BACKGROUND:Inferring gene regulatory network (GRN) has been an important topic in Bioinformatics. Many computational methods infer the GRN from high-throughput expression data. Due to the presence of time delays in the regulatory relationships, High-Order Dynamic Bayesian Network (HO-DBN) is a good model of GRN. Howeve...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0823-6

    authors: Lo LY,Wong ML,Lee KH,Leung KS

    更新日期:2015-11-25 00:00:00

  • Machine learning for discovering missing or wrong protein function annotations : A comparison using updated benchmark datasets.

    abstract:BACKGROUND:A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/s12859-019-3060-6

    authors: Nakano FK,Lietaert M,Vens C

    更新日期:2019-09-23 00:00:00

  • Development and tuning of an original search engine for patent libraries in medicinal chemistry.

    abstract:BACKGROUND:The large increase in the size of patent collections has led to the need of efficient search strategies. But the development of advanced text-mining applications dedicated to patents of the biomedical field remains rare, in particular to address the needs of the pharmaceutical & biotech industry, which inten...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S1-S15

    authors: Pasche E,Gobeill J,Kreim O,Oezdemir-Zaech F,Vachon T,Lovis C,Ruch P

    更新日期:2014-01-01 00:00:00

  • pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties.

    abstract:BACKGROUND:Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. Ho...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-152

    authors: Sarda D,Chua GH,Li KB,Krishnan A

    更新日期:2005-06-17 00:00:00

  • Assessment of the relationship between pre-chip and post-chip quality measures for Affymetrix GeneChip expression data.

    abstract:BACKGROUND:Gene expression microarray experiments are expensive to conduct and guidelines for acceptable quality control at intermediate steps before and after the samples are hybridised to chips are vague. We conducted an experiment hybridising RNA from human brain to 117 U133A Affymetrix GeneChips and used these data...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-211

    authors: Jones L,Goldstein DR,Hughes G,Strand AD,Collin F,Dunnett SB,Kooperberg C,Aragaki A,Olson JM,Augood SJ,Faull RL,Luthi-Carter R,Moskvina V,Hodges AK

    更新日期:2006-04-19 00:00:00

  • Inferring gene expression dynamics via functional regression analysis.

    abstract:BACKGROUND:Temporal gene expression profiles characterize the time-dynamics of expression of specific genes and are increasingly collected in current gene expression experiments. In the analysis of experiments where gene expression is obtained over the life cycle, it is of interest to relate temporal patterns of gene e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-60

    authors: Müller HG,Chiou JM,Leng X

    更新日期:2008-01-28 00:00:00

  • Finite mixture clustering of human tissues with different levels of IGF-1 splice variants mRNA transcripts.

    abstract:BACKGROUND:This study addresses a recurrent biological problem, that is to define a formal clustering structure for a set of tissues on the basis of the relative abundance of multiple alternatively spliced isoforms mRNAs generated by the same gene. To this aim, we have used a model-based clustering approach, based on a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0689-7

    authors: Pelosi M,Alfò M,Martella F,Pappalardo E,Musarò A

    更新日期:2015-09-15 00:00:00

  • CONSTAX: a tool for improved taxonomic resolution of environmental fungal ITS sequences.

    abstract:BACKGROUND:One of the most crucial steps in high-throughput sequence-based microbiome studies is the taxonomic assignment of sequences belonging to operational taxonomic units (OTUs). Without taxonomic classification, functional and biological information of microbial communities cannot be inferred or interpreted. The ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1952-x

    authors: Gdanetz K,Benucci GMN,Vande Pol N,Bonito G

    更新日期:2017-12-06 00:00:00

  • Protein subcellular localization prediction based on compartment-specific features and structure conservation.

    abstract:BACKGROUND:Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction h...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-330

    authors: Su EC,Chiu HS,Lo A,Hwang JK,Sung TY,Hsu WL

    更新日期:2007-09-08 00:00:00

  • CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies.

    abstract:BACKGROUND:Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. RESULTS:We d...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03777-y

    authors: Bui VK,Wei C

    更新日期:2020-10-20 00:00:00

  • In silico modelling of hormone response elements.

    abstract:BACKGROUND:An important step in understanding the conditions that specify gene expression is the recognition of gene regulatory elements. Due to high diversity of different types of transcription factors and their DNA binding preferences, it is a challenging problem to establish an accurate model for recognition of fun...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-S4-S27

    authors: Stepanova M,Lin F,Lin VC

    更新日期:2006-12-12 00:00:00

  • Structural alignment of protein descriptors - a combinatorial model.

    abstract:BACKGROUND:Structural alignment of proteins is one of the most challenging problems in molecular biology. The tertiary structure of a protein strictly correlates with its function and computationally predicted structures are nowadays a main premise for understanding the latter. However, computationally derived 3D model...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1237-9

    authors: Antczak M,Kasprzak M,Lukasiak P,Blazewicz J

    更新日期:2016-09-17 00:00:00

  • Privacy-preserving search for chemical compound databases.

    abstract:BACKGROUND:Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the databas...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S18-S6

    authors: Shimizu K,Nuida K,Arai H,Mitsunari S,Attrapadung N,Hamada M,Tsuda K,Hirokawa T,Sakuma J,Hanaoka G,Asai K

    更新日期:2015-01-01 00:00:00

  • R2R--software to speed the depiction of aesthetic consensus RNA secondary structures.

    abstract:BACKGROUND:With continuing identification of novel structured noncoding RNAs, there is an increasing need to create schematic diagrams showing the consensus features of these molecules. RNA structural diagrams are typically made either with general-purpose drawing programs like Adobe Illustrator, or with automated or i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-3

    authors: Weinberg Z,Breaker RR

    更新日期:2011-01-04 00:00:00

  • The Korean Bird Information System (KBIS) through open and public participation.

    abstract:BACKGROUND:The importance of biodiversity conservation has been increasing steadily due to its benefits to human beings. Recently, producing and managing biodiversity databases have become much easier because of the information technology (IT) advancement. This made the general public's participation in biodiversity co...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S15-S11

    authors: Paik IH,Lim J,Chun BS,Jin SD,Yu JP,Lee JW,Bhak J,Paek WK

    更新日期:2009-12-03 00:00:00

  • An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    abstract:BACKGROUND:Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-121

    authors: Tech M,Meinicke P

    更新日期:2006-03-09 00:00:00

  • An automatic device for detection and classification of malaria parasite species in thick blood film.

    abstract:BACKGROUND:Current malaria diagnosis relies primarily on microscopic examination of Giemsa-stained thick and thin blood films. This method requires vigorously trained technicians to efficiently detect and classify the malaria parasite species such as Plasmodium falciparum (Pf) and Plasmodium vivax (Pv) for an appropria...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S17-S18

    authors: Kaewkamnerd S,Uthaipibull C,Intarapanich A,Pannarut M,Chaotheing S,Tongsima S

    更新日期:2012-01-01 00:00:00

  • Googling DNA sequences on the World Wide Web.

    abstract:BACKGROUND:New web-based technologies provide an excellent opportunity for sharing and accessing information and using web as a platform for interaction and collaboration. Although several specialized tools are available for analyzing DNA sequence information, conventional web-based tools have not been utilized for bio...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S14-S4

    authors: Hajibabaei M,Singer GA

    更新日期:2009-11-10 00:00:00

  • Characterization and sequence prediction of structural variations in α-helix.

    abstract:BACKGROUND:The structure conservation in various α-helix subclasses reveals the sequence and context dependent factors causing distortions in the α-helix. The sequence-structure relationship in these subclasses can be used to predict structural variations in α-helix purely based on its sequence. We train support vector...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S1-S20

    authors: Tendulkar AV,Wangikar PP

    更新日期:2011-02-15 00:00:00

  • Prediction of MHC class I binding peptides, using SVMHC.

    abstract:BACKGROUND:T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-3-25

    authors: Dönnes P,Elofsson A

    更新日期:2002-09-11 00:00:00

  • Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments.

    abstract:BACKGROUND:RNA-Sequencing (RNA-seq) experiments have been popularly applied to transcriptome studies in recent years. Such experiments are still relatively costly. As a result, RNA-seq experiments often employ a small number of replicates. Power analysis and sample size calculation are challenging in the context of dif...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-0994-9

    authors: Bi R,Liu P

    更新日期:2016-03-31 00:00:00

  • KRLMM: an adaptive genotype calling method for common and low frequency variants.

    abstract:BACKGROUND:SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the under...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-158

    authors: Liu R,Dai Z,Yeager M,Irizarry RA,Ritchie ME

    更新日期:2014-05-23 00:00:00

  • Evaluation of gene-expression clustering via mutual information distance measure.

    abstract:BACKGROUND:The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pears...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-111

    authors: Priness I,Maimon O,Ben-Gal I

    更新日期:2007-03-30 00:00:00

  • Prediction of virus-host infectious association by supervised learning methods.

    abstract:BACKGROUND:The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the simila...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1473-7

    authors: Zhang M,Yang L,Ren J,Ahlgren NA,Fuhrman JA,Sun F

    更新日期:2017-03-14 00:00:00

  • Assessing stationary distributions derived from chromatin contact maps.

    abstract:BACKGROUND:The spatial configuration of chromosomes is essential to various cellular processes, notably gene regulation, while architecture related alterations, such as translocations and gene fusions, are often cancer drivers. Thus, eliciting chromatin conformation is important, yet challenging due to compaction, dyna...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3424-y

    authors: Segal MR,Fletez-Brant K

    更新日期:2020-02-24 00:00:00

  • BiPOm: a rule-based ontology to represent and infer molecule knowledge from a biological process-centered viewpoint.

    abstract:BACKGROUND:Managing and organizing biological knowledge remains a major challenge, due to the complexity of living systems. Recently, systemic representations have been promising in tackling such a challenge at the whole-cell scale. In such representations, the cell is considered as a system composed of interlocked sub...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03637-9

    authors: Henry V,Saïs F,Inizan O,Marchadier E,Dibie J,Goelzer A,Fromion V

    更新日期:2020-07-23 00:00:00