Abstract:
BACKGROUND:Microarray technologies produced large amount of data. The hierarchical clustering is commonly used to identify clusters of co-expressed genes. However, microarray datasets often contain missing values (MVs) representing a major drawback for the use of the clustering methods. Usually the MVs are not treated, or replaced by zero or estimated by the k-Nearest Neighbor (kNN) approach. The topic of the paper is to study the stability of gene clusters, defined by various hierarchical clustering algorithms, of microarrays experiments including or not MVs. RESULTS:In this study, we show that the MVs have important effects on the stability of the gene clusters. Moreover, the magnitude of the gene misallocations is depending on the aggregation algorithm. The most appropriate aggregation methods (e.g. complete-linkage and Ward) are highly sensitive to MVs, and surprisingly, for a very tiny proportion of MVs (e.g. 1%). In most of the case, the MVs must be replaced by expected values. The MVs replacement by the kNN approach clearly improves the identification of co-expressed gene clusters. Nevertheless, we observe that kNN approach is less suitable for the extreme values of gene expression. CONCLUSION:The presence of MVs (even at a low rate) is a major factor of gene cluster instability. In addition, the impact depends on the hierarchical clustering algorithm used. Some methods should be used carefully. Nevertheless, the kNN approach constitutes one efficient method for restoring the missing expression gene values, with a low error level. Our study highlights the need of statistical treatments in microarray data to avoid misinterpretation.
journal_name
BMC Bioinformaticsjournal_title
BMC bioinformaticsauthors
de Brevern AG,Hazout S,Malpertuy Adoi
10.1186/1471-2105-5-114keywords:
subject
Has Abstractpub_date
2004-08-23 00:00:00pages
114issn
1471-2105pii
1471-2105-5-114journal_volume
5pub_type
杂志文章abstract:BACKGROUND:The developments of high-throughput genotyping technologies, which enable the simultaneous genotyping of hundreds of thousands of single nucleotide polymorphisms (SNP) have the potential to increase the benefits of genetic epidemiology studies. Although the enhanced resolution of these platforms increases th...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-102
更新日期:2009-04-03 00:00:00
abstract:BACKGROUND:The lack of sufficient training data is the limiting factor for many Machine Learning applications in Computational Biology. If data is available for several different but related problem domains, Multitask Learning algorithms can be used to learn a model based on all available information. In Bioinformatics...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-S8-S5
更新日期:2010-10-26 00:00:00
abstract:BACKGROUND:Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-014-0397-8
更新日期:2014-12-05 00:00:00
abstract:BACKGROUND:A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profile...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-72
更新日期:2008-01-30 00:00:00
abstract:BACKGROUND:Inferring gene regulatory network (GRN) has been an important topic in Bioinformatics. Many computational methods infer the GRN from high-throughput expression data. Due to the presence of time delays in the regulatory relationships, High-Order Dynamic Bayesian Network (HO-DBN) is a good model of GRN. Howeve...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0823-6
更新日期:2015-11-25 00:00:00
abstract:BACKGROUND:A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical m...
journal_title:BMC bioinformatics
pub_type: 杂志文章,评审
doi:10.1186/s12859-019-3060-6
更新日期:2019-09-23 00:00:00
abstract:BACKGROUND:The large increase in the size of patent collections has led to the need of efficient search strategies. But the development of advanced text-mining applications dedicated to patents of the biomedical field remains rare, in particular to address the needs of the pharmaceutical & biotech industry, which inten...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-15-S1-S15
更新日期:2014-01-01 00:00:00
abstract:BACKGROUND:Protein subcellular localization is an important determinant of protein function and hence, reliable methods for prediction of localization are needed. A number of prediction algorithms have been developed based on amino acid compositions or on the N-terminal characteristics (signal peptides) of proteins. Ho...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-152
更新日期:2005-06-17 00:00:00
abstract:BACKGROUND:Gene expression microarray experiments are expensive to conduct and guidelines for acceptable quality control at intermediate steps before and after the samples are hybridised to chips are vague. We conducted an experiment hybridising RNA from human brain to 117 U133A Affymetrix GeneChips and used these data...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-211
更新日期:2006-04-19 00:00:00
abstract:BACKGROUND:Temporal gene expression profiles characterize the time-dynamics of expression of specific genes and are increasingly collected in current gene expression experiments. In the analysis of experiments where gene expression is obtained over the life cycle, it is of interest to relate temporal patterns of gene e...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-60
更新日期:2008-01-28 00:00:00
abstract:BACKGROUND:This study addresses a recurrent biological problem, that is to define a formal clustering structure for a set of tissues on the basis of the relative abundance of multiple alternatively spliced isoforms mRNAs generated by the same gene. To this aim, we have used a model-based clustering approach, based on a...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0689-7
更新日期:2015-09-15 00:00:00
abstract:BACKGROUND:One of the most crucial steps in high-throughput sequence-based microbiome studies is the taxonomic assignment of sequences belonging to operational taxonomic units (OTUs). Without taxonomic classification, functional and biological information of microbial communities cannot be inferred or interpreted. The ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1952-x
更新日期:2017-12-06 00:00:00
abstract:BACKGROUND:Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction h...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-330
更新日期:2007-09-08 00:00:00
abstract:BACKGROUND:Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. RESULTS:We d...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-03777-y
更新日期:2020-10-20 00:00:00
abstract:BACKGROUND:An important step in understanding the conditions that specify gene expression is the recognition of gene regulatory elements. Due to high diversity of different types of transcription factors and their DNA binding preferences, it is a challenging problem to establish an accurate model for recognition of fun...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-S4-S27
更新日期:2006-12-12 00:00:00
abstract:BACKGROUND:Structural alignment of proteins is one of the most challenging problems in molecular biology. The tertiary structure of a protein strictly correlates with its function and computationally predicted structures are nowadays a main premise for understanding the latter. However, computationally derived 3D model...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1237-9
更新日期:2016-09-17 00:00:00
abstract:BACKGROUND:Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the databas...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-16-S18-S6
更新日期:2015-01-01 00:00:00
abstract:BACKGROUND:With continuing identification of novel structured noncoding RNAs, there is an increasing need to create schematic diagrams showing the consensus features of these molecules. RNA structural diagrams are typically made either with general-purpose drawing programs like Adobe Illustrator, or with automated or i...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-3
更新日期:2011-01-04 00:00:00
abstract:BACKGROUND:The importance of biodiversity conservation has been increasing steadily due to its benefits to human beings. Recently, producing and managing biodiversity databases have become much easier because of the information technology (IT) advancement. This made the general public's participation in biodiversity co...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-S15-S11
更新日期:2009-12-03 00:00:00
abstract:BACKGROUND:Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-121
更新日期:2006-03-09 00:00:00
abstract:BACKGROUND:Current malaria diagnosis relies primarily on microscopic examination of Giemsa-stained thick and thin blood films. This method requires vigorously trained technicians to efficiently detect and classify the malaria parasite species such as Plasmodium falciparum (Pf) and Plasmodium vivax (Pv) for an appropria...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S17-S18
更新日期:2012-01-01 00:00:00
abstract:BACKGROUND:New web-based technologies provide an excellent opportunity for sharing and accessing information and using web as a platform for interaction and collaboration. Although several specialized tools are available for analyzing DNA sequence information, conventional web-based tools have not been utilized for bio...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-S14-S4
更新日期:2009-11-10 00:00:00
abstract:BACKGROUND:The structure conservation in various α-helix subclasses reveals the sequence and context dependent factors causing distortions in the α-helix. The sequence-structure relationship in these subclasses can be used to predict structural variations in α-helix purely based on its sequence. We train support vector...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-S1-S20
更新日期:2011-02-15 00:00:00
abstract:BACKGROUND:T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. MHC-peptide complexes are potential tools for diagnosis and treatment of pathogens and cancer, as well...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-3-25
更新日期:2002-09-11 00:00:00
abstract:BACKGROUND:RNA-Sequencing (RNA-seq) experiments have been popularly applied to transcriptome studies in recent years. Such experiments are still relatively costly. As a result, RNA-seq experiments often employ a small number of replicates. Power analysis and sample size calculation are challenging in the context of dif...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-0994-9
更新日期:2016-03-31 00:00:00
abstract:BACKGROUND:SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the under...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-15-158
更新日期:2014-05-23 00:00:00
abstract:BACKGROUND:The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pears...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-111
更新日期:2007-03-30 00:00:00
abstract:BACKGROUND:The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the simila...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1473-7
更新日期:2017-03-14 00:00:00
abstract:BACKGROUND:The spatial configuration of chromosomes is essential to various cellular processes, notably gene regulation, while architecture related alterations, such as translocations and gene fusions, are often cancer drivers. Thus, eliciting chromatin conformation is important, yet challenging due to compaction, dyna...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-3424-y
更新日期:2020-02-24 00:00:00
abstract:BACKGROUND:Managing and organizing biological knowledge remains a major challenge, due to the complexity of living systems. Recently, systemic representations have been promising in tackling such a challenge at the whole-cell scale. In such representations, the cell is considered as a system composed of interlocked sub...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-03637-9
更新日期:2020-07-23 00:00:00