Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.

Abstract:

BACKGROUND:The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have been actively utilized in order to identify groups of co-expressed genes. This article poses the problem of fuzzy clustering in microarray data as a multiobjective optimization problem which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. Each of these clustering solutions possesses some amount of information regarding the clustering structure of the input data. Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set. This approach first identifies the genes which are assigned to some particular cluster with high membership degree by most of the Pareto-optimal solutions. Using this set of genes as the training set, the remaining genes are classified by a supervised learning algorithm. In this work, we have used a Support Vector Machine (SVM) classifier for this purpose. RESULTS:The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes. CONCLUSION:The proposed clustering method has been shown to perform better than other well-known clustering algorithms in finding clusters of co-expressed genes efficiently. The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups. This indicates that the proposed clustering method can be used efficiently to identify co-expressed genes in microarray gene expression data.Supplementary Website The pre-processed and normalized data sets, the matlab code and other related materials are available at http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Maulik U,Mukhopadhyay A,Bandyopadhyay S

doi

10.1186/1471-2105-10-27

subject

Has Abstract

pub_date

2009-01-20 00:00:00

pages

27

issn

1471-2105

pii

1471-2105-10-27

journal_volume

10

pub_type

杂志文章
  • Variable cellular decision-making behavior in a constant synthetic network topology.

    abstract:BACKGROUND:Modules of interacting components arranged in specific network topologies have evolved to perform a diverse array of cellular functions. For a network with a constant topological structure, its function within a cell may still be tuned by changing the number of instances of a particular component (e.g., gene...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2866-6

    authors: Shah NA,Sarkar CA

    更新日期:2019-05-14 00:00:00

  • In silico design of targeted SRM-based experiments.

    abstract::Selected reaction monitoring (SRM)-based proteomics approaches enable highly sensitive and reproducible assays for profiling of thousands of peptides in one experiment. The development of such assays involves the determination of retention time, detectability and fragmentation properties of peptides, followed by an op...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S16-S8

    authors: Nahnsen S,Kohlbacher O

    更新日期:2012-01-01 00:00:00

  • Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis.

    abstract:BACKGROUND:Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving f...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-178

    authors: Baty F,Rüdiger J,Miglino N,Kern L,Borger P,Brutsche M

    更新日期:2013-06-06 00:00:00

  • Bounded search for de novo identification of degenerate cis-regulatory elements.

    abstract:BACKGROUND:The identification of statistically overrepresented sequences in the upstream regions of coregulated genes should theoretically permit the identification of potential cis-regulatory elements. However, in practice many cis-regulatory elements are highly degenerate, precluding the use of an exhaustive word-cou...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-254

    authors: Carlson JM,Chakravarty A,Khetani RS,Gross RH

    更新日期:2006-05-15 00:00:00

  • CellSim: a novel software to calculate cell similarity and identify their co-regulation networks.

    abstract:BACKGROUND:Cell direct reprogramming technology has been rapidly developed with its low risk of tumor risk and avoidance of ethical issues caused by stem cells, but it is still limited to specific cell types. Direct reprogramming from an original cell to target cell type needs the cell similarity and cell specific regu...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2699-3

    authors: Li L,Che D,Wang X,Zhang P,Rahman SU,Zhao J,Yu J,Tao S,Lu H,Liao M

    更新日期:2019-03-04 00:00:00

  • Recursive model for dose-time responses in pharmacological studies.

    abstract:BACKGROUND:Clinical studies often track dose-response curves of subjects over time. One can easily model the dose-response curve at each time point with Hill equation, but such a model fails to capture the temporal evolution of the curves. On the other hand, one can use Gompertz equation to model the temporal behaviors...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2831-4

    authors: Dhruba SR,Rahman A,Rahman R,Ghosh S,Pal R

    更新日期:2019-06-20 00:00:00

  • Decoding HMMs using the k best paths: algorithms and applications.

    abstract:BACKGROUND:Traditional algorithms for hidden Markov model decoding seek to maximize either the probability of a state path or the number of positions of a sequence assigned to the correct state. These algorithms provide only a single answer and in practice do not produce good results. RESULTS:We explore an alternative...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S1-S28

    authors: Brown DG,Golod D

    更新日期:2010-01-18 00:00:00

  • phyloXML: XML for evolutionary biology and comparative genomics.

    abstract:BACKGROUND:Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree b...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-356

    authors: Han MV,Zmasek CM

    更新日期:2009-10-27 00:00:00

  • Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data.

    abstract:BACKGROUND:A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3118-5

    authors: Chung NC,Miasojedow B,Startek M,Gambin A

    更新日期:2019-12-24 00:00:00

  • A method of predicting changes in human gene splicing induced by genetic variants in context of cis-acting elements.

    abstract:BACKGROUND:Polymorphic variants and mutations disrupting canonical splicing isoforms are among the leading causes of human hereditary disorders. While there is a substantial evidence of aberrant splicing causing Mendelian diseases, the implication of such events in multi-genic disorders is yet to be well understood. We...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-22

    authors: Churbanov A,Vorechovský I,Hicks C

    更新日期:2010-01-12 00:00:00

  • DART: Denoising Algorithm based on Relevance network Topology improves molecular pathway activity inference.

    abstract:BACKGROUND:Inferring molecular pathway activity is an important step towards reducing the complexity of genomic data, understanding the heterogeneity in clinical outcome, and obtaining molecular correlates of cancer imaging traits. Increasingly, approaches towards pathway activity inference combine molecular profiles (...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-403

    authors: Jiao Y,Lawler K,Patel GS,Purushotham A,Jones AF,Grigoriadis A,Tutt A,Ng T,Teschendorff AE

    更新日期:2011-10-19 00:00:00

  • Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins.

    abstract:BACKGROUND:Molecular docking is a widely-employed method in structure-based drug design. An essential component of molecular docking programs is a scoring function (SF) that can be used to identify the most stable binding pose of a ligand, when bound to a receptor protein, from among a large set of candidate poses. Des...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S6-S3

    authors: Ashtawy HM,Mahapatra NR

    更新日期:2015-01-01 00:00:00

  • Challenges in estimating percent inclusion of alternatively spliced junctions from RNA-seq data.

    abstract::Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S6-S11

    authors: Kakaradov B,Xiong HY,Lee LJ,Jojic N,Frey BJ

    更新日期:2012-04-19 00:00:00

  • Developing optimal input design strategies in cancer systems biology with applications to microfluidic device engineering.

    abstract:BACKGROUND:Mechanistic models are becoming more and more popular in Systems Biology; identification and control of models underlying biochemical pathways of interest in oncology is a primary goal in this field. Unfortunately the scarce availability of data still limits our understanding of the intrinsic characteristics...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S12-S4

    authors: Menolascina F,Bellomo D,Maiwald T,Bevilacqua V,Ciminelli C,Paradiso A,Tommasi S

    更新日期:2009-10-15 00:00:00

  • Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme.

    abstract:BACKGROUND:Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1709-6

    authors: Zhang J,Chai H,Yang G,Ma Z

    更新日期:2017-06-05 00:00:00

  • Predicting and improving the protein sequence alignment quality by support vector regression.

    abstract:BACKGROUND:For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significant...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-471

    authors: Lee M,Jeong CS,Kim D

    更新日期:2007-12-03 00:00:00

  • Coverage statistics for sequence census methods.

    abstract:BACKGROUND:We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how thi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-430

    authors: Evans SN,Hower V,Pachter L

    更新日期:2010-08-18 00:00:00

  • Integrating gene expression and GO classification for PCA by preclustering.

    abstract:BACKGROUND:Gene expression data can be analyzed by summarizing groups of individual gene expression profiles based on GO annotation information. The mean expression profile per group can then be used to identify interesting GO categories in relation to the experimental settings. However, the expression profiles present...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-158

    authors: De Haan JR,Piek E,van Schaik RC,de Vlieg J,Bauerschmidt S,Buydens LM,Wehrens R

    更新日期:2010-03-26 00:00:00

  • Detecting transitions in protein dynamics using a recurrence quantification analysis based bootstrap method.

    abstract:BACKGROUND:Proteins undergo conformational transitions over different time scales. These transitions are closely intertwined with the protein's function. Numerous standard techniques such as principal component analysis are used to detect these transitions in molecular dynamics simulations. In this work, we add a new m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1943-y

    authors: Karain WI

    更新日期:2017-11-28 00:00:00

  • Markov clustering versus affinity propagation for the partitioning of protein interaction graphs.

    abstract:BACKGROUND:Genome scale data on protein interactions are generally represented as large networks, or graphs, where hundreds or thousands of proteins are linked to one another. Since proteins tend to function in groups, or complexes, an important goal has been to reliably identify protein complexes from these graphs. Th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-99

    authors: Vlasblom J,Wodak SJ

    更新日期:2009-03-30 00:00:00

  • Impact of polymorphic transposable elements on transcription in lymphoblastoid cell lines from public data.

    abstract:BACKGROUND:Transposable elements (TEs) are DNA sequences able to mobilize themselves and to increase their copy-number in the host genome. In the past, they have been considered mainly selfish DNA without evident functions. Nevertheless, currently they are believed to have been extensively involved in the evolution of ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3113-x

    authors: Spirito G,Mangoni D,Sanges R,Gustincich S

    更新日期:2019-11-22 00:00:00

  • svapls: an R package to correct for hidden factors of variability in gene expression studies.

    abstract:BACKGROUND:Hidden variability is a fundamentally important issue in the context of gene expression studies. Collected tissue samples may have a wide variety of hidden effects that may alter their transcriptional landscape significantly. As a result their actual differential expression pattern can be potentially distort...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-236

    authors: Chakraborty S,Datta S,Datta S

    更新日期:2013-07-24 00:00:00

  • PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling.

    abstract::Identifying segments in the genome of different individuals that are identical-by-descent (IBD) is a fundamental element of genetics. IBD data is used for numerous applications including demographic inference, heritability estimation, and mapping disease loci. Simultaneous detection of IBD over multiple haplotypes has...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S5-S9

    authors: Park DS,Baran Y,Hormozdiari F,Eng C,Torgerson DG,Burchard EG,Zaitlen N

    更新日期:2015-01-01 00:00:00

  • Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

    abstract:BACKGROUND:Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology all...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-241

    authors: Aubry M,Monnier A,Chicault C,de Tayrac M,Galibert MD,Burgun A,Mosser J

    更新日期:2006-05-04 00:00:00

  • Multi-omic analysis of signalling factors in inflammatory comorbidities.

    abstract:BACKGROUND:Inflammation is a core element of many different, systemic and chronic diseases that usually involve an important autoimmune component. The clinical phase of inflammatory diseases is often the culmination of a long series of pathologic events that started years before. The systemic characteristics and relate...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2413-x

    authors: Xiao H,Bartoszek K,Lio' P

    更新日期:2018-11-30 00:00:00

  • LDNFSGB: prediction of long non-coding rna and disease association using network feature similarity and gradient boosting.

    abstract:BACKGROUND:A large number of experimental studies show that the mutation and regulation of long non-coding RNAs (lncRNAs) are associated with various human diseases. Accurate prediction of lncRNA-disease associations can provide a new perspective for the diagnosis and treatment of diseases. The main function of many ln...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03721-0

    authors: Zhang Y,Ye F,Xiong D,Gao X

    更新日期:2020-09-03 00:00:00

  • Primary orthologs from local sequence context.

    abstract:BACKGROUND:The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3384-2

    authors: Gao K,Miller J

    更新日期:2020-02-06 00:00:00

  • Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.

    abstract:BACKGROUND:High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-94

    authors: Bullard JH,Purdom E,Hansen KD,Dudoit S

    更新日期:2010-02-18 00:00:00

  • A stochastic context free grammar based framework for analysis of protein sequences.

    abstract:BACKGROUND:In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-323

    authors: Dyrka W,Nebel JC

    更新日期:2009-10-08 00:00:00

  • Repliscan: a tool for classifying replication timing regions.

    abstract:BACKGROUND:Replication timing experiments that use label incorporation and high throughput sequencing produce peaked data similar to ChIP-Seq experiments. However, the differences in experimental design, coverage density, and possible results make traditional ChIP-Seq analysis methods inappropriate for use with replica...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1774-x

    authors: Zynda GJ,Song J,Concia L,Wear EE,Hanley-Bowdoin L,Thompson WF,Vaughn MW

    更新日期:2017-08-07 00:00:00