PFClust: a novel parameter free clustering algorithm.

Abstract:

BACKGROUND:We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of 'correct' cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings. RESULTS:We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH. CONCLUSIONS:We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Mavridis L,Nath N,Mitchell JB

doi

10.1186/1471-2105-14-213

subject

Has Abstract

pub_date

2013-07-03 00:00:00

pages

213

issn

1471-2105

pii

1471-2105-14-213

journal_volume

14

pub_type

杂志文章
  • SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    abstract:BACKGROUND:Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequenc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2242-y

    authors: Yu Q,Wei D,Huo H

    更新日期:2018-06-18 00:00:00

  • MIR@NT@N: a framework integrating transcription factors, microRNAs and their targets to identify sub-network motifs in a meta-regulation network model.

    abstract:BACKGROUND:To understand biological processes and diseases, it is crucial to unravel the concerted interplay of transcription factors (TFs), microRNAs (miRNAs) and their targets within regulatory networks and fundamental sub-networks. An integrative computational resource generating a comprehensive view of these regula...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-67

    authors: Le Béchec A,Portales-Casamar E,Vetter G,Moes M,Zindy PJ,Saumet A,Arenillas D,Theillet C,Wasserman WW,Lecellier CH,Friederich E

    更新日期:2011-03-04 00:00:00

  • IRSS: a web-based tool for automatic layout and analysis of IRES secondary structure prediction and searching system in silico.

    abstract:BACKGROUND:Internal ribosomal entry sites (IRESs) provide alternative, cap-independent translation initiation sites in eukaryotic cells. IRES elements are important factors in viral genomes and are also useful tools for bi-cistronic expression vectors. Most existing RNA structure prediction programs are unable to deal ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-160

    authors: Wu TY,Hsieh CC,Hong JJ,Chen CY,Tsai YS

    更新日期:2009-05-27 00:00:00

  • Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span.

    abstract:BACKGROUND:The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caeno...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-250

    authors: Blei DM,Franks K,Jordan MI,Mian IS

    更新日期:2006-05-08 00:00:00

  • SDA: a semi-parametric differential abundance analysis method for metabolomics and proteomics data.

    abstract:BACKGROUND:Identifying differentially abundant features between different experimental groups is a common goal for many metabolomics and proteomics studies. However, analyzing data from mass spectrometry (MS) is difficult because the data may not be normally distributed and there is often a large fraction of zero value...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3067-z

    authors: Li Y,Fan TWM,Lane AN,Kang WY,Arnold SM,Stromberg AJ,Wang C,Chen L

    更新日期:2019-10-17 00:00:00

  • CONFOLD2: improved contact-driven ab initio protein structure modeling.

    abstract:BACKGROUND:Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted cont...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2032-6

    authors: Adhikari B,Cheng J

    更新日期:2018-01-25 00:00:00

  • Hotspot Hunter: a computational system for large-scale screening and selection of candidate immunological hotspots in pathogen proteomes.

    abstract:BACKGROUND:T-cell epitopes that promiscuously bind to multiple alleles of a human leukocyte antigen (HLA) supertype are prime targets for development of vaccines and immunotherapies because they are relevant to a large proportion of the human population. The presence of clusters of promiscuous T-cell epitopes, immunolo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S1-S19

    authors: Zhang GL,Khan AM,Srinivasan KN,Heiny A,Lee K,Kwoh CK,August JT,Brusic V

    更新日期:2008-01-01 00:00:00

  • GC/MS based metabolomics: development of a data mining system for metabolite identification by using soft independent modeling of class analogy (SIMCA).

    abstract:BACKGROUND:The goal of metabolomics analyses is a comprehensive and systematic understanding of all metabolites in biological samples. Many useful platforms have been developed to achieve this goal. Gas chromatography coupled to mass spectrometry (GC/MS) is a well-established analytical method in metabolomics study, an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-131

    authors: Tsugawa H,Tsujimoto Y,Arita M,Bamba T,Fukusaki E

    更新日期:2011-05-04 00:00:00

  • BatchPrimer3: a high throughput web application for PCR and sequencing primer design.

    abstract:BACKGROUND:Microsatellite (simple sequence repeat - SSR) and single nucleotide polymorphism (SNP) markers are two types of important genetic markers useful in genetic mapping and genotyping. Often, large-scale genomic research projects require high-throughput computer-assisted primer design. Numerous such web-based or ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-253

    authors: You FM,Huo N,Gu YQ,Luo MC,Ma Y,Hane D,Lazo GR,Dvorak J,Anderson OD

    更新日期:2008-05-29 00:00:00

  • XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data.

    abstract:BACKGROUND:Mouse xenografts from (patient-derived) tumors (PDX) or tumor cell lines are widely used as models to study various biological and preclinical aspects of cancer. However, analyses of their RNA and DNA profiles are challenging, because they comprise reads not only from the grafted human cancer but also from t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2353-5

    authors: Kluin RJC,Kemper K,Kuilman T,de Ruiter JR,Iyer V,Forment JV,Cornelissen-Steijger P,de Rink I,Ter Brugge P,Song JY,Klarenbeek S,McDermott U,Jonkers J,Velds A,Adams DJ,Peeper DS,Krijgsman O

    更新日期:2018-10-04 00:00:00

  • A molecular model of the full-length human NOD-like receptor family CARD domain containing 5 (NLRC5) protein.

    abstract:BACKGROUND:Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-275

    authors: Mótyán JA,Bagossi P,Benkő S,Tőzsér J

    更新日期:2013-09-17 00:00:00

  • Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation.

    abstract:BACKGROUND:High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of g...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-14

    authors: Jelier R,Jenster G,Dorssers LC,Wouters BJ,Hendriksen PJ,Mons B,Delwel R,Kors JA

    更新日期:2007-01-18 00:00:00

  • Inferring the role of transcription factors in regulatory networks.

    abstract:BACKGROUND:Expression profiles obtained from multiple perturbation experiments are increasingly used to reconstruct transcriptional regulatory networks, from well studied, simple organisms up to higher eukaryotes. Admittedly, a key ingredient in developing a reconstruction method is its ability to integrate heterogeneo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-228

    authors: Veber P,Guziolowski C,Le Borgne M,Radulescu O,Siegel A

    更新日期:2008-05-06 00:00:00

  • Recursive model for dose-time responses in pharmacological studies.

    abstract:BACKGROUND:Clinical studies often track dose-response curves of subjects over time. One can easily model the dose-response curve at each time point with Hill equation, but such a model fails to capture the temporal evolution of the curves. On the other hand, one can use Gompertz equation to model the temporal behaviors...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2831-4

    authors: Dhruba SR,Rahman A,Rahman R,Ghosh S,Pal R

    更新日期:2019-06-20 00:00:00

  • Discovering motifs that induce sequencing errors.

    abstract:BACKGROUND:Elevated sequencing error rates are the most predominant obstacle in single-nucleotide polymorphism (SNP) detection, which is a major goal in the bulk of current studies using next-generation sequencing (NGS). Beyond routinely handled generic sources of errors, certain base calling errors relate to specific ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S5-S1

    authors: Allhoff M,Schönhuth A,Martin M,Costa IG,Rahmann S,Marschall T

    更新日期:2013-01-01 00:00:00

  • LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates.

    abstract:BACKGROUND:Non-negative matrix factorisation (NMF), a machine learning algorithm, has been applied to the analysis of microarray data. A key feature of NMF is the ability to identify patterns that together explain the data as a linear combination of expression signatures. Microarray data generally includes individual e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-175

    authors: Wang G,Kossenkov AV,Ochs MF

    更新日期:2006-03-28 00:00:00

  • Moiety modeling framework for deriving moiety abundances from mass spectrometry measured isotopologues.

    abstract:BACKGROUND:Stable isotope tracing can follow individual atoms through metabolic transformations through the detection of the incorporation of stable isotope within metabolites. This resulting data can be interpreted in terms related to metabolic flux. However, detection of a stable isotope in metabolites by mass spectr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3096-7

    authors: Jin H,Moseley HNB

    更新日期:2019-10-28 00:00:00

  • Automated multigroup outlier identification in molecular high-throughput data using bagplots and gemplots.

    abstract:BACKGROUND:Analyses of molecular high-throughput data often lack in robustness, i.e. results are very sensitive to the addition or removal of a single observation. Therefore, the identification of extreme observations is an important step of quality control before doing further data analysis. Standard outlier detection...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1645-5

    authors: Kruppa J,Jung K

    更新日期:2017-05-02 00:00:00

  • Drug-target interaction prediction using semi-bipartite graph model and deep learning.

    abstract:BACKGROUND:Identifying drug-target interaction is a key element in drug discovery. In silico prediction of drug-target interaction can speed up the process of identifying unknown interactions between drugs and target proteins. In recent studies, handcrafted features, similarity metrics and machine learning methods have...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3518-6

    authors: Eslami Manoochehri H,Nourani M

    更新日期:2020-07-06 00:00:00

  • Advances in translational bioinformatics facilitate revealing the landscape of complex disease mechanisms.

    abstract::Advances of high-throughput technologies have rapidly produced more and more data from DNAs and RNAs to proteins, especially large volumes of genome-scale data. However, connection of the genomic information to cellular functions and biological behaviours relies on the development of effective approaches at higher sys...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S17-I1

    authors: Yang JY,Dunker A,Liu JS,Qin X,Arabnia HR,Yang W,Niemierko A,Chen Z,Luo Z,Wang L,Liu Y,Xu D,Deng Y,Tong W,Yang M

    更新日期:2014-01-01 00:00:00

  • circRNAprofiler: an R-based computational framework for the downstream analysis of circular RNAs.

    abstract:BACKGROUND:Circular RNAs (circRNAs) are a newly appreciated class of non-coding RNA molecules. Numerous tools have been developed for the detection of circRNAs, however computational tools to perform downstream functional analysis of circRNAs are scarce. RESULTS:We present circRNAprofiler, an R-based computational fra...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3500-3

    authors: Aufiero S,Reckman YJ,Tijsen AJ,Pinto YM,Creemers EE

    更新日期:2020-04-29 00:00:00

  • Quantiprot - a Python package for quantitative analysis of protein sequences.

    abstract:BACKGROUND:The field of protein sequence analysis is dominated by tools rooted in substitution matrices and alignments. A complementary approach is provided by methods of quantitative characterization. A major advantage of the approach is that quantitative properties defines a multidimensional solution space, where seq...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1751-4

    authors: Konopka BM,Marciniak M,Dyrka W

    更新日期:2017-07-17 00:00:00

  • Subfamily specific conservation profiles for proteins based on n-gram patterns.

    abstract:BACKGROUND:A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profile...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-72

    authors: Vries JK,Liu X

    更新日期:2008-01-30 00:00:00

  • The effect of rare variants on inflation of the test statistics in case-control analyses.

    abstract:BACKGROUND:The detection of bias due to cryptic population structure is an important step in the evaluation of findings of genetic association studies. The standard method of measuring this bias in a genetic association study is to compare the observed median association test statistic to the expected median test stati...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0496-1

    authors: Pirie A,Wood A,Lush M,Tyrer J,Pharoah PD

    更新日期:2015-02-20 00:00:00

  • CNN-based ranking for biomedical entity normalization.

    abstract:BACKGROUND:Most state-of-the-art biomedical entity normalization systems, such as rule-based systems, merely rely on morphological information of entity mentions, but rarely consider their semantic information. In this paper, we introduce a novel convolutional neural network (CNN) architecture that regards biomedical e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1805-7

    authors: Li H,Chen Q,Tang B,Wang X,Xu H,Wang B,Huang D

    更新日期:2017-10-03 00:00:00

  • WellInverter: a web application for the analysis of fluorescent reporter gene data.

    abstract:BACKGROUND:Fluorescent reporter genes have become widely used for monitoring gene expression in living cells. When a microbial strain carrying a reporter gene is grown in a microplate reader, the fluorescence and the absorbance (optical density) of the culture can be automatically measured every few minutes in a highly...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2920-4

    authors: Martin Y,Page M,Blanchet C,de Jong H

    更新日期:2019-06-11 00:00:00

  • 'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools.

    abstract:BACKGROUND:Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-420

    authors: Shen YQ,Burger G

    更新日期:2007-10-29 00:00:00

  • SPECS: a non-parametric method to identify tissue-specific molecular features for unbalanced sample groups.

    abstract:BACKGROUND:To understand biology and differences among various tissues or cell types, one typically searches for molecular features that display characteristic abundance patterns. Several specificity metrics have been introduced to identify tissue-specific molecular features, but these either require an equal number of...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3407-z

    authors: Everaert C,Volders PJ,Morlion A,Thas O,Mestdagh P

    更新日期:2020-02-17 00:00:00

  • TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins.

    abstract:BACKGROUND:Beta-barrel transmembrane (bbtm) proteins are a functionally important and diverse group of proteins expressed in the outer membranes of bacteria (both gram negative and acid fast gram positive), mitochondria and chloroplasts. Despite recent publications describing reasonable levels of accuracy for discrimin...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-56

    authors: Garrow AG,Agnew A,Westhead DR

    更新日期:2005-03-15 00:00:00

  • Large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features.

    abstract:BACKGROUND:Histopathology image analysis is a gold standard for cancer recognition and diagnosis. Automatic analysis of histopathology images can help pathologists diagnose tumor and cancer subtypes, alleviating the workload of pathologists. There are two basic types of tasks in digital histopathology image analysis: i...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1685-x

    authors: Xu Y,Jia Z,Wang LB,Ai Y,Zhang F,Lai M,Chang EI

    更新日期:2017-05-26 00:00:00