Fast batch searching for protein homology based on compression and clustering.

Abstract:

BACKGROUND:In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn't exploit the common subsequences shared by queries. RESULTS:We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage. CONCLUSIONS:It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Ge H,Sun L,Yu J

doi

10.1186/s12859-017-1938-8

subject

Has Abstract

pub_date

2017-11-21 00:00:00

pages

508

issue

1

issn

1471-2105

pii

10.1186/s12859-017-1938-8

journal_volume

18

pub_type

杂志文章
  • DLAD4U: deriving and prioritizing disease lists from PubMed literature.

    abstract:BACKGROUND:Due to recent technology advancements, disease related knowledge is growing rapidly. It becomes nontrivial to go through all published literature to identify associations between human diseases and genetic, environmental, and life style factors, disease symptoms, and treatment strategies. Here we report DLAD...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2463-0

    authors: Shen J,Vasaikar S,Zhang B

    更新日期:2018-12-28 00:00:00

  • Bioinformatics research in the Asia Pacific: a 2007 update.

    abstract::We provide a 2007 update on the bioinformatics research in the Asia-Pacific from the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation set up in 1998. From 2002, APBioNet has organized the first International Conference on Bioinformatics (InCoB) bringing together scientists work...

    journal_title:BMC bioinformatics

    pub_type:

    doi:10.1186/1471-2105-9-S1-S1

    authors: Ranganathan S,Gribskov M,Tan TW

    更新日期:2008-01-01 00:00:00

  • Simultaneous phylogeny reconstruction and multiple sequence alignment.

    abstract:BACKGROUND:A phylogeny is the evolutionary history of a group of organisms. To date, sequence data is still the most used data type for phylogenetic reconstruction. Before any sequences can be used for phylogeny reconstruction, they must be aligned, and the quality of the multiple sequence alignment has been shown to a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S1-S11

    authors: Yue F,Shi J,Tang J

    更新日期:2009-01-30 00:00:00

  • Using mechanistic Bayesian networks to identify downstream targets of the sonic hedgehog pathway.

    abstract:BACKGROUND:The topology of a biological pathway provides clues as to how a pathway operates, but rationally using this topology information with observed gene expression data remains a challenge. RESULTS:We introduce a new general-purpose analytic method called Mechanistic Bayesian Networks (MBNs) that allows for the ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-433

    authors: Shah A,Tenzen T,McMahon AP,Woolf PJ

    更新日期:2009-12-18 00:00:00

  • High-Throughput GoMiner, an 'industrial-strength' integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID).

    abstract:BACKGROUND:We previously developed GoMiner, an application that organizes lists of 'interesting' genes (for example, under-and overexpressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology. The original version of GoMiner was oriented toward visualization and interp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-168

    authors: Zeeberg BR,Qin H,Narasimhan S,Sunshine M,Cao H,Kane DW,Reimers M,Stephens RM,Bryant D,Burt SK,Elnekave E,Hari DM,Wynn TA,Cunningham-Rundles C,Stewart DM,Nelson D,Weinstein JN

    更新日期:2005-07-05 00:00:00

  • Finding sRNA generative locales from high-throughput sequencing data with NiBLS.

    abstract:BACKGROUND:Next-generation sequencing technologies allow researchers to obtain millions of sequence reads in a single experiment. One important use of the technology is the sequencing of small non-coding regulatory RNAs and the identification of the genomic locales from which they originate. Currently, there is a pauci...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-93

    authors: MacLean D,Moulton V,Studholme DJ

    更新日期:2010-02-18 00:00:00

  • GlyStruct: glycation prediction using structural properties of amino acid residues.

    abstract:BACKGROUND:Glycation is a one of the post-translational modifications (PTM) where sugar molecules and residues in protein sequences are covalently bonded. It has become one of the clinically important PTM in recent times attributed to many chronic and age related complications. Being a non-enzymatic reaction, it is a g...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2547-x

    authors: Reddy HM,Sharma A,Dehzangi A,Shigemizu D,Chandra AA,Tsunoda T

    更新日期:2019-02-04 00:00:00

  • An algorithm for automated closure during assembly.

    abstract:BACKGROUND:Finishing is the process of improving the quality and utility of draft genome sequences generated by shotgun sequencing and computational assembly. Finishing can involve targeted sequencing. Finishing reads may be incorporated by manual or automated means. One automated method uses targeted addition by local...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-457

    authors: Koren S,Miller JR,Walenz BP,Sutton G

    更新日期:2010-09-10 00:00:00

  • A novel parametric approach to mine gene regulatory relationship from microarray datasets.

    abstract:BACKGROUND:Microarray has been widely used to measure the gene expression level on the genome scale in the current decade. Many algorithms have been developed to reconstruct gene regulatory networks based on microarray data. Unfortunately, most of these models and algorithms focus on global properties of the expression...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S11-S15

    authors: Liu W,Li D,Liu Q,Zhu Y,He F

    更新日期:2010-12-14 00:00:00

  • Integrated olfactory receptor and microarray gene expression databases.

    abstract:BACKGROUND:Gene expression patterns of olfactory receptors (ORs) are an important component of the signal encoding mechanism in the olfactory system since they determine the interactions between odorant ligands and sensory neurons. We have developed the Olfactory Receptor Microarray Database (ORMD) to house OR gene exp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-231

    authors: Liu N,Crasto CJ,Ma M

    更新日期:2007-06-30 00:00:00

  • An improved string composition method for sequence comparison.

    abstract:BACKGROUND:Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison-one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-S6-S15

    authors: Lu G,Zhang S,Fang X

    更新日期:2008-05-28 00:00:00

  • Protein Sequence Annotation Tool (PSAT): a centralized web-based meta-server for high-throughput sequence annotations.

    abstract:BACKGROUND:Here we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-0887-y

    authors: Leung E,Huang A,Cadag E,Montana A,Soliman JL,Zhou CL

    更新日期:2016-01-20 00:00:00

  • How large B-factors can be in protein crystal structures.

    abstract:BACKGROUND:Protein crystal structures are potentially over-interpreted since they are routinely refined without any restraint on the upper limit of atomic B-factors. Consequently, some of their atoms, undetected in the electron density maps, are allowed to reach extremely large B-factors, even above 100 square Angstrom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2083-8

    authors: Carugo O

    更新日期:2018-02-23 00:00:00

  • EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries.

    abstract:BACKGROUND:Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of seque...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-186

    authors: Smith RP,Buchser WJ,Lemmon MB,Pardinas JR,Bixby JL,Lemmon VP

    更新日期:2008-04-10 00:00:00

  • FocAn: automated 3D analysis of DNA repair foci in image stacks acquired by confocal fluorescence microscopy.

    abstract:BACKGROUND:Phosphorylated histone H2AX, also known as γH2AX, forms μm-sized nuclear foci at the sites of DNA double-strand breaks (DSBs) induced by ionizing radiation and other agents. Due to their specificity and sensitivity, γH2AX immunoassays have become the gold standard for studying DSB induction and repair. One o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3370-8

    authors: Memmel S,Sisario D,Zimmermann H,Sauer M,Sukhorukov VL,Djuzenova CS,Flentje M

    更新日期:2020-01-28 00:00:00

  • A knowledge discovery object model API for Java.

    abstract:BACKGROUND:Biological data resources have become heterogeneous and derive from multiple sources. This introduces challenges in the management and utilization of this data in software development. Although efforts are underway to create a standard format for the transmission and storage of biological data, this objectiv...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-51

    authors: Zuyderduyn SD,Jones SJ

    更新日期:2003-10-28 00:00:00

  • Smith-Waterman peak alignment for comprehensive two-dimensional gas chromatography-mass spectrometry.

    abstract:BACKGROUND:Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC × GC-MS) is a powerful technique which has gained increasing attention over the last two decades. The GC × GC-MS provides much increased separation capacity, chemical selectivity and sensitivity for complex sample analysis an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-235

    authors: Kim S,Koo I,Fang A,Zhang X

    更新日期:2011-06-15 00:00:00

  • Gene set enrichment meta-learning analysis: next- generation sequencing versus microarrays.

    abstract:BACKGROUND:Reproducibility of results can have a significant impact on the acceptance of new technologies in gene expression analysis. With the recent introduction of the so-called next-generation sequencing (NGS) technology and established microarrays, one is able to choose between two completely different platforms f...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-176

    authors: Stiglic G,Bajgot M,Kokol P

    更新日期:2010-04-08 00:00:00

  • Improving the prediction of mRNA extremities in the parasitic protozoan Leishmania.

    abstract:BACKGROUND:Leishmania and other members of the Trypanosomatidae family diverged early on in eukaryotic evolution and consequently display unique cellular properties. Their apparent lack of transcriptional regulation is compensated by complex post-transcriptional control mechanisms, including the processing of polycistr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-158

    authors: Smith M,Blanchette M,Papadopoulou B

    更新日期:2008-03-20 00:00:00

  • GraphDNA: a Java program for graphical display of DNA composition analyses.

    abstract:BACKGROUND:Under conditions of no strand bias the number of Gs is equal to that of Cs for each DNA strand; similarly, the total number of Ts is equal to that of As. However, within each strand there are considerable local deviations from the A = T and G = C equality. These asymmetries in nucleotide composition have bee...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-21

    authors: Thomas JM,Horspool D,Brown G,Tcherepanov V,Upton C

    更新日期:2007-01-23 00:00:00

  • Prior knowledge guided eQTL mapping for identifying candidate genes.

    abstract:BACKGROUND:Expression quantitative trait loci (eQTL) mapping is often used to identify genetic loci and candidate genes correlated with traits. Although usually a group of genes affect complex traits, genes in most eQTL mapping methods are considered as independent. Recently, some eQTL mapping methods have accounted fo...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1387-9

    authors: Wang Y,Richard R,Pan Y

    更新日期:2016-12-13 00:00:00

  • An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage.

    abstract:BACKGROUND:Eukaryotic whole genome sequences are accumulating at an impressive rate. Effective methods for comparing multiple whole eukaryotic genomes on a large scale are needed. Most attempted solutions involve the production of large scale alignments, and many of these require a high stringency pre-screen for putati...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-204

    authors: Stuart GW,Berry MW

    更新日期:2004-12-17 00:00:00

  • Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations.

    abstract:BACKGROUND:Detecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1790-x

    authors: Nguyen LH,Holmes S

    更新日期:2017-09-13 00:00:00

  • Detection of nuclei in 4D Nomarski DIC microscope images of early Caenorhabditis elegans embryos using local image entropy and object tracking.

    abstract:BACKGROUND:The ability to detect nuclei in embryos is essential for studying the development of multicellular organisms. A system of automated nuclear detection has already been tested on a set of four-dimensional (4D) Nomarski differential interference contrast (DIC) microscope images of Caenorhabditis elegans embryos...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-125

    authors: Hamahashi S,Onami S,Kitano H

    更新日期:2005-05-24 00:00:00

  • Measuring similarities between transcription factor binding sites.

    abstract:BACKGROUND:Collections of transcription factor binding profiles (Transfac, Jaspar) are essential to identify regulatory elements in DNA sequences. Subsets of highly similar profiles complicate large scale analysis of transcription factor binding sites. RESULTS:We propose to identify and group similar profiles using tw...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-237

    authors: Kielbasa SM,Gonze D,Herzel H

    更新日期:2005-09-28 00:00:00

  • Automated peptide mapping and protein-topographical annotation of proteomics data.

    abstract:BACKGROUND:In quantitative proteomics, peptide mapping is a valuable approach to combine positional quantitative information with topographical and domain information of proteins. Quantitative proteomic analysis of cell surface shedding is an exemplary application area of this approach. RESULTS:We developed ImproViser...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-207

    authors: Videm P,Gunasekaran D,Schröder B,Mayer B,Biniossek ML,Schilling O

    更新日期:2014-06-19 00:00:00

  • Local search for the generalized tree alignment problem.

    abstract:BACKGROUND:A phylogeny postulates shared ancestry relationships among organisms in the form of a binary tree. Phylogenies attempt to answer an important question posed in biology: what are the ancestor-descendent relationships between organisms? At the core of every biological problem lies a phylogenetic component. The...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-66

    authors: Varón A,Wheeler WC

    更新日期:2013-02-26 00:00:00

  • Domain fusion analysis by applying relational algebra to protein sequence and domain databases.

    abstract:BACKGROUND:Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain datab...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-16

    authors: Truong K,Ikura M

    更新日期:2003-05-06 00:00:00

  • Use of a structural alphabet for analysis of short loops connecting repetitive structures.

    abstract:BACKGROUND:Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we call...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-58

    authors: Fourrier L,Benros C,de Brevern AG

    更新日期:2004-05-12 00:00:00

  • Bison: bisulfite alignment on nodes of a cluster.

    abstract:BACKGROUND:DNA methylation changes are associated with a wide array of biological processes. Bisulfite conversion of DNA followed by high-throughput sequencing is increasingly being used to assess genome-wide methylation at single-base resolution. The relative slowness of most commonly used aligners for processing such...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-337

    authors: Ryan DP,Ehninger D

    更新日期:2014-10-18 00:00:00