CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies.

Abstract:

BACKGROUND:Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. RESULTS:We developed a Classification tool using Discriminative K-mers and Approximate Matching algorithm (CDKAM). This approximate matching method was used for searching k-mers, which included two phases, a quick mapping phase and a dynamic programming phase. Simulated datasets as well as real TGS datasets have been tested to compare the performance of CDKAM with existing methods. We showed that CDKAM performed better in many aspects, especially when classifying TGS data with average length 1000-1500 bases. CONCLUSIONS:CDKAM is an effective program with higher accuracy and lower memory requirement for TGS metagenome sequence classification. It produces a high species-level accuracy.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Bui VK,Wei C

doi

10.1186/s12859-020-03777-y

subject

Has Abstract

pub_date

2020-10-20 00:00:00

pages

468

issue

1

issn

1471-2105

pii

10.1186/s12859-020-03777-y

journal_volume

21

pub_type

杂志文章
  • Integration of open access literature into the RCSB Protein Data Bank using BioLit.

    abstract:BACKGROUND:Biological data have traditionally been stored and made publicly available through a variety of on-line databases, whereas biological knowledge has traditionally been found in the printed literature. With journals now on-line and providing an increasing amount of open access content, often free of copyright ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-220

    authors: Prlić A,Martinez MA,Dimitropoulos D,Beran B,Yukich BT,Rose PW,Bourne PE,Fink JL

    更新日期:2010-04-29 00:00:00

  • Improved functional prediction of proteins by learning kernel combinations in multilabel settings.

    abstract:BACKGROUND:We develop a probabilistic model for combining kernel matrices to predict the function of proteins. It extends previous approaches in that it can handle multiple labels which naturally appear in the context of protein function. RESULTS:Explicit modeling of multilabels significantly improves the capability o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-S2-S12

    authors: Roth V,Fischer B

    更新日期:2007-05-03 00:00:00

  • iSeg: an efficient algorithm for segmentation of genomic and epigenomic data.

    abstract:BACKGROUND:Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2140-3

    authors: Girimurugan SB,Liu Y,Lung PY,Vera DL,Dennis JH,Bass HW,Zhang J

    更新日期:2018-04-11 00:00:00

  • Prediction of dinucleotide-specific RNA-binding sites in proteins.

    abstract:BACKGROUND:Regulation of gene expression, protein synthesis, replication and assembly of many viruses involve RNA-protein interactions. Although some successful computational tools have been reported to recognize RNA binding sites in proteins, the problem of specificity remains poorly investigated. After the nucleotide...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S13-S5

    authors: Fernandez M,Kumagai Y,Standley DM,Sarai A,Mizuguchi K,Ahmad S

    更新日期:2011-01-01 00:00:00

  • Robust joint analysis allowing for model uncertainty in two-stage genetic association studies.

    abstract:BACKGROUND:The cost efficient two-stage design is often used in genome-wide association studies (GWASs) in searching for genetic loci underlying the susceptibility for complex diseases. Replication-based analysis, which considers data from each stage separately, often suffers from loss of efficiency. Joint test that co...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-9

    authors: Pan D,Li Q,Jiang N,Liu A,Yu K

    更新日期:2011-01-07 00:00:00

  • Using Gene Ontology to describe the role of the neurexin-neuroligin-SHANK complex in human, mouse and rat and its relevance to autism.

    abstract:BACKGROUND:People with an autistic spectrum disorder (ASD) display a variety of characteristic behavioral traits, including impaired social interaction, communication difficulties and repetitive behavior. This complex neurodevelopment disorder is known to be associated with a combination of genetic and environmental fa...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0622-0

    authors: Patel S,Roncaglia P,Lovering RC

    更新日期:2015-06-06 00:00:00

  • LDpop: an interactive online tool to calculate and visualize geographic LD patterns.

    abstract:BACKGROUND:Linkage disequilibrium (LD)-the non-random association of alleles at different loci-defines population-specific haplotypes which vary by genomic ancestry. Assessment of allelic frequencies and LD patterns from a variety of ancestral populations enables researchers to better understand population histories as...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3340-1

    authors: Alexander TA,Machiela MJ

    更新日期:2020-01-10 00:00:00

  • FastqPuri: high-performance preprocessing of RNA-seq data.

    abstract:BACKGROUND:RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expres...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2799-0

    authors: Pérez-Rubio P,Lottaz C,Engelmann JC

    更新日期:2019-05-03 00:00:00

  • μHEM for identification of differentially expressed miRNAs using hypercuboid equivalence partition matrix.

    abstract:BACKGROUND:The miRNAs, a class of short approximately 22-nucleotide non-coding RNAs, often act post-transcriptionally to inhibit mRNA expression. In effect, they control gene expression by targeting mRNA. They also help in carrying out normal functioning of a cell as they play an important role in various cellular proc...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-266

    authors: Paul S,Maji P

    更新日期:2013-09-04 00:00:00

  • LEON-BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system.

    abstract:BACKGROUND:A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sop...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1146-y

    authors: Vanhoutreve R,Kress A,Legrand B,Gass H,Poch O,Thompson JD

    更新日期:2016-07-07 00:00:00

  • A novel approach for predicting protein S-glutathionylation.

    abstract:BACKGROUND:S-glutathionylation is the formation of disulfide bonds between the tripeptide glutathione and cysteine residues of the protein, protecting them from irreversible oxidation and in some cases causing change in their functions. Regulatory glutathionylation of proteins is a controllable and reversible process a...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03571-w

    authors: Anashkina AA,Poluektov YM,Dmitriev VA,Kuznetsov EN,Mitkevich VA,Makarov AA,Petrushanko IY

    更新日期:2020-09-14 00:00:00

  • Cyclic nucleotide binding proteins in the Arabidopsis thaliana and Oryza sativa genomes.

    abstract:BACKGROUND:Cyclic nucleotides are ubiquitous intracellular messengers. Until recently, the roles of cyclic nucleotides in plant cells have proven difficult to uncover. With an understanding of the protein domains which can bind cyclic nucleotides (CNB and GAF domains) we scanned the completed genomes of the higher plan...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-6

    authors: Bridges D,Fraser ME,Moorhead GB

    更新日期:2005-01-11 00:00:00

  • Efficient computation of motif discovery on Intel Many Integrated Core (MIC) Architecture.

    abstract:BACKGROUND:Novel sequence motifs detection is becoming increasingly essential in computational biology. However, the high computational cost greatly constrains the efficiency of most motif discovery algorithms. RESULTS:In this paper, we accelerate MEME algorithm targeted on Intel Many Integrated Core (MIC) Architectur...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2276-1

    authors: Peng S,Cheng M,Huang K,Cui Y,Zhang Z,Guo R,Zhang X,Yang S,Liao X,Lu Y,Zou Q,Shi B

    更新日期:2018-08-13 00:00:00

  • Predicting anatomic therapeutic chemical classification codes using tiered learning.

    abstract:BACKGROUND:The low success rate and high cost of drug discovery requires the development of new paradigms to identify molecules of therapeutic value. The Anatomical Therapeutic Chemical (ATC) Code System is a World Health Organization (WHO) proposed classification that assigns multi-level codes to compounds based on th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1660-6

    authors: Olson T,Singh R

    更新日期:2017-06-07 00:00:00

  • Enhancing HMM-based protein profile-profile alignment with structural features and evolutionary coupling information.

    abstract:BACKGROUND:Protein sequence profile-profile alignment is an important approach to recognizing remote homologs and generating accurate pairwise alignments. It plays an important role in protein sequence database search, protein structure prediction, protein function prediction, and phylogenetic analysis. RESULTS:In thi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-252

    authors: Deng X,Cheng J

    更新日期:2014-07-25 00:00:00

  • Model based analysis of real-time PCR data from DNA binding dye protocols.

    abstract:BACKGROUND:Reverse transcription followed by real-time PCR is widely used for quantification of specific mRNA, and with the use of double-stranded DNA binding dyes it is becoming a standard for microarray data validation. Despite the kinetic information generated by real-time PCR, most popular analysis methods assume c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-85

    authors: Alvarez MJ,Vila-Ortiz GJ,Salibe MC,Podhajcer OL,Pitossi FJ

    更新日期:2007-03-09 00:00:00

  • JNets: exploring networks by integrating annotation.

    abstract:BACKGROUND:A common method for presenting and studying biological interaction networks is visualization. Software tools can enhance our ability to explore network visualizations and improve our understanding of biological systems, particularly when these tools offer analysis capabilities. However, most published networ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-95

    authors: Macpherson JI,Pinney JW,Robertson DL

    更新日期:2009-03-26 00:00:00

  • A comparative study of conservation and variation scores.

    abstract:BACKGROUND:Conservation and variation scores are used when evaluating sites in a multiple sequence alignment, in order to identify residues critical for structure or function. A variety of scores are available today but it is not clear how different scores relate to each other. RESULTS:We applied 25 conservation and v...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-388

    authors: Johansson F,Toh H

    更新日期:2010-07-21 00:00:00

  • A platform for processing expression of short time series (PESTS).

    abstract:BACKGROUND:Time course microarray profiles examine the expression of genes over a time domain. They are necessary in order to determine the complete set of genes that are dynamically expressed under given conditions, and to determine the interaction between these genes. Because of cost and resource issues, most time se...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-13

    authors: Sinha A,Markatou M

    更新日期:2011-01-11 00:00:00

  • Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets.

    abstract:BACKGROUND:Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology all...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-241

    authors: Aubry M,Monnier A,Chicault C,de Tayrac M,Galibert MD,Burgun A,Mosser J

    更新日期:2006-05-04 00:00:00

  • Novel domain expansion methods to improve the computational efficiency of the Chemical Master Equation solution for large biological networks.

    abstract:BACKGROUND:Numerical solutions of the chemical master equation (CME) are important for understanding the stochasticity of biochemical systems. However, solving CMEs is a formidable task. This task is complicated due to the nonlinear nature of the reactions and the size of the networks which result in different realizat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03668-2

    authors: Kosarwal R,Kulasiri D,Samarasinghe S

    更新日期:2020-11-11 00:00:00

  • IRSS: a web-based tool for automatic layout and analysis of IRES secondary structure prediction and searching system in silico.

    abstract:BACKGROUND:Internal ribosomal entry sites (IRESs) provide alternative, cap-independent translation initiation sites in eukaryotic cells. IRES elements are important factors in viral genomes and are also useful tools for bi-cistronic expression vectors. Most existing RNA structure prediction programs are unable to deal ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-160

    authors: Wu TY,Hsieh CC,Hong JJ,Chen CY,Tsai YS

    更新日期:2009-05-27 00:00:00

  • Linear predictive coding representation of correlated mutation for protein sequence alignment.

    abstract:BACKGROUND:Although both conservation and correlated mutation (CM) are important information reflecting the different sorts of context in multiple sequence alignment, most of alignment methods use sequence profiles that only represent conservation. There is no general way to represent correlated mutation and incorporat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S2-S2

    authors: Jeong CS,Kim D

    更新日期:2010-04-16 00:00:00

  • A new pooling strategy for high-throughput screening: the Shifted Transversal Design.

    abstract:BACKGROUND:In binary high-throughput screening projects where the goal is the identification of low-frequency events, beyond the obvious issue of efficiency, false positives and false negatives are a major concern. Pooling constitutes a natural solution: it reduces the number of tests, while providing critical duplicat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-28

    authors: Thierry-Mieg N

    更新日期:2006-01-19 00:00:00

  • Identifying cancer mutation targets across thousands of samples: MuteProc, a high throughput mutation analysis pipeline.

    abstract:BACKGROUND:In the past decade, bioinformatics tools have matured enough to reliably perform sophisticated primary data analysis on Next Generation Sequencing (NGS) data, such as mapping, assemblies and variant calling, however, there is still a dire need for improvements in the higher level analysis such as NGS data or...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-167

    authors: Hadj Khodabakhshi A,Fejes AP,Birol I,Jones SJ

    更新日期:2013-05-28 00:00:00

  • 3off2: A network reconstruction algorithm based on 2-point and 3-point information statistics.

    abstract:BACKGROUND:The reconstruction of reliable graphical models from observational data is important in bioinformatics and other computational fields applying network reconstruction methods to large, yet finite datasets. The main network reconstruction approaches are either based on Bayesian scores, which enable the ranking...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0856-x

    authors: Affeldt S,Verny L,Isambert H

    更新日期:2016-01-20 00:00:00

  • Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts.

    abstract:BACKGROUND:We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. METHODS:We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S3-S4

    authors: Duan W,Song M,Yates A

    更新日期:2009-03-19 00:00:00

  • MGC: a metagenomic gene caller.

    abstract:BACKGROUND:Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incompl...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S9-S6

    authors: El Allali A,Rose JR

    更新日期:2013-01-01 00:00:00

  • A multiresolution approach to automated classification of protein subcellular location images.

    abstract:BACKGROUND:Fluorescence microscopy is widely used to determine the subcellular location of proteins. Efforts to determine location on a proteome-wide basis create a need for automated methods to analyze the resulting images. Over the past ten years, the feasibility of using machine learning methods to recognize all maj...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-210

    authors: Chebira A,Barbotin Y,Jackson C,Merryman T,Srinivasa G,Murphy RF,Kovacević J

    更新日期:2007-06-19 00:00:00

  • HMMvar-func: a new method for predicting the functional outcome of genetic variants.

    abstract:BACKGROUND:Numerous tools have been developed to predict the fitness effects (i.e., neutral, deleterious, or beneficial) of genetic variants on corresponding proteins. However, prediction in terms of whether a variant causes the variant bearing protein to lose the original function or gain new function is also needed f...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0781-z

    authors: Liu M,Watson LT,Zhang L

    更新日期:2015-10-30 00:00:00