Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads.

Abstract:

BACKGROUND:Adapter trimming is a prerequisite step for analyzing next-generation sequencing (NGS) data when the reads are longer than the target DNA/RNA fragments. Although typically used in small RNA sequencing, adapter trimming is also used widely in other applications, such as genome DNA sequencing and transcriptome RNA/cDNA sequencing, where fragments shorter than a read are sometimes obtained because of the limitations of NGS protocols. For the newly emerged Nextera long mate-pair (LMP) protocol, junction adapters are located in the middle of all properly constructed fragments; hence, adapter trimming is essential to gain the correct paired reads. However, our investigations have shown that few adapter trimming tools meet both efficiency and accuracy requirements simultaneously. The performances of these tools can be even worse for paired-end and/or mate-pair sequencing. RESULTS:To improve the efficiency of adapter trimming, we devised a novel algorithm, the bit-masked k-difference matching algorithm, which has O(kn) expected time with O(m) space, where k is the maximum number of differences allowed, n is the read length, and m is the adapter length. This algorithm makes it possible to fully enumerate all candidates that meet a specified threshold, e.g. error ratio, within a short period of time. To improve the accuracy of this algorithm, we designed a simple and easy-to-explain statistical scoring scheme to evaluate candidates in the pattern matching step. We also devised scoring schemes to fully exploit the paired-end/mate-pair information when it is applicable. All these features have been implemented in an industry-standard tool named Skewer (https://sourceforge.net/projects/skewer). Experiments on simulated data, real data of small RNA sequencing, paired-end RNA sequencing, and Nextera LMP sequencing showed that Skewer outperforms all other similar tools that have the same utility. Further, Skewer is considerably faster than other tools that have comparative accuracies; namely, one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing. CONCLUSIONS:Skewer achieved as yet unmatched accuracies for adapter trimming with low time bound.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Jiang H,Lei R,Ding SW,Zhu S

doi

10.1186/1471-2105-15-182

subject

Has Abstract

pub_date

2014-06-12 00:00:00

pages

182

issn

1471-2105

pii

1471-2105-15-182

journal_volume

15

pub_type

杂志文章
  • A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs.

    abstract:BACKGROUND:Accumulating evidence has demonstrated that long non-coding RNAs (lncRNAs) are closely associated with human diseases, and it is useful for the diagnosis and treatment of diseases to get the relationships between lncRNAs and diseases. Due to the high costs and time complexity of traditional bio-experiments, ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03906-7

    authors: Xiao Y,Xiao Z,Feng X,Chen Z,Kuang L,Wang L

    更新日期:2020-12-02 00:00:00

  • Clustering analysis of tumor metabolic networks.

    abstract:BACKGROUND:Biological networks are representative of the diverse molecular interactions that occur within cells. Some of the commonly studied biological networks are modeled through protein-protein interactions, gene regulatory, and metabolic pathways. Among these, metabolic networks are probably the most studied, as t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03564-9

    authors: Manipur I,Granata I,Maddalena L,Guarracino MR

    更新日期:2020-08-21 00:00:00

  • Identification of a small optimal subset of CpG sites as bio-markers from high-throughput DNA methylation profiles.

    abstract:BACKGROUND:DNA methylation patterns have been shown to significantly correlate with different tissue types and disease states. High-throughput methylation arrays enable large-scale DNA methylation analysis to identify informative DNA methylation biomarkers. The identification of disease-specific methylation signatures ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-457

    authors: Meng H,Murrelle EL,Li G

    更新日期:2008-10-27 00:00:00

  • Methodology capture: discriminating between the "best" and the rest of community practice.

    abstract:BACKGROUND:The methodologies we use both enable and help define our research. However, as experimental complexity has increased the choice of appropriate methodologies has become an increasingly difficult task. This makes it difficult to keep track of available bioinformatics software, let alone the most suitable proto...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-359

    authors: Eales JM,Pinney JW,Stevens RD,Robertson DL

    更新日期:2008-09-01 00:00:00

  • Information extraction from full text scientific articles: where are the keywords?

    abstract:BACKGROUND:To date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-20

    authors: Shah PK,Perez-Iratxeta C,Bork P,Andrade MA

    更新日期:2003-05-29 00:00:00

  • Finding motif pairs in the interactions between heterogeneous proteins via bootstrapping and boosting.

    abstract:BACKGROUND:Supervised learning and many stochastic methods for predicting protein-protein interactions require both negative and positive interactions in the training data set. Unlike positive interactions, negative interactions cannot be readily obtained from interaction data, so these must be generated. In protein-pr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S1-S57

    authors: Kim J,Huang DS,Han K

    更新日期:2009-01-30 00:00:00

  • PuFFIN--a parameter-free method to build nucleosome maps from paired-end reads.

    abstract:BACKGROUND:We introduce a novel method, called PuFFIN, that takes advantage of paired-end short reads to build genome-wide nucleosome maps with larger numbers of detected nucleosomes and higher accuracy than existing tools. In contrast to other approaches that require users to optimize several parameters according to t...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S9-S11

    authors: Polishko A,Bunnik EM,Le Roch KG,Lonardi S

    更新日期:2014-01-01 00:00:00

  • TableButler - a Windows based tool for processing large data tables generated with high-throughput methods.

    abstract:BACKGROUND:High-throughput "omics" based data analysis play emerging roles in life sciences and molecular diagnostics. This emphasizes the urgent need for user-friendly windows-based software interfaces that could process the diversity of large tab-delimited raw data files generated by these methods. Depending on the s...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-235

    authors: Schwager C,Wirkner U,Abdollahi A,Huber PE

    更新日期:2009-07-29 00:00:00

  • "METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".

    abstract:BACKGROUND:The improvements in genomics methods coupled with readily accessible high-throughput sequencing have contributed to our understanding of microbial species, metagenomes, infectious diseases and more. To maximize the impact of these genomics studies, it is important that data from biological samples will becom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03694-0

    authors: Quiñones M,Liou DT,Shyu C,Kim W,Vujkovic-Cvijin I,Belkaid Y,Hurt DE

    更新日期:2020-09-03 00:00:00

  • Measure of synonymous codon usage diversity among genes in bacteria.

    abstract:BACKGROUND:In many bacteria, intragenomic diversity in synonymous codon usage among genes has been reported. However, no quantitative attempt has been made to compare the diversity levels among different genomes. Here, we introduce a mean dissimilarity-based index (Dmean) for quantifying the level of diversity in synon...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-167

    authors: Suzuki H,Saito R,Tomita M

    更新日期:2009-06-01 00:00:00

  • Learning smoothing models of copy number profiles using breakpoint annotations.

    abstract:BACKGROUND:Many models have been proposed to detect copy number alterations in chromosomal copy number profiles, but it is usually not obvious to decide which is most effective for a given data set. Furthermore, most methods have a smoothing parameter that determines the number of breakpoints and must be chosen using v...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-164

    authors: Hocking TD,Schleiermacher G,Janoueix-Lerosey I,Boeva V,Cappo J,Delattre O,Bach F,Vert JP

    更新日期:2013-05-22 00:00:00

  • Computational approaches to protein inference in shotgun proteomics.

    abstract::Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2105-13-S16-S4

    authors: Li YF,Radivojac P

    更新日期:2012-01-01 00:00:00

  • ProbPS: a new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity.

    abstract:BACKGROUND:The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-346

    authors: Zhang S,Wang Y,Bu D,Zhang H,Sun S

    更新日期:2011-08-17 00:00:00

  • Towards mainstreaming of biodiversity data publishing: recommendations of the GBIF Data Publishing Framework Task Group.

    abstract:BACKGROUND:Data are the evidentiary basis for scientific hypotheses, analyses and publication, for policy formation and for decision-making. They are essential to the evaluation and testing of results by peer scientists both present and future. There is broad consensus in the scientific and conservation communities tha...

    journal_title:BMC bioinformatics

    pub_type: 指南,杂志文章

    doi:10.1186/1471-2105-12-S15-S1

    authors: Moritz T,Krishnan S,Roberts D,Ingwersen P,Agosti D,Penev L,Cockerill M,Chavan V,Data Publishing Framework Task Group.

    更新日期:2011-01-01 00:00:00

  • NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization.

    abstract:BACKGROUND:As high-throughput sequencing applications continue to evolve, the rapid growth in quantity and variety of sequence-based data calls for the development of new software libraries and tools for data analysis and visualization. Often, effective use of these tools requires computational skills beyond those of m...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03577-4

    authors: Yousif A,Drou N,Rowe J,Khalfan M,Gunsalus KC

    更新日期:2020-06-29 00:00:00

  • Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach.

    abstract:BACKGROUND:The identification of genes responsible for human inherited diseases is one of the most challenging tasks in human genetics. Recent studies based on phenotype similarity and gene proximity have demonstrated great success in prioritizing candidate genes for human diseases. However, most of these methods rely ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S1-S11

    authors: Zhang W,Sun F,Jiang R

    更新日期:2011-02-15 00:00:00

  • Tandem repeats discovery service (TReaDS) applied to finding novel cis-acting factors in repeat expansion diseases.

    abstract:BACKGROUND:Tandem repeats are multiple duplications of substrings in the DNA that occur contiguously, or at a short distance, and may involve some mutations (such as substitutions, insertions, and deletions). Tandem repeats have been extensively studied also for their association with the class of repeat expansion dise...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-S4-S3

    authors: Pellegrini M,Renda ME,Vecchio A

    更新日期:2012-03-28 00:00:00

  • BIOZON: a system for unification, management and analysis of heterogeneous biological data.

    abstract:BACKGROUND:Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability. DESCRIPTION:Here we present...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-70

    authors: Birkland A,Yona G

    更新日期:2006-02-15 00:00:00

  • Bayesian neural networks for detecting epistasis in genetic association studies.

    abstract:BACKGROUND:Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. RESULTS:A non-parametric Bayesian ap...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0368-0

    authors: Beam AL,Motsinger-Reif A,Doyle J

    更新日期:2014-11-21 00:00:00

  • A multifaceted analysis of HIV-1 protease multidrug resistance phenotypes.

    abstract:BACKGROUND:Great strides have been made in the effective treatment of HIV-1 with the development of second-generation protease inhibitors (PIs) that are effective against historically multi-PI-resistant HIV-1 variants. Nevertheless, mutation patterns that confer decreasing susceptibility to available PIs continue to ar...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-477

    authors: Doherty KM,Nakka P,King BM,Rhee SY,Holmes SP,Shafer RW,Radhakrishnan ML

    更新日期:2011-12-15 00:00:00

  • Software for selecting the most informative sets of genomic loci for multi-target microbial typing.

    abstract:BACKGROUND:High-throughput sequencing can identify numerous potential genomic targets for microbial strain typing, but identification of the most informative combinations requires the use of computational screening tools. This paper describes novel software-- Automated Selection of Typing Target Subsets (AuSeTTS)--that...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-148

    authors: O'Sullivan MV,Sintchenko V,Gilbert GL

    更新日期:2013-05-01 00:00:00

  • Analysis of Bovine Viral Diarrhea Viruses-infected monocytes: identification of cytopathic and non-cytopathic biotype differences.

    abstract:BACKGROUND:Bovine Viral Diarrhea Virus (BVDV) infection is widespread in cattle worldwide, causing important economic losses. Pathogenesis of the disease caused by BVDV is complex, as each BVDV strain has two biotypes: non-cytopathic (ncp) and cytopathic (cp). BVDV can cause a persistent latent infection and immune sup...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-S6-S9

    authors: Ammari M,McCarthy FM,Nanduri B,Pinchuk LM

    更新日期:2010-10-07 00:00:00

  • A molecular model of the full-length human NOD-like receptor family CARD domain containing 5 (NLRC5) protein.

    abstract:BACKGROUND:Pattern recognition receptors of the immune system have key roles in the regulation of pathways after the recognition of microbial- and danger-associated molecular patterns in vertebrates. Members of NOD-like receptor (NLR) family typically function intracellularly. The NOD-like receptor family CARD domain c...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-275

    authors: Mótyán JA,Bagossi P,Benkő S,Tőzsér J

    更新日期:2013-09-17 00:00:00

  • Computational identification of ubiquitylation sites from protein sequences.

    abstract:BACKGROUND:Ubiquitylation plays an important role in regulating protein functions. Recently, experimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this study aims to develop an accurate sequence-based prediction method...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-310

    authors: Tung CW,Ho SY

    更新日期:2008-07-15 00:00:00

  • Domain fusion analysis by applying relational algebra to protein sequence and domain databases.

    abstract:BACKGROUND:Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain datab...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-16

    authors: Truong K,Ikura M

    更新日期:2003-05-06 00:00:00

  • How large B-factors can be in protein crystal structures.

    abstract:BACKGROUND:Protein crystal structures are potentially over-interpreted since they are routinely refined without any restraint on the upper limit of atomic B-factors. Consequently, some of their atoms, undetected in the electron density maps, are allowed to reach extremely large B-factors, even above 100 square Angstrom...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2083-8

    authors: Carugo O

    更新日期:2018-02-23 00:00:00

  • Pre-processing Agilent microarray data.

    abstract:BACKGROUND:Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-142

    authors: Zahurak M,Parmigiani G,Yu W,Scharpf RB,Berman D,Schaeffer E,Shabbeer S,Cope L

    更新日期:2007-05-01 00:00:00

  • NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly.

    abstract:BACKGROUND:Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0357-3

    authors: McCorrison JM,Venepally P,Singh I,Fouts DE,Lasken RS,Methé BA

    更新日期:2014-11-19 00:00:00

  • Computational algorithms to predict Gene Ontology annotations.

    abstract:BACKGROUND:Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biologi...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S6-S4

    authors: Pinoli P,Chicco D,Masseroli M

    更新日期:2015-01-01 00:00:00

  • A global optimization algorithm for protein surface alignment.

    abstract:BACKGROUND:A relevant problem in drug design is the comparison and recognition of protein binding sites. Binding sites recognition is generally based on geometry often combined with physico-chemical properties of the site since the conformation, size and chemical composition of the protein surface are all relevant for ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-488

    authors: Bertolazzi P,Guerra C,Liuzzi G

    更新日期:2010-09-29 00:00:00