Bias detection and correction in RNA-Sequencing data.

Abstract:

BACKGROUND:High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates. RESULTS:In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively. CONCLUSIONS:Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Zheng W,Chung LM,Zhao H

doi

10.1186/1471-2105-12-290

subject

Has Abstract

pub_date

2011-07-19 00:00:00

pages

290

issn

1471-2105

pii

1471-2105-12-290

journal_volume

12

pub_type

杂志文章
  • Investigating the concordance of Gene Ontology terms reveals the intra- and inter-platform reproducibility of enrichment analysis.

    abstract:BACKGROUND:Reliability and Reproducibility of differentially expressed genes (DEGs) are essential for the biological interpretation of microarray data. The microarray quality control (MAQC) project launched by US Food and Drug Administration (FDA) elucidated that the lists of DEGs generated by intra- and inter-platform...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-143

    authors: Zhang L,Zhang J,Yang G,Wu D,Jiang L,Wen Z,Li M

    更新日期:2013-04-29 00:00:00

  • PVT: an efficient computational procedure to speed up next-generation sequence analysis.

    abstract:BACKGROUND:High-throughput Next-Generation Sequencing (NGS) techniques are advancing genomics and molecular biology research. This technology generates substantially large data which puts up a major challenge to the scientists for an efficient, cost and time effective solution to analyse such data. Further, for the dif...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-167

    authors: Maji RK,Sarkar A,Khatua S,Dasgupta S,Ghosh Z

    更新日期:2014-06-04 00:00:00

  • ElTetrado: a tool for identification and classification of tetrads and quadruplexes.

    abstract:BACKGROUND:Quadruplexes are specific structure motifs occurring, e.g., in telomeres and transcriptional regulatory regions. Recent discoveries confirmed their importance in biomedicine and led to an intensified examination of their properties. So far, the study of these motifs has focused mainly on the sequence and the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-3385-1

    authors: Zok T,Popenda M,Szachniuk M

    更新日期:2020-01-31 00:00:00

  • A machine learning strategy for predicting localization of post-translational modification sites in protein-protein interacting regions.

    abstract:BACKGROUND:One very important functional domain of proteins is the protein-protein interacting region (PPIR), which forms the binding interface between interacting polypeptide chains. Post-translational modifications (PTMs) that occur in the PPIR can either interfere with or facilitate the interaction between proteins....

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1165-8

    authors: Saethang T,Payne DM,Avihingsanon Y,Pisitkun T

    更新日期:2016-08-17 00:00:00

  • The G protein-coupled receptors in the pufferfish Takifugu rubripes.

    abstract:BACKGROUND:Guanine protein-coupled receptors (GPCRs) constitute a eukaryotic transmembrane protein family and function as "molecular switches" in the second messenger cascades and are found in all organisms between yeast and humans. They form the single, biggest drug-target family due to their versatility of action and...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S1-S3

    authors: Sarkar A,Kumar S,Sundar D

    更新日期:2011-02-15 00:00:00

  • Discovering functional interaction patterns in protein-protein interaction networks.

    abstract:BACKGROUND:In recent years, a considerable amount of research effort has been directed to the analysis of biological networks with the availability of genome-scale networks of genes and/or proteins of an increasing number of organisms. A protein-protein interaction (PPI) network is a particular biological network which...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-276

    authors: Turanalp ME,Can T

    更新日期:2008-06-11 00:00:00

  • LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates.

    abstract:BACKGROUND:Non-negative matrix factorisation (NMF), a machine learning algorithm, has been applied to the analysis of microarray data. A key feature of NMF is the ability to identify patterns that together explain the data as a linear combination of expression signatures. Microarray data generally includes individual e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-175

    authors: Wang G,Kossenkov AV,Ochs MF

    更新日期:2006-03-28 00:00:00

  • Spectral estimation in unevenly sampled space of periodically expressed microarray time series data.

    abstract:BACKGROUND:Periodogram analysis of time-series is widespread in biology. A new challenge for analyzing the microarray time series data is to identify genes that are periodically expressed. Such challenge occurs due to the fact that the observed time series usually exhibit non-idealities, such as noise, short length, an...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-137

    authors: Liew AW,Xian J,Wu S,Smith D,Yan H

    更新日期:2007-04-24 00:00:00

  • A novel computational strategy for DNA methylation imputation using mixture regression model (MRM).

    abstract:BACKGROUND:DNA methylation is an important heritable epigenetic mark that plays a crucial role in transcriptional regulation and the pathogenesis of various human disorders. The commonly used DNA methylation measurement approaches, e.g., Illumina Infinium HumanMethylation-27 and -450 BeadChip arrays (27 K and 450 K arr...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-020-03865-z

    authors: Yu F,Xu C,Deng HW,Shen H

    更新日期:2020-12-01 00:00:00

  • SHIVA - a web application for drug resistance and tropism testing in HIV.

    abstract:BACKGROUND:Drug resistance testing is mandatory in antiretroviral therapy in human immunodeficiency virus (HIV) infected patients for successful treatment. The emergence of resistances against antiretroviral agents remains the major obstacle in inhibition of viral replication and thus to control infection. Due to the h...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1179-2

    authors: Riemenschneider M,Hummel T,Heider D

    更新日期:2016-08-22 00:00:00

  • Conservation of regulatory elements between two species of Drosophila.

    abstract:BACKGROUND:One of the important goals in the post-genomic era is to determine the regulatory elements within the non-coding DNA of a given organism's genome. The identification of functional cis-regulatory modules has proven difficult since the component factor binding sites are small and the rules governing their arra...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-4-57

    authors: Emberly E,Rajewsky N,Siggia ED

    更新日期:2003-11-20 00:00:00

  • A CoD-based stationary control policy for intervening in large gene regulatory networks.

    abstract:BACKGROUND:One of the most important goals of the mathematical modeling of gene regulatory networks is to alter their behavior toward desirable phenotypes. Therapeutic techniques are derived for intervention in terms of stationary control policies. In large networks, it becomes computationally burdensome to derive an o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-S10-S10

    authors: Ghaffari N,Ivanov I,Qian X,Dougherty ER

    更新日期:2011-10-18 00:00:00

  • Identifying and quantifying metabolites by scoring peaks of GC-MS data.

    abstract:BACKGROUND:Metabolomics is one of most recent omics technologies. It has been applied on fields such as food science, nutrition, drug discovery and systems biology. For this, gas chromatography-mass spectrometry (GC-MS) has been largely applied and many computational tools have been developed to support the analysis of...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-014-0374-2

    authors: Aggio RB,Mayor A,Reade S,Probert CS,Ruggiero K

    更新日期:2014-12-10 00:00:00

  • Use of a structural alphabet for analysis of short loops connecting repetitive structures.

    abstract:BACKGROUND:Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we call...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-58

    authors: Fourrier L,Benros C,de Brevern AG

    更新日期:2004-05-12 00:00:00

  • Prediction of protein structural class with Rough Sets.

    abstract:BACKGROUND:A new method for the prediction of protein structural classes is constructed based on Rough Sets algorithm, which is a rule-based data mining method. Amino acid compositions and 8 physicochemical properties data are used as conditional attributes for the construction of decision system. After reducing the de...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-20

    authors: Cao Y,Liu S,Zhang L,Qin J,Wang J,Tang K

    更新日期:2006-01-14 00:00:00

  • Functional clustering of yeast proteins from the protein-protein interaction network.

    abstract:BACKGROUND:The abundant data available for protein interaction networks have not yet been fully understood. New types of analyses are needed to reveal organizational principles of these networks to investigate the details of functional and regulatory clusters of proteins. RESULTS:In the present work, individual cluste...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-355

    authors: Sen TZ,Kloczkowski A,Jernigan RL

    更新日期:2006-07-24 00:00:00

  • Fregene: simulation of realistic sequence-level data in populations and ascertained samples.

    abstract:BACKGROUND:FREGENE simulates sequence-level data over large genomic regions in large populations. Because, unlike coalescent simulators, it works forwards through time, it allows complex scenarios of selection, demography, and recombination to be modelled simultaneously. Detailed tracking of sites under selection is im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-364

    authors: Chadeau-Hyam M,Hoggart CJ,O'Reilly PF,Whittaker JC,De Iorio M,Balding DJ

    更新日期:2008-09-08 00:00:00

  • VITCOMIC: visualization tool for taxonomic compositions of microbial communities based on 16S rRNA gene sequences.

    abstract:BACKGROUND:Understanding the community structure of microbes is typically accomplished by sequencing 16S ribosomal RNA (16S rRNA) genes. These community data can be represented by constructing a phylogenetic tree and comparing it with other samples using statistical methods. However, owing to high computational complex...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-11-332

    authors: Mori H,Maruyama F,Kurokawa K

    更新日期:2010-06-18 00:00:00

  • Predicting anatomic therapeutic chemical classification codes using tiered learning.

    abstract:BACKGROUND:The low success rate and high cost of drug discovery requires the development of new paradigms to identify molecules of therapeutic value. The Anatomical Therapeutic Chemical (ATC) Code System is a World Health Organization (WHO) proposed classification that assigns multi-level codes to compounds based on th...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1660-6

    authors: Olson T,Singh R

    更新日期:2017-06-07 00:00:00

  • Texture based skin lesion abruptness quantification to detect malignancy.

    abstract:BACKGROUND:Abruptness of pigment patterns at the periphery of a skin lesion is one of the most important dermoscopic features for detection of malignancy. In current clinical setting, abrupt cutoff of a skin lesion determined by an examination of a dermatologist. This process is subjective, nonquantitative, and error-p...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1892-5

    authors: Erol R,Bayraktar M,Kockara S,Kaya S,Halic T

    更新日期:2017-12-28 00:00:00

  • Bioinformatics Resource Manager: a systems biology web tool for microRNA and omics data integration.

    abstract:BACKGROUND:The Bioinformatics Resource Manager (BRM) is a web-based tool developed to facilitate identifier conversion and data integration for Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Danio rerio (zebrafish), and Macaca mulatta (macaque), as well as perform orthologous conversions among the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-2805-6

    authors: Brown J,Phillips AR,Lewis DA,Mans MA,Chang Y,Tanguay RL,Peterson ES,Waters KM,Tilton SC

    更新日期:2019-05-17 00:00:00

  • Normalized N50 assembly metric using gap-restricted co-linear chaining.

    abstract:BACKGROUND:For the development of genome assembly tools, some comprehensive and efficiently computable validation measures are required to assess the quality of the assembly. The mostly used N50 measure summarizes the assembly results by the length of the scaffold (or contig) overlapping the midpoint of the length-orde...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-255

    authors: Mäkinen V,Salmela L,Ylinen J

    更新日期:2012-10-03 00:00:00

  • Natural computation meta-heuristics for the in silico optimization of microbial strains.

    abstract:BACKGROUND:One of the greatest challenges in Metabolic Engineering is to develop quantitative models and algorithms to identify a set of genetic manipulations that will result in a microbial strain with a desirable metabolic phenotype which typically means having a high yield/productivity. This challenge is not only du...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-499

    authors: Rocha M,Maia P,Mendes R,Pinto JP,Ferreira EC,Nielsen J,Patil KR,Rocha I

    更新日期:2008-11-27 00:00:00

  • OmicsARules: a R package for integration of multi-omics datasets via association rules mining.

    abstract:BACKGROUND:The improvements of high throughput technologies have produced large amounts of multi-omics experiments datasets. Initial analysis of these data has revealed many concurrent gene alterations within single dataset or/and among multiple omics datasets. Although powerful bioinformatics pipelines have been devel...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3171-0

    authors: Chen D,Zhang F,Zhao Q,Xu J

    更新日期:2019-11-08 00:00:00

  • An unsupervised classification scheme for improving predictions of prokaryotic TIS.

    abstract:BACKGROUND:Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-121

    authors: Tech M,Meinicke P

    更新日期:2006-03-09 00:00:00

  • Roar: detecting alternative polyadenylation with standard mRNA sequencing libraries.

    abstract:BACKGROUND:Post-transcriptional regulation is a complex mechanism that plays a central role in defining multiple cellular identities starting from a common genome. Modifications in the length of 3'UTRs have been found to play an important role in this context, since alternative 3' UTRs could lead to differences for exa...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1254-8

    authors: Grassi E,Mariella E,Lembo A,Molineris I,Provero P

    更新日期:2016-10-18 00:00:00

  • Ferret: a sentence-based literature scanning system.

    abstract:BACKGROUND:The rapid pace of bioscience research makes it very challenging to track relevant articles in one's area of interest. MEDLINE, a primary source for biomedical literature, offers access to more than 20 million citations with three-quarters of a million new ones added each year. Thus it is not surprising to se...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0630-0

    authors: Srinivasan P,Zhang XN,Bouten R,Chang C

    更新日期:2015-06-20 00:00:00

  • A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies.

    abstract:BACKGROUND:Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1142-2

    authors: Thakur S,Guttman DS

    更新日期:2016-06-30 00:00:00

  • hsegHMM: hidden Markov model-based allele-specific copy number alteration analysis accounting for hypersegmentation.

    abstract:BACKGROUND:Somatic copy number alternation (SCNA) is a common feature of the cancer genome and is associated with cancer etiology and prognosis. The allele-specific SCNA analysis of a tumor sample aims to identify the allele-specific copy numbers of both alleles, adjusting for the ploidy and the tumor purity. Next gene...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2412-y

    authors: Choo-Wosoba H,Albert PS,Zhu B

    更新日期:2018-11-14 00:00:00

  • Rigorous assessment and integration of the sequence and structure based features to predict hot spots.

    abstract:BACKGROUND:Systematic mutagenesis studies have shown that only a few interface residues termed hot spots contribute significantly to the binding free energy of protein-protein interactions. Therefore, hot spots prediction becomes increasingly important for well understanding the essence of proteins interactions and hel...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-12-311

    authors: Chen R,Chen W,Yang S,Wu D,Wang Y,Tian Y,Shi Y

    更新日期:2011-07-29 00:00:00