Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.

Abstract:

BACKGROUND:The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction. RESULTS:Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data. CONCLUSIONS:Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

journal_name

Biol Direct

journal_title

Biology direct

authors

Chen JC,Tyler AD

doi

10.1186/s13062-020-00287-y

subject

Has Abstract

pub_date

2020-12-10 00:00:00

pages

29

issue

1

issn

1745-6150

pii

10.1186/s13062-020-00287-y

journal_volume

15

pub_type

杂志文章
  • A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis.

    abstract:BACKGROUND:Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches--the examination of similarities to known disease genes and/or the evaluation of functional annotation of...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-6-30

    authors: Lombard Z,Park C,Makova KD,Ramsay M

    更新日期:2011-06-13 00:00:00

  • Human gammadelta T cell recognition of lipid A is predominately presented by CD1b or CD1c on dendritic cells.

    abstract:BACKGROUND:The gammadelta T cells serve as early immune defense against certain encountered microbes. Only a few gammadelta T cell-recognized ligands from microbial antigens have been identified so far and the mechanisms by which gammadelta T cells recognize these ligands remain unknown. Here we explored the mechanism ...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-4-47

    authors: Cui Y,Kang L,Cui L,He W

    更新日期:2009-12-01 00:00:00

  • On origin of genetic code and tRNA before translation.

    abstract:BACKGROUND:Synthesis of proteins is based on the genetic code - a nearly universal assignment of codons to amino acids (aas). A major challenge to the understanding of the origins of this assignment is the archetypal "key-lock vs. frozen accident" dilemma. Here we re-examine this dilemma in light of 1) the fundamental ...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-6-14

    authors: Rodin AS,Szathmáry E,Rodin SN

    更新日期:2011-02-22 00:00:00

  • Origin of the nuclear proteome on the basis of pre-existing nuclear localization signals in prokaryotic proteins.

    abstract:BACKGROUND:The origin of the selective nuclear protein import machinery, which consists of nuclear pore complexes and adaptor molecules interacting with the nuclear localization signals (NLSs) of cargo molecules, is one of the most important events in the evolution of eukaryotic cells. How proteins were selected for im...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-020-00263-6

    authors: Lisitsyna OM,Kurnaeva MA,Arifulin EA,Shubina MY,Musinova YR,Mironov AA,Sheval EV

    更新日期:2020-04-28 00:00:00

  • Prokaryotic homologs of Argonaute proteins are predicted to function as key components of a novel system of defense against mobile genetic elements.

    abstract:BACKGROUND:In eukaryotes, RNA interference (RNAi) is a major mechanism of defense against viruses and transposable elements as well of regulating translation of endogenous mRNAs. The RNAi systems recognize the target RNA molecules via small guide RNAs that are completely or partially complementary to a region of the ta...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-4-29

    authors: Makarova KS,Wolf YI,van der Oost J,Koonin EV

    更新日期:2009-08-25 00:00:00

  • The origins of phagocytosis and eukaryogenesis.

    abstract:BACKGROUND:Phagocytosis, that is, engulfment of large particles by eukaryotic cells, is found in diverse organisms and is often thought to be central to the very origin of the eukaryotic cell, in particular, for the acquisition of bacterial endosymbionts including the ancestor of the mitochondrion. RESULTS:Comparisons...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-4-9

    authors: Yutin N,Wolf MY,Wolf YI,Koonin EV

    更新日期:2009-02-26 00:00:00

  • Issues associated with the use of phosphospecific antibodies to localise active and inactive pools of GSK-3 in cells.

    abstract:BACKGROUND:Glycogen synthase kinase-3 (GSK-3) is a ubiquitously expressed serine/threonine (Ser/Thr) kinase comprising two isoforms, GSK-3α and GSK-3β. Both enzymes are similarly inactivated by serine phosphorylation (GSK-3α at Ser21 and GSK-3β at Ser9) and activated by tyrosine phosphorylation (GSK-3α at Tyr279 and GS...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-6-4

    authors: Campa VM,Kypta RM

    更新日期:2011-01-24 00:00:00

  • The manoeuvrability hypothesis to explain the maintenance of bilateral symmetry in animal evolution.

    abstract:BACKGROUND:The overwhelming majority of animal species exhibit bilateral symmetry. However, the precise evolutionary importance of bilateral symmetry is unknown, although elements of the understanding of the phenomenon have been present within the scientific community for decades. PRESENTATION OF THE HYPOTHESIS:Here w...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-7-22

    authors: Holló G,Novák M

    更新日期:2012-07-12 00:00:00

  • Use of designed sequences in protein structure recognition.

    abstract:BACKGROUND:Knowledge of the protein structure is a pre-requisite for improved understanding of molecular function. The gap in the sequence-structure space has increased in the post-genomic era. Grouping related protein sequences into families can aid in narrowing the gap. In the Pfam database, structure description is ...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-018-0209-6

    authors: Kumar G,Mudgal R,Srinivasan N,Sandhya S

    更新日期:2018-05-09 00:00:00

  • Comparative genomic analysis of the DUF71/COG2102 family predicts roles in diphthamide biosynthesis and B12 salvage.

    abstract:BACKGROUND:The availability of over 3000 published genome sequences has enabled the use of comparative genomic approaches to drive the biological function discovery process. Classically, one used to link gene with function by genetic or biochemical approaches, a lengthy process that often took years. Phylogenetic distr...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-7-32

    authors: de Crécy-Lagard V,Forouhar F,Brochier-Armanet C,Tong L,Hunt JF

    更新日期:2012-09-26 00:00:00

  • A novel superfamily containing the beta-grasp fold involved in binding diverse soluble ligands.

    abstract:BACKGROUND:Domains containing the beta-grasp fold are utilized in a great diversity of physiological functions but their role, if any, in soluble or small molecule ligand recognition is poorly studied. RESULTS:Using sensitive sequence and structure similarity searches we identify a novel superfamily containing the bet...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-2-4

    authors: Burroughs AM,Balaji S,Iyer LM,Aravind L

    更新日期:2007-01-24 00:00:00

  • Description of plant tRNA-derived RNA fragments (tRFs) associated with argonaute and identification of their putative targets.

    abstract::tRNA-derived RNA fragments (tRFs) are 19mer small RNAs that associate with Argonaute (AGO) proteins in humans. However, in plants, it is unknown if tRFs bind with AGO proteins. Here, using public deep sequencing libraries of immunoprecipitated Argonaute proteins (AGO-IP) and bioinformatics approaches, we identified th...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-8-6

    authors: Loss-Morais G,Waterhouse PM,Margis R

    更新日期:2013-02-12 00:00:00

  • IPC - Isoelectric Point Calculator.

    abstract:BACKGROUND:Accurate estimation of the isoelectric point (pI) based on the amino acid sequence is useful for many analytical biochemistry and proteomics techniques such as 2-D polyacrylamide gel electrophoresis, or capillary isoelectric focusing used in combination with high-throughput mass spectrometry. Additionally, p...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-016-0159-9

    authors: Kozlowski LP

    更新日期:2016-10-21 00:00:00

  • Stringent homology-based prediction of H. sapiens-M. tuberculosis H37Rv protein-protein interactions.

    abstract:BACKGROUND:H. sapiens-M. tuberculosis H37Rv protein-protein interaction (PPI) data are essential for understanding the infection mechanism of the formidable pathogen M. tuberculosis H37Rv. Computational prediction is an important strategy to fill the gap in experimental H. sapiens-M. tuberculosis H37Rv PPI data. Homolo...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-9-5

    authors: Zhou H,Gao S,Nguyen NN,Fan M,Jin J,Liu B,Zhao L,Xiong G,Tan M,Li S,Wong L

    更新日期:2014-04-08 00:00:00

  • MutL homologs in restriction-modification systems and the origin of eukaryotic MORC ATPases.

    abstract::The provenance and biochemical roles of eukaryotic MORC proteins have remained poorly understood since the discovery of their prototype MORC1, which is required for meiotic nuclear division in animals. The MORC family contains a combination of a gyrase, histidine kinase, and MutL (GHKL) and S5 domains that together co...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-3-8

    authors: Iyer LM,Abhiman S,Aravind L

    更新日期:2008-03-17 00:00:00

  • Is pre-Darwinian evolution plausible?

    abstract:BACKGROUND:This essay highlights critical aspects of the plausibility of pre-Darwinian evolution. It is based on a critical review of some better-known open, far-from-equilibrium system-based scenarios supposed to explain processes that took place before Darwinian evolution had emerged and that resulted in the origin o...

    journal_title:Biology direct

    pub_type: 杂志文章,评审

    doi:10.1186/s13062-018-0216-7

    authors: Tessera M

    更新日期:2018-09-21 00:00:00

  • LINEs of evidence: noncanonical DNA replication as an epigenetic determinant.

    abstract::LINE-1 (L1) retrotransposons are repetitive elements in mammalian genomes. They are capable of synthesizing DNA on their own RNA templates by harnessing reverse transcriptase (RT) that they encode. Abundantly expressed full-length L1s and their RT are found to globally influence gene expression profiles, differentiati...

    journal_title:Biology direct

    pub_type: 杂志文章,评审

    doi:10.1186/1745-6150-8-22

    authors: Belan E

    更新日期:2013-09-13 00:00:00

  • Hereditary profiles of disorderly transcription?

    abstract:BACKGROUND:Microscopic examination of living cells often reveals that cells from some cell strains appear to be in a permanent state of disarray without obvious reason. In all probability such a disorderly state affects cell functioning. The aim of this study was to establish whether a disorderly state could occur that...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-1-9

    authors: Simons JW

    更新日期:2006-04-02 00:00:00

  • Pseudo-chaotic oscillations in CRISPR-virus coevolution predicted by bifurcation analysis.

    abstract:BACKGROUND:The CRISPR-Cas systems of adaptive antivirus immunity are present in most archaea and many bacteria, and provide resistance to specific viruses or plasmids by inserting fragments of foreign DNA into the host genome and then utilizing transcripts of these spacers to inactivate the cognate foreign genome. The ...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-9-13

    authors: Berezovskaya FS,Wolf YI,Koonin EV,Karev GP

    更新日期:2014-07-02 00:00:00

  • Stable feature selection and classification algorithms for multiclass microarray data.

    abstract:BACKGROUND:Recent studies suggest that gene expression profiles are a promising alternative for clinical cancer classification. One major problem in applying DNA microarrays for classification is the dimension of obtained data sets. In this paper we propose a multiclass gene selection method based on Partial Least Squa...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-7-33

    authors: Student S,Fujarewicz K

    更新日期:2012-10-02 00:00:00

  • A web server for analysis, comparison and prediction of protein ligand binding sites.

    abstract:BACKGROUND:One of the major challenges in the field of system biology is to understand the interaction between a wide range of proteins and ligands. In the past, methods have been developed for predicting binding sites in a protein for a limited number of ligands. RESULTS:In order to address this problem, we developed...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-016-0118-5

    authors: Singh H,Srivastava HK,Raghava GP

    更新日期:2016-03-25 00:00:00

  • Strong association between pseudogenization mechanisms and gene sequence length.

    abstract:UNLABELLED:Pseudogenes arise from the decay of gene copies following either RNA-mediated duplication (processed pseudogenes) or DNA-mediated duplication (nonprocessed pseudogenes). Here, we show that long protein-coding genes tend to produce more nonprocessed pseudogenes than short genes, whereas the opposite is true f...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-4-38

    authors: Khachane AN,Harrison PM

    更新日期:2009-10-06 00:00:00

  • xHMMER3x2: Utilizing HMMER3's speed and HMMER2's sensitivity and specificity in the glocal alignment mode for improved large-scale protein domain annotation.

    abstract:BACKGROUND:While the local-mode HMMER3 is notable for its massive speed improvement, the slower glocal-mode HMMER2 is more exact for domain annotation by enforcing full domain-to-sequence alignments. Since a unit of domain necessarily implies a unit of function, local-mode HMMER3 alone remains insufficient for precise ...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-016-0163-0

    authors: Yap CK,Eisenhaber B,Eisenhaber F,Wong WC

    更新日期:2016-11-29 00:00:00

  • The mechanistic and evolutionary aspects of the 2'- and 3'-OH paradigm in biosynthetic machinery.

    abstract:BACKGROUND:The translation machinery underlies a multitude of biological processes within the cell. The design and implementation of the modern translation apparatus on even the simplest course of action is extremely complex, and involves different RNA and protein factors. According to the "RNA world" idea, the critica...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-8-17

    authors: Safro M,Klipcan L

    更新日期:2013-07-08 00:00:00

  • Biochemistry and physiology within the framework of the extended synthesis of evolutionary biology.

    abstract::Functional biologists, like Claude Bernard, ask "How?", meaning that they investigate the mechanisms underlying the emergence of biological functions (proximal causes), while evolutionary biologists, like Charles Darwin, asks "Why?", meaning that they search the causes of adaptation, survival and evolution (remote cau...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-016-0109-6

    authors: Vianello A,Passamonti S

    更新日期:2016-02-09 00:00:00

  • Diverse bacterial genomes encode an operon of two genes, one of which is an unusual class-I release factor that potentially recognizes atypical mRNA signals other than normal stop codons.

    abstract:BACKGROUND:While all codons that specify amino acids are universally recognized by tRNA molecules, codons signaling termination of translation are recognized by proteins known as class-I release factors (RF). In most eukaryotes and archaea a single RF accomplishes termination at all three stop codons. In most bacteria,...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-1-28

    authors: Baranov PV,Vestergaard B,Hamelryck T,Gesteland RF,Nyborg J,Atkins JF

    更新日期:2006-09-13 00:00:00

  • Trees and networks before and after Darwin.

    abstract::It is well-known that Charles Darwin sketched abstract trees of relationship in his 1837 notebook, and depicted a tree in the Origin of Species (1859). Here I attempt to place Darwin's trees in historical context. By the mid-Eighteenth century the Great Chain of Being was increasingly seen to be an inadequate descript...

    journal_title:Biology direct

    pub_type: 历史文章,杂志文章,评审

    doi:10.1186/1745-6150-4-43

    authors: Ragan MA

    更新日期:2009-11-16 00:00:00

  • Outer membrane protein genes and their small non-coding RNA regulator genes in Photorhabdus luminescens.

    abstract:INTRODUCTION:Three major outer membrane protein genes of Escherichia coli, ompF, ompC, and ompA respond to stress factors. Transcripts from these genes are regulated by the small non-coding RNAs micF, micC, and micA, respectively. Here we examine Photorhabdus luminescens, an organism that has a different habitat from E...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-1-12

    authors: Papamichail D,Delihas N

    更新日期:2006-05-22 00:00:00

  • Biased gene transfer and its implications for the concept of lineage.

    abstract:BACKGROUND:In the presence of horizontal gene transfer (HGT), the concepts of lineage and genealogy in the microbial world become more ambiguous because chimeric genomes trace their ancestry from a myriad of sources, both living and extinct. RESULTS:We present the evolutionary histories of three aminoacyl-tRNA synthet...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/1745-6150-6-47

    authors: Andam CP,Gogarten JP

    更新日期:2011-09-23 00:00:00

  • PEPstrMOD: structure prediction of peptides containing natural, non-natural and modified residues.

    abstract:BACKGROUND:In the past, many methods have been developed for peptide tertiary structure prediction but they are limited to peptides having natural amino acids. This study describes a method PEPstrMOD, which is an updated version of PEPstr, developed specifically for predicting the structure of peptides containing natur...

    journal_title:Biology direct

    pub_type: 杂志文章

    doi:10.1186/s13062-015-0103-4

    authors: Singh S,Singh H,Tuknait A,Chaudhary K,Singh B,Kumaran S,Raghava GP

    更新日期:2015-12-21 00:00:00