Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.

Abstract:

:The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.

journal_name

Genome Res

journal_title

Genome research

authors

Mudge JM,Jungreis I,Hunt T,Gonzalez JM,Wright JC,Kay M,Davidson C,Fitzgerald S,Seal R,Tweedie S,He L,Waterhouse RM,Li Y,Bruford E,Choudhary JS,Frankish A,Kellis M

doi

10.1101/gr.246462.118

subject

Has Abstract

pub_date

2019-12-01 00:00:00

pages

2073-2087

issue

12

eissn

1088-9051

issn

1549-5469

pii

gr.246462.118

journal_volume

29

pub_type

杂志文章
  • Genome-wide A-to-I RNA editing in fungi independent of ADAR enzymes.

    abstract::Yeasts and filamentous fungi do not have adenosine deaminase acting on RNA (ADAR) orthologs and are believed to lack A-to-I RNA editing, which is the most prevalent editing of mRNA in animals. However, during this study with the PUK1(FGRRES_01058) pseudokinase gene important for sexual reproduction in Fusarium gramine...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.199877.115

    authors: Liu H,Wang Q,He Y,Chen L,Hao C,Jiang C,Li Y,Dai Y,Kang Z,Xu JR

    更新日期:2016-04-01 00:00:00

  • Novel susceptibility locus for mouse hepatomas: evidence for a conserved tumor suppressor gene.

    abstract::We have identified previously a putative tumor suppressor gene (TSG) locus at human chromosome (hchr) 7q31 showing that it is altered in a variety of human epithelial tumors. To determine whether this TSG is conserved in mice, we studied loss of heterozygosity (LOH) in chemically induced mouse liver adenomas. The LOH ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.6.11.1070

    authors: Zenklusen JC,Rodriguez LV,LaCava M,Wang Z,Goldstein LS,Conti CJ

    更新日期:1996-11-01 00:00:00

  • lobSTR: A short tandem repeat profiler for personal genomes.

    abstract::Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic genealogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However, mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat S...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.135780.111

    authors: Gymrek M,Golan D,Rosset S,Erlich Y

    更新日期:2012-06-01 00:00:00

  • Shuffling of genes within low-copy repeats on 22q11 (LCR22) by Alu-mediated recombination events during evolution.

    abstract::Low-copy repeats, or segmental duplications, are highly dynamic regions in the genome. The low-copy repeats on chromosome 22q11.2 (LCR22) are a complex mosaic of genes and pseudogenes formed by duplication processes; they mediate chromosome rearrangements associated with velo-cardio-facial syndrome/DiGeorge syndrome, ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.1549503

    authors: Babcock M,Pavlicek A,Spiteri E,Kashork CD,Ioshikhes I,Shaffer LG,Jurka J,Morrow BE

    更新日期:2003-12-01 00:00:00

  • Recompleting the Caenorhabditis elegans genome.

    abstract::Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. el...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.244830.118

    authors: Yoshimura J,Ichikawa K,Shoura MJ,Artiles KL,Gabdank I,Wahba L,Smith CL,Edgley ML,Rougvie AE,Fire AZ,Morishita S,Schwarz EM

    更新日期:2019-06-01 00:00:00

  • The repetitive landscape of the chicken genome.

    abstract::Cot-based cloning and sequencing (CBCS) is a powerful tool for isolating and characterizing the various repetitive components of any genome, combining the established principles of DNA reassociation kinetics with high-throughput sequencing. CBCS was used to generate sequence libraries representing the high, middle, an...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.2438004

    authors: Wicker T,Robertson JS,Schulze SR,Feltus FA,Magrini V,Morrison JA,Mardis ER,Wilson RK,Peterson DG,Paterson AH,Ivarie R

    更新日期:2005-01-01 00:00:00

  • Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules.

    abstract::Comparative functional genomics studies the evolution of biological processes by analyzing functional data, such as gene expression profiles, across species. A major challenge is to compare profiles collected in a complex phylogeny. Here, we present Arboretum, a novel scalable computational algorithm that integrates e...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.146233.112

    authors: Roy S,Wapinski I,Pfiffner J,French C,Socha A,Konieczka J,Habib N,Kellis M,Thompson D,Regev A

    更新日期:2013-06-01 00:00:00

  • Thermophilic bacteria strictly obey Szybalski's transcription direction rule and politely purine-load RNAs with both adenine and guanine.

    abstract::When transcription is to the right of the promoter, the "top," mRNA-synonymous strand of DNA tends to be purine-rich. When transcription is to the left of the promoter, the top, mRNA-template strand tends to be pyrimidine-rich. This transcription-direction rule suggests that there has been an evolutionary selection pr...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.10.2.228

    authors: Lao PJ,Forsdyke DR

    更新日期:2000-02-01 00:00:00

  • Genomic evolution, patterns of global dissemination, and interspecies transmission of human and simian T-cell leukemia/lymphotropic viruses.

    abstract::Using both env and long terminal repeat (LTR) sequences, with maximal representation of genetic diversity within primate strains, we revise and expand the unique evolutionary history of human and simian T-cell leukemia/lymphotropic viruses (HTLV/STLV). Based on the robust application of three different phylogenetic al...

    journal_title:Genome research

    pub_type: 杂志文章,评审

    doi:

    authors: Slattery JP,Franchini G,Gessain A

    更新日期:1999-06-01 00:00:00

  • HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient.

    abstract::Hi-C is a powerful technology for studying genome-wide chromatin interactions. However, current methods for assessing Hi-C data reproducibility can produce misleading results because they ignore spatial features in Hi-C data, such as domain structure and distance dependence. We present HiCRep, a framework for assessin...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.220640.117

    authors: Yang T,Zhang F,Yardımcı GG,Song F,Hardison RC,Noble WS,Yue F,Li Q

    更新日期:2017-11-01 00:00:00

  • Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context.

    abstract::Gene order in prokaryotes is conserved to a much lesser extent than protein sequences. Only several operons, primarily those that code for physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Nevertheless, even the limited conservation of operon organization that is obse...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.gr-1619r

    authors: Wolf YI,Rogozin IB,Kondrashov AS,Koonin EV

    更新日期:2001-03-01 00:00:00

  • The origins and evolution of chromosomes, dosage compensation, and mechanisms underlying venom regulation in snakes.

    abstract::Here we use a chromosome-level genome assembly of a prairie rattlesnake (Crotalus viridis), together with Hi-C, RNA-seq, and whole-genome resequencing data, to study key features of genome biology and evolution in reptiles. We identify the rattlesnake Z Chromosome, including the recombining pseudoautosomal region, and...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.240952.118

    authors: Schield DR,Card DC,Hales NR,Perry BW,Pasquesi GM,Blackmon H,Adams RH,Corbin AB,Smith CF,Ramesh B,Demuth JP,Betrán E,Tollis M,Meik JM,Mackessy SP,Castoe TA

    更新日期:2019-04-01 00:00:00

  • Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing.

    abstract::Transcript leaders (TLs) can have profound effects on mRNA translation and stability. To map TL boundaries genome-wide, we developed TL-sequencing (TL-seq), a technique combining enzymatic capture of m(7)G-capped mRNA 5' ends with high-throughput sequencing. TL-seq identified mRNA start sites for the majority of yeast...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.150342.112

    authors: Arribere JA,Gilbert WV

    更新日期:2013-06-01 00:00:00

  • Copy number variation at the breakpoint region of isochromosome 17q.

    abstract::Isochromosome 17q, or i(17q), is one of the most frequent nonrandom changes occurring in human neoplasia. Most of the i(17q) breakpoints cluster within a approximately 240-kb interval located in the Smith-Magenis syndrome common deletion region in 17p11.2. The breakpoint cluster region is characterized by a complex ar...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.080697.108

    authors: Carvalho CM,Lupski JR

    更新日期:2008-11-01 00:00:00

  • Noncoding origins of anthropoid traits and a new null model of transposon functionalization.

    abstract::Little is known about novel genetic elements that drove the emergence of anthropoid primates. We exploited the sequencing of the marmoset genome to identify 23,849 anthropoid-specific constrained (ASC) regions and confirmed their robust functional signatures. Of the ASC base pairs, 99.7% were noncoding, suggesting tha...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.168963.113

    authors: del Rosario RC,Rayan NA,Prabhakar S

    更新日期:2014-09-01 00:00:00

  • Screening of gene-associated polymorphisms by use of in-gel competitive reassociation and EST (cDNA) array hybridization.

    abstract::In-gel competitive reassociation (IGCR) is a method of differential subtraction to enrich polymorphic DNA restriction fragments between two DNA samples without probes or specific sequence information. Here, we show that by combining IGCR and expressed sequence tags (EST) array hybridization, polymorphic DNA fragments ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.434103

    authors: Gotoh K,Oishi M

    更新日期:2003-03-01 00:00:00

  • Evolutionary features of the 4-Mb Xq21.3 XY homology region revealed by a map at 60-kb resolution.

    abstract::Forty-three yeast artificial chromosomes (YACs) from the X chromosome have been overlapped across the 4-Mb Xq21.3 region, which is homologous to a segment in Yp11.1. The region is formatted to 60-kb resolution with 57 STSs and is merged at its edges with contigs specific for X. This allows a direct comparison of marke...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.7.4.307

    authors: Mumm S,Molini B,Terrell J,Srivastava A,Schlessinger D

    更新日期:1997-04-01 00:00:00

  • The portability of tagSNPs across populations: a worldwide survey.

    abstract::In the search for common genetic variants that contribute to prevalent human diseases, patterns of linkage disequilibrium (LD) among linked markers should be considered when selecting SNPs. Genotyping efficiency can be increased by choosing tagging SNPs (tagSNPs) in LD with other SNPs. However, it remains to be seen w...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.4138406

    authors: González-Neira A,Ke X,Lao O,Calafell F,Navarro A,Comas D,Cann H,Bumpstead S,Ghori J,Hunt S,Deloukas P,Dunham I,Cardon LR,Bertranpetit J

    更新日期:2006-03-01 00:00:00

  • Arabidopsis-rice: will colinearity allow gene prediction across the eudicot-monocot divide?

    abstract::With the genomic sequencing of Arabidopsis nearing completion and rice sequencing very much in its infancy, a key question is whether we can exploit the Arabidopsis sequence to identify candidate genes for traits in cereal crops using a map-based approach. This requires the existence of colinearity between the Arabido...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.9.9.825

    authors: Devos KM,Beales J,Nagamura Y,Sasaki T

    更新日期:1999-09-01 00:00:00

  • DNA profiling of B chromosomes from the yellow-necked mouse Apodemus flavicollis (Rodentia, Mammalia).

    abstract::Using AP-PCR-based DNA profiling we examined some structural features of B chromosomes from yellow-necked mice Apodemus flavicollis. Mice harboring one, two, or three or lacking B chromosomes were examined. Chromosomal structure was scanned for variant bands by using a series of arbitrary primers and from these, infor...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:

    authors: Tanic N,Dedovic N,Vujosevic M,Dimitrijevic B

    更新日期:2000-01-01 00:00:00

  • X chromosome cDNA microarray screening identifies a functional PLP2 promoter polymorphism enriched in patients with X-linked mental retardation.

    abstract::X-linked Mental Retardation (XLMR) occurs in 1 in 600 males and is highly genetically heterogeneous. We used a novel human X chromosome cDNA microarray (XCA) to survey the expression profile of X-linked genes in lymphoblasts of XLMR males. Genes with altered expression verified by Northern blot and/or quantitative PCR...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.5336307

    authors: Zhang L,Jie C,Obie C,Abidi F,Schwartz CE,Stevenson RE,Valle D,Wang T

    更新日期:2007-05-01 00:00:00

  • Impact of genomics on research in the rat.

    abstract::The need to translate genes to function has positioned the rat as an invaluable animal model for genomic research. The significant increase in genomic resources in recent years has had an immediate functional application in the rat. Many of the resources for translational research are already in place and are ready to...

    journal_title:Genome research

    pub_type: 杂志文章,评审

    doi:10.1101/gr.3744005

    authors: Lazar J,Moreno C,Jacob HJ,Kwitek AE

    更新日期:2005-12-01 00:00:00

  • Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes.

    abstract::By analyzing 1,780,295 5'-end sequences of human full-length cDNAs derived from 164 kinds of oligo-cap cDNA libraries, we identified 269,774 independent positions of transcriptional start sites (TSSs) for 14,628 human RefSeq genes. These TSSs were clustered into 30,964 clusters that were separated from each other by m...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.4039406

    authors: Kimura K,Wakamatsu A,Suzuki Y,Ota T,Nishikawa T,Yamashita R,Yamamoto J,Sekine M,Tsuritani K,Wakaguri H,Ishii S,Sugiyama T,Saito K,Isono Y,Irie R,Kushida N,Yoneyama T,Otsuka R,Kanda K,Yokoi T,Kondo H,Wagatsuma M

    更新日期:2006-01-01 00:00:00

  • Evolutionary dynamics of segmental duplications from human Y-chromosomal euchromatin/heterochromatin transition regions.

    abstract::Human chromosomal regions enriched in segmental duplications are subject to extensive genomic reorganization. Such regions are particularly informative for illuminating the evolutionary history of a given chromosome. We have analyzed 866 kb of Y-chromosomal non-palindromic segmental duplications delineating four euchr...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.076711.108

    authors: Kirsch S,Münch C,Jiang Z,Cheng Z,Chen L,Batz C,Eichler EE,Schempp W

    更新日期:2008-07-01 00:00:00

  • CRISPR RNAs trigger innate immune responses in human cells.

    abstract::Here, we report that CRISPR guide RNAs (gRNAs) with a 5'-triphosphate group (5'-ppp gRNAs) produced via in vitro transcription trigger RNA-sensing innate immune responses in human and murine cells, leading to cytotoxicity. 5'-ppp gRNAs in the cytosol are recognized by DDX58, which in turn activates type I interferon r...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.231936.117

    authors: Kim S,Koo T,Jee HG,Cho HY,Lee G,Lim DG,Shin HS,Kim JS

    更新日期:2018-02-22 00:00:00

  • Antisense transcripts with FANTOM2 clone set and their implications for gene regulation.

    abstract::We have used the FANTOM2 mouse cDNA set (60,770 clones), public mRNA data, and mouse genome sequence data to identify 2481 pairs of sense-antisense transcripts and 899 further pairs of nonantisense bidirectional transcription based upon genomic mapping. The analysis greatly expands the number of known examples of sens...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.982903

    authors: Kiyosawa H,Yamanaka I,Osato N,Kondo S,Hayashizaki Y,RIKEN GER Group.,GSL Members.

    更新日期:2003-06-01 00:00:00

  • Nucleosome occupancy as a novel chromatin parameter for replication origin functions.

    abstract::Eukaryotic DNA replication initiates from multiple discrete sites in the genome, termed origins of replication (origins). Prior to S phase, multiple origins are poised to initiate replication by recruitment of the pre-replicative complex (pre-RC). For proper replication to occur, origin activation must be tightly regu...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.209940.116

    authors: Rodriguez J,Lee L,Lynch B,Tsukiyama T

    更新日期:2017-02-01 00:00:00

  • A cross-platform analysis of 14,177 expression quantitative trait loci derived from lymphoblastoid cell lines.

    abstract::Gene expression levels can be an important link DNA between variation and phenotypic manifestations. Our previous map of global gene expression, based on ~400K single nucleotide polymorphisms (SNPs) and 50K transcripts in 400 sib pairs from the MRCA family panel, has been widely used to interpret the results of genome...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.142521.112

    authors: Liang L,Morar N,Dixon AL,Lathrop GM,Abecasis GR,Moffatt MF,Cookson WO

    更新日期:2013-04-01 00:00:00

  • Integrated mapping, chromosomal sequencing and sequence analysis of Cryptosporidium parvum.

    abstract::The apicomplexan Cryptosporidium parvum is one of the most prevalent protozoan parasites of humans. We report the physical mapping of the genome of the Iowa isolate, sequencing and analysis of chromosome 6, and approximately 0.9 Mbp of sequence sampled from the remainder of the genome. To construct a robust physical m...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.1555203

    authors: Bankier AT,Spriggs HF,Fartmann B,Konfortov BA,Madera M,Vogel C,Teichmann SA,Ivens A,Dear PH

    更新日期:2003-08-01 00:00:00

  • Comparative methylome analysis of benign and malignant peripheral nerve sheath tumors.

    abstract::Aberrant DNA methylation (DNAm) was first linked to cancer over 25 yr ago. Since then, many studies have associated hypermethylation of tumor suppressor genes and hypomethylation of oncogenes to the tumorigenic process. However, most of these studies have been limited to the analysis of promoters and CpG islands (CGIs...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.109678.110

    authors: Feber A,Wilson GA,Zhang L,Presneau N,Idowu B,Down TA,Rakyan VK,Noon LA,Lloyd AC,Stupka E,Schiza V,Teschendorff AE,Schroth GP,Flanagan A,Beck S

    更新日期:2011-04-01 00:00:00