An assessment of gene prediction accuracy in large DNA sequences.

Abstract:

:One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

journal_name

Genome Res

journal_title

Genome research

authors

Guigó R,Agarwal P,Abril JF,Burset M,Fickett JW

doi

10.1101/gr.122800

subject

Has Abstract

pub_date

2000-10-01 00:00:00

pages

1631-42

issue

10

eissn

1088-9051

issn

1549-5469

journal_volume

10

pub_type

杂志文章
  • Exploring expression data: identification and analysis of coexpressed genes.

    abstract::Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of fal...

    journal_title:Genome research

    pub_type: 杂志文章,评审

    doi:10.1101/gr.9.11.1106

    authors: Heyer LJ,Kruglyak S,Yooseph S

    更新日期:1999-11-01 00:00:00

  • Ancient duplicated conserved noncoding elements in vertebrates: a genomic and functional analysis.

    abstract::Fish-mammal genomic comparisons have proved powerful in identifying conserved noncoding elements likely to be cis-regulatory in nature, and the majority of those tested in vivo have been shown to act as tissue-specific enhancers associated with genes involved in transcriptional regulation of development. Although most...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.4143406

    authors: McEwen GK,Woolfe A,Goode D,Vavouri T,Callaway H,Elgar G

    更新日期:2006-04-01 00:00:00

  • A model for postzygotic mosaicisms quantifies the allele fraction drift, mutation rate, and contribution to de novo mutations.

    abstract::The allele fraction (AF) distribution, occurrence rate, and evolutionary contribution of postzygotic single-nucleotide mosaicisms (pSNMs) remain largely unknown. In this study, we developed a mathematical model to describe the accumulation and AF drift of pSNMs during the development of multicellular organisms. By app...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.230003.117

    authors: Ye AY,Dou Y,Yang X,Wang S,Huang AY,Wei L

    更新日期:2018-07-01 00:00:00

  • Whole population, genome-wide mapping of hidden relatedness.

    abstract::We present GERMLINE, a robust algorithm for identifying segmental sharing indicative of recent common ancestry between pairs of individuals. Unlike methods with comparable objectives, GERMLINE scales linearly with the number of samples, enabling analysis of whole-genome data in large cohorts. Our approach is based on ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.081398.108

    authors: Gusev A,Lowe JK,Stoffel M,Daly MJ,Altshuler D,Breslow JL,Friedman JM,Pe'er I

    更新日期:2009-02-01 00:00:00

  • Evolution and multilevel optimization of the genetic code.

    abstract::The discovery of the genetic code was one of the most important advances of modern biology. But there is more to a DNA code than protein sequence; DNA carries signals for splicing, localization, folding, and regulation that are often embedded within the protein-coding sequence. In this issue, Itzkovitz and Alon show t...

    journal_title:Genome research

    pub_type: 评论,杂志文章,评审

    doi:10.1101/gr.6144007

    authors: Bollenbach T,Vetsigian K,Kishony R

    更新日期:2007-04-01 00:00:00

  • Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules.

    abstract::Comparative functional genomics studies the evolution of biological processes by analyzing functional data, such as gene expression profiles, across species. A major challenge is to compare profiles collected in a complex phylogeny. Here, we present Arboretum, a novel scalable computational algorithm that integrates e...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.146233.112

    authors: Roy S,Wapinski I,Pfiffner J,French C,Socha A,Konieczka J,Habib N,Kellis M,Thompson D,Regev A

    更新日期:2013-06-01 00:00:00

  • Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell.

    abstract::Comparative analysis of the protein sequences encoded in the four euryarchaeal species whose genomes have been sequenced completely (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeoglobus fulgidus, and Pyrococcus horikoshii) revealed 1326 orthologous sets, of which 543 are represented in all fou...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:

    authors: Makarova KS,Aravind L,Galperin MY,Grishin NV,Tatusov RL,Wolf YI,Koonin EV

    更新日期:1999-07-01 00:00:00

  • The Release 6 reference sequence of the Drosophila melanogaster genome.

    abstract::Drosophila melanogaster plays an important role in molecular, genetic, and genomic studies of heredity, development, metabolism, behavior, and human disease. The initial reference genome sequence reported more than a decade ago had a profound impact on progress in Drosophila research, and improving the accuracy and co...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.185579.114

    authors: Hoskins RA,Carlson JW,Wan KH,Park S,Mendez I,Galle SE,Booth BW,Pfeiffer BD,George RA,Svirskas R,Krzywinski M,Schein J,Accardo MC,Damia E,Messina G,Méndez-Lago M,de Pablos B,Demakova OV,Andreyeva EN,Boldyreva LV,Ma

    更新日期:2015-03-01 00:00:00

  • Comparing genomes within the species Mycobacterium tuberculosis.

    abstract::The study of genetic variability within natural populations of pathogens may provide insight into their evolution and pathogenesis. We used a Mycobacterium tuberculosis high-density oligonucleotide microarray to detect small-scale genomic deletions among 19 clinically and epidemiologically well-characterized isolates ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.166401

    authors: Kato-Maeda M,Rhee JT,Gingeras TR,Salamon H,Drenkow J,Smittipat N,Small PM

    更新日期:2001-04-01 00:00:00

  • A human cDNA expression library in yeast enriched for open reading frames.

    abstract::We developed a high-throughput technique for the generation of cDNA libraries in the yeast Saccharomyces cerevisiae which enables the selection of cloned cDNA inserts containing open reading frames (ORFs). For direct screening of random-primed cDNA libraries, we have constructed a yeast shuttle/expression vector, the ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.181501

    authors: Holz C,Lueking A,Bovekamp L,Gutjahr C,Bolotina N,Lehrach H,Cahill DJ

    更新日期:2001-10-01 00:00:00

  • A matter of life or death: how microsatellites emerge in and vanish from the human genome.

    abstract::Microsatellites--tandem repeats of short DNA motifs--are abundant in the human genome and have high mutation rates. While microsatellite instability is implicated in numerous genetic diseases, the molecular processes involved in their emergence and disappearance are still not well understood. Microsatellites are hypot...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.122937.111

    authors: Kelkar YD,Eckert KA,Chiaromonte F,Makova KD

    更新日期:2011-12-01 00:00:00

  • Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.

    abstract::The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.246462.118

    authors: Mudge JM,Jungreis I,Hunt T,Gonzalez JM,Wright JC,Kay M,Davidson C,Fitzgerald S,Seal R,Tweedie S,He L,Waterhouse RM,Li Y,Bruford E,Choudhary JS,Frankish A,Kellis M

    更新日期:2019-12-01 00:00:00

  • A transposon-based strategy for sequencing repetitive DNA in eukaryotic genomes.

    abstract::Repetitive DNA is a significant component of eukaryotic genomes. We have developed a strategy to efficiently and accurately sequence repetitive DNA in the nematode Caenorhabditis elegans using integrated artificial transposons and automated fluorescent sequencing. Mapping and assembly tools represent important compone...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.7.5.551

    authors: Devine SE,Chissoe SL,Eby Y,Wilson RK,Boeke JD

    更新日期:1997-05-01 00:00:00

  • Evolution of transcript modification by N6-methyladenosine in primates.

    abstract::Phenotypic differences within populations and between closely related species are often driven by variation and evolution of gene expression. However, most analyses have focused on the effects of genomic variation at cis-regulatory elements such as promoters and enhancers that control transcriptional activity, and lit...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.212563.116

    authors: Ma L,Zhao B,Chen K,Thomas A,Tuteja JH,He X,He C,White KP

    更新日期:2017-03-01 00:00:00

  • The Arabidopsis genome: a foundation for plant research.

    abstract::The sequence of the first plant genome was completed and published at the end of 2000. This spawned a series of large-scale projects aimed at discovering the functions of the 25,000+ genes identified in Arabidopsis thaliana (Arabidopsis). This review summarizes progress made in the past five years and speculates about...

    journal_title:Genome research

    pub_type: 杂志文章,评审

    doi:10.1101/gr.3723405

    authors: Bevan M,Walsh S

    更新日期:2005-12-01 00:00:00

  • Efficient identification of Y chromosome sequences in the human and Drosophila genomes.

    abstract::Notwithstanding their biological importance, Y chromosomes remain poorly known in most species. A major obstacle to their study is the identification of Y chromosome sequences; due to its high content of repetitive DNA, in most genome projects, the Y chromosome sequence is fragmented into a large number of small, unma...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.156034.113

    authors: Carvalho AB,Clark AG

    更新日期:2013-11-01 00:00:00

  • Rescue of targeted regions of mammalian chromosomes by in vivo recombination in yeast.

    abstract::In contrast to other animal cell lines, the chicken pre-B cell lymphoma line, DT40, exhibits a high level of homologous recombination, which can be exploited to generate site-specific alterations in defined target genes or regions. In addition, the ability to generate human/chicken monochromosomal hybrids in the DT40 ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.8.6.666

    authors: Kouprina N,Kawamoto K,Barrett JC,Larionov V,Koi M

    更新日期:1998-06-01 00:00:00

  • Delineation of key regulatory elements identifies points of vulnerability in the mitogen-activated signaling network.

    abstract::Drug development efforts against cancer are often hampered by the complex properties of signaling networks. Here we combined the results of an RNAi screen targeting the cellular signaling machinery, with graph theoretical analysis to extract the core modules that process both mitogenic and oncogenic signals to drive c...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.116145.110

    authors: Jailkhani N,Ravichandran S,Hegde SR,Siddiqui Z,Mande SC,Rao KV

    更新日期:2011-12-01 00:00:00

  • Genomic evolution, patterns of global dissemination, and interspecies transmission of human and simian T-cell leukemia/lymphotropic viruses.

    abstract::Using both env and long terminal repeat (LTR) sequences, with maximal representation of genetic diversity within primate strains, we revise and expand the unique evolutionary history of human and simian T-cell leukemia/lymphotropic viruses (HTLV/STLV). Based on the robust application of three different phylogenetic al...

    journal_title:Genome research

    pub_type: 杂志文章,评审

    doi:

    authors: Slattery JP,Franchini G,Gessain A

    更新日期:1999-06-01 00:00:00

  • Integrated single-cell genetic and transcriptional analysis suggests novel drivers of chronic lymphocytic leukemia.

    abstract::Intra-tumoral genetic heterogeneity has been characterized across cancers by genome sequencing of bulk tumors, including chronic lymphocytic leukemia (CLL). In order to more accurately identify subclones, define phylogenetic relationships, and probe genotype-phenotype relationships, we developed methods for targeted m...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.217331.116

    authors: Wang L,Fan J,Francis JM,Georghiou G,Hergert S,Li S,Gambe R,Zhou CW,Yang C,Xiao S,Cin PD,Bowden M,Kotliar D,Shukla SA,Brown JR,Neuberg D,Alessi DR,Zhang CZ,Kharchenko PV,Livak KJ,Wu CJ

    更新日期:2017-08-01 00:00:00

  • CG dinucleotides enhance promoter activity independent of DNA methylation.

    abstract::Most mammalian RNA polymerase II initiation events occur at CpG islands, which are rich in CpGs and devoid of DNA methylation. Despite their relevance for gene regulation, it is unknown to what extent the CpG dinucleotide itself actually contributes to promoter activity. To address this question, we determined the tra...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.241653.118

    authors: Hartl D,Krebs AR,Grand RS,Baubec T,Isbel L,Wirbelauer C,Burger L,Schübeler D

    更新日期:2019-04-01 00:00:00

  • ATAC-seq reveals regional differences in enhancer accessibility during the establishment of spatial coordinates in the Drosophila blastoderm.

    abstract::Establishment of spatial coordinates during Drosophila embryogenesis relies on differential regulatory activity of axis patterning enhancers. Concentration gradients of activator and repressor transcription factors (TFs) provide positional information to each enhancer, which in turn promotes transcription of a target ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.242362.118

    authors: Bozek M,Cortini R,Storti AE,Unnerstall U,Gaul U,Gompel N

    更新日期:2019-05-01 00:00:00

  • A complexity reduction algorithm for analysis and annotation of large genomic sequences.

    abstract::DNA is a universal language encrypted with biological instruction for life. In higher organisms, the genetic information is preserved predominantly in an organized exon/intron structure. When a gene is expressed, the exons are spliced together to form the transcript for protein synthesis. We have developed a complexit...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.313703

    authors: Chuang TJ,Lin WC,Lee HC,Wang CW,Hsiao KL,Wang ZH,Shieh D,Lin SC,Ch'ang LY

    更新日期:2003-02-01 00:00:00

  • Reconstructing complex regions of genomes using long-read sequencing technology.

    abstract::Obtaining high-quality sequence continuity of complex regions of recent segmental duplication remains one of the major challenges of finishing genome assemblies. In the human and mouse genomes, this was achieved by targeting large-insert clones using costly and laborious capillary-based sequencing approaches. Sanger s...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.168450.113

    authors: Huddleston J,Ranade S,Malig M,Antonacci F,Chaisson M,Hon L,Sudmant PH,Graves TA,Alkan C,Dennis MY,Wilson RK,Turner SW,Korlach J,Eichler EE

    更新日期:2014-04-01 00:00:00

  • A generic, cost-effective, and scalable cell lineage analysis platform.

    abstract::Advances in single-cell genomics enable commensurate improvements in methods for uncovering lineage relations among individual cells. Current sequencing-based methods for cell lineage analysis depend on low-resolution bulk analysis or rely on extensive single-cell sequencing, which is not scalable and could be biased ...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.202903.115

    authors: Biezuner T,Spiro A,Raz O,Amir S,Milo L,Adar R,Chapal-Ilani N,Berman V,Fried Y,Ainbinder E,Cohen G,Barr HM,Halaban R,Shapiro E

    更新日期:2016-11-01 00:00:00

  • Turnover of ribosome-associated transcripts from de novo ORFs produces gene-like characteristics available for de novo gene emergence in wild yeast populations.

    abstract::Little is known about the rate of emergence of de novo genes, what their initial properties are, and how they spread in populations. We examined wild yeast populations (Saccharomyces paradoxus) to characterize the diversity and turnover of intergenic ORFs over short evolutionary timescales. We find that hundreds of in...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.239822.118

    authors: Durand É,Gagnon-Arsenault I,Hallin J,Hatin I,Dubé AK,Nielly-Thibault L,Namy O,Landry CR

    更新日期:2019-06-01 00:00:00

  • Centromere repositioning.

    abstract::Primate pericentromeric regions recently have been shown to exhibit extraordinary evolutionary plasticity. In this paper we report an additional peculiar feature of these regions that we discovered while analyzing, by FISH, the evolutionary conservation of primate phylogenetic chromosome IX. If the position of the cen...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.9.12.1184

    authors: Montefalcone G,Tempesta S,Rocchi M,Archidiacono N

    更新日期:1999-12-01 00:00:00

  • Domain regulation of imprinting cluster in Kip2/Lit1 subdomain on mouse chromosome 7F4/F5: large-scale DNA methylation analysis reveals that DMR-Lit1 is a putative imprinting control region.

    abstract::Mouse chromosome 7F4/F5, where the imprinting domain is located, is syntenic to human 11p15.5, the locus for Beckwith-Wiedemann syndrome. The domain is thought to consist of the two subdomains Kip2 (p57(kip2))/Lit1 and Igf2/H19. Because DNA methylation is believed to be a key factor in genomic imprinting, we performed...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.110702

    authors: Yatsuki H,Joh K,Higashimoto K,Soejima H,Arai Y,Wang Y,Hatada I,Obata Y,Morisaki H,Zhang Z,Nakagawachi T,Satoh Y,Mukai T

    更新日期:2002-12-01 00:00:00

  • Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment.

    abstract::A new algorithm, WABA, was developed for doing large-scale alignments between genomic DNA of different species. WABA was used to align 8 million bases of Caenorhabditis briggsae genomic DNA against the entire 97-million-base Caenorhabditis elegans genome. The alignment, including C. briggsae homologs of 154 geneticall...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.10.8.1115

    authors: Kent WJ,Zahler AM

    更新日期:2000-08-01 00:00:00

  • Unique DNA methylome profiles in CpG island methylator phenotype colon cancers.

    abstract::A subset of colorectal cancers was postulated to have the CpG island methylator phenotype (CIMP), a higher propensity for CpG island DNA methylation. The validity of CIMP, its molecular basis, and its prognostic value remain highly controversial. Using MBD-isolated genome sequencing, we mapped and compared genome-wide...

    journal_title:Genome research

    pub_type: 杂志文章

    doi:10.1101/gr.122788.111

    authors: Xu Y,Hu B,Choi AJ,Gopalan B,Lee BH,Kalady MF,Church JM,Ting AH

    更新日期:2012-02-01 00:00:00