Primary orthologs from local sequence context.

Abstract:

BACKGROUND:The evolutionary history of genes serves as a cornerstone of contemporary biology. Most conserved sequences in mammalian genomes don't code for proteins, yielding a need to infer evolutionary history of sequences irrespective of what kind of functional element they may encode. Thus, sequence-, as opposed to gene-, centric modes of inferring paths of sequence evolution are increasingly relevant. Customarily, homologous sequences derived from the same direct ancestor, whose ancestral position in two genomes is usually conserved, are termed "primary" (or "positional") orthologs. Methods based solely on similarity don't reliably distinguish primary orthologs from other homologs; for this, genomic context is often essential. Context-dependent identification of orthologs traditionally relies on genomic context over length scales characteristic of conserved gene order or whole-genome sequence alignment, and can be computationally intensive. RESULTS:We demonstrate that short-range sequence context-as short as a single "maximal" match- distinguishes primary orthologs from other homologs across whole genomes. On mammalian whole genomes not preprocessed by repeat-masker, potential orthologs are extracted by genome intersection as "non-nested maximal matches:" maximal matches that are not nested into other maximal matches. It emerges that on both nucleotide and gene scales, non-nested maximal matches recapitulate primary or positional orthologs with high precision and high recall, while the corresponding computation consumes less than one thirtieth of the computation time required by commonly applied whole-genome alignment methods. In regions of genomes that would be masked by repeat-masker, non-nested maximal matches recover orthologs that are inaccessible to Lastz net alignment, for which repeat-masking is a prerequisite. mmRBHs, reciprocal best hits of genes containing non-nested maximal matches, yield novel putative orthologs, e.g. around 1000 pairs of genes for human-chimpanzee. CONCLUSIONS:We describe an intersection-based method that requires neither repeat-masking nor alignment to infer evolutionary history of sequences based on short-range genomic sequence context. Ortholog identification based on non-nested maximal matches is parameter-free, and less computationally intensive than many alignment-based methods. It is especially suitable for genome-wide identification of orthologs, and may be applicable to unassembled genomes. We are agnostic as to the reasons for its effectiveness, which may reflect local variation of mean mutation rate.

journal_name

BMC Bioinformatics

journal_title

BMC bioinformatics

authors

Gao K,Miller J

doi

10.1186/s12859-020-3384-2

subject

Has Abstract

pub_date

2020-02-06 00:00:00

pages

48

issue

1

issn

1471-2105

pii

10.1186/s12859-020-3384-2

journal_volume

21

pub_type

杂志文章
  • A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.

    abstract:BACKGROUND:The process of horizontal gene transfer (HGT) is believed to be widespread in Bacteria and Archaea, but little comparative data is available addressing its occurrence in complete microbial genomes. Collection of high-quality, automated HGT prediction data based on phylogenetic evidence has previously been im...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-419

    authors: Podell S,Gaasterland T,Allen EE

    更新日期:2008-10-07 00:00:00

  • InPrePPI: an integrated evaluation method based on genomic context for predicting protein-protein interactions in prokaryotic genomes.

    abstract:BACKGROUND:Although many genomic features have been used in the prediction of protein-protein interactions (PPIs), frequently only one is used in a computational method. After realizing the limited power in the prediction using only one genomic feature, investigators are now moving toward integration. So far, there hav...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-414

    authors: Sun J,Sun Y,Ding G,Liu Q,Wang C,He Y,Shi T,Li Y,Zhao Z

    更新日期:2007-10-26 00:00:00

  • Prediction of scaffold proteins based on protein interaction and domain architectures.

    abstract:BACKGROUND:Scaffold proteins are known for being crucial regulators of various cellular functions by assembling multiple proteins involved in signaling and metabolic pathways. Identification of scaffold proteins and the study of their molecular mechanisms can open a new aspect of cellular systemic regulation and the re...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1079-5

    authors: Oh K,Yi GS

    更新日期:2016-07-28 00:00:00

  • Epiviz: a view inside the design of an integrated visual analysis software for genomics.

    abstract:BACKGROUND:Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-16-S11-S4

    authors: Chelaru F,Corrada Bravo H

    更新日期:2015-01-01 00:00:00

  • GeneBins: a database for classifying gene expression data, with application to plant genome arrays.

    abstract:BACKGROUND:To interpret microarray experiments, several ontological analysis tools have been developed. However, current tools are limited to specific organisms. RESULTS:We developed a bioinformatics system to assign the probe set sequences of any organism to a hierarchical functional classification modelled on KEGG o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-87

    authors: Goffard N,Weiller G

    更新日期:2007-03-12 00:00:00

  • CLU: a new algorithm for EST clustering.

    abstract:BACKGROUND:The continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Cluste...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-6-S2-S3

    authors: Ptitsyn A,Hide W

    更新日期:2005-07-15 00:00:00

  • CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes.

    abstract:BACKGROUND:It is now well established that nearly 20% of human cancers are caused by infectious agents, and the list of human oncogenic pathogens will grow in the future for a variety of cancer types. Whole tumor transcriptome and genome sequencing by next-generation sequencing technologies presents an unparalleled opp...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-206

    authors: Borozan I,Wilson S,Blanchette P,Laflamme P,Watt SN,Krzyzanowski PM,Sircoulomb F,Rottapel R,Branton PE,Ferretti V

    更新日期:2012-08-17 00:00:00

  • VKCDB: voltage-gated potassium channel database.

    abstract:BACKGROUND:The family of voltage-gated potassium channels comprises a functionally diverse group of membrane proteins. They help maintain and regulate the potassium ion-based component of the membrane potential and are thus central to many critical physiological processes. VKCDB (Voltage-gated potassium [K] Channel Dat...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章,评审

    doi:10.1186/1471-2105-5-3

    authors: Li B,Gallin WJ

    更新日期:2004-01-09 00:00:00

  • Pathogenic Bacillus anthracis in the progressive gene losses and gains in adaptive evolution.

    abstract:BACKGROUND:Sequence mutations represent a driving force of adaptive evolution in bacterial pathogens. It is especially evident in reductive genome evolution where bacteria underwent lifestyles shifting from a free-living to a strictly intracellular or host-depending life. It resulted in loss-of-function mutations and/o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-10-S1-S3

    authors: Yu GX

    更新日期:2009-01-30 00:00:00

  • Improved functional prediction of proteins by learning kernel combinations in multilabel settings.

    abstract:BACKGROUND:We develop a probabilistic model for combining kernel matrices to predict the function of proteins. It extends previous approaches in that it can handle multiple labels which naturally appear in the context of protein function. RESULTS:Explicit modeling of multilabels significantly improves the capability o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-S2-S12

    authors: Roth V,Fischer B

    更新日期:2007-05-03 00:00:00

  • Inferring gene expression dynamics via functional regression analysis.

    abstract:BACKGROUND:Temporal gene expression profiles characterize the time-dynamics of expression of specific genes and are increasingly collected in current gene expression experiments. In the analysis of experiments where gene expression is obtained over the life cycle, it is of interest to relate temporal patterns of gene e...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-60

    authors: Müller HG,Chiou JM,Leng X

    更新日期:2008-01-28 00:00:00

  • TooT-T: discrimination of transport proteins from non-transport proteins.

    abstract:BACKGROUND:Membrane transport proteins (transporters) play an essential role in every living cell by transporting hydrophilic molecules across the hydrophobic membranes. While the sequences of many membrane proteins are known, their structure and function is still not well characterized and understood, owing to the imm...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3311-6

    authors: Alballa M,Butler G

    更新日期:2020-04-23 00:00:00

  • Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie.

    abstract:BACKGROUND:Bioinformatics software quality assurance is essential in genomic medicine. Systematic verification and validation of bioinformatics software is difficult because it is often not possible to obtain a realistic "gold standard" for systematic evaluation. Here we apply a technique that originates from the softw...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-15-S16-S15

    authors: Giannoulatou E,Park SH,Humphreys DT,Ho JW

    更新日期:2014-01-01 00:00:00

  • Prioritizing disease genes with an improved dual label propagation framework.

    abstract:BACKGROUND:Prioritizing disease genes is trying to identify potential disease causing genes for a given phenotype, which can be applied to reveal the inherited basis of human diseases and facilitate drug development. Our motivation is inspired by label propagation algorithm and the false positive protein-protein intera...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2040-6

    authors: Zhang Y,Liu J,Liu X,Fan X,Hong Y,Wang Y,Huang Y,Xie M

    更新日期:2018-02-08 00:00:00

  • Homology modeling, molecular docking, and molecular dynamics simulations elucidated α-fetoprotein binding modes.

    abstract:BACKGROUND:An important mechanism of endocrine activity is chemicals entering target cells via transport proteins and then interacting with hormone receptors such as the estrogen receptor (ER). α-Fetoprotein (AFP) is a major transport protein in rodent serum that can bind and sequester estrogens, thus preventing entry ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-S14-S6

    authors: Shen J,Zhang W,Fang H,Perkins R,Tong W,Hong H

    更新日期:2013-01-01 00:00:00

  • Use of a structural alphabet for analysis of short loops connecting repetitive structures.

    abstract:BACKGROUND:Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we call...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-5-58

    authors: Fourrier L,Benros C,de Brevern AG

    更新日期:2004-05-12 00:00:00

  • Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals.

    abstract:BACKGROUND:Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0810-y

    authors: Wu SH,Rodrigo AG

    更新日期:2015-11-04 00:00:00

  • CoryneRegNet 4.0 - A reference database for corynebacterial gene regulatory networks.

    abstract:BACKGROUND:Detailed information on DNA-binding transcription factors (the key players in the regulation of gene expression) and on transcriptional regulatory interactions of microorganisms deduced from literature-derived knowledge, computer predictions and global DNA microarray hybridization experiments, has opened the...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-8-429

    authors: Baumbach J

    更新日期:2007-11-06 00:00:00

  • InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams.

    abstract:BACKGROUND:Set comparisons permeate a large number of data analysis workflows, in particular workflows in biological sciences. Venn diagrams are frequently employed for such analysis but current tools are limited. RESULTS:We have developed InteractiVenn, a more flexible tool for interacting with Venn diagrams includin...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0611-3

    authors: Heberle H,Meirelles GV,da Silva FR,Telles GP,Minghim R

    更新日期:2015-05-22 00:00:00

  • Prediction of novel long non-coding RNAs based on RNA-Seq data of mouse Klf1 knockout study.

    abstract:BACKGROUND:Study on long non-coding RNAs (lncRNAs) has been promoted by high-throughput RNA sequencing (RNA-Seq). However, it is still not trivial to identify lncRNAs from the RNA-Seq data and it remains a challenge to uncover their functions. RESULTS:We present a computational pipeline for detecting novel lncRNAs fro...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-13-331

    authors: Sun L,Zhang Z,Bailey TL,Perkins AC,Tallack MR,Xu Z,Liu H

    更新日期:2012-12-13 00:00:00

  • MIENTURNET: an interactive web tool for microRNA-target enrichment and network-based analysis.

    abstract:BACKGROUND:miRNAs regulate the expression of several genes with one miRNA able to target multiple genes and with one gene able to be simultaneously targeted by more than one miRNA. Therefore, it has become indispensable to shorten the long list of miRNA-target interactions to put in the spotlight in order to gain insig...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-019-3105-x

    authors: Licursi V,Conte F,Fiscon G,Paci P

    更新日期:2019-11-04 00:00:00

  • Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data.

    abstract:BACKGROUND:Time-course microarray experiments are being increasingly used to characterize dynamic biological processes. In these experiments, the goal is to identify genes differentially expressed in time-course data, measured between different biological conditions. These differentially expressed genes can reveal the ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-267

    authors: Jonnalagadda S,Srinivasan R

    更新日期:2008-06-06 00:00:00

  • EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries.

    abstract:BACKGROUND:Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of seque...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-186

    authors: Smith RP,Buchser WJ,Lemmon MB,Pardinas JR,Bixby JL,Lemmon VP

    更新日期:2008-04-10 00:00:00

  • Determination of strongly overlapping signaling activity from microarray data.

    abstract:BACKGROUND:As numerous diseases involve errors in signal transduction, modern therapeutics often target proteins involved in cellular signaling. Interpretation of the activity of signaling pathways during disease development or therapeutic intervention would assist in drug development, design of therapy, and target ide...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-7-99

    authors: Bidaut G,Suhre K,Claverie JM,Ochs MF

    更新日期:2006-02-28 00:00:00

  • NextSV: a meta-caller for structural variants from low-coverage long-read sequencing data.

    abstract:BACKGROUND:Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is ne...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-018-2207-1

    authors: Fang L,Hu J,Wang D,Wang K

    更新日期:2018-05-23 00:00:00

  • Prediction of TF target sites based on atomistic models of protein-DNA complexes.

    abstract:BACKGROUND:The specific recognition of genomic cis-regulatory elements by transcription factors (TFs) plays an essential role in the regulation of coordinated gene expression. Studying the mechanisms determining binding specificity in protein-DNA interactions is thus an important goal. Most current approaches for model...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-9-436

    authors: Angarica VE,Pérez AG,Vasconcelos AT,Collado-Vides J,Contreras-Moreira B

    更新日期:2008-10-16 00:00:00

  • Random generalized linear model: a highly accurate and interpretable ensemble predictor.

    abstract:BACKGROUND:Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/1471-2105-14-5

    authors: Song L,Langfelder P,Horvath S

    更新日期:2013-01-16 00:00:00

  • REGULATOR: a database of metazoan transcription factors and maternal factors for developmental studies.

    abstract:BACKGROUND:Genes encoding transcription factors that constitute gene-regulatory networks and maternal factors accumulating in egg cytoplasm are two classes of essential genes that play crucial roles in developmental processes. Transcription factors control the expression of their downstream target genes by interacting ...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-015-0552-x

    authors: Wang K,Nishida H

    更新日期:2015-04-10 00:00:00

  • Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme.

    abstract:BACKGROUND:Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification o...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-017-1709-6

    authors: Zhang J,Chai H,Yang G,Ma Z

    更新日期:2017-06-05 00:00:00

  • Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features.

    abstract:BACKGROUND:The human immunodeficiency virus type 1 (HIV-1) aspartic protease is an important enzyme owing to its imperative part in viral development and a causative agent of deadliest disease known as acquired immune deficiency syndrome (AIDS). Development of HIV-1 protease inhibitors can help understand the specifici...

    journal_title:BMC bioinformatics

    pub_type: 杂志文章

    doi:10.1186/s12859-016-1337-6

    authors: Singh O,Su EC

    更新日期:2016-12-23 00:00:00