Abstract:
BACKGROUND:Parametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features. RESULTS:We have developed an R extension package, fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses. The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)). The computational efficiency is demonstrated through extensive benchmarking, and example applications to real data are presented. CONCLUSIONS:fastJT is an open-source R extension package, applying the Jonckheere-Terpstra statistic for robust feature selection for machine learning and association studies. The package implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.
journal_name
BMC Bioinformaticsjournal_title
BMC bioinformaticsauthors
Lin J,Sibley A,Shterev I,Nixon A,Innocenti F,Chan C,Owzar Kdoi
10.1186/s12859-019-2869-3subject
Has Abstractpub_date
2019-06-13 00:00:00pages
333issue
1issn
1471-2105pii
10.1186/s12859-019-2869-3journal_volume
20pub_type
杂志文章abstract:BACKGROUND:Biochemically detailed stoichiometric matrices have now been reconstructed for various bacteria, yeast, and for the human cardiac mitochondrion based on genomic and proteomic data. These networks have been manually curated based on legacy data and elementally and charge balanced. Comparative analysis of thes...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-111
更新日期:2006-03-06 00:00:00
abstract:BACKGROUND:The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running ti...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-214
更新日期:2011-05-27 00:00:00
abstract:BACKGROUND:Time- and dose-to-event phenotypes used in basic science and translational studies are commonly measured imprecisely or incompletely due to limitations of the experimental design or data collection schema. For example, drug-induced toxicities are not reported by the actual time or dose triggering the event, ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-019-2899-x
更新日期:2019-05-28 00:00:00
abstract:BACKGROUND:The frequent exchange of genetic material among prokaryotes means that extracting a majority or plurality phylogenetic signal from many gene families, and the identification of gene families that are in significant conflict with the plurality signal is a frequent task in comparative genomics, and especially ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-123
更新日期:2012-06-07 00:00:00
abstract:BACKGROUND:Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increase the power to detect strong or weak genotype effects or b) as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-231
更新日期:2012-09-12 00:00:00
abstract:BACKGROUND:Detecting local correlations in expression between neighboring genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigene...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1742-5
更新日期:2017-07-11 00:00:00
abstract:BACKGROUND:Inferring gene regulatory network (GRN) has been an important topic in Bioinformatics. Many computational methods infer the GRN from high-throughput expression data. Due to the presence of time delays in the regulatory relationships, High-Order Dynamic Bayesian Network (HO-DBN) is a good model of GRN. Howeve...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0823-6
更新日期:2015-11-25 00:00:00
abstract:BACKGROUND:An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biolo...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-94
更新日期:2012-05-11 00:00:00
abstract:BACKGROUND:REX1 and REX2 are protein components of the RNA editing complex (the editosome) and function as exouridylylases. The exact roles of REX1 and REX2 in the editosome are unclear and the consequences of the presence of two related proteins are not fully understood. Here, a variety of computational studies were p...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-305
更新日期:2006-06-16 00:00:00
abstract:BACKGROUND:Fluorescence microscopy is widely used to determine the subcellular location of proteins. Efforts to determine location on a proteome-wide basis create a need for automated methods to analyze the resulting images. Over the past ten years, the feasibility of using machine learning methods to recognize all maj...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-210
更新日期:2007-06-19 00:00:00
abstract:BACKGROUND:Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as p...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-14-S11-S2
更新日期:2013-01-01 00:00:00
abstract:BACKGROUND:Gene expression data can be analyzed by summarizing groups of individual gene expression profiles based on GO annotation information. The mean expression profile per group can then be used to identify interesting GO categories in relation to the experimental settings. However, the expression profiles present...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-158
更新日期:2010-03-26 00:00:00
abstract:BACKGROUND:Typical evolutionary events like recombination, hybridization or gene transfer make necessary the use of phylogenetic networks to properly depict the evolution of DNA and protein sequences. Although several theoretical classes have been proposed to characterize these networks, they make stringent assumptions...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-268
更新日期:2010-05-20 00:00:00
abstract:BACKGROUND:Viral infection by dengue virus is a major public health problem in tropical countries. Early diagnosis and detection are increasingly based on quantitative reverse transcriptase real-time polymerase chain reaction (RT-qPCR) directed against genomic regions conserved between different isolates. Genetic varia...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2313-0
更新日期:2018-09-04 00:00:00
abstract:BACKGROUND:Proteins are dynamic molecules with motions ranging from picoseconds to longer than seconds. Many protein functions, however, appear to occur on the micro to millisecond timescale and therefore there has been intense research of the importance of these motions in catalysis and molecular interactions. Nuclear...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-421
更新日期:2011-10-27 00:00:00
abstract:BACKGROUND:Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0810-y
更新日期:2015-11-04 00:00:00
abstract:BACKGROUND:We previously developed GoMiner, an application that organizes lists of 'interesting' genes (for example, under-and overexpressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology. The original version of GoMiner was oriented toward visualization and interp...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-168
更新日期:2005-07-05 00:00:00
abstract:BACKGROUND:Protein-protein interactions (PPIs) are of great importance in cellular systems of organisms, since they are the basis of cellular structure and function and many essential cellular processes are related to that. Most proteins perform their functions by interacting with other proteins, so predicting PPIs acc...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-03896-6
更新日期:2020-12-16 00:00:00
abstract::DNA methylation exhibits different patterns in different cancers. DNA methylation rates at different genomic loci appear to be highly correlated in some samples but not in others. We call such phenomena conditional concordant relationships (CCRs). In this study, we explored DNA methylation patterns in 12 common cancer...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S13-S7
更新日期:2012-01-01 00:00:00
abstract:BACKGROUND:Cyclic nucleotides are ubiquitous intracellular messengers. Until recently, the roles of cyclic nucleotides in plant cells have proven difficult to uncover. With an understanding of the protein domains which can bind cyclic nucleotides (CNB and GAF domains) we scanned the completed genomes of the higher plan...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-6
更新日期:2005-01-11 00:00:00
abstract:BACKGROUND:Designing small-molecule kinase inhibitors with desirable selectivity profiles is a major challenge in drug discovery. A high-throughput screen for inhibitors of a given kinase will typically yield many compounds that inhibit more than one kinase. A series of chemical modifications are usually required befor...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-491
更新日期:2008-11-25 00:00:00
abstract:BACKGROUND:Microbial electrosynthesis and electro fermentation are techniques that aim to optimize microbial production of chemicals and fuels by regulating the cellular redox balance via interaction with electrodes. While the concept is known for decades major knowledge gaps remain, which make it hard to evaluate its ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-014-0410-2
更新日期:2014-12-30 00:00:00
abstract:BACKGROUND:Computer-aided segmentation and border detection in dermoscopic images is one of the core components of diagnostic procedures and therapeutic interventions for skin cancer. Automated assessment tools for dermoscopy images have become an important research field mainly because of inter- and intra-observer var...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-S6-S26
更新日期:2010-10-07 00:00:00
abstract:BACKGROUND:The detection of bias due to cryptic population structure is an important step in the evaluation of findings of genetic association studies. The standard method of measuring this bias in a genetic association study is to compare the observed median association test statistic to the expected median test stati...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0496-1
更新日期:2015-02-20 00:00:00
abstract::Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large a...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S6-S11
更新日期:2012-04-19 00:00:00
abstract:BACKGROUND:The creation of a complete genome-wide map of transcription factor binding sites is essential for understanding gene regulatory networks in vivo. However, current prediction methods generally rely on statistical models that imperfectly model transcription factor binding. Generation of new prediction methods ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-62
更新日期:2011-02-25 00:00:00
abstract:BACKGROUND:SSWAP (Simple Semantic Web Architecture and Protocol; pronounced "swap") is an architecture, protocol, and platform for using reasoning to semantically integrate heterogeneous disparate data and services on the web. SSWAP was developed as a hybrid semantic web services technology to overcome limitations foun...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-309
更新日期:2009-09-23 00:00:00
abstract:BACKGROUND:The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-10-27
更新日期:2009-01-20 00:00:00
abstract:BACKGROUND:The locations and shapes of synapses are important in reconstructing connectomes and analyzing synaptic plasticity. However, current synapse detection and segmentation methods are still not adequate for accurately acquiring the synaptic connectivity, and they cannot effectively alleviate the burden of synaps...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2232-0
更新日期:2018-07-13 00:00:00
abstract:BACKGROUND:Polymorphic variants and mutations disrupting canonical splicing isoforms are among the leading causes of human hereditary disorders. While there is a substantial evidence of aberrant splicing causing Mendelian diseases, the implication of such events in multi-genic disorders is yet to be well understood. We...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-22
更新日期:2010-01-12 00:00:00