Abstract:
BACKGROUND:The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements. RESULTS:We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset. CONCLUSIONS:B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC.
journal_name
BMC Bioinformaticsjournal_title
BMC bioinformaticsauthors
Stubben CJ,Challacombe JFdoi
10.1186/1471-2105-15-43subject
Has Abstractpub_date
2014-02-05 00:00:00pages
43issn
1471-2105pii
1471-2105-15-43journal_volume
15pub_type
杂志文章abstract:BACKGROUND:Phylogenies capture the evolutionary ancestry linking extant species. Correlations and similarities among a set of species are mediated by and need to be understood in terms of the phylogenic tree. In a similar way it has been argued that biological networks also induce correlations among sets of interacting...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-11-470
更新日期:2010-09-20 00:00:00
abstract:BACKGROUND:An important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selec...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-478
更新日期:2011-12-16 00:00:00
abstract:BACKGROUND:Bioinformatics research for finding biological mechanisms can be done by analysis of transcriptome data with pathway based interpretation. Therefore, researchers have tried to develop tools to analyze transcriptome data with pathway based interpretation. Over the years, the amount of omics data has become hu...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2016-6
更新日期:2018-02-19 00:00:00
abstract:BACKGROUND:Isocitrate Dehydrogenases (IDHs) are important enzymes present in all living cells. Three subfamilies of functionally dimeric IDHs (subfamilies I, II, III) are known. Subfamily I are well-studied bacterial IDHs, like that of Escherischia coli. Subfamily II has predominantly eukaryotic members, but it also ha...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S17-S2
更新日期:2012-01-01 00:00:00
abstract:BACKGROUND:Protein structure prediction has achieved a lot of progress during the last few decades and a greater number of models for a certain sequence can be predicted. Consequently, assessing the qualities of predicted protein models in perspective is one of the key components of successful protein structure predict...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1691-z
更新日期:2017-05-25 00:00:00
abstract:BACKGROUND:Managing and organizing biological knowledge remains a major challenge, due to the complexity of living systems. Recently, systemic representations have been promising in tackling such a challenge at the whole-cell scale. In such representations, the cell is considered as a system composed of interlocked sub...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-03637-9
更新日期:2020-07-23 00:00:00
abstract:BACKGROUND:Sparse principal component analysis (PCA) is a popular tool for dimensionality reduction, pattern recognition, and visualization of high dimensional data. It has been recognized that complex biological mechanisms occur through concerted relationships of multiple genes working in networks that are often repre...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-017-1740-7
更新日期:2017-07-11 00:00:00
abstract:BACKGROUND:An approach to molecular classification based on the comparative expression of protein pairs is presented. The method overcomes some of the present limitations in using peptide intensity data for class prediction for problems such as the detection of a disease, disease prognosis, or for predicting treatment ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-191
更新日期:2012-08-07 00:00:00
abstract:BACKGROUND:Designing small-molecule kinase inhibitors with desirable selectivity profiles is a major challenge in drug discovery. A high-throughput screen for inhibitors of a given kinase will typically yield many compounds that inhibit more than one kinase. A series of chemical modifications are usually required befor...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-491
更新日期:2008-11-25 00:00:00
abstract:BACKGROUND:Microarrays permit biologists to simultaneously measure the mRNA abundance of thousands of genes. An important issue facing investigators planning microarray experiments is how to estimate the sample size required for good statistical power. What is the projected sample size or number of replicate chips need...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-84
更新日期:2006-02-22 00:00:00
abstract:BACKGROUND:In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multip...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1273-5
更新日期:2016-10-03 00:00:00
abstract::We provide a 2007 update on the bioinformatics research in the Asia-Pacific from the Asia Pacific Bioinformatics Network (APBioNet), Asia's oldest bioinformatics organisation set up in 1998. From 2002, APBioNet has organized the first International Conference on Bioinformatics (InCoB) bringing together scientists work...
journal_title:BMC bioinformatics
pub_type:
doi:10.1186/1471-2105-9-S1-S1
更新日期:2008-01-01 00:00:00
abstract:BACKGROUND:Temporal gene expression profiles characterize the time-dynamics of expression of specific genes and are increasingly collected in current gene expression experiments. In the analysis of experiments where gene expression is obtained over the life cycle, it is of interest to relate temporal patterns of gene e...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-60
更新日期:2008-01-28 00:00:00
abstract:BACKGROUND:The locations and shapes of synapses are important in reconstructing connectomes and analyzing synaptic plasticity. However, current synapse detection and segmentation methods are still not adequate for accurately acquiring the synaptic connectivity, and they cannot effectively alleviate the burden of synaps...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2232-0
更新日期:2018-07-13 00:00:00
abstract:BACKGROUND:Data extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1419-5
更新日期:2017-01-03 00:00:00
abstract:BACKGROUND:Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs oft...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-102
更新日期:2007-03-26 00:00:00
abstract:BACKGROUND:Genes encoding transcription factors that constitute gene-regulatory networks and maternal factors accumulating in egg cytoplasm are two classes of essential genes that play crucial roles in developmental processes. Transcription factors control the expression of their downstream target genes by interacting ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-015-0552-x
更新日期:2015-04-10 00:00:00
abstract:BACKGROUND:The process of horizontal gene transfer (HGT) is believed to be widespread in Bacteria and Archaea, but little comparative data is available addressing its occurrence in complete microbial genomes. Collection of high-quality, automated HGT prediction data based on phylogenetic evidence has previously been im...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-9-419
更新日期:2008-10-07 00:00:00
abstract:BACKGROUND:Pattern matching is the core of bioinformatics; it is used in database searching, restriction enzyme mapping, and finding open reading frames. It is done repeatedly over increasingly long sequences, thus codes must be efficient and insensitive to sequence length. Such patterns of interest include simple moti...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-3-8
更新日期:2002-01-01 00:00:00
abstract:BACKGROUND:High-throughput technology allows for genome-wide measurements at different molecular levels for the same patient, e.g. single nucleotide polymorphisms (SNPs) and gene expression. Correspondingly, it might be beneficial to also integrate complementary information from different molecular levels when building...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-1183-6
更新日期:2016-08-30 00:00:00
abstract:BACKGROUND:The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity an...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-12-346
更新日期:2011-08-17 00:00:00
abstract:BACKGROUND:Tandem repeats are multiple duplications of substrings in the DNA that occur contiguously, or at a short distance, and may involve some mutations (such as substitutions, insertions, and deletions). Tandem repeats have been extensively studied also for their association with the class of repeat expansion dise...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S4-S3
更新日期:2012-03-28 00:00:00
abstract:BACKGROUND:Predicting the suppression activity of antisense oligonucleotide sequences is the main goal of the rational design of nucleic acids. To create an effective predictive model, it is important to know what properties of an oligonucleotide sequence associate significantly with antisense activity. Also, for the m...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-8-184
更新日期:2007-06-07 00:00:00
abstract:BACKGROUND:Analysis of gene expression data in terms of a priori-defined gene sets has recently received significant attention as this approach typically yields more compact and interpretable results than those produced by traditional methods that rely on individual genes. The set-level strategy can also be adopted wit...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-13-S10-S15
更新日期:2012-06-25 00:00:00
abstract:BACKGROUND:Improvements in technology have been accompanied by the generation of large amounts of complex data. This same technology must be harnessed effectively if the knowledge stored within the data is to be retrieved. Storing data in ontologies aids its management; ontologies serve as controlled vocabularies that ...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-6-74
更新日期:2005-03-24 00:00:00
abstract:BACKGROUND:Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is ne...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-018-2207-1
更新日期:2018-05-23 00:00:00
abstract:BACKGROUND:Reverse engineering of transcriptional regulatory networks (TRN) from genomics data has always represented a computational challenge in System Biology. The major issue is modeling the complex crosstalk among transcription factors (TFs) and their target genes, with a method able to handle both the high number...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-020-3510-1
更新日期:2020-05-29 00:00:00
abstract:BACKGROUND:Here we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/s12859-016-0887-y
更新日期:2016-01-20 00:00:00
abstract:BACKGROUND:The identification of statistically overrepresented sequences in the upstream regions of coregulated genes should theoretically permit the identification of potential cis-regulatory elements. However, in practice many cis-regulatory elements are highly degenerate, precluding the use of an exhaustive word-cou...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-7-254
更新日期:2006-05-15 00:00:00
abstract:BACKGROUND:Human breast cancer resistance protein (BCRP) is an ATP-binding cassette (ABC) efflux transporter that confers multidrug resistance in cancers and also plays an important role in the absorption, distribution and elimination of drugs. Prediction as to if drugs or new molecular entities are BCRP substrates sho...
journal_title:BMC bioinformatics
pub_type: 杂志文章
doi:10.1186/1471-2105-14-130
更新日期:2013-04-15 00:00:00