Outlier analyses of the Protein Data Bank archive using a probability-density-ranking approach.

Abstract:

:Outlier analyses are central to scientific data assessments. Conventional outlier identification methods do not work effectively for Protein Data Bank (PDB) data, which are characterized by heavy skewness and the presence of bounds and/or long tails. We have developed a data-driven nonparametric method to identify outliers in PDB data based on kernel probability density estimation. Unlike conventional outlier analyses based on location and scale, Probability Density Ranking can be used for robust assessments of distance from other observations. Analyzing PDB data from the vantage points of probability and frequency enables proper outlier identification, which is important for quality control during deposition-validation-biocuration of new three-dimensional structure data. Ranking of Probability Density also permits use of Most Probable Range as a robust measure of data dispersion that is more compact than Interquartile Range. The Probability-Density-Ranking approach can be employed to analyze outliers and data-spread on any large data set with continuous distribution.

journal_name

Sci Data

journal_title

Scientific data

authors

Shao C,Liu Z,Yang H,Wang S,Burley SK

doi

10.1038/sdata.2018.293

subject

Has Abstract

pub_date

2018-12-11 00:00:00

pages

180293

issn

2052-4463

pii

sdata2018293

journal_volume

5

pub_type

  • Comprehensive draft of the mouse embryonic fibroblast lysosomal proteome by mass spectrometry based proteomics.

    abstract::Lysosomes are the main degradative organelles of cells and involved in a variety of processes including the recycling of macromolecules, storage of compounds, and metabolic signaling. Despite an increasing interest in the proteomic analysis of lysosomes, no systematic study of sample preparation protocols for lysosome...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-0399-5

    authors: Ponnaiyan S,Akter F,Singh J,Winter D

    更新日期:2020-02-26 00:00:00

  • The odonate phenotypic database, a new open data resource for comparative studies of an old insect order.

    abstract::We present The Odonate Phenotypic Database (OPD): an online data resource of dragonfly and damselfly phenotypes (Insecta: Odonata). Odonata is a relatively small insect order that currently consists of about 6400 species belonging to 32 families. The database consists of multiple morphological, life-history and behavi...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0318-9

    authors: Waller JT,Willink B,Tschol M,Svensson EI

    更新日期:2019-12-12 00:00:00

  • Spatial data of Ixodes ricinus instar abundance and nymph pathogen prevalence, Scandinavia, 2016-2017.

    abstract::Ticks carry pathogens that can cause disease in both animals and humans, and there is a need to monitor the distribution and abundance of ticks and the pathogens they carry to pinpoint potential high risk areas for tick-borne disease transmission. In a joint Scandinavian study, we measured Ixodes ricinus instar abunda...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-00579-y

    authors: Kjær LJ,Klitgaard K,Soleng A,Edgar KS,Lindstedt HEH,Paulsen KM,Andreassen ÅK,Korslund L,Kjelland V,Slettan A,Stuen S,Kjellander P,Christensson M,Teräväinen M,Baum A,Jensen LM,Bødker R

    更新日期:2020-07-16 00:00:00

  • An annotated fluorescence image dataset for training nuclear segmentation methods.

    abstract::Fully-automated nuclear image segmentation is the prerequisite to ensure statistically significant, quantitative analyses of tissue preparations,applied in digital pathology or quantitative microscopy. The design of segmentation methods that work independently of the tissue type or preparation is complex, due to varia...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-00608-w

    authors: Kromp F,Bozsaky E,Rifatbegovic F,Fischer L,Ambros M,Berneder M,Weiss T,Lazic D,Dörr W,Hanbury A,Beiske K,Ambros PF,Ambros IM,Taschner-Mandl S

    更新日期:2020-08-11 00:00:00

  • The systematic identification of cytoskeletal genes required for Drosophila melanogaster muscle maintenance.

    abstract::Animal muscles must maintain their function and structure while bearing substantial mechanical loads. How muscles withstand persistent mechanical strain is presently not well understood. Understanding the mechanisms by which tissues maintain their complex architecture is a key goal of cell biology. This dataset repres...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2014.2

    authors: Perkins AD,Lee MJ,Tanentzapf G

    更新日期:2014-03-11 00:00:00

  • Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control.

    abstract::Direct-infusion mass spectrometry (DIMS) metabolomics is an important approach for characterising molecular responses of organisms to disease, drugs and the environment. Increasingly large-scale metabolomics studies are being conducted, necessitating improvements in both bioanalytical and computational workflows to ma...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2014.12

    authors: Kirwan JA,Weber RJ,Broadhurst DI,Viant MR

    更新日期:2014-06-10 00:00:00

  • A lake data set for the Tibetan Plateau from the 1960s, 2005, and 2014.

    abstract::Long-term datasets of number and size of lakes over the Tibetan Plateau (TP) are among the most critical components for better understanding the interactions among the cryosphere, hydrosphere, and atmosphere at regional and global scales. Due to the harsh environment and the scarcity of data over the TP, data accumula...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2016.39

    authors: Wan W,Long D,Hong Y,Ma Y,Yuan Y,Xiao P,Duan H,Han Z,Gu X

    更新日期:2016-06-21 00:00:00

  • Age-related dataset on the mechanical properties and collagen fibril structure of tendons from a murine model.

    abstract::Connective tissues such as tendon, ligament and skin are biological fibre composites comprising collagen fibrils reinforcing the weak proteoglycan-rich ground substance in extracellular matrix (ECM). One of the hallmarks of ageing of connective tissues is the progressive and irreversible change in the tissue mechanica...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2018.140

    authors: Goh KL,Holmes DF,Lu YH,Kadler KE,Purslow PP

    更新日期:2018-07-24 00:00:00

  • Data for training and testing radiation detection algorithms in an urban environment.

    abstract::The detection, identification, and localization of illicit nuclear materials in urban environments is of utmost importance for national security. Most often, the process of performing these operations consists of a team of trained individuals equipped with radiation detection devices that have built-in algorithms to a...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-00672-2

    authors: Ghawaly JM Jr,Nicholson AD,Peplow DE,Anderson-Cook CM,Myers KL,Archer DE,Willis MJ,Quiter BJ

    更新日期:2020-10-05 00:00:00

  • Genome-wide identification of accessible chromatin regions in bumblebee by ATAC-seq.

    abstract::Bumblebees (Hymenoptera: Apidae) are important pollinating insects that play pivotal roles in crop production and natural ecosystem services. Although protein-coding genes in bumblebees have been extensively annotated, regulatory sequences of the genome, such as promoters and enhancers, have been poorly annotated. To ...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-00713-w

    authors: Zhao X,Su L,Xu W,Schaack S,Sun C

    更新日期:2020-10-26 00:00:00

  • A data citation roadmap for scholarly data repositories.

    abstract::This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies. The roadmap was developed by the Repositories Expert Group, as p...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0031-8

    authors: Fenner M,Crosas M,Grethe JS,Kennedy D,Hermjakob H,Rocca-Serra P,Durand G,Berjon R,Karcher S,Martone M,Clark T

    更新日期:2019-04-10 00:00:00

  • De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris.

    abstract::Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues r...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2016.83

    authors: Niu SC,Xu Q,Zhang GQ,Zhang YQ,Tsai WC,Hsu JL,Liang CK,Luo YB,Liu ZJ

    更新日期:2016-09-27 00:00:00

  • High resolution annual average air pollution concentration maps for the Netherlands.

    abstract::Long-term exposure to air pollution is considered a major public health concern and has been related to overall mortality and various diseases such as respiratory and cardiovascular disease. Due to the spatial variability of air pollution concentrations, assessment of individual exposure to air pollution requires spat...

    journal_title:Scientific data

    pub_type:

    doi:10.1038/sdata.2019.35

    authors: Schmitz O,Beelen R,Strak M,Hoek G,Soenario I,Brunekreef B,Vaartjes I,Dijst MJ,Grobbee DE,Karssenberg D

    更新日期:2019-03-12 00:00:00

  • A multi-species repository of social networks.

    abstract::Social network analysis is an invaluable tool to understand the patterns, evolution, and consequences of sociality. Comparative studies over a range of social systems across multiple taxonomic groups are particularly valuable. Such studies however require quantitative social association or interaction data across mult...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0056-z

    authors: Sah P,Méndez JD,Bansal S

    更新日期:2019-04-29 00:00:00

  • Genotoype-by-sequencing of three geographically distinct populations of Olympia oysters, Ostrea lurida.

    abstract::Olympia oysters are found along the west coast of North America and as the only native oyster species in the region, receive considerable attention with regard to restoration and conservation. Knowledge of genetic structure of this species is essential for resource managers. Here we provide genetic data for three dist...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2017.130

    authors: White SJ,Vadopalas B,Silliman K,Roberts SB

    更新日期:2017-09-12 00:00:00

  • Synthetic skull bone defects for automatic patient-specific craniofacial implant design.

    abstract::Patient-specific craniofacial implants are used to repair skull bone defects after trauma or surgery. Currently, cranial implants are designed and produced by third-party suppliers, which is usually time-consuming and expensive. Recent advances in additive manufacturing made the in-hospital or in-operation-room fabric...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-021-00806-0

    authors: Li J,Gsaxner C,Pepe A,Morais A,Alves V,von Campe G,Wallner J,Egger J

    更新日期:2021-01-29 00:00:00

  • Optical motion capture dataset of selected techniques in beginner and advanced Kyokushin karate athletes.

    abstract::Human motion capture is commonly used in various fields, including sport, to analyze, understand, and synthesize kinematic and kinetic data. Specialized computer vision and marker-based optical motion capture techniques constitute the gold-standard for accurate and robust human motion capture. The dataset presented co...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-021-00801-5

    authors: Szczęsna A,Błaszczyszyn M,Pawlyta M

    更新日期:2021-01-18 00:00:00

  • An agricultural survey for more than 9,500 African households.

    abstract::Surveys for more than 9,500 households were conducted in the growing seasons 2002/2003 or 2003/2004 in eleven African countries: Burkina Faso, Cameroon, Ghana, Niger and Senegal in western Africa; Egypt in northern Africa; Ethiopia and Kenya in eastern Africa; South Africa, Zambia and Zimbabwe in southern Africa. Hous...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2016.20

    authors: Waha K,Zipf B,Kurukulasuriya P,Hassan RM

    更新日期:2016-05-24 00:00:00

  • Comprehensive analysis of the venom gland transcriptome of the spider Dolomedes fimbriatus.

    abstract::A comprehensive transcriptome analysis of an expressed sequence tag (EST) database of the spider Dolomedes fimbriatus venom glands using single-residue distribution analysis (SRDA) identified 7,169 unique sequences. Mature chains of 163 different toxin-like polypeptides were predicted on the basis of well-established ...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2014.23

    authors: Kozlov SA,Lazarev VN,Kostryukova ES,Selezneva OV,Ospanova EA,Alexeev DG,Govorun VM,Grishin EV

    更新日期:2014-08-05 00:00:00

  • Very high resolution, altitude-corrected, TMPA-based monthly satellite precipitation product over the CONUS.

    abstract::The Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) product provided over 17 years of gridded precipitation datasets. However, the accuracy and spatial resolution of TMPA limits the applicability in hydrometeorological applications. We present a dataset that enhances the accurac...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-020-0411-0

    authors: Hashemi H,Fayne J,Lakshmi V,Huffman GJ

    更新日期:2020-03-03 00:00:00

  • A statistical atlas of cerebral arteries generated using multi-center MRA datasets from healthy subjects.

    abstract::Magnetic resonance angiography (MRA) can capture the variation of cerebral arteries with high spatial resolution. These measurements include valuable information about the morphology, geometry, and density of brain arteries, which may be useful to identify risk factors for cerebrovascular and neurological diseases at ...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0034-5

    authors: Mouches P,Forkert ND

    更新日期:2019-04-11 00:00:00

  • A geographically-diverse collection of 418 human gut microbiome pathway genome databases.

    abstract::Advances in high-throughput sequencing are reshaping how we perceive microbial communities inhabiting the human body, with implications for therapeutic interventions. Several large-scale datasets derived from hundreds of human microbiome samples sourced from multiple studies are now publicly available. However, idiosy...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2017.35

    authors: Hahn AS,Altman T,Konwar KM,Hanson NW,Kim D,Relman DA,Dill DL,Hallam SJ

    更新日期:2017-04-11 00:00:00

  • Spatial and temporal dynamics of multidimensional well-being, livelihoods and ecosystem services in coastal Bangladesh.

    abstract::Populations in resource dependent economies gain well-being from the natural environment, in highly spatially and temporally variable patterns. To collect information on this, we designed and implemented a 1586-household quantitative survey in the southwest coastal zone of Bangladesh. Data were collected on material, ...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2016.94

    authors: Adams H,Adger WN,Ahmad S,Ahmed A,Begum D,Lázár AN,Matthews Z,Rahman MM,Streatfield PK

    更新日期:2016-11-08 00:00:00

  • Multiple-data-based monthly geopotential model set LDCmgm90.

    abstract::While the GRACE (Gravity Recovery and Climate Experiment) satellite mission is of great significance in understanding various branches of Earth sciences, the quality of GRACE monthly products can be unsatisfactory due to strong longitudinal stripe-pattern errors and other flaws. Based on corrected GRACE Mascon (mass c...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0239-7

    authors: Chen W,Luo J,Ray J,Yu N,Li JC

    更新日期:2019-10-23 00:00:00

  • I-BLEND, a campus-scale commercial and residential buildings electrical energy dataset.

    abstract::Efficient energy consumption at the building level is vital for sustainability. Providing energy efficient systems and solutions requires an understanding of how energy gets consumed. However, there is a general lack of large-scale open datasets about the energy consumption of buildings, which hinders the research. Th...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2019.15

    authors: Rashid H,Singh P,Singh A

    更新日期:2019-02-19 00:00:00

  • Evaluating FAIR maturity through a scalable, automated, community-governed framework.

    abstract::Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guide...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0184-5

    authors: Wilkinson MD,Dumontier M,Sansone SA,Bonino da Silva Santos LO,Prieto M,Batista D,McQuilton P,Kuhn T,Rocca-Serra P,Crosas M,Schultes E

    更新日期:2019-09-20 00:00:00

  • Human pluripotent stem cell derived HLC transcriptome data enables molecular dissection of hepatogenesis.

    abstract::Induced pluripotent stem cells (iPSCs) and human embryonic stem cells (hESCs) differentiated into hepatocyte-like cells (HLCs) provide a defined and renewable source of cells for drug screening, toxicology and regenerative medicine. We previously reprogrammed human fetal foreskin fibroblast cells (HFF1) into iPSCs emp...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2018.35

    authors: Wruck W,Adjaye J

    更新日期:2018-03-13 00:00:00

  • Two-colour serial femtosecond crystallography dataset from gadoteridol-derivatized lysozyme for MAD phasing.

    abstract::We provide a detailed description of a gadoteridol-derivatized lysozyme (gadolinium lysozyme) two-colour serial femtosecond crystallography (SFX) dataset for multiple wavelength anomalous dispersion (MAD) structure determination. The data was collected at the Spring-8 Angstrom Compact free-electron LAser (SACLA) facil...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2017.188

    authors: Gorel A,Motomura K,Fukuzawa H,Doak RB,Grünbein ML,Hilpert M,Inoue I,Kloos M,Nass Kovács G,Nango E,Nass K,Roome CM,Shoeman RL,Tanaka R,Tono K,Foucar L,Joti Y,Yabashi M,Iwata S,Ueda K,Barends TRM,Schlichting I

    更新日期:2017-12-12 00:00:00

  • The pediatric template of brain perfusion.

    abstract::Magnetic resonance imaging (MRI) captures the dynamics of brain development with multiple modalities that quantify both structure and function. These measurements may yield valuable insights into the neural patterns that mark healthy maturation or that identify early risk for psychiatric disorder. The Pediatric Templa...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/sdata.2015.3

    authors: Avants BB,Duda JT,Kilroy E,Krasileva K,Jann K,Kandel BT,Tustison NJ,Yan L,Jog M,Smith R,Wang Y,Dapretto M,Wang DJ

    更新日期:2015-02-03 00:00:00

  • A suite of global accessibility indicators.

    abstract::Good access to resources and opportunities is essential for sustainable development. Improving access, especially in rural areas, requires useful measures of current access to the locations where these resources and opportunities are found. Recent work has developed a global map of travel times to cities with more tha...

    journal_title:Scientific data

    pub_type: 杂志文章

    doi:10.1038/s41597-019-0265-5

    authors: Nelson A,Weiss DJ,van Etten J,Cattaneo A,McMenomy TS,Koo J

    更新日期:2019-11-07 00:00:00