Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Abstract:

:Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67-0.72 to 0.60-0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.

journal_name

J Chem Inf Model

authors

Cortes-Ciriano I,Bender A

doi

10.1021/acs.jcim.5b00570

subject

Has Abstract

pub_date

2015-12-28 00:00:00

pages

2682-92

issue

12

eissn

1549-9596

issn

1549-960X

journal_volume

55

pub_type

杂志文章
  • Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection.

    abstract::The evaluation of regression QSAR model performance, in fitting, robustness, and external prediction, is of pivotal importance. Over the past decade, different external validation parameters have been proposed: Q(F1)(2), Q(F2)(2), Q(F3)(2), r(m)(2), and the Golbraikh-Tropsha method. Recently, the concordance correlati...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci300084j

    authors: Chirico N,Gramatica P

    更新日期:2012-08-27 00:00:00

  • Flux (1): a virtual synthesis scheme for fragment-based de novo design.

    abstract::It is demonstrated that the fragmentation of druglike molecules by applying simplistic pseudo-retrosynthesis results in a stock of chemically meaningful building blocks for de novo molecule generation. A stochastic search algorithm in conjunction with ligand-based similarity scoring (Flux: fragment-based ligand builde...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci0503560

    authors: Fechner U,Schneider G

    更新日期:2006-03-01 00:00:00

  • Viscosity Prediction of Lubricants by a General Feed-Forward Neural Network.

    abstract::Modern industrial lubricants are often blended with an assortment of chemical additives to improve the performance of the base stock. Machine learning-based predictive models allow fast and veracious derivation of material properties and facilitate novel and innovative material designs. In this study, we outline the d...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b01068

    authors: Loh GC,Lee HC,Tee XY,Chow PS,Zheng JW

    更新日期:2020-03-23 00:00:00

  • Exploring Tunable Hyperparameters for Deep Neural Networks with Industrial ADME Data Sets.

    abstract::Deep learning has drawn significant attention in different areas including drug discovery. It has been proposed that it could outperform other machine learning algorithms, especially with big data sets. In the field of pharmaceutical industry, machine learning models are built to understand quantitative structure-acti...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.8b00671

    authors: Zhou Y,Cahya S,Combs SA,Nicolaou CA,Wang J,Desai PV,Shen J

    更新日期:2019-03-25 00:00:00

  • The normal-mode entropy in the MM/GBSA method: effect of system truncation, buffer region, and dielectric constant.

    abstract::We have performed a systematic study of the entropy term in the MM/GBSA (molecular mechanics combined with generalized Born and surface-area solvation) approach to calculate ligand-binding affinities. The entropies are calculated by a normal-mode analysis of harmonic frequencies from minimized snapshots of molecular d...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci3001919

    authors: Genheden S,Kuhn O,Mikulskis P,Hoffmann D,Ryde U

    更新日期:2012-08-27 00:00:00

  • FAME 3: Predicting the Sites of Metabolism in Synthetic Compounds and Natural Products for Phase 1 and Phase 2 Metabolic Enzymes.

    abstract::In this work we present the third generation of FAst MEtabolizer (FAME 3), a collection of extra trees classifiers for the prediction of sites of metabolism (SoMs) in small molecules such as drugs, druglike compounds, natural products, agrochemicals, and cosmetics. FAME 3 was derived from the MetaQSAR database ( Pedre...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00376

    authors: Šícho M,Stork C,Mazzolari A,de Bruyn Kops C,Pedretti A,Testa B,Vistoli G,Svozil D,Kirchmair J

    更新日期:2019-08-26 00:00:00

  • Getting Docking into Shape Using Negative Image-Based Rescoring.

    abstract::The failure of default scoring functions to ensure virtual screening enrichment is a persistent problem for the molecular docking algorithms used in structure-based drug discovery. To remedy this problem, elaborate rescoring and postprocessing schemes have been developed with a varying degree of success, specificity, ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00383

    authors: Kurkinen ST,Lätti S,Pentikäinen OT,Postila PA

    更新日期:2019-08-26 00:00:00

  • LigQ: A Webserver to Select and Prepare Ligands for Virtual Screening.

    abstract::Virtual screening is a powerful methodology to search for new small molecule inhibitors against a desired molecular target. Usually, it involves evaluating thousands of compounds (derived from large databases) in order to select a set of potential binders that will be tested in the wet-lab. The number of tested compou...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00241

    authors: Radusky L,Ruiz-Carmona S,Modenutti C,Barril X,Turjanski AG,Martí MA

    更新日期:2017-08-28 00:00:00

  • PyCGTOOL: Automated Generation of Coarse-Grained Molecular Dynamics Models from Atomistic Trajectories.

    abstract::Development of coarse-grained (CG) molecular dynamics models is often a laborious process which commonly relies upon approximations to similar models, rather than systematic parametrization. PyCGTOOL automates much of the construction of CG models via calculation of both equilibrium values and force constants of inter...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00096

    authors: Graham JA,Essex JW,Khalid S

    更新日期:2017-04-24 00:00:00

  • Flexophore, a new versatile 3D pharmacophore descriptor that considers molecular flexibility.

    abstract::A novel pharmacophore descriptor Flexophore is presented, which considers molecular flexibility when comparing descriptor similarities. The descriptor is a complete reduced graph of the underlying molecule. Its nodes are represented by enhanced MM2 atom types, while the edge descriptions encode the molecular flexibili...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci700359j

    authors: von Korff M,Freyss J,Sander T

    更新日期:2008-04-01 00:00:00

  • Searching for recursively defined generic chemical patterns in nonenumerated fragment spaces.

    abstract::Retrieving molecules with specific structural features is a fundamental requirement of today's molecular database technologies. Estimates claim the chemical space relevant for drug discovery to be around 10⁶⁰ molecules. This figure is many orders of magnitude larger than the amount of molecules conventional databases ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400107k

    authors: Ehrlich HC,Henzler AM,Rarey M

    更新日期:2013-07-22 00:00:00

  • Computational Insight Into the Mechanism of SARS-CoV-2 Membrane Fusion.

    abstract::Membrane fusion, a key step in the early stages of virus propagation, allows the release of the viral genome in the host cell cytoplasm. The process is initiated by fusion peptides that are small, hydrophobic components of viral membrane-embedded glycoproteins and are typically conserved within virus families. Here, w...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c01231

    authors: Borkotoky S,Dey D,Banerjee M

    更新日期:2021-01-25 00:00:00

  • Nonadditivity Analysis.

    abstract::We introduce the statistics behind a novel type of SAR analysis named "nonadditivity analysis". On the basis of all pairs of matched pairs within a given data set, the approach analyzes whether the same transformations between related molecules have the same effect, i.e., whether they are additive. Assuming that the e...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00631

    authors: Kramer C

    更新日期:2019-09-23 00:00:00

  • How Well Does the Extended Linear Interaction Energy Method Perform in Accurate Binding Free Energy Calculations?

    abstract::With continually increased computer power, molecular mechanics force field-based approaches, such as the endpoint methods of molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) and molecular mechanics generalized Born surface area (MM-GBSA), have been routinely applied in both drug lead identification and opt...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00934

    authors: Hao D,He X,Ji B,Zhang S,Wang J

    更新日期:2020-12-28 00:00:00

  • Structural protein-ligand interaction fingerprints (SPLIF) for structure-based virtual screening: method and benchmark study.

    abstract::Accurate and affordable assessment of ligand-protein affinity for structure-based virtual screening (SB-VS) is a standing challenge. Hence, empirical postdocking filters making use of various types of structure-activity information may prove useful. Here, we introduce one such filter based upon three-dimensional struc...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci500319f

    authors: Da C,Kireev D

    更新日期:2014-09-22 00:00:00

  • Concept-based semi-automatic classification of drugs.

    abstract::The anatomical therapeutic chemical (ATC) classification system maintained by the World Health Organization provides a global standard for the classification of medical substances and serves as a source for drug repurposing research. Nevertheless, it lacks several drugs that are major players in the global drug market...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci9000844

    authors: Gurulingappa H,Kolárik C,Hofmann-Apitius M,Fluck J

    更新日期:2009-08-01 00:00:00

  • Algorithm for reaction classification.

    abstract::Reaction classification has important applications, and many approaches to classification have been applied. Our own algorithm tests all maximum common substructures (MCS) between all reactant and product molecules in order to find an atom mapping containing the minimum chemical distance (MCD). Recent publications hav...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400442f

    authors: Kraut H,Eiblmaier J,Grethe G,Löw P,Matuszczyk H,Saller H

    更新日期:2013-11-25 00:00:00

  • Comparative Dynamics and Functional Mechanisms of the CYP17A1 Tunnels Regulated by Ligand Binding.

    abstract::As an important member of cytochrome P450 (CYP) enzymes, CYP17A1 is a dual-function monooxygenase with a critical role in the synthesis of many human steroid hormones, making it an attractive therapeutic target. The emerging structural information about CYP17A1 and the growing number of inhibitors for these enzymes ca...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00447

    authors: Xiao F,Song X,Tian P,Gan M,Verkhivker GM,Hu G

    更新日期:2020-07-27 00:00:00

  • COSMOsar3D: molecular field analysis based on local COSMO σ-profiles.

    abstract::The COSMO surface polarization charge density σ resulting from quantum chemical calculations combined with a virtual conductor embedding has been widely proven to be a very suitable descriptor for the quantification of interactions of molecules in liquids. In a preceding paper, grid-based local histograms of σ have be...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci300231t

    authors: Klamt A,Thormann M,Wichmann K,Tosco P

    更新日期:2012-08-27 00:00:00

  • What Does the Machine Learn? Knowledge Representations of Chemical Reactivity.

    abstract::In a departure from conventional chemical approaches, data-driven models of chemical reactions have recently been shown to be statistically successful using machine learning. These models, however, are largely black box in character and have not provided the kind of chemical insights that historically advanced the fie...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00721

    authors: Kammeraad JA,Goetz J,Walker EA,Tewari A,Zimmerman PM

    更新日期:2020-03-23 00:00:00

  • In Silico Study of Membrane Lipid Composition Regulating Conformation and Hydration of Influenza Virus B M2 Channel.

    abstract::The proton conduction of transmembrane influenza virus B M2 (BM2) proton channel is possibly mediated by the membrane environment, but the detailed molecular mechanism is challenging to determine. In this work, how membrane lipid composition regulates the conformation and hydration of BM2 channel is elucidated in sili...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00329

    authors: Zhang Y,Zhang HX,Zheng QC

    更新日期:2020-07-27 00:00:00

  • Improved Prediction of Drug-Target Interactions Using Self-Paced Learning with Collaborative Matrix Factorization.

    abstract::Identifying drug-target interactions (DTIs) plays an important role in the field of drug discovery, drug side-effects, and drug repositioning. However, in vivo or biochemical experimental methods for identifying new DTIs are extremely expensive and time-consuming. Recently, in silico or various computational methods h...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00408

    authors: Xia LY,Yang ZY,Zhang H,Liang Y

    更新日期:2019-07-22 00:00:00

  • The valence state combination model: a generic framework for handling tautomers and protonation states.

    abstract::The consistent handling of molecules is probably the most basic and important requirement in the field of cheminformatics. Reliable results can only be obtained if the underlying calculations are independent of the specific way molecules are represented in the input data. However, ensuring consistency is a complex tas...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400724v

    authors: Urbaczek S,Kolodzik A,Rarey M

    更新日期:2014-03-24 00:00:00

  • Dynamics of noncovalent interactions in all-α and all-β class proteins: implications for the stability of amyloid aggregates.

    abstract::A fully folded functional protein is stabilized by several noncovalent interactions. When a protein undergoes conformational motions, the existing noncovalent interactions may be maintained. They may also break or new interactions may be formed. Knowledge of the dynamical nature of the different types of noncovalent i...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci200302q

    authors: Jain A,Sankararamakrishnan R

    更新日期:2011-12-27 00:00:00

  • Evaluation of Generalized Born Models for Large Scale Affinity Prediction of Cyclodextrin Host-Guest Complexes.

    abstract::Binding affinity prediction with implicit solvent models remains a challenge in virtual screening for drug discovery. In order to assess the predictive power of implicit solvent models in docking techniques with Amber scoring, three generalized Born models (GBHCT, GBOBCI, and GBOBCII) available in Dock 6.7 were utiliz...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.6b00418

    authors: Zhang H,Yin C,Yan H,van der Spoel D

    更新日期:2016-10-24 00:00:00

  • Role of water in ligand binding to maltose-binding protein: insight from a new docking protocol based on the 3D-RISM-KH molecular theory of solvation.

    abstract::Maltose-binding protein is a periplasmic binding protein responsible for transport of maltooligosaccarides through the periplasmic space of Gram-negative bacteria, as a part of the ABC transport system. The molecular mechanisms of the initial ligand binding and induced large scale motion of the protein's domains still...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci500520q

    authors: Huang W,Blinov N,Wishart DS,Kovalenko A

    更新日期:2015-02-23 00:00:00

  • Residue preference mapping of ligand fragments in the Protein Data Bank.

    abstract::The interaction between small molecules and proteins is one of the major concerns for structure-based drug design because the principles of protein-ligand interactions and molecular recognition are not thoroughly understood. Fortunately, the analysis of protein-ligand complexes in the Protein Data Bank (PDB) enables u...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci100386y

    authors: Wang L,Xie Z,Wipf P,Xie XQ

    更新日期:2011-04-25 00:00:00

  • Potent Human Telomerase Inhibitors: Molecular Dynamic Simulations, Multiple Pharmacophore-Based Virtual Screening, and Biochemical Assays.

    abstract::Telomere maintenance is a universal cancer hallmark, and small molecules that disrupt telomere maintenance generally have anticancer properties. Since the vast majority of cancer cells utilize telomerase activity for telomere maintenance, the enzyme has been considered as an anticancer drug target. Recently, rational ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.5b00336

    authors: Shirgahi Talari F,Bagherzadeh K,Golestanian S,Jarstfer M,Amanlou M

    更新日期:2015-12-28 00:00:00

  • Comments on the article "Evaluation of pK(a) estimation methods on 211 druglike compounds".

    abstract::The recent article "Evaluation of pK(a) Estimation Methods on 211 Druglike Compounds" ( Manchester, J.; et al. J. Chem Inf. Model. 2010, 50, 565-571 ) reports poor results for the program Epik. Here, we highlight likely sources for the poor performance and describe work done to improve the performance. Running Epik in...

    journal_title:Journal of chemical information and modeling

    pub_type: 评论,杂志文章

    doi:10.1021/ci100332m

    authors: Shelley JC,Calkins D,Sullivan AP

    更新日期:2011-01-24 00:00:00

  • De Novo Drug Design of Targeted Chemical Libraries Based on Artificial Intelligence and Pair-Based Multiobjective Optimization.

    abstract::Artificial intelligence and multiobjective optimization represent promising solutions to bridge chemical and biological landscapes by addressing the automated de novo design of compounds as a result of a humanlike creative process. In the present study, we conceived a novel pair-based multiobjective approach implement...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00517

    authors: Domenico A,Nicola G,Daniela T,Fulvio C,Nicola A,Orazio N

    更新日期:2020-10-26 00:00:00