Abstract:
:Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67-0.72 to 0.60-0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.
journal_name
J Chem Inf Modeljournal_title
Journal of chemical information and modelingauthors
Cortes-Ciriano I,Bender Adoi
10.1021/acs.jcim.5b00570subject
Has Abstractpub_date
2015-12-28 00:00:00pages
2682-92issue
12eissn
1549-9596issn
1549-960Xjournal_volume
55pub_type
杂志文章abstract::The evaluation of regression QSAR model performance, in fitting, robustness, and external prediction, is of pivotal importance. Over the past decade, different external validation parameters have been proposed: Q(F1)(2), Q(F2)(2), Q(F3)(2), r(m)(2), and the Golbraikh-Tropsha method. Recently, the concordance correlati...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci300084j
更新日期:2012-08-27 00:00:00
abstract::It is demonstrated that the fragmentation of druglike molecules by applying simplistic pseudo-retrosynthesis results in a stock of chemically meaningful building blocks for de novo molecule generation. A stochastic search algorithm in conjunction with ligand-based similarity scoring (Flux: fragment-based ligand builde...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci0503560
更新日期:2006-03-01 00:00:00
abstract::Modern industrial lubricants are often blended with an assortment of chemical additives to improve the performance of the base stock. Machine learning-based predictive models allow fast and veracious derivation of material properties and facilitate novel and innovative material designs. In this study, we outline the d...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b01068
更新日期:2020-03-23 00:00:00
abstract::Deep learning has drawn significant attention in different areas including drug discovery. It has been proposed that it could outperform other machine learning algorithms, especially with big data sets. In the field of pharmaceutical industry, machine learning models are built to understand quantitative structure-acti...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.8b00671
更新日期:2019-03-25 00:00:00
abstract::We have performed a systematic study of the entropy term in the MM/GBSA (molecular mechanics combined with generalized Born and surface-area solvation) approach to calculate ligand-binding affinities. The entropies are calculated by a normal-mode analysis of harmonic frequencies from minimized snapshots of molecular d...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci3001919
更新日期:2012-08-27 00:00:00
abstract::In this work we present the third generation of FAst MEtabolizer (FAME 3), a collection of extra trees classifiers for the prediction of sites of metabolism (SoMs) in small molecules such as drugs, druglike compounds, natural products, agrochemicals, and cosmetics. FAME 3 was derived from the MetaQSAR database ( Pedre...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b00376
更新日期:2019-08-26 00:00:00
abstract::The failure of default scoring functions to ensure virtual screening enrichment is a persistent problem for the molecular docking algorithms used in structure-based drug discovery. To remedy this problem, elaborate rescoring and postprocessing schemes have been developed with a varying degree of success, specificity, ...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b00383
更新日期:2019-08-26 00:00:00
abstract::Virtual screening is a powerful methodology to search for new small molecule inhibitors against a desired molecular target. Usually, it involves evaluating thousands of compounds (derived from large databases) in order to select a set of potential binders that will be tested in the wet-lab. The number of tested compou...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.7b00241
更新日期:2017-08-28 00:00:00
abstract::Development of coarse-grained (CG) molecular dynamics models is often a laborious process which commonly relies upon approximations to similar models, rather than systematic parametrization. PyCGTOOL automates much of the construction of CG models via calculation of both equilibrium values and force constants of inter...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.7b00096
更新日期:2017-04-24 00:00:00
abstract::A novel pharmacophore descriptor Flexophore is presented, which considers molecular flexibility when comparing descriptor similarities. The descriptor is a complete reduced graph of the underlying molecule. Its nodes are represented by enhanced MM2 atom types, while the edge descriptions encode the molecular flexibili...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci700359j
更新日期:2008-04-01 00:00:00
abstract::Retrieving molecules with specific structural features is a fundamental requirement of today's molecular database technologies. Estimates claim the chemical space relevant for drug discovery to be around 10⁶⁰ molecules. This figure is many orders of magnitude larger than the amount of molecules conventional databases ...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci400107k
更新日期:2013-07-22 00:00:00
abstract::Membrane fusion, a key step in the early stages of virus propagation, allows the release of the viral genome in the host cell cytoplasm. The process is initiated by fusion peptides that are small, hydrophobic components of viral membrane-embedded glycoproteins and are typically conserved within virus families. Here, w...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.0c01231
更新日期:2021-01-25 00:00:00
abstract::We introduce the statistics behind a novel type of SAR analysis named "nonadditivity analysis". On the basis of all pairs of matched pairs within a given data set, the approach analyzes whether the same transformations between related molecules have the same effect, i.e., whether they are additive. Assuming that the e...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b00631
更新日期:2019-09-23 00:00:00
abstract::With continually increased computer power, molecular mechanics force field-based approaches, such as the endpoint methods of molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) and molecular mechanics generalized Born surface area (MM-GBSA), have been routinely applied in both drug lead identification and opt...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.0c00934
更新日期:2020-12-28 00:00:00
abstract::Accurate and affordable assessment of ligand-protein affinity for structure-based virtual screening (SB-VS) is a standing challenge. Hence, empirical postdocking filters making use of various types of structure-activity information may prove useful. Here, we introduce one such filter based upon three-dimensional struc...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci500319f
更新日期:2014-09-22 00:00:00
abstract::The anatomical therapeutic chemical (ATC) classification system maintained by the World Health Organization provides a global standard for the classification of medical substances and serves as a source for drug repurposing research. Nevertheless, it lacks several drugs that are major players in the global drug market...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci9000844
更新日期:2009-08-01 00:00:00
abstract::Reaction classification has important applications, and many approaches to classification have been applied. Our own algorithm tests all maximum common substructures (MCS) between all reactant and product molecules in order to find an atom mapping containing the minimum chemical distance (MCD). Recent publications hav...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci400442f
更新日期:2013-11-25 00:00:00
abstract::As an important member of cytochrome P450 (CYP) enzymes, CYP17A1 is a dual-function monooxygenase with a critical role in the synthesis of many human steroid hormones, making it an attractive therapeutic target. The emerging structural information about CYP17A1 and the growing number of inhibitors for these enzymes ca...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.0c00447
更新日期:2020-07-27 00:00:00
abstract::The COSMO surface polarization charge density σ resulting from quantum chemical calculations combined with a virtual conductor embedding has been widely proven to be a very suitable descriptor for the quantification of interactions of molecules in liquids. In a preceding paper, grid-based local histograms of σ have be...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci300231t
更新日期:2012-08-27 00:00:00
abstract::In a departure from conventional chemical approaches, data-driven models of chemical reactions have recently been shown to be statistically successful using machine learning. These models, however, are largely black box in character and have not provided the kind of chemical insights that historically advanced the fie...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b00721
更新日期:2020-03-23 00:00:00
abstract::The proton conduction of transmembrane influenza virus B M2 (BM2) proton channel is possibly mediated by the membrane environment, but the detailed molecular mechanism is challenging to determine. In this work, how membrane lipid composition regulates the conformation and hydration of BM2 channel is elucidated in sili...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.0c00329
更新日期:2020-07-27 00:00:00
abstract::Identifying drug-target interactions (DTIs) plays an important role in the field of drug discovery, drug side-effects, and drug repositioning. However, in vivo or biochemical experimental methods for identifying new DTIs are extremely expensive and time-consuming. Recently, in silico or various computational methods h...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.9b00408
更新日期:2019-07-22 00:00:00
abstract::The consistent handling of molecules is probably the most basic and important requirement in the field of cheminformatics. Reliable results can only be obtained if the underlying calculations are independent of the specific way molecules are represented in the input data. However, ensuring consistency is a complex tas...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci400724v
更新日期:2014-03-24 00:00:00
abstract::A fully folded functional protein is stabilized by several noncovalent interactions. When a protein undergoes conformational motions, the existing noncovalent interactions may be maintained. They may also break or new interactions may be formed. Knowledge of the dynamical nature of the different types of noncovalent i...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci200302q
更新日期:2011-12-27 00:00:00
abstract::Binding affinity prediction with implicit solvent models remains a challenge in virtual screening for drug discovery. In order to assess the predictive power of implicit solvent models in docking techniques with Amber scoring, three generalized Born models (GBHCT, GBOBCI, and GBOBCII) available in Dock 6.7 were utiliz...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.6b00418
更新日期:2016-10-24 00:00:00
abstract::Maltose-binding protein is a periplasmic binding protein responsible for transport of maltooligosaccarides through the periplasmic space of Gram-negative bacteria, as a part of the ABC transport system. The molecular mechanisms of the initial ligand binding and induced large scale motion of the protein's domains still...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci500520q
更新日期:2015-02-23 00:00:00
abstract::The interaction between small molecules and proteins is one of the major concerns for structure-based drug design because the principles of protein-ligand interactions and molecular recognition are not thoroughly understood. Fortunately, the analysis of protein-ligand complexes in the Protein Data Bank (PDB) enables u...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/ci100386y
更新日期:2011-04-25 00:00:00
abstract::Telomere maintenance is a universal cancer hallmark, and small molecules that disrupt telomere maintenance generally have anticancer properties. Since the vast majority of cancer cells utilize telomerase activity for telomere maintenance, the enzyme has been considered as an anticancer drug target. Recently, rational ...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.5b00336
更新日期:2015-12-28 00:00:00
abstract::The recent article "Evaluation of pK(a) Estimation Methods on 211 Druglike Compounds" ( Manchester, J.; et al. J. Chem Inf. Model. 2010, 50, 565-571 ) reports poor results for the program Epik. Here, we highlight likely sources for the poor performance and describe work done to improve the performance. Running Epik in...
journal_title:Journal of chemical information and modeling
pub_type: 评论,杂志文章
doi:10.1021/ci100332m
更新日期:2011-01-24 00:00:00
abstract::Artificial intelligence and multiobjective optimization represent promising solutions to bridge chemical and biological landscapes by addressing the automated de novo design of compounds as a result of a humanlike creative process. In the present study, we conceived a novel pair-based multiobjective approach implement...
journal_title:Journal of chemical information and modeling
pub_type: 杂志文章
doi:10.1021/acs.jcim.0c00517
更新日期:2020-10-26 00:00:00