Molecular Structure Extraction from Documents Using Deep Learning.

Abstract:

:Chemical structure extraction from documents remains a hard problem because of both false positive identification of structures during segmentation and errors in the predicted structures. Current approaches rely on handcrafted rules and subroutines that perform reasonably well generally but still routinely encounter situations where recognition rates are not yet satisfactory and systematic improvement is challenging. Complications impacting the performance of current approaches include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations, and other challenges related to image quality, including resolution and noise. We present end-to-end deep learning solutions for both segmenting molecular structures from documents and predicting chemical structures from the segmented images. This deep-learning-based approach does not require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein, we show that it is possible to perform well on both segmentation and prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.

journal_name

J Chem Inf Model

authors

Staker J,Marshall K,Abel R,McQuaw CM

doi

10.1021/acs.jcim.8b00669

subject

Has Abstract

pub_date

2019-03-25 00:00:00

pages

1017-1029

issue

3

eissn

1549-9596

issn

1549-960X

journal_volume

59

pub_type

杂志文章
  • ColBioS-FlavRC: a collection of bioselective flavonoids and related compounds filtered from high-throughput screening outcomes.

    abstract::Flavonoids, the vastest class of natural polyphenols, are extensively investigated for their multiple benefits on human health. Due to their physicochemical or biological properties, many representatives are considered to exhibit low selectivity among various protein targets or to plague high-throughput screening (HTS...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci5002668

    authors: Avram SI,Pacureanu LM,Bora A,Crisan L,Avram S,Kurunczi L

    更新日期:2014-08-25 00:00:00

  • Determination of Structural Ensembles of Flexible Molecules in Solution from NMR Data Undergoing Spin Diffusion.

    abstract::Spin diffusion is a formidable problem when interpreting NMR data of chemical compounds. We developed a method to reconstruct the conformational ensemble of flexible molecules displaying spin diffusion, which minimizes the subjective bias in the interpretation of experimental data and which can be used routinely to ob...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00259

    authors: Vasile F,Tiana G

    更新日期:2019-06-24 00:00:00

  • Determining the validity of a QSAR model--a classification approach.

    abstract::The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previo...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci0497511

    authors: Guha R,Jurs PC

    更新日期:2005-01-01 00:00:00

  • ThermoData Engine (TDE): software implementation of the dynamic data evaluation concept. 9. Extensible thermodynamic constraints for pure compounds and new model developments.

    abstract::ThermoData Engine (TDE) is the first full-scale software implementation of the dynamic data evaluation concept, as reported in this journal. The present article describes the background and implementation for new additions in latest release of TDE. Advances are in the areas of program architecture and quality improvem...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci4005699

    authors: Diky V,Chirico RD,Muzny CD,Kazakov AF,Kroenlein K,Magee JW,Abdulagatov I,Frenkel M

    更新日期:2013-12-23 00:00:00

  • FAME 3: Predicting the Sites of Metabolism in Synthetic Compounds and Natural Products for Phase 1 and Phase 2 Metabolic Enzymes.

    abstract::In this work we present the third generation of FAst MEtabolizer (FAME 3), a collection of extra trees classifiers for the prediction of sites of metabolism (SoMs) in small molecules such as drugs, druglike compounds, natural products, agrochemicals, and cosmetics. FAME 3 was derived from the MetaQSAR database ( Pedre...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00376

    authors: Šícho M,Stork C,Mazzolari A,de Bruyn Kops C,Pedretti A,Testa B,Vistoli G,Svozil D,Kirchmair J

    更新日期:2019-08-26 00:00:00

  • Modeling Binding with Large Conformational Changes: Key Points in Ensemble-Docking Approaches.

    abstract::Protein dynamics play a critical role in ligand binding, and different models have been proposed to explain the relationships between protein motion and molecular recognition. Here, we present a study of ligand-binding processes associated with large conformational changes of a protein to elucidate the critical choice...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00125

    authors: Motta S,Bonati L

    更新日期:2017-07-24 00:00:00

  • CoMFA, CoMSIA, and molecular hologram QSAR studies of novel neuronal nAChRs ligands-open ring analogues of 3-pyridyl ether.

    abstract::3-Pyridyl ethers are excellent nAChRs ligands, which show high subtype selectivity and binding affinity to alpha4beta2 nAChR. Although the quantitative structure-activity relationship (QSAR) of nAChRs ligands has been widely investigated using various classes of compounds, the open ring analogues of 3-pyridyl ethers h...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci0498113

    authors: Zhang H,Li H,Liu C

    更新日期:2005-03-01 00:00:00

  • Training a scoring function for the alignment of small molecules.

    abstract::A comprehensive data set of aligned ligands with highly similar binding pockets from the Protein Data Bank has been built. Based on this data set, a scoring function for recognizing good alignment poses for small molecules has been developed. This function is based on atoms and hydrogen-bond projected features. The co...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci100227h

    authors: Chan SL,Labute P

    更新日期:2010-09-27 00:00:00

  • Assessing different classification methods for virtual screening.

    abstract::How well do different classification methods perform in selecting the ligands of a protein target out of large compound collections not used to train the model? Support vector machines, random forest, artificial neural networks, k-nearest-neighbor classification with genetic-algorithm-optimized feature selection, tren...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci050519k

    authors: Plewczynski D,Spieser SA,Koch U

    更新日期:2006-05-01 00:00:00

  • Benchmark data sets for structure-based computational target prediction.

    abstract::Structure-based computational target prediction methods identify potential targets for a bioactive compound. Methods based on protein-ligand docking so far face many challenges, where the greatest probably is the ranking of true targets in a large data set of protein structures. Currently, no standard data sets for ev...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci500131x

    authors: Schomburg KT,Rarey M

    更新日期:2014-08-25 00:00:00

  • A critical assessment of combined ligand- and structure-based approaches to HERG channel blocker modeling.

    abstract::Blockade of human ether-à-go-go related gene (hERG) channel prolongs the duration of the cardiac action potential and is a common reason for drug failure in preclinical safety trials. Therefore, it is of great importance to develop robust in silico tools to predict potential hERG blockers in the early stages of drug d...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci200271d

    authors: Du-Cuny L,Chen L,Zhang S

    更新日期:2011-11-28 00:00:00

  • Expanding the Range of Force Fields Available for ONIOM Calculations: The SICTWO Interface.

    abstract::The ONIOM scheme is one of the most popular QM/MM approaches, but its extended application has been so far hindered by the limited availability of force fields in most practical implementations. This paper describes a simple software code to overcome this limitation, and its application to three representative chemica...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.8b00332

    authors: Sameera WMC,Maseras F

    更新日期:2018-09-24 00:00:00

  • Molecular Dynamics Simulation of the Conformational Preferences of Pseudouridine Derivatives: Improving the Distribution in the Glycosidic Torsion Space.

    abstract::There are only four derivatives of pseudouridine (Ψ) that are known to occur naturally in RNA as post-transcriptional modifications. We have studied the conformational consequences of pseudouridylation and further modifications using replica exchange molecular dynamics simulations at the nucleoside level, and the simu...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00369

    authors: Dutta N,Sarzynska J,Lahiri A

    更新日期:2020-10-26 00:00:00

  • PyCGTOOL: Automated Generation of Coarse-Grained Molecular Dynamics Models from Atomistic Trajectories.

    abstract::Development of coarse-grained (CG) molecular dynamics models is often a laborious process which commonly relies upon approximations to similar models, rather than systematic parametrization. PyCGTOOL automates much of the construction of CG models via calculation of both equilibrium values and force constants of inter...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00096

    authors: Graham JA,Essex JW,Khalid S

    更新日期:2017-04-24 00:00:00

  • Develop and test a solvent accessible surface area-based model in conformational entropy calculations.

    abstract::It is of great interest in modern drug design to accurately calculate the free energies of protein-ligand or nucleic acid-ligand binding. MM-PBSA (molecular mechanics Poisson-Boltzmann surface area) and MM-GBSA (molecular mechanics generalized Born surface area) have gained popularity in this field. For both methods, ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci300064d

    authors: Wang J,Hou T

    更新日期:2012-05-25 00:00:00

  • Evaluating Free Energies of Binding and Conservation of Crystallographic Waters Using SZMAP.

    abstract::The SZMAP method computes binding free energies and the corresponding thermodynamic components for water molecules in the binding site of a protein structure [ SZMAP, 1.0.0 ; OpenEye Scientific Software Inc. : Santa Fe, NM, USA , 2011 ]. In this work, the ability of SZMAP to predict water structure and thermodynamic s...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci500746d

    authors: Bayden AS,Moustakas DT,Joseph-McCarthy D,Lamb ML

    更新日期:2015-08-24 00:00:00

  • Development of novel statistical potentials describing cation-pi interactions in proteins and comparison with semiempirical and quantum chemistry approaches.

    abstract::Novel statistical potentials derived from known protein structures are presented. They are designed to describe cation-pi and amino-pi interactions between a positively charged amino acid or an amino acid carrying a partially charged amino group and an aromatic moiety. These potentials are based on the propensity of r...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci050395b

    authors: Gilis D,Biot C,Buisine E,Dehouck Y,Rooman M

    更新日期:2006-03-01 00:00:00

  • Fragment-Based Computational Method for Designing GPCR Ligands.

    abstract::G protein-coupled receptors (GPCRs) are the largest family of cell surface receptors, which is arguably the most important family of drug target. With the technology breakthroughs in X-ray crystallography and cryo-electron microscopy, more than 300 GPCR-ligand complex structures have been publicly reported since 2007,...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00699

    authors: Li Y,Sun Y,Song Y,Dai D,Zhao Z,Zhang Q,Zhong W,Hu LA,Ma Y,Li X,Wang R

    更新日期:2020-09-28 00:00:00

  • In silico deconstruction of ATP-competitive inhibitors of glycogen synthase kinase-3β.

    abstract::Fragment-based methods have emerged in the last two decades as alternatives to traditional high throughput screenings for the identification of chemical starting points in drug discovery. One arguable yet popular assumption about fragment-based design is that the fragment binding mode remains conserved upon chemical e...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci300355p

    authors: Bisignano P,Lambruschini C,Bicego M,Murino V,Favia AD,Cavalli A

    更新日期:2012-12-21 00:00:00

  • Protein Preparation Automatic Protocol for High-Throughput Inverse Virtual Screening: Accelerating the Target Identification by Computational Methods.

    abstract::Structure-based virtual screening is highly used in the early stages of drug discovery to identify new putative lead compounds for a given target. However, when a small molecule elicits a biological effect, but its target is unknown, or the side effects it causes arise from its undesired interaction with unknown count...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00428

    authors: De Vita S,Lauro G,Ruggiero D,Terracciano S,Riccio R,Bifulco G

    更新日期:2019-11-25 00:00:00

  • Partitioning of Benzoic Acid into 1,2-Dimyristoyl-sn-glycero-3-phosphocholine and Blood-Brain Barrier Mimetic Bilayers.

    abstract::Using an all-atom explicit water model and replica exchange umbrella sampling simulations, we investigated the molecular mechanisms of benzoic acid partitioning into two model lipid bilayers. The first was formed of 1,2-dimyristoyl-sn-glycero-3-phosphocholine (DMPC) lipids, whereas the second was composed of an equimo...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00590

    authors: Siwy CM,Delfing BM,Smith AK,Klimov DK

    更新日期:2020-08-24 00:00:00

  • FlexAID: Revisiting Docking on Non-Native-Complex Structures.

    abstract::Small-molecule protein docking is an essential tool in drug design and to understand molecular recognition. In the present work we introduce FlexAID, a small-molecule docking algorithm that accounts for target side-chain flexibility and utilizes a soft scoring function, i.e. one that is not highly dependent on specifi...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.5b00078

    authors: Gaudreault F,Najmanovich RJ

    更新日期:2015-07-27 00:00:00

  • Unraveling Energy and Dynamics Determinants to Interpret Protein Functional Plasticity: The Limonene-1,2-epoxide-hydrolase Case Study.

    abstract::The balance between structural stability and functional plasticity in proteins that share common three-dimensional folds is the key factor that drives protein evolvability. The ability to distinguish the parts of homologous proteins that underlie common structural organization patterns from the parts acting as regulat...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.6b00504

    authors: Rinaldi S,Gori A,Annovazzi C,Ferrandi EE,Monti D,Colombo G

    更新日期:2017-04-24 00:00:00

  • Ligand binding determinants for angiotensin II type 1 receptor from computer simulations.

    abstract::The ligand binding determinants for the angiotensin II type 1 receptor (AT1R), a G protein-coupled receptor (GPCR), have been characterized by means of computer simulations. As a first step, a pharmacophore model of various known AT1R ligands exhibiting a wide range of binding affinities was generated. Second, a struc...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400400m

    authors: Matsoukas MT,Cordomí A,Ríos S,Pardo L,Tselios T

    更新日期:2013-11-25 00:00:00

  • Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discove

    abstract::All molecules of up to 11 atoms of C, N, O, and F possible under consideration of simple valency, chemical stability, and synthetic feasibility rules were generated and collected in a database (GDB). GDB contains 26.4 million molecules (110.9 million stereoisomers), including three- and four-membered rings and triple ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci600423u

    authors: Fink T,Reymond JL

    更新日期:2007-03-01 00:00:00

  • Benchmark data set for in silico prediction of Ames mutagenicity.

    abstract::Up to now, publicly available data sets to build and evaluate Ames mutagenicity prediction tools have been very limited in terms of size and chemical space covered. In this report we describe a new unique public Ames mutagenicity data set comprising about 6500 nonconfidential compounds (available as SMILES strings and...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci900161g

    authors: Hansen K,Mika S,Schroeter T,Sutter A,ter Laak A,Steger-Hartmann T,Heinrich N,Müller KR

    更新日期:2009-09-01 00:00:00

  • Scaling predictive modeling in drug development with cloud computing.

    abstract::Growing data sets with increased time for analysis is hampering predictive modeling in drug discovery. Model building can be carried out on high-performance computer clusters, but these can be expensive to purchase and maintain. We have evaluated ligand-based modeling on cloud computing resources where computations ar...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci500580y

    authors: Moghadam BT,Alvarsson J,Holm M,Eklund M,Carlsson L,Spjuth O

    更新日期:2015-01-26 00:00:00

  • Molecular Dynamics Simulations of Membrane-Bound STIM1 to Investigate Conformational Changes during STIM1 Activation upon Calcium Release.

    abstract::Calcium is involved in important intracellular processes, such as intracellular signaling from cell membrane receptors to the nucleus. Typically, calcium levels are kept at less than 100 nM in the nucleus and cytosol, but some calcium is stored in the endoplasmic reticulum (ER) lumen for rapid release to activate intr...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.6b00475

    authors: Mukherjee S,Karolak A,Debant M,Buscaglia P,Renaudineau Y,Mignen O,Guida WC,Brooks WH

    更新日期:2017-02-27 00:00:00

  • Rapid evaluation of synthetic and molecular complexity for in silico chemistry.

    abstract::Methods that rapidly evaluate molecular complexity and synthetic feasibility are becoming increasingly important for in silico chemistry. We propose a new metric based on relative atomic electronegativities and bond parameters that evaluate both synthetic and molecular complexity (SMCM) starting from chemical structur...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci0501387

    authors: Allu TK,Oprea TI

    更新日期:2005-09-01 00:00:00

  • Partitioning of pi-electrons in rings for Clar structures of benzenoid hydrocarbons.

    abstract::Resonance structures of polycyclic aromatic hydrocarbons can be associated with numerical formulas by assigning pi-electrons of C=C double bonds to individual benzenoid rings. Each C=C double bond in a resonance structure assigns two pi-electrons to a ring in a fused-benzenoid system if it is not shared by adjacent ri...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci050196s

    authors: Randić M,Balaban AT

    更新日期:2006-01-01 00:00:00