Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach.

Abstract:

:Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like "proteins", "DNA", or "steroids". Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

journal_name

J Chem Inf Model

authors

Schneider N,Fechner N,Landrum GA,Stiefl N

doi

10.1021/acs.jcim.7b00249

subject

Has Abstract

pub_date

2017-08-28 00:00:00

pages

1816-1831

issue

8

eissn

1549-9596

issn

1549-960X

journal_volume

57

pub_type

杂志文章
  • LigQ: A Webserver to Select and Prepare Ligands for Virtual Screening.

    abstract::Virtual screening is a powerful methodology to search for new small molecule inhibitors against a desired molecular target. Usually, it involves evaluating thousands of compounds (derived from large databases) in order to select a set of potential binders that will be tested in the wet-lab. The number of tested compou...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00241

    authors: Radusky L,Ruiz-Carmona S,Modenutti C,Barril X,Turjanski AG,Martí MA

    更新日期:2017-08-28 00:00:00

  • Virtual Screening with Generative Topographic Maps: How Many Maps Are Required?

    abstract::Universal generative topographic maps (GTMs) provide two-dimensional representations of chemical space selected for their "polypharmacological competence", that is, the ability to simultaneously represent meaningful activity and property landscapes, associated with many distinct targets and properties. Several such GT...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.8b00650

    authors: Casciuc I,Zabolotna Y,Horvath D,Marcou G,Bajorath J,Varnek A

    更新日期:2019-01-28 00:00:00

  • Benchmark Sets for Binding Hot Spot Identification in Fragment-Based Ligand Discovery.

    abstract::Binding hot spots are regions of proteins that, due to their potentially high contribution to the binding free energy, have high propensity to bind small molecules. We present benchmark sets for testing computational methods for the identification of binding hot spots with emphasis on fragment-based ligand discovery. ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00877

    authors: Wakefield AE,Yueh C,Beglov D,Castilho MS,Kozakov D,Keserű GM,Whitty A,Vajda S

    更新日期:2020-12-28 00:00:00

  • On three-electron bonds and hydrogen bonds in the open-shell complexes [H2X2]+ for X = F, Cl, and Br.

    abstract::The [H2X2]+ (X = Cl, Br) formula could refer to two possible stable structures, namely, the hydrogen-bonded complex and the three-electron-bonded one. In contrary to the results published by other authors, we claim that for the F-type structures the hydrogen-bonded form is the only possible one and the [HFFH]+ complex...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci600355g

    authors: Bil A,Berski S,Latajka Z

    更新日期:2007-05-01 00:00:00

  • Prediction of cytochrome P450 xenobiotic metabolism: tethered docking and reactivity derived from ligand molecular orbital analysis.

    abstract::Metabolism of xenobiotic and endogenous compounds is frequently complex, not completely elucidated, and therefore often ambiguous. The prediction of sites of metabolism (SoM) can be particularly helpful as a first step toward the identification of metabolites, a process especially relevant to drug discovery. This pape...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400058s

    authors: Tyzack JD,Williamson MJ,Torella R,Glen RC

    更新日期:2013-06-24 00:00:00

  • Facile Solutions to the Problems Associated with Chemical Information and Mathematical Symbolism While Using Machine Translation Tools.

    abstract::Advances in computer-aided translation technology have made tremendous progress in accuracy in the past few years. Chemical Abstracts Service of the American Chemical Society summarizes scientific works from more than 50 languages and allows the users to search papers in nine selected languages. Currently, only the ab...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.0c00274

    authors: Wahab MF,Zulfiqar S,Sarwar MI,Lieberwirth I

    更新日期:2020-07-27 00:00:00

  • Molecular Mechanism, Dynamics, and Energetics of Protein-Mediated Dinucleotide Flipping in a Mismatched DNA: A Computational Study of the RAD4-DNA Complex.

    abstract::DNA damage alters genetic information and adversely affects gene expression pathways leading to various complex genetic disorders and cancers. DNA repair proteins recognize and rectify DNA damage and mismatches with high fidelity. A critical molecular event that occurs during most protein-mediated DNA repair processes...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00636

    authors: Pitta K,Krishnan M

    更新日期:2018-03-26 00:00:00

  • Ranking Reversible Covalent Drugs: From Free Energy Perturbation to Fragment Docking.

    abstract::Reversible covalent inhibitors have drawn increasing attention in drug design, as they are likely more potent than noncovalent inhibitors and less toxic than covalent inhibitors. Despite those advantages, the computational prediction of reversible covalent binding presents a formidable challenge because the binding pr...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.8b00959

    authors: Zhang H,Jiang W,Chatterjee P,Luo Y

    更新日期:2019-05-28 00:00:00

  • FragPELE: Dynamic Ligand Growing within a Binding Site. A Novel Tool for Hit-To-Lead Drug Design.

    abstract::The early stages of drug discovery rely on hit-to-lead programs, where initial hits undergo partial optimization to improve binding affinities for their biological target. This is an expensive and time-consuming process, requiring multiple iterations of trial and error designs, an ideal scenario for applying computer ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00938

    authors: Perez C,Soler D,Soliva R,Guallar V

    更新日期:2020-03-23 00:00:00

  • Computational evidence for the role of Arabidopsis thaliana UVR8 as UV-B photoreceptor and identification of its chromophore amino acids.

    abstract::A homology model of the Arabidopsis thaliana UV resistance locus 8 (UVR8) protein is presented herein, showing a seven-bladed β-propeller conformation similar to the globular structure of RCC1. The UVR8 amino acid sequence contains a very high amount of conserved tryptophans, and the homology model shows that seven of...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci200017f

    authors: Wu M,Grahn E,Eriksson LA,Strid A

    更新日期:2011-06-27 00:00:00

  • Pharmacophore Model for Wnt/Porcupine Inhibitors and Its Use in Drug Design.

    abstract::Porcupine is a component of the Wnt pathway which regulates cell proliferation, migration, stem cell self-renewal, and differentiation. The Wnt pathway has been shown to be dysregulated in a variety of cancers. Porcupine is a membrane bound O-acyltransferase that palmitoylates Wnt. Inhibiting porcupine blocks the secr...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.5b00159

    authors: Poulsen A,Ho SY,Wang W,Alam J,Jeyaraj DA,Ang SH,Tan ES,Lin GR,Cheong VW,Ke Z,Lee MA,Keller TH

    更新日期:2015-07-27 00:00:00

  • Comparative modeling and benchmarking data sets for human histone deacetylases and sirtuin families.

    abstract::Histone deacetylases (HDACs) are an important class of drug targets for the treatment of cancers, neurodegenerative diseases, and other types of diseases. Virtual screening (VS) has become fairly effective approaches for drug discovery of novel and highly selective histone deacetylase inhibitors (HDACIs). To facilitat...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci5005515

    authors: Xia J,Tilahun EL,Kebede EH,Reid TE,Zhang L,Wang XS

    更新日期:2015-02-23 00:00:00

  • RED: a set of molecular descriptors based on Renyi entropy.

    abstract::New molecular descriptors, RED (Renyi entropy descriptors), based on the generalized entropies introduced by Renyi are presented. Topological descriptors based on molecular features have proven to be useful for describing molecular profiles. Renyi entropy is used as a variability measure to contract a feature-pair dis...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci900275w

    authors: Delgado-Soler L,Toral R,Tomás MS,Rubio-Martinez J

    更新日期:2009-11-01 00:00:00

  • Insights on the facet specific adsorption of amino acids and peptides toward platinum.

    abstract::Engineering shape-controlled bionanomaterials requires comprehensive understanding of interactions between biomolecules and inorganic surfaces. We explore the origin of facet-selective binding of peptides adsorbed onto Pt(100) and Pt(111) crystallographic planes. Using molecular dynamics simulations, we show that upon...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400630d

    authors: Ramakrishnan SK,Martin M,Cloitre T,Firlej L,Cuisinier FJ,Gergely C

    更新日期:2013-12-23 00:00:00

  • Random Forest Refinement of Pairwise Potentials for Protein-Ligand Decoy Detection.

    abstract::An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function's ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluat...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00356

    authors: Pei J,Zheng Z,Kim H,Song LF,Walworth S,Merz MR,Merz KM Jr

    更新日期:2019-07-22 00:00:00

  • Coarse-Grained Prediction of Peptide Binding to G-Protein Coupled Receptors.

    abstract::In this study, we used the Martini Coarse-Grained model with no applied restraints to predict the binding mode of some peptides to G-Protein Coupled Receptors (GPCRs). Both the Neurotensin-1 and the chemokine CXCR4 receptors were used as test cases. Their ligands, NTS8-13 and CVX15 peptides, respectively, were initial...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.6b00503

    authors: Delort B,Renault P,Charlier L,Raussin F,Martinez J,Floquet N

    更新日期:2017-03-27 00:00:00

  • Holistic Approach to Partial Covalent Interactions in Protein Structure Prediction and Design with Rosetta.

    abstract::Partial covalent interactions (PCIs) in proteins, which include hydrogen bonds, salt bridges, cation-π, and π-π interactions, contribute to thermodynamic stability and facilitate interactions with other biomolecules. Several score functions have been developed within the Rosetta protein modeling framework that identif...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00398

    authors: Combs SA,Mueller BK,Meiler J

    更新日期:2018-05-29 00:00:00

  • Structure-based design and screen of novel inhibitors for class II 3-hydroxy-3-methylglutaryl coenzyme A reductase from Streptococcus pneumoniae.

    abstract::3-Hydroxy-3-methylglutaryl coenzyme A reductase (HMGR) is a primary target in the current clinical treatment of hypercholesterolemia with specific inhibitors of "statin" family. Statins are excellent inhibitors of the class I (human) enzyme but relatively poor inhibitors of the class II enzyme, which are well-known as...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci300163v

    authors: Li D,Gui J,Li Y,Feng L,Han X,Sun Y,Sun T,Chen Z,Cao Y,Zhang Y,Zhou L,Hu X,Ren Y,Wan J

    更新日期:2012-07-23 00:00:00

  • Heterogeneous Dielectric Implicit Membrane Model for the Calculation of MMPBSA Binding Free Energies.

    abstract::Membrane-bound protein receptors are a primary biological drug target, but the computational analysis of membrane proteins has been limited. In order to improve molecular mechanics Poisson-Boltzmann surface area (MMPBSA) binding free energy calculations for membrane protein-ligand systems, we have optimized a new hete...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00363

    authors: Greene D,Qi R,Nguyen R,Qiu T,Luo R

    更新日期:2019-06-24 00:00:00

  • Training a scoring function for the alignment of small molecules.

    abstract::A comprehensive data set of aligned ligands with highly similar binding pockets from the Protein Data Bank has been built. Based on this data set, a scoring function for recognizing good alignment poses for small molecules has been developed. This function is based on atoms and hydrogen-bond projected features. The co...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci100227h

    authors: Chan SL,Labute P

    更新日期:2010-09-27 00:00:00

  • Kinetic Models of Cyclosporin A in Polar and Apolar Environments Reveal Multiple Congruent Conformational States.

    abstract::The membrane permeability of cyclic peptides and peptidomimetics, which are generally larger and more complex than typical drug molecules, is likely strongly influenced by the conformational behavior of these compounds in polar and apolar environments. The size and complexity of peptides often limit their bioavailabil...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.6b00251

    authors: Witek J,Keller BG,Blatter M,Meissner A,Wagner T,Riniker S

    更新日期:2016-08-22 00:00:00

  • Template CoMFA: the 3D-QSAR Grail?

    abstract::Template CoMFA, a novel alignment methodology for training or test set structures in 3D-QSAR, is introduced. Its two most significant advantages are its complete automation and its ability to derive a single combined model from multiple structural series affecting a biological target. Its only two inputs are one or mo...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci400696v

    authors: Cramer RD,Wendt B

    更新日期:2014-02-24 00:00:00

  • Catalytic Role of Gln202 in the Carboligation Reaction Mechanism of Yeast AHAS: A QM/MM Study.

    abstract::Acetohydroxyacid synthase (AHAS) is a thiamin diphosphate-dependent enzyme involved in the biosynthesis of valine, leucine, isoleucine, and lysine. Experimental evidence has shown that mutation of the Gln202 residue results in a decrease in the enzymatic activity, thus suggesting the main role of the carboligation cat...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00863

    authors: Mendoza F,Medina FE,Jiménez VA,Jaña GA

    更新日期:2020-02-24 00:00:00

  • Turbocharging Matched Molecular Pair Analysis: Optimizing the Identification and Analysis of Pairs.

    abstract::We have applied the two most commonly used methods for automatic matched pair identification, obtained the optimum settings, and discovered that the two methods are synergistic. A turbocharging approach to matched pair analysis is advocated in which a first round (a conservative categorical approach that uses an analo...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00335

    authors: Lukac I,Zarnecka J,Griffen EJ,Dossetter AG,St-Gallay SA,Enoch SJ,Madden JC,Leach AG

    更新日期:2017-10-23 00:00:00

  • Combined 3D-QSAR modeling and molecular docking study on indolinone derivatives as inhibitors of 3-phosphoinositide-dependent protein kinase-1.

    abstract::3-Phosphoinositide-dependent protein kinase-1 (PDK1) is a promising target for developing novel anticancer drugs. In order to understand the structure-activity correlation of indolinone-based PDK1 inhibitors, we have carried out a combined molecular docking and three-dimensional quantitative structure-activity relatio...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci800147v

    authors: AbdulHameed MD,Hamza A,Liu J,Zhan CG

    更新日期:2008-09-01 00:00:00

  • Torsion Library Reloaded: A New Version of Expert-Derived SMARTS Rules for Assessing Conformations of Small Molecules.

    abstract::The Torsion Library contains hundreds of rules for small molecule conformations which have been derived from the Cambridge Structural Database (CSD) and are curated by molecular design experts. The torsion rules are encoded as SMARTS patterns and categorize rotatable bonds via a traffic light coloring scheme. We have ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.5b00522

    authors: Guba W,Meyder A,Rarey M,Hert J

    更新日期:2016-01-25 00:00:00

  • Sharing Data from Molecular Simulations.

    abstract::Given the need for modern researchers to produce open, reproducible scientific output, the lack of standards and best practices for sharing data and workflows used to produce and analyze molecular dynamics (MD) simulations has become an important issue in the field. There are now multiple well-established packages to ...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00665

    authors: Abraham M,Apostolov R,Barnoud J,Bauer P,Blau C,Bonvin AMJJ,Chavent M,Chodera J,Čondić-Jurkić K,Delemotte L,Grubmüller H,Howard RJ,Jordan EJ,Lindahl E,Ollila OHS,Selent J,Smith DGA,Stansfeld PJ,Tiemann JKS,Trellet M

    更新日期:2019-10-28 00:00:00

  • Two model system of the alpha1A-adrenoceptor docked with selected ligands.

    abstract::In this study, we have developed a two model system to mimic the active and inactive states of a G-protein coupled receptor specifically the alpha1A adrenergic receptor. We have docked two agonists, epinephrine (phenylamine type) and oxymetazoline (imidazoline type), as well as two antagonists, prazosin and 5-methylur...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/ci700026v

    authors: Asher WB,Hoskins SN,Slasor LA,Morris DH,Cook EM,Bautista DL

    更新日期:2007-09-01 00:00:00

  • Isomerization and Decomposition of 2-Methylfuran with External Forces.

    abstract::The primary goal of this project was to evaluate the performance of the Standard and Enforced Geometry Optimization (SEGO) method which we have recently developed. The SEGO method has been designed for an automatic location of multiple minima on the molecular Potential Energy Surface (PES), and its usefulness has been...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.9b00352

    authors: Brzyska A,Woliński K

    更新日期:2019-08-26 00:00:00

  • Probing the Binding Pathway of BRACO19 to a Parallel-Stranded Human Telomeric G-Quadruplex Using Molecular Dynamics Binding Simulation with AMBER DNA OL15 and Ligand GAFF2 Force Fields.

    abstract::Human telomeric DNA G-quadruplex has been identified as a good therapeutic target in cancer treatment. G-quadruplex-specific ligands that stabilize the G-quadruplex have great potential to be developed as anticancer agents. Two crystal structures (an apo form of parallel stranded human telomeric G-quadruplex and its h...

    journal_title:Journal of chemical information and modeling

    pub_type: 杂志文章

    doi:10.1021/acs.jcim.7b00287

    authors: Machireddy B,Kalra G,Jonnalagadda S,Ramanujachary K,Wu C

    更新日期:2017-11-27 00:00:00