PLoSWiki
wiki
http://topicpageswiki.plos.org/wiki/Main_Page
MediaWiki 1.32.1
first-letter
Media
Special
Talk
User
User talk
PLoSWiki
PLoSWiki talk
File
File talk
MediaWiki
MediaWiki talk
Template
Template talk
Help
Help talk
Category
Category talk
Module
Module talk
Chemical graph generators
0
801
8117
2019-09-12T18:32:45Z
MehmetAzizYirik
145
Created page with "{{author |first1 = Mehmet Aziz |last1 = Yirik |department1 = Analytical Chemistry |institution1 = [[WP:University of Jena|University of Jena]] |address1 = L..."
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. Here we describe the theoretical basis of chemical graph generators and provide a historical overview of their development.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, chemical graphs [[wp:Molecular graph|Molecular graph]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A group [[wp:Group theory|Group theory]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
For each element of G, there is an element g<sup>-1</sup> such that g*g<sup>-1</sup> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a symmetry group [[wp:Permutation group|Permutation group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph [[wp:Graph automorphism|Graph automorphism]]. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)/math> is an edge of the graph, <math>G=(E,V)</math>, and a is a permutation of <math>V</math>, then
<math>a({u,v})=(a(u),a(v))</math>
A permutation a of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if <math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Summary==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
9iatr2leac0z1vy7at620gstm1omcb0
8118
8117
2019-09-12T18:52:26Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. Here we describe the theoretical basis of chemical graph generators and provide a historical overview of their development.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, chemical graphs [[wp:Molecular graph|Molecular graph]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A group [[wp:Group theory|Group theory]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a symmetry group [[wp:Permutation group|Permutation group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph [[wp:Graph automorphism|Graph automorphism]]. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
3u1fm6vixw0rxeqb7yyqynq51k8gq66
8211
8118
2019-12-09T15:51:26Z
Daniel Mietchen
5
/* Chemical Graphs */ link fix
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. Here we describe the theoretical basis of chemical graph generators and provide a historical overview of their development.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A group [[wp:Group theory|Group theory]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a symmetry group [[wp:Permutation group|Permutation group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph [[wp:Graph automorphism|Graph automorphism]]. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
oqfdo7i6ntcu5k2yxon66ojbm5gaxyu
8212
8211
2019-12-09T15:54:23Z
Daniel Mietchen
5
/* Symmetry Groups for Molecular Graphs */ link fix
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. Here we describe the theoretical basis of chemical graph generators and provide a historical overview of their development.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a symmetry group [[wp:Permutation group|Permutation group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph [[wp:Graph automorphism|Graph automorphism]]. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
f7nhpm64tx97ntpb0k32xdm28v1s22l
8213
8212
2019-12-10T09:31:42Z
MehmetAzizYirik
145
/* Abstract */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a symmetry group [[wp:Permutation group|Permutation group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the automorphisms of the graph [[wp:Graph automorphism|Graph automorphism]]. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
nq43dhlki1m6dushoqkjhlrxcdw6aqm
8214
8213
2019-12-11T11:33:28Z
MehmetAzizYirik
145
/* Symmetry Groups for Molecular Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a tree [[wp:Tree (graph theory)|Tree (graph theory)]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are branch-and-bound methods [[wp:Branch and bound|Branch and bound]], which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
k5j2zeswhbif9o8dvzr463xz8p87log
8215
8214
2019-12-11T11:35:32Z
MehmetAzizYirik
145
/* Structure Assembly */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of cheminformatics. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of computational biology. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
5459pvwzwr1iuxrm2m1tti2gmok6glm
8216
8215
2019-12-11T14:17:33Z
MehmetAzizYirik
145
/* Abstract */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as valences, bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the DENDRAL project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of stereoisomers, symmetry group calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an adjacency matrix generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a hypergraph is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as HMBC and COSY data to generate all possible structures. LSD is still used as an open source structure generator with GPL (General Public License). As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, HSQC and NMR data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HBMC signals are implemented in a connection matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on automorphic groups from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of computational group theory. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on two methods: canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. Automorphism of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
5uhiafoz9s3k9mwziltqpcgrekynj37
8217
8216
2019-12-11T15:49:54Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from [[wp:NMR|NMR]] correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]]of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
cqa4il0nvtjxfmckh5pj2ho200cqoqy
8218
8217
2019-12-11T15:59:01Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry <ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref>. CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator <ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref>. The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS <ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS <ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref>. The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA <ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref>. The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator <ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref>. Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists <ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref>. The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE <ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks <ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref>. In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator <ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref>. The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC <ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref>. After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>. The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules <ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from [[wp:NMR|NMR]] correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY <ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref>. To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN <ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref>. Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev <ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read <ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref>. Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies <ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc <ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref>. In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG) <ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref>. The algorithm relies on canonical path augmentation and McKay’s NAUTY package <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
jyblfn6tmegcemsn1cpy3zwwfv5rkn1
8219
8218
2019-12-12T11:30:38Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements <ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref>. An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
75v5hnkir7arux8z3g3nogo1lc4szsh
8220
8219
2019-12-12T11:55:06Z
MehmetAzizYirik
145
/* Symmetry Groups for Molecular Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE <ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator <ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref>. The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator <ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added <ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref>. Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator) <ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref>. MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS <ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
p8hj0v3x5z72yo32t3i86etw4jqfxq3
8221
8220
2019-12-12T12:24:20Z
MehmetAzizYirik
145
/* Structure Assembly */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN <ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI <ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
rak4z1jzhig1o8pe5od7gnit7nq0n4v
8222
8221
2019-12-12T12:36:23Z
MehmetAzizYirik
145
/* Structure Reduction */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this [[wp:Hypergraph|hypergraph]] is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to [[wp:Adjacency matrix|adjacency matrices]], list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as [[wp:HMBC|HMBC]], [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the [[wp:HMBC|HMBC]] data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of [[wp:HMBC|HMBC]] signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied [[wp:Automorphism group|automorphism groups]] in the generation of [[wp:Adjacency matrix|adjacency matrices]]. An [[wp:Automorphism group|automorphism group]] of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing [[wp:Automorphism group|automorphism groups]] as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
kpgw1z2paiw7mcnloe1k1wtllk1iko4
8223
8222
2019-12-12T15:12:38Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally multigraphs. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
nv9wlb0xnqecytrgvq7t42hwaildak8
8224
8223
2019-12-12T15:16:39Z
MehmetAzizYirik
145
/* Chemical Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in drug design, for organic synthesis design or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational metabolomics, a current area of [[wp:computational biology|computational biology]]. The theoretical basis of chemical graph generators is described and a historical overview of their development is provided.
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
4rq35jtv4zlz9zvplh5eekb57b4qkdp
8225
8224
2019-12-12T15:29:10Z
MehmetAzizYirik
145
/* Abstract */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
13b9kp5x58glzgn700egog8vr719acy
8226
8225
2019-12-12T16:20:17Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. [[wp:Molecular graph|Molecular structures]] are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:open-source software|open-source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, for molecular formulas with more than 30 heavy atoms, it was too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:Isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
1k4oilz2r8r2im6p524fu1t7x0yje0u
8227
8226
2019-12-12T16:21:45Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were modified versions of graph generators for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
13b9kp5x58glzgn700egog8vr719acy
8228
8227
2019-12-12T16:34:57Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms took are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
i7g7lpkshrcat2mi58mlwqwl93isvnk
8229
8228
2019-12-12T16:40:19Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of graph generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in organic chemistry.<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than atoms were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an open source structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an integer partitioning-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the molecular formula into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
e0r0g8134t6gll9qejypwq5tyrt6vzg
8230
8229
2019-12-12T23:38:32Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: simulated annealing and genetic algorithms. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient branch-and-bound methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and isomorphism checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the first open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
a5wu8t21qbg3z4tnevpqv1n7k1asvbi
8231
8230
2019-12-12T23:47:14Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the vertices and edges represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
90i8rt6k3fmtjzdta0bwn73dr0zt1bp
8232
8231
2019-12-13T00:11:30Z
MehmetAzizYirik
145
/* Chemical Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], bond multiplicity and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
rz6otngac43mf2hq6y83fkebjb9h79x
8233
8232
2019-12-13T00:17:02Z
MehmetAzizYirik
145
/* History */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the degree of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
1u9g8pefvnssc5m7ly82o1c4x1wnnb3
8234
8233
2019-12-13T00:19:11Z
MehmetAzizYirik
145
/* Chemical Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a permutation is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The order of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. Set systems consist of a finite set <math>X</math> and its subsets, called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
18gtbomrerkp38a02qd755emnyrdk1q
8235
8234
2019-12-13T00:24:31Z
MehmetAzizYirik
145
/* Symmetry Groups for Molecular Graphs */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a user-friendly interface to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a breadth-first search method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their hybridization. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been parallelized, and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
fkt8ckdzzp70x1l463h3qz87kyv1phf
8236
8235
2019-12-13T00:45:30Z
MehmetAzizYirik
145
/* Structure Assembly */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are matrix generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
2hxptgfhgfejxw6th7lhded34hj4vq8
8237
8236
2019-12-13T00:51:11Z
MehmetAzizYirik
145
/* Structure Assembly */
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
=== See also===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
=== Wikipedia pages that should link here===
----
*[[wp:Graph theory|Graph theory]]
*[[wp:Cheminformatics|Cheminformatics]]
*[[wp:Chemical graph theory|Chemical graph theory]]
==References==
{{Reflist}}
cs0i3bum5gjqsq7dnmioa4y1opnatrj
8238
8237
2019-12-13T11:34:02Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
==References==
{{Reflist}}
9exfosi51r2uzl6xdqfmvwtse59z499
8246
8238
2019-12-24T13:53:38Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. [[File:overstructures.tiff|thumb|left|alt=Overlapping substructure of caffeine.|Fig 1. Overlapping substructure of caffeine. Two substructures of caffeine molecule are given (A) and (B). The overlap of these substructures is highlighted by green in caffeine structure (C).]]. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
==References==
{{Reflist}}
lj1f5pfkid8fse0oat74x5qtlaswrcv
8248
8246
2019-12-24T14:07:30Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures. The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. [[File:Overstructures.png|thumb|left|'''Fig 1. Overlapping substructure of caffeine.''' Two substructures of caffeine molecule are given '''(A)''' and '''(B)'''. The overlap of these substructures is highlighted by green in caffeine structure '''(C)'''.]]. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively. The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method. This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
==References==
{{Reflist}}
8ffdt4fml2bgzz577p4lzkecyuj4k65
8249
8248
2019-12-24T14:23:33Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures
('''Figure 1'''). The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. [[File:Overstructures.png|thumb|left|'''Fig 1. Overlapping substructure of caffeine.''' Two substructures of caffeine molecule are given '''(A)''' and '''(B)'''. The overlap of these substructures is highlighted by green in caffeine structure '''(C)'''.]]. For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively ('''Figure 2'''). The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry ('''Figure 3''') detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method ('''Figure 4'''). This tree extension terminates when all the branches reach saturated structures.
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
==References==
{{Reflist}}
m9qwy4ccc580ibmyo2bch4r9ij8skz9
8253
8249
2019-12-24T14:57:47Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures ('''Figure 1'''). The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. [[File:Overstructures.png|thumb|right|400px|'''Fig 1. Overlapping substructure of caffeine.''' Two substructures of caffeine molecule are given '''(A)''' and '''(B)'''. The overlap of these substructures is highlighted by green in caffeine structure '''(C)'''.]] For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
[[File:Chemical Graph Representation.png|thumb|right|400px|'''Fig 2. Graph representation of [[wp:Serotonin|serotonin molecule]].''' '''(A)''' Molecular structure of serotonin. '''(B)''' Graph representation of the molecule.]]
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively ('''Figure 2'''). The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
[[File:Permutation Action on Atom Sets.png|thumb|right|400px|'''Fig 3. Molecular Symmetry.''' '''(A)''' The initial enumeration of 2,4-Dimethyl-2-pentene. '''(B)''' and '''(C)''' are symmetries of the same molecule with different enumerations.]]
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry ('''Figure 3''') detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method ('''Figure 4'''). This tree extension terminates when all the branches reach saturated structures. [[File:Breadth-First Search Structure Generation.png|thumb|right|500px|'''Fig 4. Breadth-first search generation.''' Molecular structure generation is explained step by step. Starting from a set of atoms, bonds are added between atom pairs until reaching saturated structures.]]
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disadvantages in the generation step. In other words, structure reduction is efficient when structural constraints are provides, and structure assembly is faster without constraints. First, the useless connections were eliminated, and then the substructures were assembled to build structures. Thus, GEN copes with the constraints in a more efficient way by combining these methods. GEN removes the connections creating the forbidden structures, and then the connection matrices are filled based on substructure information. The method does not accept overlaps among substructures. Once the structure is built in the matrix representation, the saturated molecule is stored in the output list. Munk and his team improved the COCOA method and built a new generator, HOUDINI.<ref>A. Korytko, K.-P. Schulz, M. S. Madison, and M. E. Munk, ‘HOUDINI: A New Approach to Computer-Based Structure Generation’, J. Chem. Inf. Comput. Sci., vol. 43, no. 5, pp. 1434–1446, Sep. 2003.</ref> HOUDINI relies on two data structures: a square matrix of compounds representing all bonds in a hyper structure is constructed, and second, substructure representation is used to list atom-centred fragments. In the structure generation, HOUDINI maps all the atom-centred fragments onto the hyper structure.
==Conclusion==
The structural identification of unknown molecules is an interdisciplinary field involving mathematicians, chemists and computer scientists; moreover, it has led to the creation of the field of mathematical chemistry and cheminformatics. The state-of-art methods comprise a variety of algorithms that can be classified into two groups; moreover, structure assembly has been the dominant approach in the field. Both assembly and reduction methods are incremental processes: all the intermediate structures are constructed based on previously generated structures, and duplicates are then excluded. The algorithms are generally breadth-first searches and terminate once all the structures are saturated. The generation of too many intermediate structures and their storage make these algorithms inefficient. In the field, matrix generators have been attracting increasing interest from many scientists. According to the literature, there is still a lack of mathematical algorithms; more precisely, there is a lack of efficient open-source structure generators.
==References==
{{Reflist}}
3xxko8bupaj2zpssg05psovw229a1du
8254
8253
2019-12-24T15:00:00Z
MehmetAzizYirik
145
wikitext
text/x-wiki
{{author
|first1 = Mehmet Aziz
|last1 = Yirik
|department1 = Analytical Chemistry
|institution1 = [[WP:University of Jena|University of Jena]]
|address1 = Lessingstrasse 8, 07743, Jena, Germany
|username1 = User:MehmetAzizYirik
|orcid1 = https://orcid.org/0000-0001-7520-7215
|first2 = Christoph
|last2 = Steinbeck
|department2 = Analytical Chemistry
|institution2 = [[WP:University of Jena|University of Jena]]
|address2 = Lessingstrasse 8, 07743, Jena, Germany
|username2 = User:csteinbeck
|orcid2 = https://orcid.org/0000-0001-6966-0814
}}
==Abstract==
Chemical Graph Generators are software packages to generate computer representations of chemical structures adhering to certain boundary conditions. Their development is a research topic of [[wp:Cheminformatics|cheminformatics]]. Chemical Graph Generators are used in areas such as virtual library generation in [[wp:drug design|drug design]], for [[wp:organic synthesis|organic synthesis design]] or in systems for computer-assisted structure elucidation (CASE). CASE systems again have regained interest for the structure elucidation of unknowns in computational [[wp:metabolomics|metabolomics]], a current area of [[wp:computational biology|computational biology]].
==History==
Molecular structure generation is a branch of [[wp:Graph(discrete mathematics)|graph]] generation problems. Molecular structures are graphs with chemical constraints such as [[wp:Valence(chemistry)|valences]], [[wp:Bond order|bond multiplicity]] and fragments. The first structure generators were graph generators modified versions for chemical purposes. CONGEN was the first structure generator developed for the [[wp:DENDRAL|DENDRAL]] project, the first artificial intelligence project in [[wp:organic chemistry|organic chemistry]].<ref>G. Sutherland, ‘DENDRAL - A computer program for generating and filtering chemical structures’, Stanf. Artifical Intell., vol. 49, p. 34.</ref> CONGEN dealt well with overlaps in substructures ('''Figure 1'''). The overlaps among substructures other than [[wp:Atom|atoms]] were used as the building blocks. For the case of [[wp:stereoisomerism|stereoisomers]], [[wp:Symmetry group|symmetry group]] calculations were performed for duplicate detection. Another early attempt was made by Abe in 1975 using a pattern recognition-based structure generator.<ref>H. Abe and P. C. Jurs, ‘Automated chemical structure analysis of organic molecules with a molecular structure generator and pattern recognition techniques’, Anal. Chem., vol. 47, no. 11, pp. 1829–1835, 1975.</ref> The algorithm had two steps: first, the prediction of the substructure from low-resolution spectral data; second, the assembly of these substructures based on a set of construction rules. A year later, a mathematical method, MASS<ref>V. V. Serov, M. E. Elyashberg, and L. A. Gribov, ‘Mathematical synthesis and analysis of molecular structures’, J. Mol. Struct., vol. 31, no. 2, pp. 381–397, 1976.</ref>, a tool for mathematical synthesis and analysis of molecular structures, was reported. Mathematically speaking, the algorithm worked as an [[wp:Adjacency matrix|adjacency matrix]] generator. Following MASS, Abe and his collaborators published the first paper on CHEMICS<ref>S. I. Sasaki et al., ‘CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds’, J. Chem. Inf. Comput. Sci., vol. 18, no. 4, pp. 211–222, 1978</ref>, which is a computer-assisted structure elucidation (CASE) tool comprising structure generation methods. The program relies on a predefined non-overlapping fragment library. [[File:Overstructures.png|thumb|right|400px|'''Fig 1. Overlapping substructure of caffeine.''' Two substructures of caffeine molecule are given '''(A)''' and '''(B)'''. The overlap of these substructures is highlighted by green in caffeine structure '''(C)'''.]] For the input spectral data, the matching component sets are used as building blocks. These component sets were ranked from primary to tertiary substructures. Substantial contributions were made by Shelley and Munk, who published a large number of CASE papers in this field. The first paper reported a structure generator, ASSEMBLE.<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> The algorithm is considered one of the earliest assembly methods in the field. As the name indicates, the algorithm assembles substructures with overlaps to construct structures. ASSEMBLE overcomes overlapping by including a “neighbouring atom tag”. Later, the algorithm became part of a CASE system called CASE. The second version of ASSEMBLE was released in 2000. Between the releases of these two versions, the same team also reported a different approach, the first structure reduction method, COCOA.<ref>B. D. Christie and M. E. Munk, ‘Structure Generation by Reduction: A New Strategy for Computer-Assisted Structure Elucidation’, J. Chem. Inf. Comput. Sci., vol. 28, no. 2, pp. 87–93, 1988.</ref> The method is an exhaustive, recursive bond-removal procedure. Unlike the assembly approaches, a [[wp:Hypergraph|hypergraph]] is constructed with all the spectral information. During generation, the size of this hypergraph is decreased by removing irrelevant bonds from the graph. The efficiency and exhaustivity of generators are also related to the data structures. Unlike previous methods, AEGIS was a list-processing generator.<ref>H. J. Luinge and J. H. Van Der Maas, ‘AEGIS, an algorithm for the exhaustive generation of irredundant structures’, Chemom. Intell. Lab. Syst., vol. 8, no. 2, pp. 157–165, Jun. 1990.</ref> Compared to adjacency matrices, list data requires less memory. As no spectral data was interpreted in this system, the user needed to provide substructures as inputs. LSD (Logic for Structure Determination) is an important contribution from French scientists.<ref> J.-M. Nuzillard and M. Georges, ‘Logic for structure determination’, Tetrahedron, vol. 47, no. 22, pp. 3655–3664, 1991.</ref> The tool uses spectral data information such as [[wp:HMBC|HMBC]] and [[wp:COSY|COSY]] data to generate all possible structures. LSD is an [[wp:Open-source software|open source]] structure generator with [[wp:GNU General Public License|General Public License (GPL)]]. As successors of these generators, a series of stochastic generators were reported by Faulon. His software, SIGNATURE<ref>J.-L. Faulon, D. P. Visco, and R. S. Pophale, ‘The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies’, J. Chem. Inf. Comput. Sci., vol. 43, no. 3, pp. 707–720, 2003.</ref>, was integrated into this stochastic generator for canonical labelling and duplicate checks.<ref>J.-L. Faulon, ‘Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules’, J. Chem. Inf. Model., vol. 34, no. 5, pp. 1204–1218, Sep. 1994.</ref> In 1994, the same year that Faulon released the stochastic structure generator, Chinese scientists reported an [[wp:Partition(number theory)|integer partitioning]]-based structure generator.<ref>C.-Y. Hu and L. Xu, ‘Principles for structure generation of organic isomers from molecular formula’, Anal. Chim. Acta, vol. 298, no. 1, pp. 75–85, Nov. 1994.</ref> The decomposition of the [[wp:Chemical formula|molecular formula]] into fragments, components and segments was performed as an application of integer partitioning. These fragments were then used as building blocks in the structure generator. This structure generator was part of a CASE system, ESESOC.<ref>J. Hao, L. Xu, and C. Hu, ‘Expert system for elucidation of structures of organic compounds (ESESOC): —Algorithm on stereoisomer generation’, Sci. China Ser. B Chem., vol. 43, no. 5, pp. 503–515, Oct. 2000.</ref> After Munk’s assembly and reduction methods, Bohanec published a method combining these two methods.<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref> The aim of this assembly and reduction process was to combine the benefits of the two methods to develop an efficient structure generator. First, the useless connections were eliminated, and then, the substructures were assembled. Eliminating these connections at the beginning accelerated the assembly approach relative to previous methods. Structure generators can also vary based on the type of data used, such as HMBC, [[wp:HSQC|HSQC]] and [[wp:NMR|NMR]] data. LUCY is an open-source structure elucidation method based on the HMBC data of unknown molecules<ref>C. Steinbeck, ‘LUCY - A program for structure elucidation from NMR correlation experiments’, Angew. Chem. Int. Ed. Engl., vol. 35, no. 17, pp. 1984–1986, 1996.</ref>, and involves an exhaustive 2-step structure generation process where first all combinations of interpretations of HMBC signals are implemented in a connectivity matrix, which is then completed by a deterministic generator filling in missing bond information. This platform could generate structures with any arbitrary size of molecules; however, molecular formulas with more than 30 heavy atoms are too time consuming for practical applications. This limitation highlighted the need for a new CASE system. SENECA was developed to eliminate the shortcomings of LUCY.<ref>C. Steinbeck, ‘SENECA: A Platform-Independent, Distributed, and Parallel System for Computer-Assisted Structure Elucidation in Organic Chemistry’, J. Chem. Inf. Comput. Sci., vol. 41, no. 6, pp. 1500–1507, 2001.</ref> To overcome the limitations of the exhaustive method, SENECA was developed as a stochastic method to find optimal solutions. The systems comprise two stochastic methods: [[wp:simulated annealing|simulated annealing]] and [[wp:genetic algorithm|genetic algorithms]]. First, a random structure is generated; then, its energy is calculated to evaluate the structure and its spectral properties. By transforming this structure into another structure, the process continues until the optimum energy is reached. In the generation, this transformation relies on equations based on Faulon’s rules. Approximately 30 years after the first DENDRAL paper, Molchanova published a mathematical structure generator, SMOG, as a descendant of CONGEN.<ref>M. S. Molchanova, V. V. Shcherbukhin, and N. S. Zefirov, ‘Computer generation of molecular structures by the SMOG program’, J. Chem. Inf. Comput. Sci., vol. 36, no. 4, pp. 888–899, 1996.</ref> Many mathematical generators are descendants of efficient [[wp:branch and bound|branch-and-bound]] methods from Faradjev<ref>I. Faradzev, ‘Constructive enumeration of combinatorial objects’, in Colloq. Internat. CNRS, 1978, vol. 260, pp. 131–135.</ref> and Read.<ref>R. C. Read, ‘Every one a winner or how to avoid isomorphism search when cataloguing combinatorial configurations’, in Annals of Discrete Mathematics, vol. 2, Elsevier, 1978, pp. 107–120.</ref> Although their report is from the 1970s, this study is still the fundamental reference for structure generators. One of the earliest structure generators, SMOG, was a modification of the Faradjev method. In this algorithm, canonicity criteria and [[wp:isomorphism|isomorphism]] checks are based on [[wp:Automorphism group|automorphism groups]] from mathematics. Many other algorithms, such as MASS, MOLGEN and Bangov’s studies<ref>I. Bangov and K. Kanev, ‘Computer-assisted structure generation from a gross formula: II. Multiple bond unsaturated and cyclic compounds. Employment of fragments’, J. Math. Chem., vol. 2, no. 1, pp. 31–48, 1988.</ref>, were developed as descendants of this method. These generators were purely mathematical and applied automorphism groups in the generation of adjacency matrices. An automorphism group of a graph consists of all its symmetries, and thus an awareness of symmetry types accelerates the construction process. To date, MOLGEN is the only maintained efficient generic structure generator. The tool was developed as a closed-source platform by a group of mathematicians as an application of [[wp:Computational group theory|computational group theory]]. Another well-known commercial structure generator is from ACD Labs, and notably, one of the developers of MASS, Elyashberg. The structure generator was part of a known CASE system called StrucEluc.<ref>K. Blinov, M. Elyashberg, S. Molodtsov, A. Williams, and E. Martirosian, ‘An expert system for automated structure elucidation utilizing 1H-1H, 13C-1H and 15N-1H 2D NMR correlations’, Fresenius J. Anal. Chem., vol. 369, no. 7–8, pp. 709–714, 2001.</ref> In 2012, Peironcely introduced the an open-source structure generator called Open Molecule Generator (OMG).<ref>J. E. Peironcely et al., ‘OMG: Open molecule generator’, J. Cheminformatics, vol. 4, no. 9, pp. 1–13, 2012.</ref> The algorithm relies on canonical path augmentation and McKay’s NAUTY package.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> NAUTY is a program for computing automorphism groups as well as the canonical labelling of graphs. [[wp:Graph automorphism|Automorphism]] of a graph is a mapping of the graph to itself by preserving the edge-vertex connectivity. Compared to MOLGEN, OMG generates large molecules almost 2000 times slower than can be achieved with MOLGEN.
==Mathematical Basis==
===Chemical Graphs===
----
[[File:Chemical Graph Representation.png|thumb|right|400px|'''Fig 2. Graph representation of [[wp:Serotonin|serotonin molecule]].''' '''(A)''' Molecular structure of serotonin. '''(B)''' Graph representation of the molecule.]]
In a graph representing a chemical structure, the [[wp:Vertex(graph theory)|vertices]] and [[wp:Edge(graph theory)|edges]] represent atoms and bonds, respectively ('''Figure 2'''). The bond order corresponds to the edge multiplicity, and as a result, [[wp:Molecular graph|chemical graphs]] are generally [[wp:Multigraph|multigraphs]]. A multigraph <math>G = (V,E) </math> is described as a chemical graph where <math>V</math> is the set of vertices, i.e., atoms, and <math>E</math> is the set of edges, which represents the bonds.
In graph theory, the [[wp:Degree(graph theory)|degree]] of a vertex is its number of connections. In a chemical graph, the maximum degree of an atom is its valence, and the maximum number of bonds a chemical element can make. For example, carbon’s valence is 4. In a chemical graph, an atom is saturated if it reaches its valence.
A graph is connected if there is at least one path between each pair of vertices. A connectivity check is one of the mandatory intermediate steps in structure generation because the aim is to generate fully saturated molecules. A molecule is saturated if all its atoms are saturated.
===Symmetry Groups for Molecular Graphs===
----
[[File:Permutation Action on Atom Sets.png|thumb|right|400px|'''Fig 3. Molecular Symmetry.''' '''(A)''' The initial enumeration of 2,4-Dimethyl-2-pentene. '''(B)''' and '''(C)''' are symmetries of the same molecule with different enumerations.]]
For a set of elements, a [[wp:permutation|permutation]] is a rearrangement of these elements.<ref>D. L. Kreher and D. R. Stinson, Combinatorial Algorithms: Generation, Enumeration, and Search. CRC Press, 1998.</ref> An example is given below:
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none; text-align: center;"
|-
| <math> x </math>
| 1
| 2
| 3
| 4
| 5
| 6
| 7
| 8
| 9
| 10
| 11
|-
| <math> f(x) </math>
| 4
| 2
| 11
| 6
| 1
| 5
| 8
| 9
| 7
| 10
| 3
|+ Table 1: Permutation of set of integers.
|}
The second line of Table 1 shows a permutation of the first line. The multiplication of permutations, <math>a</math> and <math>b</math>, is defined as a function composition, as shown below.
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>(ab)(x)=a(b(x))</math></div>
The combination of two permutations is also a permutation.
A [[wp:Group theory|group]], <math>G</math>, is a set of elements together with an associative binary operation <math>*</math> defined on <math>G</math> such that the following are true:
*There is an element <math>I</math> in <math>G</math> satisfying <math>g*I=g</math>, for all elements <math>g</math> of <math>G</math>.
*For each element of G, there is an element <math> g^{-1}</math> such that <math> g*g^{-1}</math> is equal to the identity element.
The [[wp:Order(group theory)|order]] of a group is the number of elements in the group. Let us assume <math>X</math> is a set of permutations over a set of numbers. Under the function composition operation, <math>Sym(X)</math> is a [[wp:Permutation group|symmetry group]]. If the size of <math>X</math> is <math>n</math>, then the order of <math>Sym(X)</math> is <math>n!</math>. [[wp:set(mathematics)|set]] systems consist of a finite set <math>X</math> and its [[wp:subset|subsets]], called blocks of the set. The set of permutations preserving the set system is used to build the [[wp:Graph automorphism|automorphisms]] of the graph. An automorphism permutes the vertices of a graph; in other words, mapping a graph onto itself. This action is edge-vertex preserving.
If <math>(u,v)</math> is an edge of the graph, <math>G=(E,V)</math>, and <math>a</math> is a permutation of <math>V</math>, then
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a({u,v})=(a(u),a(v))</math></div>
A permutation <math>a</math> of <math>V</math> is an automorphism of the graph <math>G=(E,V)</math> if
<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><math>a((u,v))</math> is an element of <math>E</math>, if <math>{u,v}</math> is an element of <math>E</math>.</div>
The automorphism group of a graph <math>G</math>, denoted <math>Aut(G)</math>, is the set of all automorphisms on <math>V</math>. In molecular graphs, canonical labelling and molecular symmetry ('''Figure 3''') detection are implementations of automorphism groups. NAUTY is an efficient software package for automorphism group calculations and canonical labelling. OMG is an implementation of NAUTY.
==Methods==
Generation methods are the core of CASE systems. These generators relied on combinatorial methods. In a generator, the molecular formula is the basic input. If fragments are obtained from the experimental data, they can also be used as inputs to accelerate generation. The literature classifies generators into two major types: structure assembly and structure reduction. The algorithmic complexity and the run time are the criteria used for comparison.
===Structure Assembly===
----
The generation process starts with a set of atoms from the molecular formula. In structure assembly, atoms are combinatorically connected to consider all possible extensions. If substructures are obtained from the experimental data, the generation starts with these substructures. These substructures provide known bonds in the molecule. One of the earliest assembly methods was Shelley and Munk’s CASE<ref>C. A. Shelley and M. E. Munk, ‘Case, a computer model of the structure elucidation process’, Anal. Chim. Acta, vol. 133, no. 4, pp. 507–516, 1981.</ref> system, which included the ASSEMBLE generator.<ref>M. Badertscher et al., ‘Assemble 2.0: A structure generator’, Chemom. Intell. Lab. Syst., vol. 51, no. 1, pp. 73–79, 2000.</ref> The generator is purely mathematical and does not involve the interpretation of any spectral data. Spectral data are used for structure scoring and substructure information. Based on the molecular formula, the generator forms bonds between pairs of atoms, and all the extensions are checked against the given constraints. If the process is considered as a [[wp:Tree (graph theory)|tree]], the first node of the tree is an atom set with substructures if any are provided by the spectral data. By extending the molecule with a bond, an intermediate structure is built. Each intermediate structure can be represented by a node in the generation tree. ASSEMBLE was developed with a [[wp:User interface|user-friendly interface]] to facilitate use. The tree approach is the skeleton of many generators. For example, Peironcely’s structure generator, OMG, takes atoms and substructures as inputs and extends the structures using a [[wp:breadth-first search|breadth-first search]] method ('''Figure 4'''). This tree extension terminates when all the branches reach saturated structures. [[File:Breadth-First Search Structure Generation.png|thumb|right|500px|'''Fig 4. Breadth-first search generation.''' Molecular structure generation is explained step by step. Starting from a set of atoms, bonds are added between atom pairs until reaching saturated structures.]]
Another assembly method is GENOA. Compared to ASSEMBLE and many other generators, GENOA is a constructive substructure search-based algorithm, and it assembles different substructures by also considering the overlaps. CHEMICS is also a well-known CASE system that provides a novel structure generator algorithm. The earliest CHEMICS paper, based on the vector representation of components, was published in 1977. It generates different types of component sets ranked from primary to tertiary based on component complexity. The primary set contains atoms, i.e., C, N, O and S, with their [[wp:Orbital hybridisation|hybridization]]. The secondary and tertiary component sets are built layer-by-layer starting with these primary components. These component sets are represented as vectors and are used as inputs in the process.
In the generation trees, considering all possible extensions leads to a combinatorial explosion. Orderly generation is performed to cope with this exhaustivity. Many assembly algorithms, such as OMG, MOLGEN and Faulon’s structure generator<ref>J. L. Faulon, ‘On Using Graph-Equivalent Classes for the Structure Elucidation of Large Molecules’, J. Chem. Inf. Comput. Sci., vol. 32, no. 4, pp. 338–348, 1992.</ref>, are orderly generation methods. Faulon’s structure generator relies on equivalence classes over atoms. Atoms with the same interaction type and element are grouped in the same equivalence class. Rather than extending all atoms in a molecule, one atom from each class is extended. OMG generates structures based on the canonical augmentation method from McKay’s NAUTY package. This method is an early attempt at orderly graph generation. The algorithm calculates canonical labelling and then extends structures by adding one bond. To keep the extension canonical, canonical bonds are added.<ref>B. D. McKay and A. Piperno, ‘Practical graph isomorphism, II’, J. Symb. Comput., vol. 60, pp. 94–112, 2014.</ref> Despite NAUTY an efficient tool for graph canonical labelling, OMG is 2000 times slower than MOLGEN. The problem is the storage of all the intermediate structures. OMG has since been [[wp:Parallel computing|parallelized]], and the developers released PMG (Parallel Molecule Generator).<ref>M. M. Jaghoori et al., ‘PMG: Multi-core metabolite identification’, Electron. Notes Theor. Comput. Sci., vol. 299, pp. 53–60, 2013.</ref> MOLGEN outperforms PMG using only 1 core; however, PMG outperforms MOLGEN by increasing the number of cores to 10.
Constructive search algorithms are [[wp:Branch and bound|branch-and-bound]] methods, which are a solution to memory problems. These methods are [[wp:Matrix(mathematics)|matrix]] generation algorithms. In contrast to previous methods, these methods build all the connectivity matrices without building intermediate structures. The generation process is simplified by solving matrix generation as a numerical problem. MASS, SMOG and MOLGEN are good examples of matrix generators used in the literature. These are all descendants of the Faradjev algorithm, which was the first graph generator. Many structure generators refer to this study. MASS is a method of mathematical synthesis. First, it builds all incidence matrices for a given molecular formula. The atom valences are used as the input for matrix generation. The matrices are generated by considering all the possible interactions among atoms with respect to the constraints and valences. The benefit of constructive search algorithms is their low memory usage. SMOG is a successor of MASS and relies on a similar approach. This algorithm can be considered the chemical version of the Faradjev algorithm. Unlike previous methods, MOLGEN is an algebraic combinatorics method that relies on group theorems. Applied group theory is performed in the orderly generation of the matrices. Many different versions of MOLGEN have been developed, and they provide various functions. Based on the users’ needs, different types of inputs can be used. For example, MOLGEN-MS<ref>A. Kerber and R. Laue, ‘MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation’, Adv. Mass Spectrom., vol. 15, no. 2, pp. 939–940, 2001.</ref> allows users to input MS data of an unknown molecule. Compared to many other generators, MOLGEN approaches the problem from different angles. The key feature of MOLGEN is generating structures without building all the intermediate structures and without generating duplicates. It first generates all the combinatorically possible connectivity matrices and determines if a matrix represents a saturated molecule that satisfies the constraints.
===Structure Reduction===
----
Unlike these assembly methods, reduction methods make all the bonds between atom pairs, generating a hypergraph. Then, the size of the graph is reduced with respect to the constraints. First, the existence of substructures in the hypergraph is checked. Unlike assembly methods, the generation tree starts with the hypergraph, and the structures decrease in size at each step. Bonds are deleted based on the substructures. If a substructure is no longer in the hypergraph, the substructure is removed from the constraints. Overlaps in the substructures were also considered due to the hypergraphs. The earliest reduction-based structure generator is COCOA. Generated fragments are described as atom-centred fragments to optimize storage, comparable to circular fingerprints and atom signatures. Rather than storing structures, only the list of first neighbours of each atom is stored. The main disadvantage of reduction methods is the massive size of the hypergraphs. Indeed, for molecules with unknown structures, the size of the hyper structure becomes extremely large, resulting in a proportional increase in the run time.
Bohanec’s structure generator, GEN<ref>S. Bohanec, ‘Structure Generation by the Combination of Structure Reduction and Structure Assembly’, J. Chem. Inf. Comput. Sci., vol. 35, no. 3, pp. 494–503, 1995.</ref>, combines two tasks: structure assembly and structure reduction. Like COCOA, the initial state of the problem is a hyper structure. Both assembly and reduction methods have advantages and disadvantages, and the GEN tool avoids these disad