CHAPTER4 : SMILES

01:14

What is SMILES ?

- SMILES is simplified molecular input line entry system.

- Function : To translate chemical's three dimension structure into string of symbol that easily understood by computer software

- Software : there are many software that we can use to draw chemical structure.
The example are :-

1. chemsketch
2. chemdraw
3. chemdoodle
4. marvin

DESCRIPTION

1- ATOM

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets may be omitted in the common case of atoms which:
  1. are in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, or I, and
  2. have no formal charge, and
  3. have the number of hydrogens attached implied by their normal valence, and
  4. are the normal isotopes, and
  5. are not chiral centers.
All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for water may be written as either O or [OH2]. Hydrogen may also be written as a separate atom; water may also be written as [H]O[H].
When brackets are used, the symbol H is added if the atom in brackets is bonded to one or more hydrogen, followed by the number of hydrogen atoms if greater than 1, then by the sign '+' for a positive charge or by '-' for a negative charge. For example, NH4+ for ammonium) If there is more than one charge, it is normally written as digit; however, it is also possible write the sign as many times as the ion has charges: instead of "[Ti+4]", one may also write "[Ti++++]" (Titanium IV, Ti4+). Thus, the hydroxide anion is represented by [OH-], the hydronium cation is [OH3+] and the cobalt III cation (Co3+) is either [Co+3] or [Co+++].

2- BOND

Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as "-", this is usually omitted. For example, the SMILES for ethanol may be written as C-C-O, CC-O or C-CO, but is usually written CCO.
Double, triple, and quadruple bonds are represented by the symbols '=', '#', and '$' respectively as illustrated by the SMILES O=C=O (carbon dioxide), C#N (hydrogen cyanide) and [Ga-]$[As+] (gallium arsenide).
An additional type of bond is a "non-bond", indicated with ".", to indicate that two parts are not bonded together. For example, aqueous sodium chloride may be written as [Na+].[Cl-] to show the dissociation.
3- RINGS

Ring structures are written by breaking each ring at an arbitrary point (although some choices will lead to a more legible SMILES than others) to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms.
For example, cyclohexane and dioxane may be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2. For example, decalin (decahydronaphthalene) may be written as C1CCCC2C1CCCC2.
SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this is rarely used. Also, it is permitted to re-use ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, bicyclohexyl is usually written as C1CCCCC1C2CCCCC2, but it may also be written as C0CCCCC0C0CCCCC0.
Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin is C1CCCC2CCCCC12, where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by %, so "C%12" is a single ring-closing bond, of ring 12.
Ring-closing digits may be preceded by a bond type. For example, cyclopropene is usually written C1=CC1, but if the double bond is chosen as the ring-closing bond, it may be written as C=1CC1, C1CC=1, or C=1CC=1. (The first form is preferred.) C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond.
Ring-closing bonds may not be used to denote multiple bonds. For example, C1C1 is not a valid alternative to C=C for ethylene. However, they may be used with non-bonds; C1.C2.C12 is a peculiar but legal alternative way to write propane, more commonly written CCC.
Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as OC1CCCCC1O; choosing a different ring-break location produces a branched structure that requires parentheses to write.

4- AROMATICITY

Aromatic rings such as benzene may be written in one of three forms:
  1. In Kekulé form with alternating single and double bonds, e.g. C1=CC=CC=C1,
  2. Using the aromatic bond symbol ":", e.g. C:1:C:C:C:C:C1, or
  3. Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms 'b', 'c', 'n', 'o', 'p' and 's', respectively.
In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus, benzenepyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1.
Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1.
When aromatic atoms are singly bonded to each other, such as in biphenyl, a single bond must be shown explicitly: c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol "-" is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the form "c1ccccc1c2ccccc2".)
The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.

5- BRANCHING

Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
Branches may be written in any order. For example, bromochlorofluoromethane may be written as C(Br)(Cl)F, C(F)(Cl)Br, C(Cl)(F)Br, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex.
The one form of branch which does not require parentheses are ring-closing bonds. Choosing ring-closing bonds appropriately can reduce the number of parentheses required. For example, toluene is normally written as Cc1ccccc1 or c1ccccc1C, but could also be written as c1ccc(C)ccc1 or c1ccc(ccc1)C.




You Might Also Like

0 comments

Subscribe