8. Data Formats¶
This page gives an overview of the external data formats that MØD works with. For the formats that describe graphs and rules there are details on how the external format is interpreted when creating the corresponding in-memmory data structure. The in-memory model is described in Graph, Rule, and Molecule Model.
8.1. GML¶
MØD uses the Graph Modelling Language (GML) for general specification of graphs and rules. The parser recognises most of the published specification, with regard to syntax. The specific grammar is as follows.
GML ::= (key
value
)* key ::=identifier
value ::= int float quoteEscapedStringlist
list ::= '[' (key
value
)* ']' identifier ::= a word matching the regex "[a-zA-Z][a-zA-Z0-9]*"
A quoteEscapedString
is zero or more characters surrounded by double
quotation marks. To include a \"
character it must be escaped. Tabs,
newlines, and backslashses can be written as \t
, \n
, and \\
.
GML code may have line comments, starting with #
.
They are ignored during parsing.
8.1.1. Graph¶
A graph can be specified as GML by giving a list of vertices and edges
with the key graph
.
The following grammar exemplifies the required key-value structure.
graphGML ::= 'graph [' (node
|edge
)* ']' node ::= 'node [ id' int 'label' quoteEscapedString ']' edge ::= 'edge [ source' int 'target' int 'label' quoteEscapedString ']'
Note though that list elements can appear in any order.
8.1.2. Rule¶
A rule \((L\leftarrow K\rightarrow R)\) in GML format is specified
as three graph fragments; left
, context
, and right
.
From those
\(L\) is constructed as left
\(\cup\) context
,
\(R\) as right
\(\cup\) context
, and
\(K\) as context
\(\cup\) (left
\(\cap\) right
).
Each graph fragment is specified as a list of vertices and edges, similar to a
graph in GML format.
The key-value structure is exemplified by the following grammar.
ruleGML ::= 'rule [' [ 'ruleID' quoteEscapedString ] [ 'labelType "'labelType
'"' ] [leftSide
] [context
] [rightSide
]matchConstraint
* ']' labelType ::= 'string' | 'term' leftSide ::= 'left [' (node
|edge
)* ']' context ::= 'context [' (node
|edge
)* ']' rightSide ::= 'right [' (node
|edge
)* ']' matchConstraint ::=adjacency
|labelAny
adjacency ::= 'constrainAdj [' 'id' int [ 'nodeLabels ['labelList
']' ] [ 'edgeLabels ['labelList
']' ] 'op "'op
'"' 'count' unsignedInt ']' labelAny ::= 'constrainLabelAny [' 'label' quoteEscapedString 'labels ['labelList
']' ']' labelList ::= ('label' quoteEscapedString)* op ::= '<' | '<=' | '=' | '>=' | '>'
Note though that list elements can appear in any order.
For details on the constraints, see Application Constraints.
8.1.2.1. A Note on Term Labels¶
As described in Switching Category — First-Order Terms as Labels it is possible to interpret the
ordinary vertex and edges labels as first-order terms.
When using the label *
it will be interpreted as an unnamed term variable.
Consider the rule:
rule [
left [ node [ id 0 label "*" ] ]
right [ node [ id 0 label "*" ] ]
]
In string mode this is simply an identity rule, but in term mode each *
is interpreted as an unnamed variable. Be careful that in this case the
two labels are interepreted as the same variable. That is, it is equivalent
to:
rule [
left [ node [ id 0 label "_A" ] ]
right [ node [ id 0 label "_A" ] ]
]
If you wish to replace any vertex label with an explicit new variable, you can write it as:
rule [
left [ node [ id 0 label "_A" ] ]
right [ node [ id 0 label "_B" ] ]
]
8.2. SMILES¶
The Simplified molecular-input line-entry system is a line notation for molecules. MØD can load most SMILES strings, and converts them internally to labelled graphs according to a specific molecule encoding. For graphs that are sufficiently molecule-like, a SMILES string can be generated. The generated strings are canonical in the sense that the same version of MØD will print the same SMILES string for isomorphic molecules.
The reading of SMILES strings is based on the OpenSMILES specification, but with the following notes/changes.
Only single SMILES strings are accepted, i.e., not multiple strings separated by white-space.
Up and down bonds are regarded as implicit bonds, i.e., they might represent either a sngle bond or an aromatic bond. The stereo information is ignored.
Atom classes are (mostly) ignored. They can be used to specify unique IDs to atoms.
Wildcard atoms (specified with
*
) are converted to vertices with label*
. When inside brackets, only the hydrogen count and atom class is then permitted.Abstract vertex labels can be specified inside brackets. The bracket must in that case only contain the label and an optional class label. The label must be a non-empty string without
:
and with balanced square brackets.Charges of magnitude 2 and 3 may be specified with repeated
-
and+
.The bond type
$
is currently not allowed.Aromaticity can only be specified using the bond type
:
or using the special lower case atoms. I.e.,c1ccccc1
andC1:C:C:C:C:C:1
represent the same molecule, butC1=CC=CC=C1
is a different molecule. The lower-case atoms are converted to normal case when used as a label.Ring-bonds and branches may appear in mixed order. The normal order is to have all ring-bonds first and all branches, e.g.,
C123(O)(N)
. The parser accepts them in mixed order, e.g.,C1(O)2(N)3
.Implicit hydrogens are added following a more complicated procedure (see below).
A bracketed atom can have a radical by writing a dot (
.
) between the position of the charge and the position of the class.
The written SMILES strings are intended to be canonical and may not conform to any “prettyness” standards.
8.2.1. Implicit Hydrogen Atoms¶
When SMILES strings are written they will use implicit hydrogens whenever they
can be inferred when reading the string back in.
For the purposes of implicit hydrogens we use the following definition of
valence for an atom.
The valence of an atom is the weighted sum of its incident edges, where single
(-
) and aromatic (:
) bonds have weight 1, double bounds (=
) have
weight 2, and triple bonds (#
) have weight 3.
If an atom has an incident aromatic bond, its valence is increased by 1.
The atoms that can have implicit hydrogens are
B, C, N, O, P, S, F, Cl, Br, and I.
Each have a set of so-called “normal” valences as shown in the following table.
The atoms N and S additionally have certain sets of incident edges that are
also considered “normal”, which are also listed in the table.
Atom |
Normal Valences and Neighbourhoods |
---|---|
B |
3 |
C |
4 |
N |
3, 5, \(\{-, :, :\}\), \(\{-, -, =\}\), \(\{:, :, :\}\) |
O |
2 |
P |
3, 5 |
S |
2, 4, 6, \(\{:, :\}\) |
F, Cl, Br, I |
1 |
If the set of incident edges is listed in the table, then no hydrogens are added. If the valence is higher than the highest normal valence, then no hydrogens are added. Otherwise, hydrogens are added until the valence is at the next higher normal valence.
When writing SMILES strings the inverse procedure is used.
8.3. DFS Line Notation¶
The DFS formats are intended to provide a convenient line notation for general undirected labelled graphs and rules. Thus it is in many aspects similar to SMILES strings and reaction SMILES strings, but a string being both a valid (reaction SMILES) string and GraphDFS/RuleDFS string does not mean they represent the same objects. In particular, the semantics of ring-closures/back-edges are not the same.
8.3.1. GraphDFS¶
graphDFS ::=chain
chain ::=vertex
evPair
* vertex ::= (labelVertex
|ringClosure
)branch
* evPair ::=edge
vertex
labelVertex ::= '[' bracketEscapedString ']' [defRingId
]implicitHydrogenVertexLabels
[defRingId
] implicitHydrogenVertexLabels ::= 'B' | 'C' | 'N' | 'O' | 'P' | 'S' | 'F' | 'Cl' | 'Br' | 'I' defRingId ::= unsignedInt ringClosure ::= unsignedInt edge ::= '{' braceEscapedString '}'shorthandEdgeLabel
shorthandEdgeLabel ::= '-' | ':' | '=' | '#' | '.' | '' branch ::= '('evPair
+ ')'
A bracketEscapedString
and braceEscapedString
are zero or more
characters except respectively ]
and }
. To have these characters in
each of their strings they must be escaped, i.e., \]
and \}
respectively.
Whitespace is ignored, except inside bracketEscapedString
and
braceEscapedString
.
The parser additionally enforces that a defRingId
may not be
a number which has previously been used.
Similarly, a ringClosure
may only be a number which has
previously occured in a defRingId
.
A vertex specified via the implicitHydrogenVertexLabels
rule
will potentially have ekstra neighbours added after parsning. The rules are the
exact same as for implicit hydrogen atoms in SMILES.
8.3.1.1. Semantics¶
A GraphDFS string is, like the SMILES strings, an encoding of a depth-first
traversal of the graph it encodes. Vertex labels are enclosed in square
brackets and edge labels are enclosed in curly brackets. However, a special
set of labels can be specified without the enclosing brackets.
An edge label may additionally be completely omitted as a shorthand for a dash
(-
).
A vertex can have a numeric identifier, defined by the
defRingId
non-terminal.
At a later stage this identifier can be used as a vertex specification to
specify a back-edge in the depth-first traversal.
Example: [v1]1-[v2]-[v3]-[v4]-1
, specifies a labelled \(C_4\)
(which equivalently can be specified shorter as [v1]1[v2][v3][v4]1
).
A vertex
being a ringClosure
can never be
the first vertex in a string, and is thus preceded with a
edge
. As in a depth-first traversal, such a back-edge is a
kind of degenerated branch. Example: [v1]1[v2][v3][v4]1[v5][v6]1
, this
specifies a graph which is two fused \(C_4\),
\(v_1, v_2, v_3, v_4\) and \(v_4, v_5, v_6, v_1\),
with a common edge, \((v_1, v_4)\).
Warning
The semantics of back-edges/ring closures are not the same as in SMILES strings. In SMILES, a pair of matching numeric identifiers denote the individual back-edges.
A branch in the depth-first traversal is enclosed in parentheses.
The shorthandEdgeLabel
.
indicates a non-edge,
i.e., a jump to a new vertex without creating an edge.
For example [v1].[v2]
encodes a graph with two vertices and no edges,
while [v1]{.}[v2]
encodes a graph with two vertcies connected with an edge
with label .
.
8.3.1.2. Abstracted Molecules¶
The short-hand labels for vertices and edges makes it easier to specify partial molecules than using GML files.
As example, consider modelling Acetyl-CoA in which we wish to abstract most of
the CoA part. The GraphDFS string CC(=O)S[CoA]
can be used and we let the
library add missing hydrogen atoms to the vertices which encode atoms. A plain
CoA molecule would in this modelling be [CoA]S
, or a bit more verbosely as
[CoA]S[H]
.
The format can also be used to create completely abstract structures (it can encode any undirected labelled graph), e.g., RNA strings. Note that in this case it may not be appropriate to add “missing” hydrogen atoms. This can be controlled by an optional parameter to the loading function.
8.3.2. RuleDFS¶
The rule format builds on the graph format by using two GraphDFS strings to encode a rule:
ruleDFS ::= [graphDFS
] '>>' [graphDFS
]
The two (possibly empty) GraphDFS strings encode the left-hand and right-hand side of a rule, with the vertex IDs being used to relate them. That is, a pair of vertices in the left side and right side with the same ID will be identified and the vertex put in the context graph of the rule as well. A similar pair of edges where both end-points are in the context graph will be put in the context graph as well.
Examples:
>>
: the empty rule.[A]>>
: a rule with a single vertex in \(L\), and empty \(K\) and \(R\).[A]>>[B]
: a rule with empty \(K\) but with a vertex in \(L\) which is removed by the rule, and a vertex in \(R\) being created by the rule.[A]1>>[B]1
: a rule with a vertex changing label from “A” to “B”.
Note
Currently it is not possible to use vertices with implicit hydrogens in RuleDFS.
8.4. MOL and SD¶
MØD can load graphs stored in the
CT File formats
MOL (single structure) and SD (multiple structures).
See the loading functions in graph::Graph
/Graph
for
the API.
The loaded structures are converted to labelled graphs according to a specific
molecule encoding.
The reading of structures is based on the published specification, but with the following notes/changes.
the
MDLOptions
/MDLOptions
can be used to customize the loading procedure.radical value 2 is converted to
.
in the vertex labels.the atom symbols “LP” and “L” are used as is, as an atom with an abstract label.
the atom symbols “A”, “Q”, and “*” are all considered as wildcard atoms and are converted to vertices with label
*
. See the use of first-order terms as labels.the bond orders 5, 6, and 7 for constrained wildcard bonds are converted to edges with labels starting with
_Q
, i.e., term variables. See the use of first-order terms as labels.the bond order 8 for unconstrianed bonds are converted to edges with label
*
. See the use of first-order terms as labels.
8.5. Abstract Derivation Graphs¶
Sometimes it is really convenient to quickly write down a few equations to describe a “derivation graph”, without associating actual graphs and rules to it. That is, only specifying the underlying network. The network description is a string adhering to the following grammar:
description ::=derivation
{derivation
} derivation ::=side
("->" | "<=>")side
side ::=term
{ "+"term
} term ::= [ unsignedInt ]identifier
identifier ::= any character sequence without spaces
Note that the identifier
definition
in particular means that whitespace is important between coefficients and
identifiers. E.g., 2 A -> B
is different from 2A -> B
.
See also DG.Builder.addAbstract()
/dg::Builder::addAbstract()
.
8.6. Tikz¶
Both graphs and rules are visualized through PostMØD by the library generating Tikz code and compiling it with Latex.
The visualisation style is controlled by passing instances of
mod::graph::Printer
/mod.GraphPrinter
to the printing functions.
The drawing style is inspired by ChemFig
and Open Babel.
The coordinates for the layout is either generated using Open Babel when the graphs a chemical enough, but otherwise Graphviz is invoked to generate coordinates.
For visualizing a rule or DPO diagram, the position of vertices is used to indicate how morphisms map vertices to each other.
8.7. DOT (Graphviz)¶
The DOT format (from Graphviz) is used for generating vertex coordinates for the Tikz format, when Open Babel can not be used.