With James's permission, I would like to introduce an initiative of his that he and I have been working on: the CIF data model. This is a
logical model with which the structure and data content of any CIF can be described, and I present it to the group with the intention that it should serve at least two purposes:
1) most importantly, to provide a common vocabulary with which to discuss the form and parts of CIFs, but also
2) as a reference against which whatever physical data models we later consider can be validated and compared.
The need for (1) particularly struck me as I was compiling the
requirements list.
I emphasize that this logical model is not proposed to be directly translated into a physical model. On the other hand, any physical model that adequately covers the CIF API's problem space by necessity has a subset that maps to the logical model presented here:
(1) A CIF FILE is a set of zero or more (BLOCKNAME, DATABLOCK) pairs, wherein all the BLOCKNAMEs are distinct
(2) A DATABLOCK is a CONTAINER plus zero or more (FRAMENAME, SAVEFRAME) pairs, wherein all the FRAMENAMEs are distinct
(3) BLOCKNAMEs and FRAMENAMEs are sequences of one or more Unicode code points from the CIF character set, excluding CIF whitespace
(4) A SAVEFRAME is a CONTAINER
(5) A CONTAINER is a set of zero or more LOOPs satisfying the constraints described in (9) and (10)
(6) A LOOP is a collection of one or more PACKETS for a DOMAIN
(7) A DOMAIN is a set of DATANAMEs (see note ii)
(8) A PACKET for a given DOMAIN is a mapping from each DATANAME in that DOMAIN to a DATAVALUE
(9) Each CONTAINER contains at most one single-PACKET loop (see note iii)
(10) All DOMAINs of all LOOPs in the same CONTAINER must be disjoint
(11) A DATAVALUE is any one of TABLE, LIST, MAYBENUMB, CHAR, NUMB, NULL, or UNKNOWN
(12) A TABLE is a set of zero or more (KEY, DATAVALUE) pairs, wherein all the KEYs are distinct
(13) A KEY is a CHAR
(14) A LIST is an ordered sequence of zero or more DATAVALUEs
(15) MAYBENUMB is a (NUMB, CHAR) pair, constituting separate numeric and character interpretations of one value (see note v)
(16) A CHAR value is a sequence of zero or more Unicode code points
(17) A NUMB value is a (NUMBER, NUMBER) pair, constituting a numeric value and its standard uncertainty
(18) NUMBER is a numeric value
(19) NULL and UNKNOWN are primitive values
Note that:
(i) The word "set" is used to imply order-independence.
(ii) DATANAME is the same as "data name" as described in the CIF 2.0 syntax specification.
(iii) The one allowed single-packet loop per CONTAINER comprises all 'top-level' key-value pairs as well as the joined contents of all syntactic one-packet loops in that CONTAINER, and that means
(iv) distinct physical CIFs have the same representation in this data model when they differ only in whether data are presented as one-packet loops or in flattened form.
(v) MAYBENUMB is the result of parsing a data value having numeric form, prior to any validation to establish its correct data type. NUMB, on the other hand, is a value that is known to be numeric, whether through validation or some other (e.g. programmatic) means.
(vi) This formulation of the model is not intended to define how code point sequences are judged to be distinct or equal.
With James's permission, I would like to introduce an initiative of his that he and I have been working on: the CIF data model. This is a [i]logical[/i] model with which the structure and data content of any CIF can be described, and I present it to the group with the intention that it should serve at least two purposes:
1) most importantly, to provide a common vocabulary with which to discuss the form and parts of CIFs, but also
2) as a reference against which whatever physical data models we later consider can be validated and compared.
The need for (1) particularly struck me as I was compiling the [url=http://forums.iucr.org/viewtopic.php?f=27&t=73#p217]requirements list[/url].
I emphasize that this logical model is not proposed to be directly translated into a physical model. On the other hand, any physical model that adequately covers the CIF API's problem space by necessity has a subset that maps to the logical model presented here:
(1) A CIF FILE is a set of zero or more (BLOCKNAME, DATABLOCK) pairs, wherein all the BLOCKNAMEs are distinct
(2) A DATABLOCK is a CONTAINER plus zero or more (FRAMENAME, SAVEFRAME) pairs, wherein all the FRAMENAMEs are distinct
(3) BLOCKNAMEs and FRAMENAMEs are sequences of one or more Unicode code points from the CIF character set, excluding CIF whitespace
(4) A SAVEFRAME is a CONTAINER
(5) A CONTAINER is a set of zero or more LOOPs satisfying the constraints described in (9) and (10)
(6) A LOOP is a collection of one or more PACKETS for a DOMAIN
(7) A DOMAIN is a set of DATANAMEs (see note ii)
(8) A PACKET for a given DOMAIN is a mapping from each DATANAME in that DOMAIN to a DATAVALUE
(9) Each CONTAINER contains at most one single-PACKET loop (see note iii)
(10) All DOMAINs of all LOOPs in the same CONTAINER must be disjoint
(11) A DATAVALUE is any one of TABLE, LIST, MAYBENUMB, CHAR, NUMB, NULL, or UNKNOWN
(12) A TABLE is a set of zero or more (KEY, DATAVALUE) pairs, wherein all the KEYs are distinct
(13) A KEY is a CHAR
(14) A LIST is an ordered sequence of zero or more DATAVALUEs
(15) MAYBENUMB is a (NUMB, CHAR) pair, constituting separate numeric and character interpretations of one value (see note v)
(16) A CHAR value is a sequence of zero or more Unicode code points
(17) A NUMB value is a (NUMBER, NUMBER) pair, constituting a numeric value and its standard uncertainty
(18) NUMBER is a numeric value
(19) NULL and UNKNOWN are primitive values
Note that:
(i) The word "set" is used to imply order-independence.
(ii) DATANAME is the same as "data name" as described in the CIF 2.0 syntax specification.
(iii) The one allowed single-packet loop per CONTAINER comprises all 'top-level' key-value pairs as well as the joined contents of all syntactic one-packet loops in that CONTAINER, and that means
(iv) distinct physical CIFs have the same representation in this data model when they differ only in whether data are presented as one-packet loops or in flattened form.
(v) MAYBENUMB is the result of parsing a data value having numeric form, prior to any validation to establish its correct data type. NUMB, on the other hand, is a value that is known to be numeric, whether through validation or some other (e.g. programmatic) means.
(vi) This formulation of the model is not intended to define how code point sequences are judged to be distinct or equal.