The namespace problem and a possible solution
Posted: Thu Sep 06, 2012 8:14 am
We wish to build some sort of namespace mechanism into CIF so that other communities can use CIF with minimal, if any, coordination with COMCIFS. The key requirement is that datanames and the corresponding dictionary definitions must be unambiguously matchable. Currently, COMCIFS guarantees the uniqueness and immutable nature of datanames, so there is no need for any disambiguation mechanism. If CIF is to be usable outside COMCIFS, there must be a mechanism so that the readers and writers of CIF data files from a given community can agree on the correct definition for a given dataname.
Two partial solutions already exist:
(1) people and organisations register an opaque 'prefix' for a dataname with the IUCr. This allows users to populate their own namespaces safely and devolves management of dataname collisions to the relevant community. From the point of view of the outside discipline, there remains the annoyance that the datanames and dictionaries are cluttered with a redundant prefix.
(2) The _audit tags in a datablock can specify which dictionary the datanames come from. The problem then becomes one of encouraging programs to read and write these _audit items, and it is not clear what to do if a datablock lacks _audit items. Note that the dictionary merging protocol describes how to reconcile identical datanames provided they come from explicitly specified dictionaries.
It is fair to say that only (1) reliably insulates various domains from collisions with other domains, because _audit items are not widely used.
Search for solutions
=================
In a fundamental sense we can implement solutions either at the syntactic level (e.g. a new keyword or a dataname prefix), the semantic level (by defining a new dataname) or a combination of both (e.g. a new dataname is used to provide a default prefix which otherwise should be explicitly included in a dataname). I favour a purely semantic solution so that current CIF parsers are able to operate with no changes. Therefore I propose the following simple solution, which is essentially making solution (2) above maximally easy to deal with:
Solution A:
=========
A small audit dictionary is created that is stipulated to be common to *all* CIF-using domains (a part of the base framework, if you like). A dataname, let's call it _audit_domain, is defined within this dictionary. This dataname can take values from a simple enumerated list. Each of these values would correspond to a CIF-using scientific field. These values index into a URI (perhaps defined in the audit dictionary) which points to a resource that can be used to locate the dictionaries relevant to that scientific domain and from which the datanames in that CIF datablock are drawn. Therefore, it is sufficient to include a single dataname in a CIF data block to identify the preferred source of dataname definitions that follow, e.g. '_audit_domain atomic_physics'. An optional _audit_domain_URI dataname could explicitly provide the URI for the dictionary lists for maximum robustness, but this is unlikely to be used much in general practice.
The advantage of such a scheme is that programmers would only have to check a simple text string (rather than a long and variable dictionary URI) in order to determine whether or not the datablock has datanames from the field for which they have written their application. Note also that _audit_domain could be looped in the unusual case that dictionaries from multiple scientific fields will be combined, in which case dataname collisions are dealt with using the standard dictionary merging protocol in 'replace' mode.
Two partial solutions already exist:
(1) people and organisations register an opaque 'prefix' for a dataname with the IUCr. This allows users to populate their own namespaces safely and devolves management of dataname collisions to the relevant community. From the point of view of the outside discipline, there remains the annoyance that the datanames and dictionaries are cluttered with a redundant prefix.
(2) The _audit tags in a datablock can specify which dictionary the datanames come from. The problem then becomes one of encouraging programs to read and write these _audit items, and it is not clear what to do if a datablock lacks _audit items. Note that the dictionary merging protocol describes how to reconcile identical datanames provided they come from explicitly specified dictionaries.
It is fair to say that only (1) reliably insulates various domains from collisions with other domains, because _audit items are not widely used.
Search for solutions
=================
In a fundamental sense we can implement solutions either at the syntactic level (e.g. a new keyword or a dataname prefix), the semantic level (by defining a new dataname) or a combination of both (e.g. a new dataname is used to provide a default prefix which otherwise should be explicitly included in a dataname). I favour a purely semantic solution so that current CIF parsers are able to operate with no changes. Therefore I propose the following simple solution, which is essentially making solution (2) above maximally easy to deal with:
Solution A:
=========
A small audit dictionary is created that is stipulated to be common to *all* CIF-using domains (a part of the base framework, if you like). A dataname, let's call it _audit_domain, is defined within this dictionary. This dataname can take values from a simple enumerated list. Each of these values would correspond to a CIF-using scientific field. These values index into a URI (perhaps defined in the audit dictionary) which points to a resource that can be used to locate the dictionaries relevant to that scientific domain and from which the datanames in that CIF datablock are drawn. Therefore, it is sufficient to include a single dataname in a CIF data block to identify the preferred source of dataname definitions that follow, e.g. '_audit_domain atomic_physics'. An optional _audit_domain_URI dataname could explicitly provide the URI for the dictionary lists for maximum robustness, but this is unlikely to be used much in general practice.
The advantage of such a scheme is that programmers would only have to check a simple text string (rather than a long and variable dictionary URI) in order to determine whether or not the datablock has datanames from the field for which they have written their application. Note also that _audit_domain could be looped in the unusual case that dictionaries from multiple scientific fields will be combined, in which case dataname collisions are dealt with using the standard dictionary merging protocol in 'replace' mode.