The namespace problem and a possible solution

Discussion of namespace conventions for community-developed CIF dictionaries.

Moderators: Brian McMahon, jcbollinger

Post Reply
jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

The namespace problem and a possible solution

Post by jamesrhester » Thu Sep 06, 2012 8:14 am

We wish to build some sort of namespace mechanism into CIF so that other communities can use CIF with minimal, if any, coordination with COMCIFS. The key requirement is that datanames and the corresponding dictionary definitions must be unambiguously matchable. Currently, COMCIFS guarantees the uniqueness and immutable nature of datanames, so there is no need for any disambiguation mechanism. If CIF is to be usable outside COMCIFS, there must be a mechanism so that the readers and writers of CIF data files from a given community can agree on the correct definition for a given dataname.

Two partial solutions already exist:
(1) people and organisations register an opaque 'prefix' for a dataname with the IUCr. This allows users to populate their own namespaces safely and devolves management of dataname collisions to the relevant community. From the point of view of the outside discipline, there remains the annoyance that the datanames and dictionaries are cluttered with a redundant prefix.
(2) The _audit tags in a datablock can specify which dictionary the datanames come from. The problem then becomes one of encouraging programs to read and write these _audit items, and it is not clear what to do if a datablock lacks _audit items. Note that the dictionary merging protocol describes how to reconcile identical datanames provided they come from explicitly specified dictionaries.

It is fair to say that only (1) reliably insulates various domains from collisions with other domains, because _audit items are not widely used.

Search for solutions
=================
In a fundamental sense we can implement solutions either at the syntactic level (e.g. a new keyword or a dataname prefix), the semantic level (by defining a new dataname) or a combination of both (e.g. a new dataname is used to provide a default prefix which otherwise should be explicitly included in a dataname). I favour a purely semantic solution so that current CIF parsers are able to operate with no changes. Therefore I propose the following simple solution, which is essentially making solution (2) above maximally easy to deal with:

Solution A:
=========

A small audit dictionary is created that is stipulated to be common to *all* CIF-using domains (a part of the base framework, if you like). A dataname, let's call it _audit_domain, is defined within this dictionary. This dataname can take values from a simple enumerated list. Each of these values would correspond to a CIF-using scientific field. These values index into a URI (perhaps defined in the audit dictionary) which points to a resource that can be used to locate the dictionaries relevant to that scientific domain and from which the datanames in that CIF datablock are drawn. Therefore, it is sufficient to include a single dataname in a CIF data block to identify the preferred source of dataname definitions that follow, e.g. '_audit_domain atomic_physics'. An optional _audit_domain_URI dataname could explicitly provide the URI for the dictionary lists for maximum robustness, but this is unlikely to be used much in general practice.

The advantage of such a scheme is that programmers would only have to check a simple text string (rather than a long and variable dictionary URI) in order to determine whether or not the datablock has datanames from the field for which they have written their application. Note also that _audit_domain could be looped in the unusual case that dictionaries from multiple scientific fields will be combined, in which case dataname collisions are dealt with using the standard dictionary merging protocol in 'replace' mode.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: The namespace problem and a possible solution

Post by jcbollinger » Fri Sep 07, 2012 2:33 pm

jamesrhester wrote:It is fair to say that only (1) reliably insulates various domains from collisions with other domains, because _audit items are not widely used.


Option (1) also allows for any CIF to mix items from various namespaces, including items from different namespaces with the same simple name. No mechanism that relies on (meta)data that is physically separate from the names themselves can do that. Is that a desirable or required property of the solution?


John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: The namespace problem and a possible solution

Post by jamesrhester » Mon Dec 10, 2012 11:53 pm

jcbollinger wrote:
jamesrhester wrote:It is fair to say that only (1) reliably insulates various domains from collisions with other domains, because _audit items are not widely used.


Option (1) also allows for any CIF to mix items from various namespaces, including items from different namespaces with the same simple name. No mechanism that relies on (meta)data that is physically separate from the names themselves can do that. Is that a desirable or required property of the solution?


John


What do you mean by "same simple name"? Option (1) results in datanames that are never identical, so even different namespaces will have different names. Option (2) provides a standard resolution mechanism for identical datanames and explicitly allows for mixing of datanames from various namespaces (i.e. dictionaries).

My suggested enhancement attempts to hide as much as possible of the underlying mechanics of option (2), and takes into account the fact that programmers must choose a dictionary when writing a program, not when reading the datafile, so all they want to do at runtime is to check that they have datanames from the right dictionary. Within a given domain, datanames are guaranteed to be unique, so the act of successfully reading a dataname should be sufficient to guarantee that it comes from the dictionary you think it comes from.

To make all this concrete, a physical properties dictionary has been developed completely independently of COMCIFS. Suppose that, instead of the Option (1) approach of prefixing everything with '_prop_', these workers had wished to operate completely independently of IUCr dictionaries, so that some of their datanames collided or anticipated other COMCIFS datanames (e.g. journal_, phase_). Using existing option (2), datafiles containing the new datanames should include audit_ items to resolve which dataname definitions (theirs or the IUCr's) should take precendence in a given datafile. Now, the very possibility that files might exist with conflicting definitions means that all programs must now check the audit_ items when ingesting a datafile, leading to annoying boilerplate for the programmer, especially if they are drawing in datanames from a variety of sources (e.g. pdCIF + coreCIF + symCIF). This complicates how CIF programmers have operated up until now - programmers have been able to assume COMCIFS guarantees of uniqueness and backwards-compatibility.

Under my suggestion, the boilerplate is reduced to an absolute minimum, so that the datafile producer would loop the _audit.domain dataname as such:

Code: Select all

loop_
    _audit.domain 'materials'
    _audit.domain 'IUCr'


and the programmer would check that the priority of their domain in the loop corresponded to the priority that she/he has assumed in the program. Where the priority is different, the programmer can choose at runtime to either (i) abort (ii) risk a clash (iii) download the potentially contradictory dictionary and check for any datanames that might clash, then decide what to do (or alert the user etc.)

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: The namespace problem and a possible solution

Post by jcbollinger » Tue Dec 11, 2012 3:55 pm

jamesrhester wrote:What do you mean by "same simple name"? Option (1) results in datanames that are never identical, so even different namespaces will have different names. Option (2) provides a standard resolution mechanism for identical datanames and explicitly allows for mixing of datanames from various namespaces (i.e. dictionaries).


By "simple name" I mean a data name as defined in its dictionary. That can be different from the name as used in a CIF instance document under option (1). Under option (2), CIF instance documents only ever contain simple names, though in a logical sense those names may carry namespace information declared elsewhere.

jamesrhester wrote:My suggested enhancement attempts to hide as much as possible of the underlying mechanics of option (2), and takes into account the fact that programmers must choose a dictionary when writing a program, not when reading the datafile, so all they want to do at runtime is to check that they have datanames from the right dictionary. Within a given domain, datanames are guaranteed to be unique, so the act of successfully reading a dataname should be sufficient to guarantee that it comes from the dictionary you think it comes from.


Yes, but the main purpose of namespacing in the first place is not to attach names to dictionaries, but rather to distinguish between the same name appearing in multiple (relevant) dictionaries. Is that what the proposal is intended to address? If not, then it may be a great idea, but a "namespace" problem is not what it solves. If the proposal is intended to address that case, however, then it fails in these ways:
  • it does not accommodate cases where the dictionary priority needs to be different for different data names. The proposal could probably be amended with some kind of special-casing mechanism for that, but also
  • it cannot accommodate multiple dictionaries' definitions of the same name being used (for different items) in the same CIF.
The only way to address the latter, especially, is with a scheme that physically de-duplicates the colliding data names.

To rephrase my earlier question, then, are those limitations fatal to the proposal?


John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: The namespace problem and a possible solution

Post by jamesrhester » Mon Dec 24, 2012 2:07 am

(I am perhaps going to repeat myself in the following, but the discussion is helpful in allowing me to clarify the key issues).

In light of John's points let me clarify the aim as I see it: we wish to provide a mechanism such that other domains may use CIF with minimal or no coordination with COMCIFS. A 'domain' may be formally described, therefore, as a group of dictionaries which define datanames that are guaranteed to always have a constant, unambiguous meaning. This guarantee would presumably be provided by some organisation using policies chosen by that organisation.

So, the key issue that needs solving here is how to distinguish CIF datafiles that arise from differing domains. If all datafiles contained a complete list of the dictionaries that the programmer used in writing the program that produces the datafiles together with a list of any dictionaries that subsequent hand editing referred to, and all CIF file reading programs assiduously checked this dictionary list before assuming they knew the correct dataname meanings, then there would be no need for any further action on our part. However, very little dictionary checking is performed by current software simply because there is only one domain (courtesy of COMCIFS diligence) so if a dataname matches one that is expected by the program, the value of that dataname is guaranteed to be that which the programmer expects. This will be the case within any given domain.

My proposal of creating an _audit.domain dataname is to minimise the checking required - within a domain there should be no need for any per-dictionary checking.

Implicit in my definition of a 'domain' is the possibility that non-conflicting dictionaries from different domains could be arbitrarily mixed to create ad-hoc domains. This is not desirable, because dictionaries that were non-conflicting when a CIF file was written may have come into conflict subsequently, as separate domains are by definition uncoordinated. Considerable detailed checking of dictionary versions would therefore need to be performed by software in order to disambiguate such files, defeating the purpose of the domain idea. The correct way of mixing dictionaries from multiple domains would be for each domain to incorporate a dictionary from some other domain into its own domain, adopting whatever policies it wishes to ensure that clashes within its own domain do not ensue (e.g. prepending a prefix referring to the originating domain). Note that we are therefore actively discouraging mixing of datanames from different domains, and my comment above about some mixing being possible using looped '_audit.domain' datanames was a mistake.

If there are show-stoppers that I've missed, or there is a better alternative, please bring them forward. I will walk through some scenarios in a separate post.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: The namespace problem and a possible solution

Post by jcbollinger » Thu Jan 03, 2013 3:55 pm

Thanks, James, for the clarifications.
jamesrhester wrote:If there are show-stoppers that I've missed, or there is a better alternative, please bring them forward. I will walk through some scenarios in a separate post.

I don't see any show-stoppers, but I do see some issues that should be considered:
  • For the _audit.domain item to be of much use, its values need to be drawn from a list of standard alternatives. Domains could self-select, but if CIF were some day to spread to a large number of domains then the risk of domain collisions would be non-negligible. Thus, a complete solution requires a central registry of CIF domains, yet requiring domain registration limits (slightly) how independent domains can really be.
  • We should consider how the _audit.domain item interacts with data blocks and save frames. As I presently see it, CIF holds that only an _audit.domain item appearing in the same frame could apply to items in a save frame, and only one appearing in the same block could apply to items appearing directly in a data block. Is that what is wanted?
  • Do we really want to have cross-domain dictionary merging as originally suggested? The clarifications present a persuasive argument against it.
  • We should consider whether _audit.domain is itself defined in its own dictionary(-ies), whether it is explicitly or implicitly added to one or more domain dictionaries, or whether its definition is a special case, defined by the standard instead of by any dictionary.
  • As a related, minor issue, we should consider with what name formalism(s) the item should comply
  • We should make recommendations for how to handle CIFs that do not carry the _audit.domain item.

John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: The namespace problem and a possible solution

Post by jamesrhester » Tue Jan 08, 2013 1:36 am

jcbollinger wrote:Thanks, James, for the clarifications.
I don't see any show-stoppers, but I do see some issues that should be considered:
  • For the _audit.domain item to be of much use, its values need to be drawn from a list of standard alternatives. Domains could self-select, but if CIF were some day to spread to a large number of domains then the risk of domain collisions would be non-negligible. Thus, a complete solution requires a central registry of CIF domains, yet requiring domain registration limits (slightly) how independent domains can really be.

I think that there should be a simple central registry, run by the IUCr. This should be viewed as a service which allows different domains to avoid collisions, rather than appearing as a requirement on all CIF users. COMCIFS would develop a set of criteria for granting domain status, something along the lines of an application having to come from a credible central body that would undertake to coordinate that domain (such as a learned society). The single requirement would be that no domain is allowed to redefine the '_audit.domain' dataname(s). Those who wish to use CIF with no reference at all to the IUCr would then simply have to avoid a single dataname.
  • We should consider how the _audit.domain item interacts with data blocks and save frames. As I presently see it, CIF holds that only an _audit.domain item appearing in the same frame could apply to items in a save frame, and only one appearing in the same block could apply to items appearing directly in a data block. Is that what is wanted?

The relationship of save frames to the datablock has been little discussed, presumably because up until now they have only been used in dictionaries as a convenient encapsulation device. I do discuss some of the issues in the PyCIFRW paper, section 2.3.2. For the present discussion it must be true that the scope of an _audit.domain dataname appearing in the datablock proper would include the contents of save frames, as otherwise the datafile could contain datanames from different domains, which we are trying to avoid. This behaviour is broadly in line with dataname use in DDL2 dictionaries.

If _audit.domain appears in a save frame, and that save frame is then incorporated into a datablock, as the DDLm paper proposes, then the datanames appearing the save frame must come from the same domain as the datablock datanames if we are to avoid mixing domains. Logically, therefore, all save frames must have the same _audit.domain unless they are simply used as an encapsulation device. To avoid too much program logic tracing use of save frames through the datafile, I suggest that we specify that the scope of _audit.domain is the entire datablock in which it appears, and recommend that it only appears in the datablock proper. If there were a way in DDLm to limit _audit.domain use to the datablock proper, I would embrace that immediately.

  • Do we really want to have cross-domain dictionary merging as originally suggested? The clarifications present a persuasive argument against it.

No, we definitely do not want cross-domain dictionary merging.
  • We should consider whether _audit.domain is itself defined in its own dictionary(-ies), whether it is explicitly or implicitly added to one or more domain dictionaries, or whether its definition is a special case, defined by the standard instead of by any dictionary.

I think that only the latter can work - a separate 'audit' dictionary that is considered common to all domains, and is always implied in any CIF file. This is consistent with the first point above.
  • As a related, minor issue, we should consider with what name formalism(s) the item should comply

I continue to assert that semantically there is no such thing as a DDL1 or DDL2 dataname, as the dataname structure carries no information. Given that DDLm has adopted the DDL2 conventions, I think that a DDL2-style name is adequate. DDL1 programs are perfectly capable of checking and handling datanames that contain a period.
  • We should make recommendations as to how to handle CIFs that do not carry the _audit.domain item.

I believe that this is really up to the programmer and particular problem space. For example, the more datanames that are successfully read in, the more likely you are reading a file belonging to your expected domain. I think that all we can provide is a discussion of the issues.
Some choices are:
(i) prompt the user to confirm choice of domain (for interactive programs)
(ii) check for datanames that really should be unique to crystallography
(iii) include a warning in output
(iv) terminate

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: The namespace problem and a possible solution

Post by jcbollinger » Fri Jan 11, 2013 9:03 pm

jamesrhester wrote:
jcbollinger wrote:
  • For the _audit.domain item to be of much use, its values need to be drawn from a list of standard alternatives. Domains could self-select, but if CIF were some day to spread to a large number of domains then the risk of domain collisions would be non-negligible. Thus, a complete solution requires a central registry of CIF domains, yet requiring domain registration limits (slightly) how independent domains can really be.

I think that there should be a simple central registry, run by the IUCr. [...]

That sounds fine, as long as the IUCr is willing to undertake the effort. If not, then I suppose it won't happen, as I can't imagine any other organization doing it.
  • We should consider how the _audit.domain item interacts with data blocks and save frames. As I presently see it, CIF holds that only an _audit.domain item appearing in the same frame could apply to items in a save frame, and only one appearing in the same block could apply to items appearing directly in a data block. Is that what is wanted?

The relationship of save frames to the datablock has been little discussed, presumably because up until now they have only been used in dictionaries as a convenient encapsulation device. I do discuss some of the issues in the PyCIFRW paper, section 2.3.2. For the present discussion it must be true that the scope of an _audit.domain dataname appearing in the datablock proper would include the contents of save frames, as otherwise the datafile could contain datanames from different domains, which we are trying to avoid. This behaviour is broadly in line with dataname use in DDL2 dictionaries.

The relevant point that I draw from the PyCIFRW paper is that existing CIF [dictionary] usage demands that a data block and the save frames within be interpreted and validated as a unified whole. To some extent. I'm not altogether happy with the vagueness in that area, but I'll accept that one appearance of _audit.domain can suffice for a block and all its save frames.
If _audit.domain appears in a save frame, and that save frame is then incorporated into a datablock, as the DDLm paper proposes, then the datanames appearing the save frame must come from the same domain as the datablock datanames if we are to avoid mixing domains.

One could imagine using the _audit.domain item to support embedding CIF from one domain inside CIF from a different domain (inside a save frame). In other words, it is not self-evident that completely avoiding mixing domains should be the objective. On the other hand, it is not at all clear whether providing for that usage would be desirable or wise.

I think, though, that I would be most comfortable to say that the scope of an _audit.domain appearing directly in a data block extends to all save frames within that block, whereas the scope of such an item appearing in a save frame is only that frame. In that case it remains to be settled whether a save frame is permitted to override an _audit.domain declared by its host block (as distinguished from declaring one when the host block does not). I am not comfortable with the concept that an _audit.domain declared in a save frame should apply to the host data block or to sibling save frames.
I think that only [...] a separate 'audit' dictionary that is considered common to all domains, and is always implied in any CIF file [can work].

Fair enough.
  • As a related, minor issue, we should consider with what name formalism(s) the item should comply

I continue to assert that semantically there is no such thing as a DDL1 or DDL2 dataname, as the dataname structure carries no information.

My apologies, I did not mean to imply otherwise.
Given that DDLm has adopted the DDL2 conventions, I think that a DDL2-style name is adequate. DDL1 programs are perfectly capable of checking and handling datanames that contain a period.

Indeed so. I mainly wanted to establish that we're talking about having just one universal data name for this thing, not different names for the same item in different contexts.
  • We should make recommendations as to how to handle CIFs that do not carry the _audit.domain item.

I believe that this is really up to the programmer and particular problem space.

Yes, of course it is. I'm just suggesting that exactly the kind of discussion you described should accompany the official documentation, preferably one leading to some recommend practices. I don't think the content of such a discussion needs to be settled here and now.


John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: The namespace problem and a possible solution

Post by jamesrhester » Tue Jan 29, 2013 5:35 am

I have just addressed the issue of the domain of embedded save frames in a separate post, found here.

Post Reply