Status update

jcbollinger · Post by **jcbollinger** » Wed Sep 11, 2013 6:05 pm

I have completed initial implementation of the CIF formatter (function cif_write() in the API docs), and I am now working on the parser. A preliminary version of the scanner is in place, and the rest of the parser is roughed in, but there is still a lot of work to do.

The decision to define CIF 2.0 as UTF-8-only both simplifies and complicates these bits. On one hand, it is a great simplification to avoid encoding detection and adaptation issues. On the other hand, CIF 2.0 is now fundamentally a binary format, even though it can do a pretty good imitation of a text format on many systems. To write parsing code that is portable to systems whose default character encoding is not ASCII-compatible (e.g. EBCDIC variants), however, all characters need to be identified by their Unicode code point values. C Char literals such as '_' are are not portable for this purpose, because they are interpreted according to the compiler's default encoding. I think it is still possible to use standard tools such as lex/flex and yacc/bison, but care is necessary in defining syntax rules and perhaps grammar rules.

Note also that this binary vs. text distinction can constitute an incompatibility between CIF 1.1 and CIF 2.0, depending on how you interpret the CIF 1.1 specs. If you interpret CIF 1.1 as specifying an encoding-agnostic text format, as some have consistently and persuasively done, then there are systems on which no well-formed, non-trivial CIF 1.1 file is well-formed CIF 2.0 (relative to the current specs as I understand them), and vise versa. I can and will address this issue in the API implementation, so that it is as transparent as possible to users, but if there are any objections or discussion then I would prefer to hear it now, before I do the work.