Survey of existing software

Forum for CIF developers to define an application programming interface for CIF software.

Moderators: Brian McMahon, jcbollinger

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Tue Dec 11, 2012 5:42 pm

jamesrhester wrote:I have finally returned to thinking about the CIFAPI, and would like to discuss the possibility that Richard Gildea's iotbx.cif work could form the basis of an API that we can go forward with. To summarise: cctbx contains an iotbx module which reads in CIF files. The parsing is done using C++ code generated by ANTLR from a grammar file that looks quite similar to the CIF BNF. The nature of the datastructure constructed by this generated parser is left unspecified by using C++ virtual functions that need to be defined in any particular compilation.


I am singularly uninterested in any approach that fundamentally relies on C++. However, inasmuch as one of the points raised in favor of iotbx.cif is that the data structure it yields is defined by the application, I could accept moving forward on the basis of defining the (standard C) data structure to be produced, and / or a set of (standard C) functions for accessing and manipulating it, per your (1) and (2) below. A working library with iotbx.cif as the back end could serve as a reference implementation.

jamesrhester wrote:If we go down this path, the following work appears necessary.

(1) Choosing a standard datastructure
(2) Development of the higher-level functions to manipulate and write the datastructure, as outlined in our requirements.
(3) Adding CIF2.0 support - should be relatively easy given the simplicity of the ANTLR grammar files


Items (1) and (2) are surely the priority. Item (3) is less important, because the API doesn't depend on a specific parser implementation, and also because a lot of development and testing can be performed based on v1.1 CIFs. Item (2), of course, is most of what this group is tasked with devising.

jamesrhester wrote:As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.


I am intrigued by the idea of an SQLite-based approach. It could be very powerful, provided that we can be confident of representing all valid CIF instance documents in a useful relational form. To be able to accommodate CIFs written against an arbitrary dictionary or no dictionary at all, the schemas would need to be very simple -- SQL realizations of the CIF Data Model, in fact. I think those would qualify as useful forms. This idea definitely bears investigation.


John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Survey of existing software

Post by jamesrhester » Tue Dec 11, 2012 11:12 pm

jcbollinger wrote:
I am singularly uninterested in any approach that fundamentally relies on C++. However, inasmuch as one of the points raised in favor of iotbx.cif is that the data structure it yields is defined by the application, I could accept moving forward on the basis of defining the (standard C) data structure to be produced, and / or a set of (standard C) functions for accessing and manipulating it, per your (1) and (2) below. A working library with iotbx.cif as the back end could serve as a reference implementation.


My understanding is that ANTLR produces C code, and the iotbx project does some simple trickery to turn the files into C++ files. So we can avoid the C++ linkage altogether by simply defining our datastructure and slotting the manipulation functions (6 or so) into the places where the C++ virtual functions currently sit in the grammar specification. So no issue there.

jcbollinger wrote:
jamesrhester wrote:As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.


I am intrigued by the idea of an SQLite-based approach. It could be very powerful, provided that we can be confident of representing all valid CIF instance documents in a useful relational form. To be able to accommodate CIFs written against an arbitrary dictionary or no dictionary at all, the schemas would need to be very simple -- SQL realizations of the CIF Data Model, in fact. I think those would qualify as useful forms. This idea definitely bears investigation.


The PDB have been interpreting CIF files as relational databases for well over 10 years so I can't take credit for the idea. If this is worth pursuing then we should start another thread, I think.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Wed Dec 12, 2012 2:08 pm

jamesrhester wrote:The PDB have been interpreting CIF files as relational databases for well over 10 years so I can't take credit for the idea. If this is worth pursuing then we should start another thread, I think.

Yes and no. As I understand it, PDB have been interpreting mmCIF files as relational databases, an activity that is supported by design by DDL2 and mmCIF. It is a somewhat different endeavor to interpret arbitrary CIF as a DB, and especially to do so in a way that provides capabilities similar to those that PDB enjoys in its narrower problem space. But I think it can be done, and that it would be useful to do it.


John

Post Reply