Survey of existing software

Post a reply


This question is a means of preventing automated form submissions by spambots.
Smilies
:D :) ;) :( :o :shock: :? 8-) :lol: :x :P :oops: :cry: :evil: :twisted: :roll: :!: :?: :idea: :arrow: :| :mrgreen: :geek: :ugeek:

BBCode is OFF
Smilies are ON

Topic review
   

Expand view Topic review: Survey of existing software

Re: Survey of existing software

by jcbollinger » Wed Dec 12, 2012 2:08 pm

jamesrhester wrote:The PDB have been interpreting CIF files as relational databases for well over 10 years so I can't take credit for the idea. If this is worth pursuing then we should start another thread, I think.

Yes and no. As I understand it, PDB have been interpreting mmCIF files as relational databases, an activity that is supported by design by DDL2 and mmCIF. It is a somewhat different endeavor to interpret arbitrary CIF as a DB, and especially to do so in a way that provides capabilities similar to those that PDB enjoys in its narrower problem space. But I think it can be done, and that it would be useful to do it.


John

Re: Survey of existing software

by jamesrhester » Tue Dec 11, 2012 11:12 pm

jcbollinger wrote:
I am singularly uninterested in any approach that fundamentally relies on C++. However, inasmuch as one of the points raised in favor of iotbx.cif is that the data structure it yields is defined by the application, I could accept moving forward on the basis of defining the (standard C) data structure to be produced, and / or a set of (standard C) functions for accessing and manipulating it, per your (1) and (2) below. A working library with iotbx.cif as the back end could serve as a reference implementation.


My understanding is that ANTLR produces C code, and the iotbx project does some simple trickery to turn the files into C++ files. So we can avoid the C++ linkage altogether by simply defining our datastructure and slotting the manipulation functions (6 or so) into the places where the C++ virtual functions currently sit in the grammar specification. So no issue there.

jcbollinger wrote:
jamesrhester wrote:As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.


I am intrigued by the idea of an SQLite-based approach. It could be very powerful, provided that we can be confident of representing all valid CIF instance documents in a useful relational form. To be able to accommodate CIFs written against an arbitrary dictionary or no dictionary at all, the schemas would need to be very simple -- SQL realizations of the CIF Data Model, in fact. I think those would qualify as useful forms. This idea definitely bears investigation.


The PDB have been interpreting CIF files as relational databases for well over 10 years so I can't take credit for the idea. If this is worth pursuing then we should start another thread, I think.

Re: Survey of existing software

by jcbollinger » Tue Dec 11, 2012 5:42 pm

jamesrhester wrote:I have finally returned to thinking about the CIFAPI, and would like to discuss the possibility that Richard Gildea's iotbx.cif work could form the basis of an API that we can go forward with. To summarise: cctbx contains an iotbx module which reads in CIF files. The parsing is done using C++ code generated by ANTLR from a grammar file that looks quite similar to the CIF BNF. The nature of the datastructure constructed by this generated parser is left unspecified by using C++ virtual functions that need to be defined in any particular compilation.


I am singularly uninterested in any approach that fundamentally relies on C++. However, inasmuch as one of the points raised in favor of iotbx.cif is that the data structure it yields is defined by the application, I could accept moving forward on the basis of defining the (standard C) data structure to be produced, and / or a set of (standard C) functions for accessing and manipulating it, per your (1) and (2) below. A working library with iotbx.cif as the back end could serve as a reference implementation.

jamesrhester wrote:If we go down this path, the following work appears necessary.

(1) Choosing a standard datastructure
(2) Development of the higher-level functions to manipulate and write the datastructure, as outlined in our requirements.
(3) Adding CIF2.0 support - should be relatively easy given the simplicity of the ANTLR grammar files


Items (1) and (2) are surely the priority. Item (3) is less important, because the API doesn't depend on a specific parser implementation, and also because a lot of development and testing can be performed based on v1.1 CIFs. Item (2), of course, is most of what this group is tasked with devising.

jamesrhester wrote:As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.


I am intrigued by the idea of an SQLite-based approach. It could be very powerful, provided that we can be confident of representing all valid CIF instance documents in a useful relational form. To be able to accommodate CIFs written against an arbitrary dictionary or no dictionary at all, the schemas would need to be very simple -- SQL realizations of the CIF Data Model, in fact. I think those would qualify as useful forms. This idea definitely bears investigation.


John

Re: Survey of existing software

by jamesrhester » Tue Dec 11, 2012 4:52 am

I have finally returned to thinking about the CIFAPI, and would like to discuss the possibility that Richard Gildea's iotbx.cif work could form the basis of an API that we can go forward with. To summarise: cctbx contains an iotbx module which reads in CIF files. The parsing is done using C++ code generated by ANTLR from a grammar file that looks quite similar to the CIF BNF. The nature of the datastructure constructed by this generated parser is left unspecified by using C++ virtual functions that need to be defined in any particular compilation. In the case of iotbx, these datastructures are defined by Python, so we can't leverage much more of iotbx beyond the parser.

Richard has unbundled the CIF parser into a standalone parser, with instructions. When compiling, about 20 ANTLR runtime files need to be compiled and linked into the final executable/library.

If we go down this path, the following work appears necessary.

(1) Choosing a standard datastructure
(2) Development of the higher-level functions to manipulate and write the datastructure, as outlined in our requirements.
(3) Adding CIF2.0 support - should be relatively easy given the simplicity of the ANTLR grammar files

As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.

Re: Survey of existing software

by jamesrhester » Tue Apr 17, 2012 2:22 am

jcbollinger wrote:Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?


John


I would be in favour of dropping the CIF2.0 support and the full save frame support, on the basis that the API can be designed such that both of these items can be added at a later date. For CIF2.0, the API specification can be CIF grammar agnostic (e.g. a grammar='xxx' keyword in a file read function). I am in favour of any implementation either directly supporting CIF2.0 datastructures immediately (Unicode, tables and lists) or else allowing them to be added in a modular fashion at a later date. Save frames are essentially the same data structure as a CIF datablock, so they can also be added fairly quickly at a later date.

I am strongly in favour of sticking to C/C++ for the implementation language, and am agnostic on the parse error recovery and small-footprint operation requirements.

Here's a suggestion: how about we look at CBFLib, CIFPARSE-OBJ and iotbx.cif (being the only C/C++ candidates) and perform a quick "gap analysis" to see how much work it would take in each case to bring to software into conformance with our requirements? For example, all that may be required to produce an error-tolerant parser is to replace a hand-written parser with a yacc or ANTLR-based parser. On the other hand, changing to small-footprint operation may require a complete rewrite of anything handling the CIF data structure and therefore be a lot of work.

Note that the above paragraph is predicated on this software having a suitably liberal license that will allow us to edit and redistribute, potentially with a different name.

Re: Survey of existing software

by jcbollinger » Wed Mar 21, 2012 9:04 pm

jcbollinger wrote:Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?

It was my intention to allow others to go first, but since that hasn't happened yet, I'll go ahead and offer my two cents:

1) As far as I am concerned, full save frame support is a must-have.

2) If we don't include CIF 2.0 support now, then we'll need to come back soon to extend the API to include it. It will be far easier and better to account for CIF 2.0 in the initial design, so I don't see myself agreeing to relax that requirement.

3) I would be willing to flex somewhat on error detection and recovery, in that I would be willing to allow API parse functions to stop when an error is detected, provided that they yielded a machine-interpretable description of the nature and location of the (first) error. I nevertheless consider such a solution inferior to automated error recovery (still requiring the parse results to describe the nature and
location of all errors detected).

4) As for language target, I could imagine writing language-neutral API specifications that would admit independent implementations in various languages, instead of a concrete specification in C terms. That supposes concrete implementations in different languages might be independent, instead of simply being bindings to a common, underlying C interface. It is unclear to me whether such an approach would adequately satisfy this group's mandate, however.

5) Finally, fast, small-footprint operation. It looks like none among the surveyed software except JCIF fully provides this feature. As I interpret Herbert, even CBFlib transiently requires a large footprint before it can shift to lightweight operation. Does this mean that fast, small-footprint operation is not an important requirement? I am inclined to think otherwise, but if small-footprint mode is important then why is it not widely implemented?


John

Re: Survey of existing software

by jcbollinger » Fri Mar 16, 2012 5:39 pm

Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?


John

Re: Survey of existing software

by jcbollinger » Mon Mar 05, 2012 6:31 pm

Having seen some apparent trends in the detailed responses so far, I would like to provide a review of my own current CIF parser, which I have tentatively named "JCIF". This software is not currently available publicly, and the IP policy I am working under could mean that it never is publicly released. This is therefore not a candidate for adoption as the standard API. I present it mainly to provide a contrast on some of the requirements.

  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) JCIF supports both CIF 1.1 and CIF 2.0, including both explicit version specification and automatic version divining via the leading comment.
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. JCIF supports both. It offers an event-based CIF parser using callbacks, analogous to the XML parser expat and Java's built-in SAX API. It provides a DOM-like tree-building module on top of that by which users can conveniently parse CIFs to data objects.
  • The API will support inputting and parsing CIF text from external sources. JCIF supports this.
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. JCIF supports this.
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks JCIF supports this
    • adding save frames to and removing them from data blocks JCIF supports this
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame JCIF supports this.
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame JCIF supports this.
    • removing data names and their associated data values from a block or frame JCIF supports this
    • querying the data value(s) associated with a specified data name within a block or frame JCIF supports this
    • replacing one or more data values associated with a specified data name within a block or frame JCIF supports this
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) JCIF supports this
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) JCIF supports this
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) JCIF supports this
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) JCIF supports this
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. JCIF is amenable to both intra-parse and post-parse validation, but it offers no built-in validation at this time
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) JCIF is written in Java
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. JCIF supports only Java
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. JCIF supports this.
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations. JCIF provides extensive, pervasive error reporting and recovery. The event-driven interface allows users to register a callback to be invoked in the event of syntax or parse errors. The nature and location of each error is provided to the error callback (if any), and the callback opts either to continue or terminate parsing, in addition to whatever other action it may perform. The DOM-based parser implements an error-recovery fallback for every error it can detect. The default behavior is to parse the whole input, recovering automatically from any errors.

Re: Survey of existing software

by jcbollinger » Mon Mar 05, 2012 5:30 pm

jcbollinger wrote:The first I received are from John Westbrook, for several of the RCSB's offerings:
CIFLIB is a historic C++ parsing library that we developed at the very beginning of the
mmCIF project. This is replaced by CIFPARSE-OBJ

We are currently using and supporting CIFPARSE-OBJ which is C++/STL library providing:
[...]

CIFPARSE is an older C-library that provides essential parsing and accessors for mmCIF/PDBx data files.
We provide this but do not use or extend this library

I will take it upon myself to respond more specifically about CIFPARSE's and CIFPARSE-OBJ's features relative to the requirements list, based on their documentation.

CIFPARSE-OBJ is the RCSB's preferred tool in this space, so I address that first:
  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) CIFPARSE-OBJ appears not to support CIF 2.0 as of v7.1.0 (Sept, 2011).
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. CIFPARSE-OBJ appears to provide only (2).
  • The API will support inputting and parsing CIF text from external sources. CIFPARSE-OBJ supports this.
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. CIFPARSE-OBJ supports this.
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks CIFPARSE-OBJ supports adding blocks, but appears not to support removing them.
    • adding save frames to and removing them from data blocks CIFPARSE-OBJ does not appear to support this.
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame CIFPARSE-OBJ provides limited support for this, and only for data blocks.
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • removing data names and their associated data values from a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • querying the data value(s) associated with a specified data name within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • replacing one or more data values associated with a specified data name within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. CIFPARSE-OBJ supports post-parse validation
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) CIFPARSE-OBJ is witten in C++, and its public API relies heavily on C++ features.
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. CIFPARSE-OBJ supports C++ natively. Bindings to other languages, including C, dot not presently appear to be available
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. CIFPARSE-OBJ supports this.
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations.CIFPARSE-OBJ's support of this requirement is unclear from its documentation
My overall impression of CIFPARSE-OBJ is that it provides a slightly higher level, more application-specific API than our requirements are aimed at. It is also focused specifically on mmCIF, and some aspects of the API (data structure, names) reflect that focus.

As for the older, unsupported CIFPARSE:
  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) CIFPARSE supports only CIF 1.1
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. CIFPARSE supports only (2)
  • The API will support inputting and parsing CIF text from external sources. CIFPARSE supports this
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. CIFPARSE supports this
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks CIFPARSE supports this
    • adding save frames to and removing them from data blocks CIFPARSE does not support this
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame CIFPARSE provides limited support for this
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame CIFPARSE supports this for data blocks only
    • removing data names and their associated data values from a block or frame CIFPARSE supports this for data blocks only
    • querying the data value(s) associated with a specified data name within a block or frame CIFPARSE supports this for data blocks only
    • replacing one or more data values associated with a specified data name within a block or frame CIFPARSE supports this (one value at a time) for data blocks only
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) CIFPARSE supports this, requiring one function call for each value retrieved, for data blocks only
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) CIFPARSE supports this, requiring one function call for each value retrieved, for data blocks only
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) CIFPARSE supports this, requiring multiple function calls, for data blocks only
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) CIFPARSE supports this for data blocks only
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. CIFPARSE supports this
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) CIFPARSE is written in C. Specific standards-compliance is unclear.
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. CIFPARSE documentation does not mention bindings to other languages.
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. CIFPARSE appears to support this
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations. CIFPARSE's support for this is unclear from its documentation.
Like CIFPARSE-OBJ, CIFPARSE is focused specifically on mmCIF. That appears to manifest mainly in naming choices and (presumed) specificity to DDL2 for dictionaries.

Re: Survey of existing software

by yayahjb » Sun Jan 29, 2012 3:46 am

CBFlib is an ANSI C API similar to the PDB's CIFlib with support for imgCIF, DDL1 CIF, DDL2 CIF,
2009 DDLm CIF, and with extensive optional validation. The DDLm CIF support uses
PyCIFRW for method evaluation.
CBFlib is available at
http://www.bernstein-plus-sons.com/software/CBF/
and
http://sf.net/projects/cbflib/
CBFlib is a Debian package, a Gentoo package and an Ubuntu package
It is under review as a Fedora package. Unfortunately that required us to
decouple it from PyCIFRW because of PyCIFRW license issues, which complicates
CIF2 support.

CBFlib
    CIF 1.1 support: Yes
    CIF 2 support: No (in progress)
    Fast small footprint operation: Yes, but starts with full tree read
    Rich expressive and convenient public interface: Yes, especially via the Python wrapper
    Determining the presence of a data names and their contexts: Yes
    Add data name: Yes
    Remove data names and value: Yes
    Query data values: Yes
    Replace specific data values: Yes
    Query sets of related data values: Yes
    Replace sets of related data values: Yes (one at a time)
    Add sets of related data values: Yes (one at a time)
    Remove sets of related data values: Yes (one at a time)
    Optional validation: Yes
    Target C99, avoiding incompatibility with C89: Yes
    Accessible from C++: Yes
    Accessible from Fortran 77: Yes, via wrappers.
    Accessible from Fortran 95: Yes, via wrappers
    Accessible from Python: Yes, via SWIG
    Accessible from Java: Yes via SWIG
    Simultaneous multi-read: Yes
    Simultaneous multi-write: Yes
    Error recovery/logging: Yes

CIFtbx is a Fortran 77 API. It supports DDL1 CIF, DDL2 CIF and, to a limited extent
2009 DDLm CIF. It supports optional validation. It does not support method evaluation.
CIFtbx is available at
http://www.bernstein-plus-sons.com/software/ciftbx/
CIFtbx
    CIF 1.1 support: Yes
    CIF 2 support: No (partial support in progress, but limited by Fortran)
    Fast small footprint operation: Yes
    Rich expressive and convenient public interface: No
    Determining the presence of a data names and their contexts: Yes
    Add data name: No
    Remove data names and value: No
    Query data values: Yes
    Replace specific data values: No
    Query sets of related data values: Yes
    Replace sets of related data values: No
    Add sets of related data values: No
    Remove sets of related data values: No
    Optional validation: Yes
    Target C99, avoiding incompatibility with C89: No (this is an f77 package)
    Accessible from C++: No (except via f2c, which is clumsy)
    Accessible from Fortran 77: Yes
    Accessible from Fortran 95: No (except for some compilers)
    Accessible from Python: No
    Accessible from Java: No
    Simultaneous multi-read: No
    Simultaneous multi-write: No
    Error recovery/logging: Yes

Top