Survey of existing software

Forum for CIF developers to define an application programming interface for CIF software.

Moderators: Brian McMahon, jcbollinger

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Comparing requirements list to current software

Post by jamesrhester » Thu Jan 19, 2012 3:14 am

I think it would be good for CIF software authors to compare this requirement list to their own software. My comparison with PyCIFRW follows:


jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Survey of existing software

Post by jcbollinger » Thu Jan 19, 2012 3:06 pm

In this topic please comment on the capabilities of existing CIF libraries. Discussion of how well various libraries comply with our requirements is especially sought, but other commentary is welcome, such as particular strengths, weaknesses, or special features of specific CIF libraries. This will serve multiple purposes:

1) to evaluate existing software with respect to our requirements, in hope that an existing library or a combination of them can provide a basis for our API specifications, and perhaps even its implementation

2) to provide a basis for a future discussion of how our requirements might need to be refined

3) to identify additional features we should consider including in our specifications, beyond those needed to meet our requirements

4) to identify specific problems our specifications should avoid

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Fri Jan 27, 2012 7:12 pm

I have invited authors of and contributors to the CIF software libraries in IUCr's list to provide self-reviews of those libraries, either here or in e-mail to me. Likewise, Richard Gildea for iotbx.cif. I will post the e-mail responses here.

The first I received are from John Westbrook, for several of the RCSB's offerings:
CIFLIB is a historic C++ parsing library that we developed at the very beginning of the
mmCIF project. This is replaced by CIFPARSE-OBJ

We are currently using and supporting CIFPARSE-OBJ which is C++/STL library providing:

+ parsing tools for mmCIF/PDBx data files and dictionaries (flex/bison)

+ accessors to mmCIF data via tabular in-memory data model.

+ Efficient search using a variety of indexing options.

+ serialization/persistence in a platform independent binary format.

+ Accessors for dictionaries and DDLs attributes.

+ Methods provided for detailed dictionary based checking (e.g. data file versus domain dictionary, domain dictionary versus DDL dictionary).

+ Boost/Python wrapper is available

URL -- http://sw-tools.rcsb.org/apps/CIFPARSE-OBJ/index.html

CIFPARSE is an older C-library that provides essential parsing and accessors for mmCIF/PDBx data files.
We provide this but do not use or extend this library

I will take it upon myself to respond more specifically about CIFPARSE's and CIFPARSE-OBJ's features relative to the requirements list, based on their documentation.

yayahjb
Posts: 18
Joined: Sun Sep 11, 2011 9:54 pm

Re: Survey of existing software

Post by yayahjb » Sun Jan 29, 2012 3:46 am

CBFlib is an ANSI C API similar to the PDB's CIFlib with support for imgCIF, DDL1 CIF, DDL2 CIF,
2009 DDLm CIF, and with extensive optional validation. The DDLm CIF support uses
PyCIFRW for method evaluation.
CBFlib is available at
http://www.bernstein-plus-sons.com/software/CBF/
and
http://sf.net/projects/cbflib/
CBFlib is a Debian package, a Gentoo package and an Ubuntu package
It is under review as a Fedora package. Unfortunately that required us to
decouple it from PyCIFRW because of PyCIFRW license issues, which complicates
CIF2 support.

CBFlib
    CIF 1.1 support: Yes
    CIF 2 support: No (in progress)
    Fast small footprint operation: Yes, but starts with full tree read
    Rich expressive and convenient public interface: Yes, especially via the Python wrapper
    Determining the presence of a data names and their contexts: Yes
    Add data name: Yes
    Remove data names and value: Yes
    Query data values: Yes
    Replace specific data values: Yes
    Query sets of related data values: Yes
    Replace sets of related data values: Yes (one at a time)
    Add sets of related data values: Yes (one at a time)
    Remove sets of related data values: Yes (one at a time)
    Optional validation: Yes
    Target C99, avoiding incompatibility with C89: Yes
    Accessible from C++: Yes
    Accessible from Fortran 77: Yes, via wrappers.
    Accessible from Fortran 95: Yes, via wrappers
    Accessible from Python: Yes, via SWIG
    Accessible from Java: Yes via SWIG
    Simultaneous multi-read: Yes
    Simultaneous multi-write: Yes
    Error recovery/logging: Yes

CIFtbx is a Fortran 77 API. It supports DDL1 CIF, DDL2 CIF and, to a limited extent
2009 DDLm CIF. It supports optional validation. It does not support method evaluation.
CIFtbx is available at
http://www.bernstein-plus-sons.com/software/ciftbx/
CIFtbx
    CIF 1.1 support: Yes
    CIF 2 support: No (partial support in progress, but limited by Fortran)
    Fast small footprint operation: Yes
    Rich expressive and convenient public interface: No
    Determining the presence of a data names and their contexts: Yes
    Add data name: No
    Remove data names and value: No
    Query data values: Yes
    Replace specific data values: No
    Query sets of related data values: Yes
    Replace sets of related data values: No
    Add sets of related data values: No
    Remove sets of related data values: No
    Optional validation: Yes
    Target C99, avoiding incompatibility with C89: No (this is an f77 package)
    Accessible from C++: No (except via f2c, which is clumsy)
    Accessible from Fortran 77: Yes
    Accessible from Fortran 95: No (except for some compilers)
    Accessible from Python: No
    Accessible from Java: No
    Simultaneous multi-read: No
    Simultaneous multi-write: No
    Error recovery/logging: Yes

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Mon Mar 05, 2012 5:30 pm

jcbollinger wrote:The first I received are from John Westbrook, for several of the RCSB's offerings:
CIFLIB is a historic C++ parsing library that we developed at the very beginning of the
mmCIF project. This is replaced by CIFPARSE-OBJ

We are currently using and supporting CIFPARSE-OBJ which is C++/STL library providing:
[...]

CIFPARSE is an older C-library that provides essential parsing and accessors for mmCIF/PDBx data files.
We provide this but do not use or extend this library

I will take it upon myself to respond more specifically about CIFPARSE's and CIFPARSE-OBJ's features relative to the requirements list, based on their documentation.

CIFPARSE-OBJ is the RCSB's preferred tool in this space, so I address that first:
  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) CIFPARSE-OBJ appears not to support CIF 2.0 as of v7.1.0 (Sept, 2011).
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. CIFPARSE-OBJ appears to provide only (2).
  • The API will support inputting and parsing CIF text from external sources. CIFPARSE-OBJ supports this.
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. CIFPARSE-OBJ supports this.
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks CIFPARSE-OBJ supports adding blocks, but appears not to support removing them.
    • adding save frames to and removing them from data blocks CIFPARSE-OBJ does not appear to support this.
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame CIFPARSE-OBJ provides limited support for this, and only for data blocks.
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • removing data names and their associated data values from a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • querying the data value(s) associated with a specified data name within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • replacing one or more data values associated with a specified data name within a block or frame CIFPARSE-OBJ supports this for data blocks only.
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) CIFPARSE-OBJ supports this for data blocks only.
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. CIFPARSE-OBJ supports post-parse validation
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) CIFPARSE-OBJ is witten in C++, and its public API relies heavily on C++ features.
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. CIFPARSE-OBJ supports C++ natively. Bindings to other languages, including C, dot not presently appear to be available
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. CIFPARSE-OBJ supports this.
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations.CIFPARSE-OBJ's support of this requirement is unclear from its documentation
My overall impression of CIFPARSE-OBJ is that it provides a slightly higher level, more application-specific API than our requirements are aimed at. It is also focused specifically on mmCIF, and some aspects of the API (data structure, names) reflect that focus.

As for the older, unsupported CIFPARSE:
  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) CIFPARSE supports only CIF 1.1
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. CIFPARSE supports only (2)
  • The API will support inputting and parsing CIF text from external sources. CIFPARSE supports this
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. CIFPARSE supports this
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks CIFPARSE supports this
    • adding save frames to and removing them from data blocks CIFPARSE does not support this
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame CIFPARSE provides limited support for this
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame CIFPARSE supports this for data blocks only
    • removing data names and their associated data values from a block or frame CIFPARSE supports this for data blocks only
    • querying the data value(s) associated with a specified data name within a block or frame CIFPARSE supports this for data blocks only
    • replacing one or more data values associated with a specified data name within a block or frame CIFPARSE supports this (one value at a time) for data blocks only
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) CIFPARSE supports this, requiring one function call for each value retrieved, for data blocks only
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) CIFPARSE supports this, requiring one function call for each value retrieved, for data blocks only
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) CIFPARSE supports this, requiring multiple function calls, for data blocks only
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) CIFPARSE supports this for data blocks only
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. CIFPARSE supports this
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) CIFPARSE is written in C. Specific standards-compliance is unclear.
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. CIFPARSE documentation does not mention bindings to other languages.
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. CIFPARSE appears to support this
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations. CIFPARSE's support for this is unclear from its documentation.
Like CIFPARSE-OBJ, CIFPARSE is focused specifically on mmCIF. That appears to manifest mainly in naming choices and (presumed) specificity to DDL2 for dictionaries.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Mon Mar 05, 2012 6:31 pm

Having seen some apparent trends in the detailed responses so far, I would like to provide a review of my own current CIF parser, which I have tentatively named "JCIF". This software is not currently available publicly, and the IP policy I am working under could mean that it never is publicly released. This is therefore not a candidate for adoption as the standard API. I present it mainly to provide a contrast on some of the requirements.

  • The API will support both CIF 1.1 and CIF 2.0. (Degree and form of support for non-standard features and other syntax variants is yet to be determined.) JCIF supports both CIF 1.1 and CIF 2.0, including both explicit version specification and automatic version divining via the leading comment.
  • The API will support both (1) fast, small-footprint operation, and (2) a rich, expressive, and convenient public interface; these may be provided by separate branches of the API. JCIF supports both. It offers an event-based CIF parser using callbacks, analogous to the XML parser expat and Java's built-in SAX API. It provides a DOM-like tree-building module on top of that by which users can conveniently parse CIFs to data objects.
  • The API will support inputting and parsing CIF text from external sources. JCIF supports this.
  • The API will support outputting logical CIF structure and content to external sinks as well-formed CIF text. JCIF supports this.
  • Between source, if any, and sink, if any, and in memory where applicable, the API will support all CIF-compatible inquiries and modifications of logical CIF structure and data, including, but not necessarily limited to
    • adding and removing data blocks JCIF supports this
    • adding save frames to and removing them from data blocks JCIF supports this
    • determining the presence of a data names and their contexts (whether looped; other names in the same loop) within a block or frame JCIF supports this.
    • adding data names to a chosen context (for example, to a particular loop) within a block or frame JCIF supports this.
    • removing data names and their associated data values from a block or frame JCIF supports this
    • querying the data value(s) associated with a specified data name within a block or frame JCIF supports this
    • replacing one or more data values associated with a specified data name within a block or frame JCIF supports this
    • querying the set(s) of related data values for a chosen context within a block or frame (for example, retrieving all the values belonging to a chosen packet of a chosen loop) JCIF supports this
    • replacing one or more of the data values belonging to one or more of the sets of related data values for a chosen context within a block or frame (for example, replacing selected values in a chosen packet of a chosen loop) JCIF supports this
    • adding one or more sets of related data values to a chosen context within a block or frame (for example, adding a packet to a chosen loop) JCIF supports this
    • removing one or more sets of related data values for a chosen context within a block or frame (for example, removing a selected packet from a chosen loop) JCIF supports this
  • The API will support optional validation. (More detailed requirements to be determined.) That validation is optional implies that to the greatest extent feasible, all the API's features will be available when validation is disabled. JCIF is amenable to both intra-parse and post-parse validation, but it offers no built-in validation at this time
  • The core API will be targeted at C99, and it will avoid any incompatibility with C89. (This is more restrictive than targeting C89 or C99 alone would be.) JCIF is written in Java
  • The API will be accessible from other languages, including, at minimum, C++, Fortran 77, Fortran 95, Python, and Java. JCIF supports only Java
  • The API will everywhere support multiple CIF files simultaneously open for reading and / or writing (including updating), and where applicable, it will support multiple independent logical CIFs in-memory simultaneously. JCIF supports this.
  • The API will provide for error recovery and informative error reporting, especially for parsing and validation operations. JCIF provides extensive, pervasive error reporting and recovery. The event-driven interface allows users to register a callback to be invoked in the event of syntax or parse errors. The nature and location of each error is provided to the error callback (if any), and the callback opts either to continue or terminate parsing, in addition to whatever other action it may perform. The DOM-based parser implements an error-recovery fallback for every error it can detect. The default behavior is to parse the whole input, recovering automatically from any errors.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Fri Mar 16, 2012 5:39 pm

Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?


John

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Survey of existing software

Post by jcbollinger » Wed Mar 21, 2012 9:04 pm

jcbollinger wrote:Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?

It was my intention to allow others to go first, but since that hasn't happened yet, I'll go ahead and offer my two cents:

1) As far as I am concerned, full save frame support is a must-have.

2) If we don't include CIF 2.0 support now, then we'll need to come back soon to extend the API to include it. It will be far easier and better to account for CIF 2.0 in the initial design, so I don't see myself agreeing to relax that requirement.

3) I would be willing to flex somewhat on error detection and recovery, in that I would be willing to allow API parse functions to stop when an error is detected, provided that they yielded a machine-interpretable description of the nature and location of the (first) error. I nevertheless consider such a solution inferior to automated error recovery (still requiring the parse results to describe the nature and
location of all errors detected).

4) As for language target, I could imagine writing language-neutral API specifications that would admit independent implementations in various languages, instead of a concrete specification in C terms. That supposes concrete implementations in different languages might be independent, instead of simply being bindings to a common, underlying C interface. It is unclear to me whether such an approach would adequately satisfy this group's mandate, however.

5) Finally, fast, small-footprint operation. It looks like none among the surveyed software except JCIF fully provides this feature. As I interpret Herbert, even CBFlib transiently requires a large footprint before it can shift to lightweight operation. Does this mean that fast, small-footprint operation is not an important requirement? I am inclined to think otherwise, but if small-footprint mode is important then why is it not widely implemented?


John

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Survey of existing software

Post by jamesrhester » Tue Apr 17, 2012 2:22 am

jcbollinger wrote:Though I had hoped to hear from a few more authors, I think we have enough to work with for now, and the discussion has languished for too long.

None of the packages reviewed so far meet all of the requirements, though CBFLib seems to come pretty close. At the same time, each requirement is met by at least one of the packages reviewed, so no requirement stands out as too far fetched or unreasonable.

The requirements that seem least met are
  • Fast, small-footprint operation
  • Syntax / parse error recovery
  • CIF 2.0 support
  • Full save frame support
  • Language target and other languages support
The next question to take up, then, is whether any of these requirements should be dropped or modified.

Opinions?


John


I would be in favour of dropping the CIF2.0 support and the full save frame support, on the basis that the API can be designed such that both of these items can be added at a later date. For CIF2.0, the API specification can be CIF grammar agnostic (e.g. a grammar='xxx' keyword in a file read function). I am in favour of any implementation either directly supporting CIF2.0 datastructures immediately (Unicode, tables and lists) or else allowing them to be added in a modular fashion at a later date. Save frames are essentially the same data structure as a CIF datablock, so they can also be added fairly quickly at a later date.

I am strongly in favour of sticking to C/C++ for the implementation language, and am agnostic on the parse error recovery and small-footprint operation requirements.

Here's a suggestion: how about we look at CBFLib, CIFPARSE-OBJ and iotbx.cif (being the only C/C++ candidates) and perform a quick "gap analysis" to see how much work it would take in each case to bring to software into conformance with our requirements? For example, all that may be required to produce an error-tolerant parser is to replace a hand-written parser with a yacc or ANTLR-based parser. On the other hand, changing to small-footprint operation may require a complete rewrite of anything handling the CIF data structure and therefore be a lot of work.

Note that the above paragraph is predicated on this software having a suitably liberal license that will allow us to edit and redistribute, potentially with a different name.

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Survey of existing software

Post by jamesrhester » Tue Dec 11, 2012 4:52 am

I have finally returned to thinking about the CIFAPI, and would like to discuss the possibility that Richard Gildea's iotbx.cif work could form the basis of an API that we can go forward with. To summarise: cctbx contains an iotbx module which reads in CIF files. The parsing is done using C++ code generated by ANTLR from a grammar file that looks quite similar to the CIF BNF. The nature of the datastructure constructed by this generated parser is left unspecified by using C++ virtual functions that need to be defined in any particular compilation. In the case of iotbx, these datastructures are defined by Python, so we can't leverage much more of iotbx beyond the parser.

Richard has unbundled the CIF parser into a standalone parser, with instructions. When compiling, about 20 ANTLR runtime files need to be compiled and linked into the final executable/library.

If we go down this path, the following work appears necessary.

(1) Choosing a standard datastructure
(2) Development of the higher-level functions to manipulate and write the datastructure, as outlined in our requirements.
(3) Adding CIF2.0 support - should be relatively easy given the simplicity of the ANTLR grammar files

As a second suggestion, a lot of the work in (1) and (2) may potentially be avoided by using SQLite to handle the CIF file as an in-memory database. Many manipulation functions then have equivalent SQL expressions. This does however create an extra library dependency (670K for libsqlite3 on Linux).

A useful standard for judging footprint would be a flex+bison parser with a naive C datastructure.

Post Reply