Draft API specification

Forum for CIF developers to define an application programming interface for CIF software.

Moderators: Brian McMahon, jcbollinger

Post Reply
jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Draft API specification

Post by jamesrhester » Mon Feb 04, 2013 7:35 am

Below I go through the exercise of writing out explicit specifications for CIF API functions. Consider the string 'cif_' to be prepended for namespace purity. The general scheme is to model the ciffile as a set of datablocks, each with a nominated parent block. Thus arbitrarily nested save frames can be represented if necessary. In a situation with multiple CIF files, a file handle, datablock id and dataname are sufficient to uniquely specify a datum. There is no attempt to incorporate DDL dictionaries in the function list below, but if anybody can see a reason why dictionary validation could not be implemented through calls to the functions below they should certainly point it out.

Any and all comments whether nitpicking or general are welcome and indeed necessary. I will edit this post as the comments come in and try to flag that edit in this top comment. If it all gets too confused I'll start a GitHub wiki page and we can all pile in.

I expect that all of the below can be easily implemented through SQLite, indeed many of the functions below are trivial SQL calls. I have already a complete (in terms of the requirements) CIF implementation in Python using SQLite for the datastructure which I'll be releasing to the community as soon as I get licensing sorted out with my workplace.

File-level operations

Code: Select all

ciftype * open(FILE * filename)

Create a ciffile structure based on the contents of filename. The returned pointer is to be used for all interactions with the library. Note that we are hereby taking responsibility for memory management of the ciffile structure. The ciftype structure is opaque.

Code: Select all

ciftype * create()

Create a new ciffile structure

Code: Select all

void write(ciftype * ciffile, FILE * filename)

Write the contents of ciffile to filename. Note I have taken the road of allowing the OS to provide the output stream, to allow maximum flexibility.

Block-level operations

Code: Select all

int create_block(ciftype * ciffile, char * blockname, int parent)

Create a new datablock with name blockname as a parent of parent. If parent is 0, the block is a top-level datablock. Otherwise it is a save frame. The returned integer is a unique identifier for this datablock.

Code: Select all

int delete_block(ciftype * ciffile, int blockid)

Remove the block identified by blockid from the file identified by ciffile.

Code: Select all

int get_block_id(ciftype * ciffile, char * blockname, int parent)

Return the unique block id for the given blockname with given parent. Note that blocknames need only be unique within their enclosing block.

Code: Select all

char * get_block_name(ciftype * ciffile, int blockid)

Get the blockname for blockid.

Code: Select all

int count_blocks(ciftype * ciffile, int blockid)

Return the number of child blocks in block identified by blockid. If blockid = 0, gives number of datablocks in the ciffile.

Code: Select all

int * get_blocks(ciftype * ciffile,int blockid)

Return an array of all blockids that are direct children of blockid. If blockid = 0, gives an array of datablocks, otherwise it will represent the save frames.

Data item query operations

Code: Select all

int find_name(ciftype * ciffile, int blockid, char * dataname)

Return KVPAIR if the dataname occurs as a key-value pair, INLOOP if the dataname occurs in a loop, and 0 if absent

Code: Select all

bool has_name(ciftype *ciffile, int blockid, char  * dataname)

Return true if the dataname occurs anywhere within the given block. Note that this and the previous call do not search the contents of nested save frames.

Code: Select all

char * get_item_as_string(ciftype * ciffile, int blockid, char * dataname)

Return the string representation of an item's value. We undertake not to destroy memory for this string while ciffile remains open and this dataname is defined.

Code: Select all

double * get_item_as_float(ciftype * ciffile, int blockid, char * dataname)

Return the representation of the item as a pair of real numbers, if possible. Position zero is the number itself, and position one is the esd. If esd is missing, esd will be negative. If there is no numerical representation of the number, NaN is returned in position zero.

Code: Select all

datatype * get_item(ciftype * ciffile, int blockid, char *dataname)

Return all information about the item in the datatype structure. This is notionally a 4-entry structure containing the string representation, two numbers for the numerical representation, and an int to tag the type as UNKNOWN, NULL, or NUMB/CHAR. See the end of the post for methods of accessing this structure.

Loop operations

Code: Select all

int count_loops(ciftype * ciffile, int blockid,bool include_kvpairs)

Return the number of loops in ciffile. If include_kvpairs is true, the one-row loop containing all key-value pairs counts as a separate loop. Otherwise, it is ignored.

Code: Select all

ciflooptype * get_loops(ciftype * ciffile, int blockid)

Return an array of pointers to loops in the block, with length as given by the previous command.

Code: Select all

ciflooptype * get_loop_by_dataname(ciftype * ciffile, int blockid, char * dataname)

Return pointer to loop containing dataname.

Code: Select all

int get_loop_length(ciflooptype * loop)

Return the number of packets in the loop. For use in conjunction with the following functions.

Code: Select all

datatype** get_loop_item(ciftype * ciffile, int blockid, char *dataname)
double** get_loop_item_as_float(ciftype * ciffile, int blockid, char * dataname)
char** get_loop_item_as_char(ciftype * ciffile, int blockid, char * dataname)

Return the contents of a looped dataname as an array of values.

Code: Select all

cifpacket * start_loop_iteration(ciflooptype * cifloop)

Get a pointer to a loop packet, where the loop is identified by the opaque loop pointer; the packet pointer can be used to iterate over packets (see below). The contents of the packet will be destroyed once it is used in one of the calls below, so contents should be copied before getting the next packet.

Code: Select all

cifpacket * get_next_packet(cifpacket * packetptr)

Get the next packet from the nominated loop. If no more packets are available, a null pointer is returned. See below for handling routines.

Code: Select all

cifpacket * get_matching_packets(ciffile * file, int blockid, cifpacket * conditions)

Get all packets where the values match those in the conditions packet. get_next_packet() will return further packets where more than one exists. If no packets match, a null pointer is returned.

Data value construction routines

Code: Select all

set_string_item(ciftype * ciffile, int blockid, char * dataname, char * value)
set_numb_item(ciftype * ciffile, int blockid, double value, double esd)
set_data_item(ciftype * ciffile, int blockid, datatype * value)

Set dataname to a value.

Code: Select all

ciflooptype * create_loop(ciftype * ciffile, int blockid, char ** datanames)

Create an empty loop containing datanames and return a handle for the loop. It is not
an error to pass a null pointer for datanames, in which case an empty loop is created
suitable for adding columns.

Code: Select all

add_column(ciflooptype * cifloop, char * dataname)

Add an empty column to the loop. If other datanames are already present, the column
will be initialised with values of UNKNOWN.

Code: Select all

add_packet(ciflooptype * cifloop, cifpacket * packet)

Add a packet to the loop. The contents of packet are copied.

Code: Select all

cifpacket * get_packet_template(ciflooptype * cifloop)

Get a template for adding packets to this loop. See below for packet handling routines.

Code: Select all

add_numb_column(ciflooptype * cifloop, char * dataname, float * values, int len)
add_char_column(ciflooptype * cifloop, char * dataname, char ** values, int len)
add_data_column(ciflooptype * cifloop, char * dataname, datatype ** values, int len)

Add a column of data. If the loop is not empty, the length of the data should
correspond to the length of the data already in the loop, otherwise an error will be raised.

Code: Select all

delete_column(int cifloop, char * dataname)

Remove the column from the loop. It is not an error to remove the last column or a non-existent column.

Code: Select all

delete_dataname(ciffile * file, int blockid, char * dataname)

Remove a dataname from the datablock. It is not an error to remove a non-existent dataname.

Code: Select all

delete_packets(cifpacket * packet)

Delete all packets matching packet.

Packet handling routines

The cifpacket structure is attached to a specific loop and created using the loop as an argument (see above).

Code: Select all

char * get_char_value(cifpacket * cp, char * dataname)
double[2] get_numb_value(cifpacket * cp, char * dataname)
datatype * get_data_value(cifpacket *cp, char *dataname)
set_char_value(cifpacket * cp, char * dataname, char * value)
set_numb_value(cifpacket * cp, char * dataname, double value, double esd)
set_data_value(cifpacket *cp, char *dataname, datatype * value)

Get or set values in a packet.

Code: Select all

 int count_columns(cifpacket *cp)

Return number of columns in the packet. For use with the following routine

Code: Select all

char ** get_column_names(cifpacket *cp)

Return the datanames in this loop.

Datatype handling routines

The datatype structure is the most general way of representing CIF values, as it can represent all CIF types, in particular NULL and UNKNOWN. Applications would generally use direct setting of columns as CHAR or NUMB for efficiency unless there are some UNKNOWN or NULL values in the columns.

Code: Select all

datatype * create_datavalue()
void set_numb_value(datatype * data, double value, double esd)
void set_char_value(datatype * data, char * value)
void set_unknown(datatype * data)
void set_null(datatype * data)
double[2] get_numb_value(datatype * data)
char * get_char_value(datatype * data)
bool is_unknown(datatype * data)
bool is_null(datatype * data)


Get and set values in the datatype. This is clearly extensible to CIF2.0 compound structures as well.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Draft API specification

Post by jcbollinger » Mon Feb 04, 2013 11:34 pm

Though it may seem that I have been ignoring this area, I have in fact been working on a similar API definition as my time has permitted. To validate and flush out problems with the API, I have been working as well on an SQLite implementation. In retrospect, it would have been wiser to keep this group better apprised, and I offer my apologies for not doing so.

Nevertheless, I think it may prove useful to start with two independent designs that we can compare and contrast. Therefore, I present below a working version of the main public API as I conceived it, as it could be expressed in a C header. For brevity I omit several utility and maintenance functions such as functions to parse strings into value objects and functions for resource management. I also omit definitions of the list of defined error codes that the API functions may emit. Finally, I omit most details of the complex data types used by the API.

Whole-CIF functions

Creates a new, empty, in-memory CIF data structure, and on success, records a handle on it where the 'cif' pointer points:

Code: Select all

int cif_create(cif_t **cif);


Removes the specified in-memory CIF representation, releasing all resources it holds. The original external source (if any) of the CIF data is not affected:

Code: Select all

int cif_destroy(cif_t *cif);


Parses a CIF from the specified stream, file descriptor, or path. If the 'cif' argument is non-NULL then the parsed data are added to the CIF it specifies; otherwise, the parsed result is discarded, but the return code still indicates whether parsing was successful. cif_parse() and cif_parse_fd() start parsing at the current file pointer, whereas cif_parse_file() opens the specified file for reading and attempts to parse its entire contents. The 'path' argument is interpreted according to the current locale, as if by open() or fopen():

Code: Select all

int cif_parse(FILE *stream, cif_t *cif);
int cif_parse_fd(int fd, cif_t *cif);
int cif_parse_file(const char *path, cif_t *cif);


Formats the CIF data represented by the 'cif' handle to the specified output. cif_write() and cif_write_fd() start writing at the current file pointer; whereas cif_write_file() opens the specified file for writing and overwrites any previous contents. The 'path' argument is interpreted according to the current locale, as if by open() or fopen():

Code: Select all

int cif_write(FILE *stream, cif_t *cif);
int cif_write_fd(int fd, cif_t *cif);
int cif_write_file(const char *path, cif_t *cif);


Cata block functions

Creates a new data block bearing the specified name in the specified CIF. If 'block' is not null then a handle on the new block is recorded where it points:

Code: Select all

int cif_create_block(cif_t *cif, const UChar *name, cif_block_t *block);


To accommodate CIF2, block, frame, and data names need to support Unicode. I did not at first appreciate the ramifications, but after a while I realized that that is a deal-breaker for every C-language CIF API currently available. Even if the API were to be defined in terms of UTF-8-encoded byte strings, code that is not written to expect such is certain to trip over it. It is furthermore not sufficient to use the wchar_t of C89, as that type and the library functions based on it are in no way guaranteed to support Unicode -- that's completely up to the implementation.

My proposal in this area is to rely for Unicode data types and library support on the International Components for Unicode (ICU) project, a mature and widely-used set of libraries supported by industry heavyweights including IBM, Apple, and Google. The "UChar" in the above definition and some subsequent ones is the ICU native Unicode character data type, an unsigned 16-bit integer. ICU Unicode strings are arrays of UChar, thus encoding text in UTF-16 with native byte-order.

Looks up the data block bearing the specified name in the specified CIF. If 'block' is not null then a handle on the block is written where it points. Returns CIF_OK if the block is present in the given CIF, even if 'block' is NULL:

Code: Select all

int cif_get_block(cif_t *cif, const UChar *name, cif_block_t **block);


Creates a new save frame bearing the specified name in the specified data block. If 'frame' is not NULL then a handle on the new frame is recorded where it points:

Code: Select all

int cif_block_create_frame(cif_block_t *block, const UChar *name, cif_frame_t **frame);


Looks up the save frame bearing the specified name in the specified CIF. If 'frame' is not NULL then a handle on the frame is written where it points. Returns CIF_OK if the frame is present in the given block, even if 'frame' is NULL:

Code: Select all

int cif_block_get_frame(cif_block_t *block, const UChar *name, cif_frame_t **frame);


General data block and save frame functions

Creates a new loop in the specified container, with the specified item names, expressed as a null-terminated array of pointers to null-terminated Unicode strings. If the 'loop' argument is not NULL then a handle on the new loop is written where it points:

Code: Select all

int cif_container_create_loop(cif_container_t *container, const UChar * category, cif_loop_t **loop, UChar *names[]);

Note 1: cif_container_t is compatible with both (EDITED:) cif_block_t and cif_frame_t.

Note 2: Loop (or data name) categories are not part of the CIF data model, and their use is optional (with one important exception) in this API. They are conceived to serve at least two purposes:
  • tagging the loop in each block or frame that contains the scalars, and
  • serving as reference-able table names, especially for relationally-oriented dictionaries such as mmCIF.
It is not intended that the API treat them as anything other than opaque tags, but I anticipate that some higher-level protocols would find them useful.

Looks up the loop, if any, in the specified container that is assigned to the specified category. If 'loop' is not NULL then a handle on the discovered loop is written where it points. Returns CIF_OK if there is exactly one such loop in the container, even if 'loop' is NULL:

Code: Select all

int cif_container_get_category_loop(cif_container_t *container, const UChar *category, cif_loop_t **loop);


Looks up the loop, if any, in the specified container that contains the specified item. If 'loop' is not NULL and the item is found in the container then a handle on the discovered loop is written where 'loop' points. Returns CIF_OK if there is such a loop in the container, even if 'loop' is NULL:

Code: Select all

int cif_container_get_item_loop(cif_container_t *container, const UChar *item_name, cif_loop_t **loop);


Sets the value of the specified item in the specified container, or adds it as a scalar if it's not already present in the container. Care is required with this function: if the named item is in a multi-packet loop then the given value is set set for the item in every loop packet:

Code: Select all

int cif_container_set_value(cif_container_t *container, const UChar *item_name, cif_value_t *val);


Code: Select all

Removes the specified item and all associated values from the specified container:

Code: Select all

int cif_container_remove_item(cif_container_t *container, const UChar *item_name);


Loop manipulation functions

Retrieves the item names belonging to the specified loop:

Code: Select all

int cif_loop_get_names(cif_loop_t *loop, UChar **item_names);


Adds the CIF data item identified by the specified name to the specified loop. Its initial value (in every existing loop packet) is given by 'val':

Code: Select all

int cif_loop_add_item(cif_loop_t *loop, const UChar *item_name, cif_value_t *val);


Adds a packet to the specified loop:

Code: Select all

int cif_loop_add_packet(cif_loop_t *loop, cif_packet_t *packet);


Creates an iterator over the packets in the specified loop:

Code: Select all

int cif_loop_get_packets(cif_loop_t *loop, cif_packet_it_t **iterator);

Note 3: cif_packet_it_t holds the state of an iteration through the packets of a given loop. The library is careful to avoid maintaining such state details internally, as that presents concurrency problems.

Loop packet manipulation functions

Advances to the next packet available from the specified iterator. If 'packet' is not NULL then the packet data is written where it points. Returns CIF_OK on success, or CIF_FINISHED if no more packets are available via this iterator:

Code: Select all

int cif_pktitr_next_packet(cif_packet_it_t *iterator, cif_packet_t **packet);


Updates the last packet iterated by the specified iterator with the values from the specified packet:

Code: Select all

int cif_pktitr_update_packet(cif_packet_it_t *iterator, cif_packet_t *packet);


Removes the last packet iterated by the specified iterator from its loop:

Code: Select all

int cif_pktitr_remove_packet(cif_packet_it_t *iterator);


There are additional functions for initialization and manipulation of individual values and for resource management, but the above are the functions addressing the API requirements we set forth. I'll have some comments soon (maybe tomorrow) about the similarities (there are many) and differences between the API proposals, and also about my experience so far with prototyping an implementation.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Draft API specification

Post by jcbollinger » Tue Feb 05, 2013 3:26 pm

It encourages me that the two API drafts are in fact very similar in general organization and approach, as well as in some of the design principles. Both feature opaque data structures representing the different kinds of objects in a CIF, with associated functions to interrogate and manipulate them. Both find ways to exploit the common aspects of data blocks and save frames so as to avoid redundant API functions. Both even choose a packet iterator approach to processing loops. To some extent these seem natural design choices for the problem area (or so I thought), but I'm nevertheless glad to find we're on the same page.

Some of the more notable ways in which the drafts differ are
  • All functions in my version of the API (call it alternative 2) have an integer status code as a return value, whereas functions in James's version (alternative 1) tend to return application data directly
  • Alternative 2 represents data blocks and save frames via their own structures (which can be used interchangeably in API functions where it makes sense to do so), whereas alternative 1 models both as possibly-nested blocks and provides integer block IDs as handles on them
  • Alternative 2 uses a polymorphic data type to represent values in the main API functions, and thus provides only one version of each API function that manipulates CIF values, whereas alternative 1 provides data-type-specific functions for CIF value manipulation
  • Alternative 2 handles 'scalar' data via the same API that are used for multi-packet looped data (and indeed, does little to distinguish those cases), whereas alternative 1 provides separate functions for manipulating looped and un-looped data
  • Alternative 2 relies explicitly on ICU for its character data type for CIF data, whereas alternative 1 assumes an encoded byte-string representation (or maybe just defers a decision on Unicode data representation)

Before I offer my own my own opinions on those differences in design (which I hope are not very evident from the presentation above), I'd like to invite James and anyone else interested to weigh in.

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Draft API specification

Post by jamesrhester » Wed Feb 06, 2013 6:22 am

I'm quite pleased that we've both had an independent go at thinking up an API, as we are likely to quickly triangulate any design issues. We may also drag some lurkers into the discussion. In general I think alternative 2 can be characterised as an admirably economical set of functions that fulfill the design requirements, on top of which some simple convenience functions should be overlaid, i.e. I'd be happy to proceed with alternative 2.
jcbollinger wrote:Some of the more notable ways in which the drafts differ are
  • All functions in my version of the API (call it alternative 2) have an integer status code as a return value, whereas functions in James's version (alternative 1) tend to return application data directly

The most important practical ramification of this difference seems to be consistent error reporting for alternative 2, which is a Good Thing of course. As far as I can tell the API still takes most of the responsibility for memory management of the various data structures. I suspect Alternative 2 is the better course in this case because of the nice error reporting.
  • Alternative 2 represents data blocks and save frames via their own structures (which can be used interchangeably in API functions where it makes sense to do so), whereas alternative 1 models both as possibly-nested blocks and provides integer block IDs as handles on them

I would assert that a data block is just a save frame with no parent container, and so therefore they can be handled identically. Certainly the API can provide some convenience functions to hide this equality. As a side note, my Python implementation uses an SQLite table defined thus:

Code: Select all

self.db.execute('''create table datablocks (blockid int primary key, name text collate nocase, parent int)'''

in order to store the datablock relations. The alternative 1 integer ID could be hidden inside the alternative 2 container type with no loss of function.
  • Alternative 2 uses a polymorphic data type to represent values in the main API functions, and thus provides only one version of each API function that manipulates CIF values, whereas alternative 1 provides data-type-specific functions for CIF value manipulation

The datatype-specific routines were a recognition that a programmer will almost always want a double or a string with minimal messing around. A base-level function to extract/insert opaque values together with some simple additional functions to extract/insert values of the appropriate type from these opaque types would suit me just as well. There is an annoying implementation issue here in that we cannot assume a uniform type for a column and so we need to store the type for every datavalue.
  • Alternative 2 handles 'scalar' data via the same API that are used for multi-packet looped data (and indeed, does little to distinguish those cases), whereas alternative 1 provides separate functions for manipulating looped and un-looped data

Alternative 2 leverages the point that the tag-value pairs are just like a one-row loop. Again, a few convenience functions over the top of them would probably help those who aren't used to thinking this way. The context in which a tag-value pair would be accessed is one in which a single value is expected (not an array), and given how often this happens I think a straightforward function to return a single value is useful.
  • Alternative 2 relies explicitly on ICU for its character data type for CIF data, whereas alternative 1 assumes an encoded byte-string representation (or maybe just defers a decision on Unicode data representation)

I just didn't want to think about it at this stage. Obviously alternative 2 is the better way to go on this.

I'd be interested to hear your experiences in implementing this API.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Draft API specification

Post by jcbollinger » Thu Feb 07, 2013 5:36 pm

I estimate that the storage and manipulation parts of my implementation are about 75% done, but I have done nothing with parsing or writing, and it's all essentially untested. Nevertheless, here's what I can say so far about my experiences in implementing it:

As conventional wisdom regarding C would dictate, a great deal of the executable code is devoted to resource management and error handling. Some of that is tricky to get right. After a long time programming mostly in Java, the return to C has greatly heightened my appreciation of garbage collection and exceptions.

The C preprocessor goes a long way toward making the task tolerable. There is a lot of boilerplate code that cannot reasonably be implemented as functions, but which I am overjoyed to neither rewrite nor cut & paste & edit, and which, therefore, I can fix in one place when necessary.

Providing for case-insensitive comparison of Unicode strings used as item, data block, and save frame names is non-trivial, and the CIF2 spec changes don't really address the question. One must consider that case conversion is not unambiguously defined outside the ASCII set, and one must also consider combining characters, pre-composed characters, etc.. I have chosen to base name comparisons on a normalized form obtained by first case-folding the original name according to the Unicode case-folding algorithm, then transforming the folded version to Unicode normalization form NFC. The latter step is perhaps not justified by the letter of the CIF2 specs, but it's the right thing to do. (So call me a maverick.) Note that performing those steps in the reverse order would also yield a usable normalized form, but equality comparisons based on the two forms could differ in some cases.

Although I am implementing an SQLite-based back end, the API does not depend on that, and I'm trying to ensure that it stays that way. I anticipate that that may prove a bit challenging in the packet iteration / loop manipulation code that I'm working on right now, because I think I'll need to hold open a database transaction across multiple API calls. The user doesn't see that directly, but it does require limitations on API use.

Most operations are implemented, at their core, via a single execution of one, or occasionally two, SQL statements.

Values are by far the most complicated thing in the API. CIF2 lists and tables are particularly irksome when it comes to storage in the database -- I have not found an alternative to storing them in some kind of serialized form. Although I could use CIF as the serialization format, my current implementation instead uses a custom internal format that is quick and easy to deserialize. Even numeric values are less straightforward than they might seem, since one has to account for the standard uncertainty that some of them carry. Also, one must deal with numbers that can be expressed in CIF but cannot be represented in any of the native numeric formats supported by the host system (or else knowingly refuse to do so).

Whether or how to distinguish data blocks from save frames at the implementation level is a distinct question from whether or how to do so at the API level, particularly if one intends to rely on the DB engine to enforce uniqueness of block / frame names. Here's a slightly-modified version of my table definition for save frames:

Code: Select all

create table save_frame (
  id integer primary key,
  data_block_id integer not null,
  name varchar(80) not null,
  foreign key (data_block_id)
    references data_block(id),
  unique (data_block_id, name)
)

That's not a complete picture, but it's enough to demonstrate. It looks fine to me for save frames (given the complementary data_block table, which I've not shown), but what if we want to store data blocks in that same table? Data blocks have no parent block, so one would have to record either a NULL parent block ID or a special, never-used one. The latter is inconsistent with the foreign key clause, but the former breaks the uniqueness constraint for data block names! Something has to give: to put it all in one table, one must give up either parent/child referential integrity enforcement or database-enforcement of name uniqueness.

Actually, a third alternative now occurs to me: one could record a dummy top-level container representing the global context of which all data blocks are direct children, but which should not directly contain any data of its own. It would itself suffer from the uniqueness problem, but that would be irrelevant because it wouldn't (shouldn't) have any siblings. That would carry its own costs, however, in a need for special-case implementation code and a resulting model that incorporates constraints that cannot be expressed in DDL (that the top-level container have no siblings and no data of its own).

There is also a fairly major implementation question with respect to the approach to representing loops in the database. One alternative is to create a table in the database for each loop. That models the CIF structure rather directly, and provides good support for relational operations such as joins. On the other hand, it means that the database schema changes routinely in the course of API use, and schema changes are non-transactional in SQLite. Also, the per-loop tables need names, but there are no natural names available for use, so this approach requires programmatic generation of table names. That somewhat dulls the luster of direct support for loop joins. (This differs from the case of an API aimed at, say, valid mmCIF, wherein items could be organized into DB tables based on their dictionary-defined categories.) Moreover, the complexity of values presents an especial problem when there are many columns that must represent values.

The main alternative I see for representing loops is to use what I'll call "virtual tables". Virtual tables use a small number of physical tables to model an unlimited number of logical tables. In this case, the database schema never needs to change as loop definitions are added, modified, or removed. Also, because all the values for all the loops are recorded in the same physical table, it is not especially onerous to deal with the type ambiguity problem. Among the drawbacks, however, is that loop joins are not as straightforward. This is the approach I am attempting, and here is a simplified version of the relevant parts of my working schema:

An index of all the loops appearing anywhere in the CIF. Metadata general to whole loops can be attached here.

Code: Select all

create table loop (
  container_id integer not null,
  loop_num integer not null,
  primary key (container_id, loop_num)
)


Specifies which data names are tabulated in each loop -- that is, the virtual table definitions

Code: Select all

create table loop_item (
  container_id integer not null,
  name varchar(80) not null,
  loop_num integer not null,
  primary key (container_id, name),
  foreign key (container_id, loop_num)
    references loop(container_id, loop_num)
)


Gives the value for each item in each packet of each loop

Code: Select all

create table item_value (
  container_id integer not null,
  name varchar(80) not null,
  row_num integer not null,
  value numeric,
  primary key (container_id, name, row_num),
  foreign key (container_id, name)
    references loop_item(container_id, name)
)


(Don't pay too much attention to data types in the forgoing, since SQLite doesn't either.)

One more thing: I am seriously considering writing a second implementation of the back end that uses hash tables, linked lists, and similar data structures for in-memory CIF storage. I anticipate that it would offer both advantages and disadvantages relative the SQLite version, but I don't have a good feel for where the balance would lie. If I actually have time to do that, however, or if someone else does, then at minimum it will help flush out quirks and implementation assumptions in the API definition.

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Draft API specification

Post by jamesrhester » Fri Feb 08, 2013 6:47 am

jcbollinger wrote:Providing for case-insensitive comparison of Unicode strings used as item, data block, and save frame names is non-trivial, and the CIF2 spec changes don't really address the question. One must consider that case conversion is not unambiguously defined outside the ASCII set, and one must also consider combining characters, pre-composed characters, etc.. I have chosen to base name comparisons on a normalized form obtained by first case-folding the original name according to the Unicode case-folding algorithm, then transforming the folded version to Unicode normalization form NFC. The latter step is perhaps not justified by the letter of the CIF2 specs, but it's the right thing to do. (So call me a maverick.) Note that performing those steps in the reverse order would also yield a usable normalized form, but equality comparisons based on the two forms could differ in some cases.


It may not be too late to tweak CIF2 as regards the minutae of case-insensitivity, if you can put forward a specific suggestion with justification. Herbert has always insisted on trying it out before setting it in stone.

Although I am implementing an SQLite-based back end, the API does not depend on that, and I'm trying to ensure that it stays that way. I anticipate that that may prove a bit challenging in the packet iteration / loop manipulation code that I'm working on right now, because I think I'll need to hold open a database transaction across multiple API calls. The user doesn't see that directly, but it does require limitations on API use.

That would seem to mirror exactly what the database has to do anyway when executing multiple fetches from a single 'SELECT'. How about keeping the statement pointer provided by SQLite in the iterator structure, and perhaps adding some boilerplate to pick up when the database throws an exception because that statement is no longer valid?

Values are by far the most complicated thing in the API. CIF2 lists and tables are particularly irksome when it comes to storage in the database -- I have not found an alternative to storing them in some kind of serialized form. Although I could use CIF as the serialization format, my current implementation instead uses a custom internal format that is quick and easy to deserialize. Even numeric values are less straightforward than they might seem, since one has to account for the standard uncertainty that some of them carry. Also, one must deal with numbers that can be expressed in CIF but cannot be represented in any of the native numeric formats supported by the host system (or else knowingly refuse to do so).

I can see two practical solutions: just preserve the input text as-is, perhaps with a prepended character to signal type. It would be a hit on performance to have to constantly parse long lists every time a user requests it, but given typical CIF2 usage (approximately none) the lists are likely to be short vectors. The other practical solution would be to use the SQLite Blob type to contain the serialisation (perhaps you are doing this). Table types could be parsed into real database tables!!

Whether or how to distinguish data blocks from save frames at the implementation level is a distinct question from whether or how to do so at the API level, particularly if one intends to rely on the DB engine to enforce uniqueness of block / frame names. Here's a slightly-modified version of my table definition for save frames:

Code: Select all

create table save_frame (
  id integer primary key,
  data_block_id integer not null,
  name varchar(80) not null,
  foreign key (data_block_id)
    references data_block(id),
  unique (data_block_id, name)
)

That's not a complete picture, but it's enough to demonstrate. It looks fine to me for save frames (given the complementary data_block table, which I've not shown), but what if we want to store data blocks in that same table? Data blocks have no parent block, so one would have to record either a NULL parent block ID or a special, never-used one. The latter is inconsistent with the foreign key clause, but the former breaks the uniqueness constraint for data block names! Something has to give: to put it all in one table, one must give up either parent/child referential integrity enforcement or database-enforcement of name uniqueness.

Actually, a third alternative now occurs to me: one could record a dummy top-level container representing the global context of which all data blocks are direct children, but which should not directly contain any data of its own. It would itself suffer from the uniqueness problem, but that would be irrelevant because it wouldn't (shouldn't) have any siblings. That would carry its own costs, however, in a need for special-case implementation code and a resulting model that incorporates constraints that cannot be expressed in DDL (that the top-level container have no siblings and no data of its own).


This latter is essentially what I have done. The special case implementation is not particularly onerous for writing: you recursively write all datablocks having parent = 0 and not having id=0, and adding a datablock is simply adding a save frame with parent = 0.

There is also a fairly major implementation question with respect to the approach to representing loops in the database. One alternative is to create a table in the database for each loop. That models the CIF structure rather directly, and provides good support for relational operations such as joins. On the other hand, it means that the database schema changes routinely in the course of API use, and schema changes are non-transactional in SQLite. Also, the per-loop tables need names, but there are no natural names available for use, so this approach requires programmatic generation of table names. That somewhat dulls the luster of direct support for loop joins. (This differs from the case of an API aimed at, say, valid mmCIF, wherein items could be organized into DB tables based on their dictionary-defined categories.) Moreover, the complexity of values presents an especial problem when there are many columns that must represent values.

If you were to abandon the database table = CIF loop equation then you might as well abandon using a relational database. I'm not so worried about the schema changing, as my approach has always (as a relational database outsider) been to create the datastructures, then compare that with the DDL schema. In terms of the API, aren't you just adding or removing columns? Why is not being transactional an issue here?

The main alternative I see for representing loops is to use what I'll call "virtual tables". Virtual tables use a small number of physical tables to model an unlimited number of logical tables. In this case, the database schema never needs to change as loop definitions are added, modified, or removed. Also, because all the values for all the loops are recorded in the same physical table, it is not especially onerous to deal with the type ambiguity problem. Among the drawbacks, however, is that loop joins are not as straightforward. This is the approach I am attempting, and here is a simplified version of the relevant parts of my working schema:

An index of all the loops appearing anywhere in the CIF. Metadata general to whole loops can be attached here.

Code: Select all

create table loop (
  container_id integer not null,
  loop_num integer not null,
  primary key (container_id, loop_num)
)


Specifies which data names are tabulated in each loop -- that is, the virtual table definitions

Code: Select all

create table loop_item (
  container_id integer not null,
  name varchar(80) not null,
  loop_num integer not null,
  primary key (container_id, name),
  foreign key (container_id, loop_num)
    references loop(container_id, loop_num)
)

Gives the value for each item in each packet of each loop

Code: Select all

create table item_value (
  container_id integer not null,
  name varchar(80) not null,
  row_num integer not null,
  value numeric,
  primary key (container_id, name, row_num),
  foreign key (container_id, name)
    references loop_item(container_id, name)
)


(Don't pay too much attention to data types in the forgoing, since SQLite doesn't either.)

I think that if you have to abandon the CIF loop = database table equivalence, you might as well abandon trying to use a relational database as the datastructure manager. How much is the database contributing in this virtual table implementation towards reducing the amount of programming work, compared to, say your alternative below?
One more thing: I am seriously considering writing a second implementation of the back end that uses hash tables, linked lists, and similar data structures for in-memory CIF storage. I anticipate that it would offer both advantages and disadvantages relative the SQLite version, but I don't have a good feel for where the balance would lie. If I actually have time to do that, however, or if someone else does, then at minimum it will help flush out quirks and implementation assumptions in the API definition.

This latter would appear to be more work, especially if you eschew the C++ stdlib container types. One of the attractions of the relational database route is that user-defined aggregate and row functions in SQLite find a natural mapping to dREL, key relations can be enforced by the database, and category merging/splitting as proposed in DDLm is trivial.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Draft API specification

Post by jcbollinger » Fri Feb 08, 2013 4:40 pm

jamesrhester wrote:It may not be too late to tweak CIF2 as regards the minutae of case-insensitivity, if you can put forward a specific suggestion with justification. Herbert has always insisted on trying it out before setting it in stone.

I'll work on that. Would it go to the DDLm working group or directly to COMCIFS?

Although I am implementing an SQLite-based back end, the API does not depend on that, and I'm trying to ensure that it stays that way. I anticipate that that may prove a bit challenging in the packet iteration / loop manipulation code that I'm working on right now, because I think I'll need to hold open a database transaction across multiple API calls. The user doesn't see that directly, but it does require limitations on API use.

That would seem to mirror exactly what the database has to do anyway when executing multiple fetches from a single 'SELECT'. How about keeping the statement pointer provided by SQLite in the iterator structure, and perhaps adding some boilerplate to pick up when the database throws an exception because that statement is no longer valid?

Yes, the iterator structure holds a pointer to the corresponding statement object, and the back end will hold open a transaction at least until all result rows have been retrieved or the statement is reset. The problem, such as it is, is that I think I need to be able to execute additional statements in the same transaction (for loop packet deletion, in particular). That requires using explicit transactions, and the trick is (1) all API functions using the same DB connection will operate in the context of that transaction while it is open, and (2) the transaction commit has to be triggered by some kind of explicit "I'm done" signal from the user. This can all work, but it imposes limitations on how the API may safely be used. I'm documenting those limitations, of course.

Values are by far the most complicated thing in the API. CIF2 lists and tables are particularly irksome when it comes to storage in the database -- I have not found an alternative to storing them in some kind of serialized form. [...] Even numeric values are less straightforward than they might seem [...].

I can see two practical solutions: just preserve the input text as-is, perhaps with a prepended character to signal type. It would be a hit on performance to have to constantly parse long lists every time a user requests it, but given typical CIF2 usage (approximately none) the lists are likely to be short vectors. The other practical solution would be to use the SQLite Blob type to contain the serialisation (perhaps you are doing this). Table types could be parsed into real database tables!!

Yes, I'm currently taking the Blob approach for both lists and tables. In principle, tables could indeed be stored as real DB tables, but that could lead to a great profusion of tables (every table value needs its own). I'd be more comfortable with another use of "virtual tables" to control that, but even so we would be looking at executing potentially a lot more SQL statements. I anticipate that would be substantially more costly than the serialization / deserialization I'm currently doing, and not necessarily any simpler on the C side. List values would be simpler to lay out in purely relational form, but similar considerations still apply.

Whether or how to distinguish data blocks from save frames at the implementation level is a distinct question from whether or how to do so at the API level, particularly if one intends to rely on the DB engine to enforce uniqueness of block / frame names.
[...]
Actually, a third alternative now occurs to me: one could record a dummy top-level container representing the global context of which all data blocks are direct children, but which should not directly contain any data of its own. It would itself suffer from the uniqueness problem, but that would be irrelevant because it wouldn't (shouldn't) have any siblings. That would carry its own costs, however, in a need for special-case implementation code and a resulting model that incorporates constraints that cannot be expressed in DDL (that the top-level container have no siblings and no data of its own).

This latter is essentially what I have done. The special case implementation is not particularly onerous for writing: you recursively write all datablocks having parent = 0 and not having id=0, and adding a datablock is simply adding a save frame with parent = 0.

My concern is not so much that the special case would be hard to recognize or implement, but rather that it would be needed at all. It's inelegant, and in being inelegant it presents a cranny in which bugs can lodge. I'm somewhat more concerned, however, with the database side. There is always a degree of risk in using a database that is subject to validity constraints that the DBMS cannot recognize or enforce. That's what database normalization is all about, after all. Nevertheless, in this case that may be a trade-off I am willing to make.

There is also a fairly major implementation question with respect to the approach to representing loops in the database. One alternative is to create a table in the database for each loop. [...]

If you were to abandon the database table = CIF loop equation then you might as well abandon using a relational database. I'm not so worried about the schema changing, as my approach has always (as a relational database outsider) been to create the datastructures, then compare that with the DDL schema. In terms of the API, aren't you just adding or removing columns? Why is not being transactional an issue here?

Not being transactional is problematic for a couple of reasons:
  • It can be a lot more complicated to clean up after errors. For example, if you successfully add a column, but encounter an error trying to set initial values, then you have to manually remove the column again. And if that column removal fails then there's pretty much no recovery. If you stick to transactional operations, on the other hand, then you just roll back the transaction on error, and you're done. (The rollback can also fail, but if that happens then the DB was already hosed.)
  • Non-transactional operations are poison for concurrent access to the database from multiple threads. That might not be the only concurrency problem the API implementation faces, but I am loathe to implement a concurrency-blocker if I don't have to do.
Even without loops being stored as separate tables in the DB, the relational DB still provides a lot of advantages. Transactionality is a big one. There's also automatic enforcement of uniqueness constraints. In general, SQL allows for simple expressions of a lot of the operations we want to perform. And I think loop joins can still be performed, only with somewhat more complication. I'll report more on that later.

Also, we're not talking about only adding and removing columns in the table-per-loop approach, we're talking about adding and removing whole tables, plus keeping track of which tables represent which loops, etc.. As far as supporting joins goes, users would still need to go to some effort to determine which table to join to which. Moreover, loop-per-table requires dynamic construction of SQL statements for accessing loop contents. That's certainly doable, but it adds a lot of complexity on the C side, and it limits the performance benefits available from reusing previously-compiled prepared statements.

The main alternative I see for representing loops is to use what I'll call "virtual tables".[...]

I think that if you have to abandon the CIF loop = database table equivalence, you might as well abandon trying to use a relational database as the datastructure manager. How much is the database contributing in this virtual table implementation towards reducing the amount of programming work, compared to, say your alternative below?

That's a fair question. I can make only predictions and educated guesses at this point, but I think the RDB implementation using virtual tables for loops will provide advantages for resource management, constraint enforcement, error recovery, and code size and clarity (at least in most places). It may facilitate concurrent multi-threaded usage, and at minimum it will be no worse in that area than the alternative. It remains to be seen whether it provides any advantage for CIF-level relational operations such as loop joins, but I am hopeful that it will.

One more thing: I am seriously considering writing a second implementation of the back end that uses hash tables, linked lists, and similar data structures for in-memory CIF storage. I anticipate that it would offer both advantages and disadvantages relative the SQLite version, but I don't have a good feel for where the balance would lie. If I actually have time to do that, however, or if someone else does, then at minimum it will help flush out quirks and implementation assumptions in the API definition.

This latter would appear to be more work, especially if you eschew the C++ stdlib container types.

You may be right, but I'm not so confident. Certainly, one doesn't need to turn to STL containers to find decent, useful implementations of hash table and list management routines. Using native data structures directly for storage also would obviate all the code currently needed for marshaling data into and out of the DB, and the attending error handling that must be implemented. I suspect the code using only native structures would actually be shorter, simpler, and faster. The reasons for a DB back end include instead many of the other things we are discussing.

One of the attractions of the relational database route is that user-defined aggregate and row functions in SQLite find a natural mapping to dREL, key relations can be enforced by the database, and category merging/splitting as proposed in DDLm is trivial.

I confess that at this point I am focusing on generic, universal CIF2 support rather than on implementation features catering specially to dREL or any particular DDL. I think it is more important at the moment to produce and validate an API that serves that purpose, regardless of the details of any prototype implementation(s). I am entirely prepared to find that a different implementation is more suited to support validation against dictionaries in general, and the needs of DDLm / dREL in particular, but I don't think we're in a position yet to have a well-informed discussion on that topic.

jcbollinger
Posts: 57
Joined: Tue Dec 20, 2011 2:41 pm

Re: Draft API specification

Post by jcbollinger » Wed Mar 27, 2013 9:58 pm

I now have a full API spec and largely untested implementation for what I am calling the CIF API "core", which is basically everything related to manipulating an in-memory (or I am now preferring the term "managed") CIF. That excludes parsing, post-parse validation, and writing to external files, but includes everything to do with manipulating values, including table and list values.

I would like to make the HTML documentation available for comment and discussion, but I don't appear able to add attachments to messages in this forum. I also don't have a local capability to publish world-accessible data. Does anyone have a suggestion on how to share this? Or the code, for that matter?

John

Post Reply