Retrieving content from a dataset identified by a DOI

This is a public forum that invites community input on strategies and desirable practices in providing open and long-term access to diffraction data sets.

Moderator: Brian McMahon

Post Reply
Brian McMahon
Site Admin
Posts: 116
Joined: Fri May 13, 2011 12:34 pm

Retrieving content from a dataset identified by a DOI

Post by Brian McMahon » Thu Sep 29, 2016 5:29 pm

Retrieving content from a dataset identified by a DOI - designing an extensible protocol

John Helliwell recently posted a message to the dddwg mailing list that has raised an interesting point about deposition of datasets. To secure a persistent long-term deposition for this dataset, John uploaded over 1000 image files to Zenodo, as part of a dataset deposition to which was assigned the DOI 10.5281/zenodo.154704

Performing the upload was straightforward: the 1190 files were selected using the (SHIFT + SELECT) function of the originating PC, and all were uploaded in a single commit. However, the uploaded files are presented on the DOI landing page (http://doi.org/10.5281/zenodo.154704) as 1190 individual hyperlinks. There is no easy way through the standard Zenodo interface to download all of them. One might encourage Zenodo to provide a feature in the download interface to select and package multiple files. Or one could use an external tool such as wget to download all the subsidiary linked files; or to scrape some subset of linked files and retrieve them at one go. However, it would be useful to develop a formal and extensible approach that can be applied to different sets of requirements, that can select different subsets of the data according to content-driven needs, and that can be driven programmatically.

Other contributors to the list discussion have said that their normal practice in such a case would be to bundle the files together in a single zip archive. This certainly facilitates 'one-click' download of the entire series. Conversely, though, it makes it cumbersome to retrieve one specific image, if that were wanted. One would need to download the entire archive (probably several gigabytes) to extract a single 6 MB image, incurring high overheads both in network bandwidth and in temporary storage on the local filesystem. One can imagine this becoming a greater problem as the file sizes of individual images grow, perhaps dramatically, in the next few years.

Indeed, the problem is more general. There is no particular rule determining what a DOI should be applied to. In principle, John could have assigned a separate DOI to each individual image. In practice, that seems unreasonable - there is a real cost involved in each DOI assignment (both in terms of the handling fee charged by the DOI registration agency, but also in the construction of the appropriate metadata to characterise each individual digital object). Probably most users would agree that one DOI "per experiment" is appropriate. But what is an experiment? A single beamline session? A collection of scans at a single wavelength? A full multi-wavelength collection? All diffraction datasets collected at a single resolution? At multiple resolutions? One derivative versus many derivatives? Multiple crystals? It may well be that different studies suggest different granularities as being appropriate for separate DOI assignments. In that case, one might find oneself faced with the prospect of retrieving all images associated with one crystal from a much larger collection that is described by a single DOI. So the general problem is: given a DOI, is there some mechanism that will allow a user to retrieve some specific subset of the content associated with that DOI?

The short answer is "no". The DOI is simply a persistent identifier that allows you to be certain that a digital resource is that specified by another user. It doesn't even guarantee that you can retrieve the content - it is possible that a DOI resolution tells you that the specific content that you were looking for via that identifier no longer exists (but the point is, that it should, forever and unequivocally, inform you about the disposition of the digital object thus identified). The DOI in itself does not contain detailed information about the internal structure of a digital object.

The longer answer is "no - but...". One of the attractions of the DOI over other persistent identifier schemes is that there is a significant infrastructure underpinning its implementation. Federated registration agencies and resolver systems allow the possibility of layering additional functionality on top of the identifier itself. This is something that CrossRef has done very effectively for many publishing requirements, and that DataCite is in a position to develop for research data management. In the rest of this post, I explore some ideas about what might be needed to develop an interface for retrieving specific content included within an umbrella DOI. Given the general nature of the problem, this should be developed as a methodology independent of specific disciplinary requirements; nevertheless examples ('use cases') from crystallography might be very helpful in suggesting the sort of detail that different disciplines might require.

Example 1: sections, tables, figures from a reference book

When the IUCr was publishing International Tables for Crystallography online, we decided that the appropriate level of granularity for assignment of DOIs was one per chapter. Hence if you resolve the DOI 10.1107/97809553602060000512 via the CrossRef site (i.e. visit http://dx.doi.org/10.1107/97809553602060000512) you will come to the "landing page" for Chapter 6.1 ("The 17 plane groups" of International Tables for Crystallography Volume A).

However, we wanted to be able to serve smaller components of this chapter (namely the individual plane-group tables) using a protocol that identified the parent container (i.e. the chapter) by its DOI but carried a query payload that would select a specific sub-component of that chapter. CrossRef offers a mechanism to do this, based on an existing protocol known as OpenURL. The actual implementation of this is somewhat involved - the specific query to retrieve the table for plane group no. 1, for example, is:

Code: Select all

http://dx.doi.org/openurl?url_ver=Z39.88-2003&rfr_id=ori:rid:wiley.com&rft_id=doi:10.1107/97809553602060000512&rfr_dat=cr%5FsetVer%3D01%26cr%5Fpub%3D10%2E1107%26cr%5Fwork%3DThe%2017%20plane%20groups%20%28two%2Ddimensional%20space%20groups%29%26cr%5Fsrc%3D10%2E1002%26cr%5FsrvTyp%3Dpdf%26cr_rfr_dat%3Dsgtable6o1o001


In essence, however, it has three main components: the address of the resolver service, in this case http://dx.doi.org/openurl; the DOI that identifies the top-level digital object, in this case 10.1107/97809553602060000512 identifying Chapter 6.1; and a query in a known language that retrieves the required sub-component (in this case the significant part of the query is that final "sgtable6o1o001" that identifies the specific table).

Given that this specific implementation exists, is it possible that CrossRef could modify its service to allow individual files from John's Zenodo deposition to be retrieved, each with its own URL but based on the container DOI? Possibly - though, as can be seen from the example, there is a significant amount of additional overhead in the existing OpenURL approach that is designed to meet the requirements of librarians and publishers. Perhaps a better approach is to develop a similar but more streamlined service, maybe hosted at DataCite, that handles scientific data requirements more efficiently. The next sections consider what those requirements might be in the case of crystallographic diffraction experiments.

Example 2. John's Zenodo deposition with multiple individual files

To make things easier to discuss, I invent a hypothetical service to do what we want, which I call datacite.org/gimme and which always takes the parent DOI as its first argument. If there are no other arguments, we might suppose that the result is equivalent to the landing page you would get with a straightforward DOI lookup, i.e.

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.154704
(hypothetical)

would be identical to

http://data.datacite.org/10.5281/zenodo.154704 (actual example)

Then one could request an individual file within this data collection perhaps by specifying a filename and (optionally?) a MIME type so that the user's browser or other software knows what to do with such a file:

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.154704&file=MPTPB-259-13_2_0001.cbf&mime=application/binary


[A couple of technical notes. (1) The real URL would encode special characters in the normal way, i.e. "/" would be represented by "%2F" etc. Here we retain the raw symbols to make the hypothetical URL constructions easier to interpret. (2) If this really caught on, one could look at reviving the mid-1990s proposal of Rzepa and Murray-Rust for chemical MIME types, and extend it e.g. to "chemical/x-cbf".]

A couple of obvious extensions then suggest themselves.

(1) Allow wildcard characters in the filename specification. "file=*" would download all files, "file=*.cbf" would download all CBF files etc.

(2) A processing directive could allow mutiple files to be packaged into a convenient single file for download: something like "archive=zip", "archive=tgz", "archive=rar" etc.

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.154704&file=MPTPB*.cbf&archive=zip


Example 3. A dataset uploaded as a zip archive

In this case, the author has uploaded a large batch of images in (one or more) zip archives:

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.012345&file=expt1.zip&extract=img01.cbf


The idea is that the file parameter again identifies the file on the server that needs to be accessed (in this case it is a zip archive, called expt1.zip), but the "extract" directive tells the server to extract the named component from the zip archive and send only that to the user.

In a case like this, it is assumed that the end-user has some prior knowledge of the files stored inside the zip, in order to be able to ask for one by name. This knowledge could be imparted, e.g. by uploading a manifest file that can be retrieved from the DOI landing page. Alternatively, it might be reasonable to add other directives to the protocol that would allow the user to inspect the contents of the zip archive on the server - something like

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.012345&file=expt1.zip&list_contents


Such a list of contents doesn't fully explain the roles or relationships of individual files, but something of this might be guessed from their names.

Example 4. A dataset uploaded with relational metadata

The trouble with zip (or tar, rar and other generic archive formats) is that they allow the bundling together of many files without recording their functional relationship to each other. The next step would be to make those relationships explicit.

There are a number of proposals (perhaps even standards) to package components of a complex document in a well-defined container that includes metadata displaying explicitly the relationships between the component files. One I am aware of (slightly) is METS, the Metadata Encoding and Transmission Standard. This was developed by the Library of Congress to specify the relationship between components in a digital document or set of documents. Probably to extend this to scientific data sets would require some extension of the recognised relationships, but such an extended METS schema could provide for an orderly cataloguing of the components of a complex object (i.e. an experimental dataset) upon deposition, and perhaps an obvious retrieval mechanism for individual components based on that machine-readable catalogue.

(If this seems vague, it is because I don't have detailed knowledge of METS or similar standards, but record it as a possible tool in any attempt to characterise and retrieve components of a complex dataset.)

Example 5. Extracting a subset of data from within a CIF file

Now we become a little more ambitious (perhaps a little more fanciful). Suppose the author has deposited a single file (in this case I suppose an imgCIF/CBF file, since I am a little familiar with that format) that contains multiple images, each one in its own data block. It would be nice to imagine that we could request the server to download just one data block, e.g.

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.99999&file=expt1.cbf&mime=chemical/x-cbf&extract=data_img0023


In this (imaginary) example, I am using the same directive "extract" as in the zipfile example, and assuming that the server (a) is capable of reading the MIME type and mapping that to an appropriate application that knows how to handle files with that declared MIME type; and (b) can actually perform the extraction of a named data block and deliver just that subset of the data to the requesting user.

This begs a number of questions. Is "MIME type" the best parameter for instructing the server how to interpret and perform subsequent directives on a particular file? Usually the MIME type is supplied so that the browser or other client application knows how to handle the content that is received. Maybe one could/should separately define "server_mime" and "client_mime" - the first tells the server what sort of file is in the repository, the latter specifies the filetype of the data to be supplied to the user (this allows, in principle, format conversions on the fly).

Second: is a general-purpose "extract" directive going to work across all the possible different data filetypes that might be encountered? Or would it make more sense to specify some query language, and then provide a directive in that query language? E.g. a request such as

Code: Select all

http://datacite.org/gimme?doi=10.5281/zenodo.9999&file=expt1.cbf&query_lang=cif&procedure="sb -r data_img0023"


Here the query (properly URL encoded, of course) "sb -r data_img0023" is understood to be a procedural query in the specified language ("cif") that the server can execute provided it has access to the appropriate processing engine. [Technical note: "sb" in this example refers to the database-like application "StarBase" - see http://www.iucr.org/resources/cif/software/starbase]

The implications of this are (a) a user could make a series of such queries to interrogate the imgCIF itself for information about the data it contains and then retrieve whatever subset of the data is required, based on that information; (b) other languages could be supported. CIF is really rather specialised, but XML/XSLT could retrieve specific data from XML files; SQL could retrieve data from database images; etc. The OpenURL mechanism discussed in Example 1 is another candidate query language.

Example 6. Extracting a subset of data from within a HDF5 file

An exercise for the reader. Imagine the same scenario as Example 5, but for a HDF5/NeXus dataset. Can anyone with experience of these files suggest a notional syntax for retrieving a single image? No different really from the CIF example, but just a request to see if HDF5 might be handled in the same sort of way, supposing HDF5 to be a file format that is intermediate in popularity between "XML" and "CIF".

Where to go from this?

This post has sketched a very broad outline of how one might start to design repository services so that subsets of deposited data sets identified by a single DOI can be retrieved from sufficiently capable repositories through some sort of generic, extensible protocol. Clearly, to turn it into a working system would require much thought and effort:
  • (i) refine the crystallography use cases by deciding how much it would be desirable to "drill down" into the deposited content;
  • (ii) extend this approach to a broader community to see what features of such a protocol might be necessary or sufficient for different discipline communities;
  • (iii) explore the feasibility of implementing this protocol through the most significant DOI registration agencies/service providers (CrossRef, DataCite?);
  • (iv) design and build a pilot implementation - this could be either in association with an organisation like DataCite, or a specifically crystallographic implementation at one of the existing repositories (Store.Synchrotron, ESRF, NCS Southampton, proteindiffraction.org), depending on the level of interest/enthusiasm.
It is possible that at least some of these steps (given the potential for interdisciplinary application) could be carried out by Working Groups in the framework of the Research Data Alliance.

Brian

jamesrhester
Posts: 39
Joined: Mon Sep 19, 2011 8:21 am

Re: Retrieving content from a dataset identified by a DOI

Post by jamesrhester » Tue Oct 11, 2016 8:31 am

Building on Brian's examples 4 and 6, assume we have the following tools:


  • 'ManifestMaker': A tool that could examine the individual files in a Zenodo deposit, to produce an overall 'manifest'. This manifest could be as simple as an imgCIF file where the individual images are replaced by web links, or it could be a sqlite database file
  • 'FrameGetter': A tool that, given the manifest and a set of values for arbitrary datanames, returned only those 2D frames from the DOI that matched the datanames (if any)

Now consider the following use cases for the Zenodo data and how these tools might be used. Assume that ManifestMaker is run either once for each DOI by our 'gimme' service, or on demand.

Use case 1: Download a subset of frames (e.g. from one chi orientation/to rerun indexing/to check data quality/to check a specific reflection)

The user is presented a web form that constructs a web URL of the type that Brian presented, of the form:

Code: Select all

https://datacite.org/gimme?doi=10.5281/zenodo.012345&<dataname1>=<value1>&<dataname2>=<value2>&...


where <dataname1> etc. are datanames that might appear in our manifest and <value1> are values that may be ranges or lists. These dataname values are passed to the FrameGetter which bundles 2D frames matching all criteria and initiates download. So to obtain all files with rotation angle between 90 and 120 for a chi angle of 0 we might do:

Code: Select all

https://datacite.org/gimme?doi=10.5281/zenodo.012345&crystal_rotation=90-120&crystal_chi=0


A user who didn't yet know the appropriate ranges could send a specific request for the manifest in text form to the gimme service and have it displayed as either text or html. Brian's suggestions on adding options for preferred format are obviously also possible, and given a DOI the web form could be constructed intelligently to only offer choices for datanames that take more than one value over the complete dataset.

Use Case 2: Completely re-run the data reduction and analysis process

This is the situation the person who collected the data finds themselves in. FrameGetter just gets everything in the manifest.

Comments

  1. This scheme is completely format agnostic from the point of view of the request. The <datanames> in the query can be generic names - after all, a rotation angle is a rotation angle whether it is in HDF5 or CBF. It is in fact true that the returned data could also be in a different format to that in which they were deposited.
  2. Following on from the above comment, the query format for HDF5 (Brian's example 6) would be identical. A subset of images can be provided in a single file, or as a master file linked to a series of other HDF5 files - I'm a bit hazy on the latter.
  3. If the manifest is not provided by Zenodo, then we have to use infrastructure that will need to download (but not store) every relevant data deposition
  4. A key job for the DDDWG is to choose an open-ended set of generic terms to use in such retrievals.
  5. This proposal is really just a straightforward database filtering operation
  6. Neither ManifestMaker or FrameGetter require very much in the way of software effort for CBF or other 'directory of single images' layouts.

To answer Brian's questions at the end of his post:

(i) refine the crystallography use cases by deciding how much it would be desirable to "drill down" into the deposited content;


The Manifest can include any items that have been defined by e.g. the DDDWG as custodians of the standard, so we can build incrementally

(ii) extend this approach to a broader community to see what features of such a protocol might be necessary or sufficient for different discipline communities;


This approach is generic - any dataset can be viewed in relational database terms (as per my Data Journal paper) and the above operations applied, without the user needing to know anything about relational databases (or hierarchies for that matter).

(iii) explore the feasibility of implementing this protocol through the most significant DOI registration agencies/service providers (CrossRef, DataCite?);

We could provide a manifest-creation service (a la CheckCIF) if the IUCr has sufficient bandwidth and the service providers are prepared to associate or link a manifest with each dataset.

(iv) design and build a pilot implementation - this could be either in association with an organisation like DataCite, or a specifically crystallographic implementation at one of the existing repositories (Store.Synchrotron, ESRF, NCS Southampton, proteindiffraction.org), depending on the level of interest/enthusiasm.


If we could get a tame webserver to put up a demonstration that might go a long way towards convincing these other organisations. I think that the design is not all that hard, if we can agree on a set of generic terms and a web interface as described above. I can't resist pointing to my own Zenodo deposit (https://zenodo.org/record/154459), in which Python software, guided by an ontology (in that case a DDLm dictionary) reads, writes and derives generic datanames in a format-agnostic manner. Interfaces to a particular format are provided by modules with a uniform, simple API that work exclusively in terms of the generic datanames (imgCIF and HDF5 modules provided). Similar modules could be added for everything in Wladek's long image list, meaning that ManifestMaker could handle any image deposition format - although there may be issues actually deciding which format an image is expressed in.

Post Reply