John Helliwell recently posted a message to the dddwg mailing list that has raised an interesting point about deposition of datasets. To secure a persistent long-term deposition for this dataset, John uploaded over 1000 image files to Zenodo, as part of a dataset deposition to which was assigned the DOI 10.5281/zenodo.154704
Performing the upload was straightforward: the 1190 files were selected using the (SHIFT + SELECT) function of the originating PC, and all were uploaded in a single commit. However, the uploaded files are presented on the DOI landing page (http://doi.org/10.5281/zenodo.154704) as 1190 individual hyperlinks. There is no easy way through the standard Zenodo interface to download all of them. One might encourage Zenodo to provide a feature in the download interface to select and package multiple files. Or one could use an external tool such as wget to download all the subsidiary linked files; or to scrape some subset of linked files and retrieve them at one go. However, it would be useful to develop a formal and extensible approach that can be applied to different sets of requirements, that can select different subsets of the data according to content-driven needs, and that can be driven programmatically.
Other contributors to the list discussion have said that their normal practice in such a case would be to bundle the files together in a single zip archive. This certainly facilitates 'one-click' download of the entire series. Conversely, though, it makes it cumbersome to retrieve one specific image, if that were wanted. One would need to download the entire archive (probably several gigabytes) to extract a single 6 MB image, incurring high overheads both in network bandwidth and in temporary storage on the local filesystem. One can imagine this becoming a greater problem as the file sizes of individual images grow, perhaps dramatically, in the next few years.
Indeed, the problem is more general. There is no particular rule determining what a DOI should be applied to. In principle, John could have assigned a separate DOI to each individual image. In practice, that seems unreasonable - there is a real cost involved in each DOI assignment (both in terms of the handling fee charged by the DOI registration agency, but also in the construction of the appropriate metadata to characterise each individual digital object). Probably most users would agree that one DOI "per experiment" is appropriate. But what is an experiment? A single beamline session? A collection of scans at a single wavelength? A full multi-wavelength collection? All diffraction datasets collected at a single resolution? At multiple resolutions? One derivative versus many derivatives? Multiple crystals? It may well be that different studies suggest different granularities as being appropriate for separate DOI assignments. In that case, one might find oneself faced with the prospect of retrieving all images associated with one crystal from a much larger collection that is described by a single DOI. So the general problem is: given a DOI, is there some mechanism that will allow a user to retrieve some specific subset of the content associated with that DOI?
The short answer is "no". The DOI is simply a persistent identifier that allows you to be certain that a digital resource is that specified by another user. It doesn't even guarantee that you can retrieve the content - it is possible that a DOI resolution tells you that the specific content that you were looking for via that identifier no longer exists (but the point is, that it should, forever and unequivocally, inform you about the disposition of the digital object thus identified). The DOI in itself does not contain detailed information about the internal structure of a digital object.
The longer answer is "no - but...". One of the attractions of the DOI over other persistent identifier schemes is that there is a significant infrastructure underpinning its implementation. Federated registration agencies and resolver systems allow the possibility of layering additional functionality on top of the identifier itself. This is something that CrossRef has done very effectively for many publishing requirements, and that DataCite is in a position to develop for research data management. In the rest of this post, I explore some ideas about what might be needed to develop an interface for retrieving specific content included within an umbrella DOI. Given the general nature of the problem, this should be developed as a methodology independent of specific disciplinary requirements; nevertheless examples ('use cases') from crystallography might be very helpful in suggesting the sort of detail that different disciplines might require.
Example 1: sections, tables, figures from a reference book
When the IUCr was publishing International Tables for Crystallography online, we decided that the appropriate level of granularity for assignment of DOIs was one per chapter. Hence if you resolve the DOI 10.1107/97809553602060000512 via the CrossRef site (i.e. visit http://dx.doi.org/10.1107/97809553602060000512) you will come to the "landing page" for Chapter 6.1 ("The 17 plane groups" of International Tables for Crystallography Volume A).
However, we wanted to be able to serve smaller components of this chapter (namely the individual plane-group tables) using a protocol that identified the parent container (i.e. the chapter) by its DOI but carried a query payload that would select a specific sub-component of that chapter. CrossRef offers a mechanism to do this, based on an existing protocol known as OpenURL. The actual implementation of this is somewhat involved - the specific query to retrieve the table for plane group no. 1, for example, is:
Code: Select all
In essence, however, it has three main components: the address of the resolver service, in this case http://dx.doi.org/openurl; the DOI that identifies the top-level digital object, in this case 10.1107/97809553602060000512 identifying Chapter 6.1; and a query in a known language that retrieves the required sub-component (in this case the significant part of the query is that final "sgtable6o1o001" that identifies the specific table).
Given that this specific implementation exists, is it possible that CrossRef could modify its service to allow individual files from John's Zenodo deposition to be retrieved, each with its own URL but based on the container DOI? Possibly - though, as can be seen from the example, there is a significant amount of additional overhead in the existing OpenURL approach that is designed to meet the requirements of librarians and publishers. Perhaps a better approach is to develop a similar but more streamlined service, maybe hosted at DataCite, that handles scientific data requirements more efficiently. The next sections consider what those requirements might be in the case of crystallographic diffraction experiments.
Example 2. John's Zenodo deposition with multiple individual files
To make things easier to discuss, I invent a hypothetical service to do what we want, which I call datacite.org/gimme and which always takes the parent DOI as its first argument. If there are no other arguments, we might suppose that the result is equivalent to the landing page you would get with a straightforward DOI lookup, i.e.
Code: Select all
would be identical to
http://data.datacite.org/10.5281/zenodo.154704 (actual example)
Then one could request an individual file within this data collection perhaps by specifying a filename and (optionally?) a MIME type so that the user's browser or other software knows what to do with such a file:
Code: Select all
[A couple of technical notes. (1) The real URL would encode special characters in the normal way, i.e. "/" would be represented by "%2F" etc. Here we retain the raw symbols to make the hypothetical URL constructions easier to interpret. (2) If this really caught on, one could look at reviving the mid-1990s proposal of Rzepa and Murray-Rust for chemical MIME types, and extend it e.g. to "chemical/x-cbf".]
A couple of obvious extensions then suggest themselves.
(1) Allow wildcard characters in the filename specification. "file=*" would download all files, "file=*.cbf" would download all CBF files etc.
(2) A processing directive could allow mutiple files to be packaged into a convenient single file for download: something like "archive=zip", "archive=tgz", "archive=rar" etc.
Code: Select all
Example 3. A dataset uploaded as a zip archive
In this case, the author has uploaded a large batch of images in (one or more) zip archives:
Code: Select all
The idea is that the file parameter again identifies the file on the server that needs to be accessed (in this case it is a zip archive, called expt1.zip), but the "extract" directive tells the server to extract the named component from the zip archive and send only that to the user.
In a case like this, it is assumed that the end-user has some prior knowledge of the files stored inside the zip, in order to be able to ask for one by name. This knowledge could be imparted, e.g. by uploading a manifest file that can be retrieved from the DOI landing page. Alternatively, it might be reasonable to add other directives to the protocol that would allow the user to inspect the contents of the zip archive on the server - something like
Code: Select all
Such a list of contents doesn't fully explain the roles or relationships of individual files, but something of this might be guessed from their names.
Example 4. A dataset uploaded with relational metadata
The trouble with zip (or tar, rar and other generic archive formats) is that they allow the bundling together of many files without recording their functional relationship to each other. The next step would be to make those relationships explicit.
There are a number of proposals (perhaps even standards) to package components of a complex document in a well-defined container that includes metadata displaying explicitly the relationships between the component files. One I am aware of (slightly) is METS, the Metadata Encoding and Transmission Standard. This was developed by the Library of Congress to specify the relationship between components in a digital document or set of documents. Probably to extend this to scientific data sets would require some extension of the recognised relationships, but such an extended METS schema could provide for an orderly cataloguing of the components of a complex object (i.e. an experimental dataset) upon deposition, and perhaps an obvious retrieval mechanism for individual components based on that machine-readable catalogue.
(If this seems vague, it is because I don't have detailed knowledge of METS or similar standards, but record it as a possible tool in any attempt to characterise and retrieve components of a complex dataset.)
Example 5. Extracting a subset of data from within a CIF file
Now we become a little more ambitious (perhaps a little more fanciful). Suppose the author has deposited a single file (in this case I suppose an imgCIF/CBF file, since I am a little familiar with that format) that contains multiple images, each one in its own data block. It would be nice to imagine that we could request the server to download just one data block, e.g.
Code: Select all
In this (imaginary) example, I am using the same directive "extract" as in the zipfile example, and assuming that the server (a) is capable of reading the MIME type and mapping that to an appropriate application that knows how to handle files with that declared MIME type; and (b) can actually perform the extraction of a named data block and deliver just that subset of the data to the requesting user.
This begs a number of questions. Is "MIME type" the best parameter for instructing the server how to interpret and perform subsequent directives on a particular file? Usually the MIME type is supplied so that the browser or other client application knows how to handle the content that is received. Maybe one could/should separately define "server_mime" and "client_mime" - the first tells the server what sort of file is in the repository, the latter specifies the filetype of the data to be supplied to the user (this allows, in principle, format conversions on the fly).
Second: is a general-purpose "extract" directive going to work across all the possible different data filetypes that might be encountered? Or would it make more sense to specify some query language, and then provide a directive in that query language? E.g. a request such as
Code: Select all
http://datacite.org/gimme?doi=10.5281/zenodo.9999&file=expt1.cbf&query_lang=cif&procedure="sb -r data_img0023"
Here the query (properly URL encoded, of course) "sb -r data_img0023" is understood to be a procedural query in the specified language ("cif") that the server can execute provided it has access to the appropriate processing engine. [Technical note: "sb" in this example refers to the database-like application "StarBase" - see http://www.iucr.org/resources/cif/software/starbase]
The implications of this are (a) a user could make a series of such queries to interrogate the imgCIF itself for information about the data it contains and then retrieve whatever subset of the data is required, based on that information; (b) other languages could be supported. CIF is really rather specialised, but XML/XSLT could retrieve specific data from XML files; SQL could retrieve data from database images; etc. The OpenURL mechanism discussed in Example 1 is another candidate query language.
Example 6. Extracting a subset of data from within a HDF5 file
An exercise for the reader. Imagine the same scenario as Example 5, but for a HDF5/NeXus dataset. Can anyone with experience of these files suggest a notional syntax for retrieving a single image? No different really from the CIF example, but just a request to see if HDF5 might be handled in the same sort of way, supposing HDF5 to be a file format that is intermediate in popularity between "XML" and "CIF".
Where to go from this?
This post has sketched a very broad outline of how one might start to design repository services so that subsets of deposited data sets identified by a single DOI can be retrieved from sufficiently capable repositories through some sort of generic, extensible protocol. Clearly, to turn it into a working system would require much thought and effort:
- (i) refine the crystallography use cases by deciding how much it would be desirable to "drill down" into the deposited content;
- (ii) extend this approach to a broader community to see what features of such a protocol might be necessary or sufficient for different discipline communities;
- (iii) explore the feasibility of implementing this protocol through the most significant DOI registration agencies/service providers (CrossRef, DataCite?);
- (iv) design and build a pilot implementation - this could be either in association with an organisation like DataCite, or a specifically crystallographic implementation at one of the existing repositories (Store.Synchrotron, ESRF, NCS Southampton, proteindiffraction.org), depending on the level of interest/enthusiasm.