John R Helliwell, Brian McMahon and Tom Terwilliger
The Workshop was held on August 6th 2012 at Bergen University and attracted participants from Europe and the USA, from Universities and from synchrotron radiation and neutron facilities. The Workshop programme is listed in detail in Appendix 1; all talks went ahead as planned. The IUCr and the DDD Working Group is very grateful to all speakers and participants for their contributions. The abstracts and talks can be found from the public page of the DDD forum (see: http://forums.iucr.org); these were posted on August 15th 2012 and already by midday of August 20th they had received 14 views. Ten days before the Bergen Workshop, at the ACA in Boston, Dr Tom Terwilliger chaired a meeting on DDD. His notes for that meeting are reproduced as Appendix 2. Appendix 3 carries IUCr Forum postings made after the Workshop, which are directly linked to the Workshop. It is anticipated that a DDD discussion will be held at the AsCA meeting in Adelaide in December 2012; this will have the objective of reviewing the details below and achieving a global consultation on these matters, thereby enabling the crystallographic community to move forward as expeditiously as possible with its views and plans for archiving its raw data.
The need to have clarity on these DDD issues has two main aspects. First, crystallographers have obligations to securely and properly retain the raw data that they measure (‘loss of data is viewed as research malpractice’). Second, the reader of a published article involving crystallography can and should have access to the raw data on which the article is based (‘don’t take my word for it; try the data yourself and see directly the research results’).
The actions and recommendations arising from the Bergen Workshop can be summarised as follows:
• The Workshop notes that there is an enthusiasm and encouragement to archive more than derived or processed data in many areas of science besides our own.
• The crystallographic community prides itself in making its processed data accompany its publications; indeed, this has been obligatory in IUCr journals for over 10 years.
• We, the crystallographic community, basically have three practical options in the near future to extend these principles to our raw data;
– via a local Data Archive
– via synchrotron or neutron or X-ray laser (or other large-scale experimental facility) data storage
– or via the corresponding author setting up a personal link to datasets underpinning publications on their personal websites. [At the Workshop the Protein Data Bank (John Westbrook) offered that the PDB would help to coordinate DOI registration in cases where the raw data could be hosted on a reliable public site.]
• So we suggest that we encourage all three practical options and recommend to the IUCr Executive Committee that:
– Authors should provide a permanent and prominent link from an article to the raw data sets underpinning a journal publication
with a view to making this a formal requirement on authors at such time as the community has adopted raw data deposition as a routine procedure.
That said there is an urgent need to be clear about the metadata required for the various IUCr Commissions and their experimental raw data. As John Westbrook of the RCSB stressed the importance of this “if the metadata details required are not standardised then there will be datasets but which are nothing more than a mess and which would not be effectively usable by someone retrieving such raw datasets”.
At the time of writing we are aware of two attempts to describe in detail the technical metadata required:
Article submission AJ5204 to J. Appl Cryst. for protein crystallography entitled “Experience with exchange and archiving of raw data: comparison of data from two diffractometers and four software packages on a series of lysozyme crystals” by Simon W. M. Tanley, Antoine M. M. Schreurs, John R. Helliwell and Loes M. J. Kroon-Batenburg.
Post workshop note: We are now aware of a similar type of article submission to JSR for defining data formats in X-ray absorption spectroscopy, now in press.
Whilst the IUCr Commissions need to undertake the specification of "technical" metadata - i.e. those specific to their experimental raw data there is also a need to review "generic" metadata - e.g. who "owns" a data set, as well as details of research grants, embargo periods imposed by researchers, their institutions or their funding agencies, etc. Perhaps a higher-level classification of the domain of study is needed. For instance, a synchrotron facility might need to define different data storage policies for, say, X-ray diffraction images versus X-ray tomography images, and those policies can be automatically implemented if data sets have characteristics identifying what sort of scientific study they represent. We feel that it would be beneficial then to form a specialist group analysing these requirements. The members could be defined to include:- comparisons with the existing content of the imgCIF dictionary, and exploring current operational practice in services such as the UK National Crystallography Service, UK’s STFC RAL and European, American and Australian synchrotrons. The members of this sub-group would be specialists who would be able to represent the areas and bodies mentioned. This specialist group would likely be a sub-group of the IUCr DDD WG. This specialist sub-group cannot be convened informally as it will have some degree of budget needs e.g. to allow a meeting, probably partly by Skype and hosted at IUCr, Chester, UK.
Thus a way forward to emphasise and hopefully guarantee that metadata technical definitions and standards are clarified satisfactorily is for the IUCr Executive Committee to require of its Commissions to produce their metadata recommendations as soon as possible.
An exemplar of good practice demonstrating access to raw data is at the ISIS UK neutron source. In her workshop presentation Dr Erica Yang showed an example STFC DOI landing page for a data set (RB 920486) at https://data.isis.stfc.ac.uk/doi/INVEST ... /24079627/ .
The ISIS data management policy can be found here:-
http://www.isis.stfc.ac.uk/user-office/ ... 11204.html accessed August 30th 2012.
We highlight a couple of points from this policy:
5.4 PIs and researchers who carry out analyses of raw data and metadata are encouraged to link the results of these analyses with the raw data / metadata using the facilities provided by the on-line catalogue. Furthermore, they are encouraged to make such results publicly accessible.
3.3.3 Access to raw data and the associated metadata obtained from an experiment is restricted to the experimental team for a period of three years after the end of the experiment. Thereafter, it will become publicly accessible. Any PI that wishes their data to remain ‘restricted access’ for a longer period will be required to make a special case to the Director of ISIS.
In the Workshop, discussion of the concept of “The Living Publication” led to the suggestion that journals need a new category name for the articles stemming from a ‘starter article and data set’ e.g. equivalent to Erratum or Corrigendum, but which would not be appropriate since no error or correction as such was to be implied from these elaborations of the original study. Thus ‘Ad extensum’ was offered as a suggested article type name. Subsequent discussions with Peter Strickland, Managing Editor of IUCr Journals, revealed that a mechanism can already be provided to let the reader know that an article is related to other articles and data, and this can be accompanied by metadata explaining the relationships. Thus, in a digital world there need not be 'Ad extensum', i.e. an extension to the title of a parent article, but for derivative articles there could instead be an agreed set of metadata across publishers. Specifically, CrossRef is an organisation defining such standards across publishers, and CrossMark is an example of their first attempt to get publishers to collaborate in this way.
Appendix 1: Workshop Programme
This full-day Workshop was organised by the DDD Working Group (WG), appointed by the IUCr Executive Committee to define the need for and practicalities of routine deposition of primary experimental data in X-ray diffraction and related experiments. It reviewed progress during the Working Group's first year of activity.
Objective: To help frame a policy to be drafted by the IUCr DDD WG on Raw diffraction data deposition for final approval by the IUCr Executive Committee
Monday August 6th
09:00 Introduction and welcome. John Helliwell (U. Manchester) and Brian McMahon (IUCr)
09.05 The IUCr Diffraction Data Deposition Working Group Activities since IUCr Madrid. John R. Helliwell and Brian McMahon
09.30 Motivations, challenges, horror stories and opportunities: Experiences of diffraction data management, archival and publication at the UK National Crystallography Service. Simon Coles (UK National Crystallography Service, U. Southampton)
10.05 Report on several important EU projects: CRISP, PaNdata, NMI3, Biostruct X, HDRI and CALIPS. Heinz J. Weyer (SwissFEL , Paul-Scherrer Institut)
10.40 Linking raw experimental data with scientific workflow and software repository: some early experience in the PanData-ODI project. Erica Yang and Brian Matthews (STFC Rutherford Appleton Laboratory)
11.35 Ten years and change: the MX data archive at ALS 8.3.1. James Holton (UCSF, Lawrence Berkeley National Laboratory)
12.10 Continuous improvement of macromolecular crystal structures. Thomas C. Terwilliger (Los Alamos National Laboratory)
12.45 Open Discussion
14.00 Towards policy for archiving raw data for macromolecular crystallography: Recent experience Loes M. J. Kroon-Batenburg and Antoine MM Schreurs (U. Utrecht); Simon W. M. Tanleyand John R Helliwell (U. Manchester)
14.35 Some Economic Considerations for Managing a Centralized Archive of Raw Diffraction Data. John Westbrook (RCSB, U. Rutgers)
15.10 A vision involving raw data archiving via local archives as a supplement to the existing processed data archives (PDB, CSD, ICDD etc). John R. Helliwell, Brian McMahon and Thomas C. Terwilliger
15.45 Remote participants and open discussion
16.20 Summing up. John R. Helliwell and Brian McMahon
Appendix 2: IUCr DDD WG Input Meeting
ACA 2012 Boston, MA July 29, 2012
Present: Frances Bernstein, Herbert Bernstein, Patrick Mercier, Paul Langan , John Westbrook, Marian Szebenyi, James Holton. Chair: Tom Terwilliger
Tom Terwilliger presented slides from John Helliwell and Brian McMahon (see http://www.iucr.org/resources/data/dddw ... n-workshop for these presentations and others that are presented at the DDD WG meeting in Bergen Aug 6, 2012).
John Westbrook described DOIs. The PDB has access, with a subscription fee. Automatically done for structure entries. Will be soon for experimental data files. DOI is relocatable which has many advantages. To generate DOI's you have to be a publisher if data (DOE has this capacity already for example). Note that DOI data must be available at the time of generation.
Herbert Bernstein pointed out that OSTI/DOE will issue DOI's
Marian points out that you could have a tag that is generic like person-date-location and this could be used to track datasets (without worrying about someone accessing data that is private).
Herbert Bernstein notes that it would be useful to keep a digital hash of each data item.
John Westbrook notes that things we need include
• metadata required for data processing
• data itself
If "archival" level...would want to be able to reasonably expect that a person could reproduce and validate original results.
If only "data dump" (motivated user would have a chance of doing something with data) then maybe not so difficult to save, but difficult for non-expert to use, and chance of getting data in useable form is low.
James Holton noted that we might not even have sequence information at the beamline.
Herbert Bernstein suggests the DOI be used as a tracking mechanism (like a database, or part of a database). CIF is a living document that gets things added to it as you go along.
John Westbrook notes that there will be significant effort required to even figure out how to do this...and if we do this in an incremental way it may require a lot of remediation later. As an example, PDB-REDO has been made possible by standardization of the entire archive. Fails however with things that have complicated chemistry and are not present in the metadata. So thinking through what such data might be important in the long run, will be highly beneficial. Perhaps a piloted way would be a good idea.
James Holton notes that the SR will provide resistance. They don't mind 'not deleting' data, but do not wish to be responsible.
Herbert Bernstein points out that NSF is interested in funding development of core technology for big data. NIH is interested in funding application. Also DOE may fund the storage of data long-term. NSLS-II is asking for funding from NIH for long-term storage of data, and are planning to create the necessary infrastructure to make some of this possible.
Patrick Mercier asks if journals could require the saving the data that led to a publication. There was discussion that this could happen (as it did for structure factors).
James Holton asks what the motivation was early on to put all data in a centralized location. Frances Bernstein noted that there was community discussion that led to the archive. Required deposition took another 30 years.
John Westbrook asks, of the data that is available now (e.g., from PSI and from TARDIS). Answer: 250 datasets available at TARDIS, mostly from developers of TARDIS. Highest rate of downloads was 35 in a month of a dataset. (How many structure factor downloads from PDB?)
James Holton notes that it would be great to be able to put datasets (in general, uploaded) on the web and make images available. John Westbrook pointed out that TARDIS can do this now (not sure). (Also that TARDIS usage has been low so far).
Frances noted that a DOI is not present in the PDB entry. John Westbrook noted that it is in the CIF file.
Marian Szebenyi notes that if a user had a tag associated with the data, the organization of data is such that it could be packaged up later (if stored).
John Westbrook notes that if an "archive" maintains images, then it is expected that it be done right.
Herbert Bernstein suggests that SR facilities do what they can...and a system is put in place to keep track of the facts (what is actually there)....and the PDB would record the link to the data. John Westbrook noted that the PDB will be happy to do this.
Herbert Bernstein plans at NSLS-II to save some data as is, but most as compressed images (with some small loss of information).
John Westbrook notes that pilot projects will help identify practicalities of transfer of data (and managing it).
Patrick Mercier notes that there is a market for databases keeping track of data (ICDB).
Paul Langan notes for neutrons the issues are rather different. Concern is benefit of spending a lot of time and resources on this. What is the cost-benefit ratio on this problem.
For synchrotron, local DOI's seem useful.
Herbert Bernstein notes that he has presented a plan for data management at NSLS-II at http://cci.lbl.gov/dials/HJB_27Jul12.pdf.
Summary discussion: It will be highly useful to have a demonstration projects that identify both extent of use and bottlenecks in storing raw images.
Appendix 3 IUCr Forum postings made as a result of these two meetings
1. August 13th 2012 by Dr Ian Clifton, University of Oxford, UK:-
One issue that came up informally during the DDD meeting at Bergen concerns the ever‐vexed issue of image file formats, and whether the imgCIF/CBF standard (I’ll just call it imgCIF from now on) is being
used in the best way. The issue seems to be, not with what is NECESSARY for an image to be a valid imgCIF, but what is SUFFICIENT for it to be fully useful in the future.
What we would like to ensure I suppose is that all the information needed for processing the image in the future, is in fact correctly encoded in the image header using the standard imgCIF “data names”. Maybe we could come up with a compliance test program, perhaps derived from already‐existing automated processing software, which has all knowledge of specific detector types stripped out, so that it
depends strictly and solely on the CIF header.