Notes and actions from meeting held 27th August 2011 Madrid
Steve Androulakis (TARDIS representative)
John R. Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the IUCr Journals Commission 1996-2005)
Loes Kroon-Batenburg (Data processing software)
Brian McMahon (IUCr CODATA Representative)
John Westbrook (wwPDB representative and COMCIFS)
Sol Gruner (Diffuse scattering specialist and SR Facility Director)
Heinz-Josef Weyer (SR and Neutron Facility user)
By invitation, Chairs and delegates of IUCr Commissions:
Isabella Ascone (CXAFS)
Kamil Dziubek (CHP)
Francesca Fabbiani (CHP)
Maria Teresa Fernandez-Diaz (CNS)
Ralf Grosse-Kunstleve (CCC and COMCIFS)
James Kaduk (CPD)
Katherine Kantardjieff (CCT)
Ute Kolb (CEC)
Patrick Mercier (CIMS and IUCr Database Users Committee)
Soichi Wakatsuki (CSynR)
No Commission representation from:
CAC, CBM, CCGCM, CCACH, CMTC, CSAS, CSC.
Alun Ashton (Diamond Light Source (DLS); Data Archive leader there)
Herbert Bernstein (Head of the imgCIF Dictionary Maintenance Group and member of COMCIFS)
Frances Bernstein (Observer on data deposition policies)
Gerard Bricogne (Active software and methods developer)
Bernhard Rupp ( Macromolecular crystallographer)
The Chairman welcomed everyone and read out the Terms of Reference defined by the IUCr Executive Committee:
Terms of Reference
It is becoming increasingly important to deposit the raw data from scattering experiments; a lot of valuable information gets lost when only structure factors are deposited. A number of research centres, e.g. synchrotron and neutron facilities, are fully aware of the need and have established detector working groups addressing this issue.
The IUCr is the natural organization to lead the development of standards for the representation of data and associated metadata that can lead to the routine deposition of raw data. A Working Group on these matters has thereby been launched by the IUCr Executive Committee, to which the Working Group will report, to be Chaired by Professor John R. Helliwell. Its provisional title is 'Diffraction Data Deposition Working Group of the IUCr'.
The suggested agenda was agreed as per the section headings below.
An IUCr Forum for communication was outlined by Brian McMahon:
The IUCr website would host a family of discussion forums devoted to the matter in hand, with different levels of privilege allowing structured discussion within and between the WG Members, Commission Representatives and Consultants, and at the appropriate time, the wider community.
[Note added after meeting: The forum “Consultation on diffraction data deposition” has now been set up on the IUCr web site at the URL viewforum.php?f=16 , which includes much constructive discussion flowing on from the meeting. Initially, access is limited to individuals formally invited or expressing a close interest in the consultation process. Readers who wish to join the forum should register themselves at http://forums.iucr.org and send a request for access privileges by email to firstname.lastname@example.org.]
Each attendee was invited to address the meeting. The Commission Representatives were invited, where possible, to state their view of the short and longer term needs for the archiving of relevant data and metadata with publications and/or formally deposited datasets, i.e. cases where there was not (yet) a formal publication.
A range of current practice emerged:
CPD stated that the Powder Diffraction Data File already kept full datasets, although it was aware of missing datasets.
CNS, CHP emphasised the need, and perceived challenge, of creating appropriately complete metadata as a first key step.
CEC emphasised that their diversity of data would likely make progress slow in metadata definition and achieving data archiving where there were severe amounts of raw data.
CXAFS. Since the Commission meeting in July 2009, members have been invited to define relevant information on data deposition (format, file size and total number or files). Technical aspects have been examined and different solutions proposed. The International X-ray Absorption Society (IXAS) has been involved and H. Oyanagi (IXAS's Chair) created a working group on this item. In order to progress, a roadmap has been determined in agreement with IXAS. CXAFS will discuss suggestions from IXAS that will be formulated during Q2XAFS2011 workshop (December 2011) and during the XAFS15 Conference (July 2012). The XAFS Commission will be able to give recommendations in September 2012, taking into account all comments received. At the least, deposition of the information about the data was needed as soon as possible and an objective is to see established a real database of raw data. IUCr journals should plan to implement the policy of deposition of XAFS data from an appropriately early date (e.g. September 2012).
CCC stated that it is extremely important to have access to raw data.
CIMS wished to see metadata and all raw data archived but said this was not an easy matter and asked "What role in this might IUCr Journals itself take?"
CSynR reported that it has immediately appointed a Member of its Commission to represent it to the DDD WG (Colin Nave, DLS).
The Consultants reported as follows:
Alun Ashton: DLS is imminently launching a tape-based archive of all DLS raw data measured so far comprising over 200 terabytes (TB) of raw and processed data; accompanying this is a financial commitment from Diamond, who receive the majority of their funding from the UK Science and Technology Facilities Council (STFC), to extend this continuously as required. [Subsequent discussion identified the fact that such complete retention of all experimental data included much relatively low-quality material generated in pre-screening runs. Although this data is valuable in other contexts, it did not contribute directly to the findings/results in a final publication so this is not reason enough to merit indefinite archival.] The access to datasets would be password protected to the PIs and their research group. However Diamond and STFC are committed to public access to all raw data after a grace period, likely to be 5 years. The STFC neutron source facility, ISIS, was doing likewise. DLS was also investing much effort in settling on standardised formats (such as imgCIF and NeXus).
Frances Bernstein: Access to present data via an archive 100 years from now could require a policy to deal with ownership of data, i.e. akin to the clarity involved in making a will. Datasets without such a clear assignment could make problematic public access; this could include for example datasets that achieve a significance for publication after a person's death.
Herbert Bernstein and Bernhard Rupp: the reasons for archiving raw data were compelling as the right and proper record of the science undertaken. Instances of science fabrication would disappear if raw data deposition policies were adopted. The gathering of appropriate metadata is hard though and its difficulty should not be underestimated; it is essential to get clear agreement of the relevant communities (e.g. within each IUCr Commission) on formats and ontologies (the controlled vocabularies and structured representation needed to formalise concepts within standard digital information exchange processes). It is particularly important to separate the conceptual content of ontologies from underlying formats, in order to promote interoperability between different disciplines.
Gerard Bricogne: (1) It is an invaluable asset if people keep their raw diffraction images, as it provides developers of methods with the foodstuff of progress; already we can see real progress from the deposition of processed structure factors and model coordinates in the biological field for model refinement programs. (2) We lose a lot of raw data if we do not archive raw images; not only diffuse scattering but the increasingly difficult biological crystal samples challenge current methods, and the possibility of distinct improvements in software make it possible to revisit data processing. In summary, publications must be accompanied by raw diffraction images.
The Working Group Members were then invited to make their inputs:
Steve Androulakis: The Australian TARDIS initiative seeks to expand the raw data archive of diffraction data images as an agreed priority within the Australian Research Council, who fund their work and archive. Particular challenges are ontologies, rather than image formats. Strategies for data access need to be more than a tape-based archive in their view. In the short to medium term, as the volumes of data generated grow rapidly, an important need might be the implementation of orderly policies for data reduction.
Loes Kroon-Batenburg: Raw data is needed for software developers. This includes a good definition of the metadata. Home X-ray source diffraction data are an important fraction of the data that is produced. Perhaps if the SR and neutron facilities are taking responsibility for archiving their facility's raw data the IUCr could take responsibility for the home source raw data?
John Westbrook: The PDB Validation Task Force and Advisers recommend unscaled and unmerged diffraction data be deposited along with models; but not raw diffraction data images at this stage. In the longer term the archived metadata should be extended via the mmCIF mechanism including semantic integration. The wwPDB is providing anonymous access to its archive using synchronized servers hosted by the wwPDB partner sites. The model that is evolving for serving raw diffraction data appears to be a distributed network of servers hosted at large data centres affiliated with data generation facilities or at the academic or commercial data centres supporting locally collected data sets. It would be useful to develop a registration system (e.g. a DOI-like system) for this distributed network of diffraction data so that references to this data could be collected with other metadata when structural models are deposited with the wwPDB..
John Helliwell: the role of the IUCr Working Group is to bring to a focus information (for example in this meeting from the various disciplines of crystallography as represented by its Commissions), and to identify steps forward. A suitable term (albeit not used at the meeting) is to create a 'Roadmap' for the short, medium and longer term with appropriate milestones to measure progress towards archiving and making available all relevant scientific data associated with a literature publication (or a completed structure deposition in a validated database such as the PDB). The Roadmap would be discussed on an annual basis via a workshop or workshops at one of or each of the IUCr Congress or regional associate conferences (ACA, ASCA, ECM). As an immediate action each Commission is requested to select one or more exemplar publications and to arrange for the full metadata and raw data to be provided with it/them. These should be provided to Brian McMahon at IUCr Chester (email@example.com) for appropriately secure collation.
General discussion ensued with the points emerging below:
Rough estimates of volume, and therefore costs, were initially being made on the basis of the 200TB size of the Diamond archive, until it was pointed out that only a small fraction of that would contain datasets which had given rise to the data against which actual PDB entries had been refined. Part of the scope of the working group could be to identify criteria for “relevance” which would affect decisions on what proportion of generated data to retain in the longer term. An intermediate goal was identified, namely letting synchrotrons and archives such as TARDIS hold those datasets, rather than immediately putting that burden on a central archive such as the PDB; but ensuring that unique standard identifiers (e.g. DOIs) be assigned to all datasets collected in the future to provide a means of identifying the subset of "relevant" datasets, both in the publication describing a structure and in its PDB entry. It was argued that synchrotron archives should make those datasets publicly available at the time when the PDB entry is released (instead of waiting for the 5-year period currently contemplated by Diamond for all datasets).
The idea of a pilot project thus emerged, in which journals would encourage raw image deposition or disclosure on a voluntary basis. With a certain amount of such material then available in the public domain, the PDB could begin to investigate how to use unique DOIs to link raw image files to and from the corresponding PDB entries.
This would have the advantage of testing the various bits of technology involved, and to also start giving an idea of the uptake of such information by users and developers and hence build the case for moving towards compulsory deposition in the future.
Similar pilots could also be undertaken outside the macromolecular crystallography world, for example by IUCr journals, to assess the requirements of other disciplines; for some of these, the volume of raw experimental data might be sufficiently low to allow the journals to act as depositories.
How can the funding agencies be influenced to provide funds for data archiving?
Might it be appropriate to engage partnerships within the commercial world? For example, large operations such as Google could make available large storage volumes and robust backup procedures at relatively modest cost.
What does it really cost to host the output of the world's crystallographic data? Indeed, what is the output now and projected for the world's crystallographic data? [Herbert Bernstein subsequently drew attention to the NSF RDLM workshop website at http://rcs.columbia.edu/rdlm that suggests current realistic estimates for long-term data storage with adequate redundancy and media migration strategies are of order $1000-3000 per TB per year.]
Could an immediate first step be to add to IUCr Journals' copyright transfer statement that the corresponding author guarantees to host and make available the raw data for an extended period, no less than 5 years, to any reader of a publication that desires access to that raw data?
What practical scenarios are needed to guarantee the veracity of archived data? E.g. a ‘Datacheck’ method analogous to checkCIF?
Pilot experiments for each of the Commissions in archiving raw data and metadata, should also include some data check/validation step afterwards. It should be possible to reproduce the interpretations given by the authors. This would make it a true case study.
What fraction of a SR or neutron facility's archived data is likely to lead to publication and/or formal dataset deposition?
What level of archived dataset redundancy is appropriate? Duplicate, triplicate, quadruplicate etc?
Another useful first step might be to encourage journals to require and actively to support retention of datasets (by researchers or their delegates) rather than insisting from the outset on universal deposition.
For proper interpretation later, well-characterised metadata is probably even more important for powder data than it is for single-crystal images. The powder CIF dictionary (pdCIF) was designed to capture as much metadata as possible, but is still incomplete for the requirements of long-term deposition and subsequent information retrieval.
While the problem of raw data archiving at the synchrotron and neutron sources was widely discussed during the meeting, there are still no clear plans for the files from home diffractometers, particularly in light of the necessity to upload massive amounts of data through the Internet, which can be cumbersome especially where there is a lack of a broadband Internet connection. In this respect, providing technical assistance to developing countries may also be required. A way forward could be for those IUCr Journals hosting the highest fractions of home X-ray diffractometers' output, namely Acta Cryst B, C and E, to draw up a business plan in which they acted as the depository for diffraction data images. This approach was very actively promoted by the Commission on Inorganic and Mineral Structures at the Meeting and subsequently.
Notwithstanding the suggested timetable of activities by the Working Group involving annual workshops, most participants at the meeting voiced a sense of urgency that experimental data supporting a literature publication should be deposited and archived at publication time, and made publicly available to ensure that the results can be re-derived by other investigators. It was also emphasised that the broad availability of such data would give added stimulus to the development of improved methods and software that may in turn enable the subsequent derivation of better results from those same data.
Post meeting notes:
Following a detailed case put forward by the Commission on Biological Macromolecules to the IUCr President and Immediate Past President, Dr Tom Terwilliger of Los Alamos National Laboratory, USA, an accomplished macromolecular crystallographer, has been formally appointed to the DDD WG as a Member.
John Helliwell and Brian McMahon
16 September 2011