IUCr Committee on Data (CommDat)
Meeting 1 held in the IUCr Congress Centre Hyderabad 27th August 2017
Present: John R. Helliwell* (Chair), CommDat Members (*) or Consultants (+): Herbert Bernstein*(remotely), Simon Coles*, Kamil Dziubek*, James Hester*, Loes Kroon-Batenburg*, Brian McMahon*, Luc Van Meervelt*, Wladek Minor*, Amy Sarjeant*, Edgar Weckert+ (remotely), John Westbrook*. IUCr Chester: Gillian Holmes, Louise Jones. Guests: Stephen Burley, Marius Grabowski. Apologies: Peter Strickland, Soorya Kabekkodu*, Andrew Goetz*.
1. Introduction and Welcome
John Helliwell (JRH) introduced the new body as a Standing Committee reporting to the IUCr Executive Committee; as such it was envisaged to have a long-term existence, and could adapt its focus to changing needs.
2. Tour de Table of those present
As above. Consultants not present at the meeting were S. Androulakis, M. P. Blakeley, G. Bricogne, S. Grazulis, B. Matthews, D. Szebenyi, J. Trewhella. The membership and consultants encompass a broad range of geography and scientific speciality as well as facility representation.
3. Links with ICSU, CODATA, Research Data Alliance and ICSTI, and the need for an integrated approach for IUCr with respect to data
JRH and Brian McMahon (BM) had attended the ICSU CODATA with Research Data Alliance (RDA) joint meeting in Denver, September 2016, as part of International Data Week, at which they had chaired a session on crystallographic databases. This had led to an open access article in Data Science Journal (https://datascience.codata.org/articles ... -2017-038/ ). JRH had also attended the Barcelona RDA Plenary meeting earlier in 2017, but there would be no participation at the next RDA meeting in Montreal. ICSTI was also a player in the data arena, increasingly organising useful workshops, but this was further adding to demands on time and travel costs. JRH emphasized the benefit of alternate IUCr delegates, e.g. from within CommDat, thus "sharing the load" of attending this growing diversity of data related meetings. CODATA had also organized a Workshop in Paris on Interoperability of Data in June 2017 with a focus on the role of the international union members within ICSU and which JRH had attended as IUCr’s Representative to CODATA.
John Westbrook (JW) saw a lot of overlap between these groups and was uncertain if all this was productive, since there was relatively little coordination between their activities. JRH asked people intending to visit RDA or other meetings to let him and BM know, in an effort to optimize IUCr participation. Simon Coles (SC) has attended several, and acknowledged the multiplicity of activities. He pointed out that the focus of the RDA was on frequent regular meetings (every 6 months) getting specific work done in ‘removing obstacles to open data’, and its short-term project approach was a useful working model. Wladek Minor (WM) commented that "data science" was very wide, and attracted many computer science experts; but often subject expertise was lacking or had low priority.
Luc Van Meervelt (LVM) reported that the current ICSTI Representative (Mike Glazer) was unenthusiastic about ICSTI meetings, and that the IUCr Executive Committee was considering withdrawing from ICSTI membership. JRH spoke up for ICSTI, at least during his term as IUCr Representative to ICSTI, but had always noted from the outset of his participation a need to promote a merger of ICSTI with CODATA not least as we as crystallographers always saw a benefit of linking literature with the underpinning data rather than their formal separation which ICSTI and CODATA seemed to emphasize. The growing number of joint meetings involving ICSTI with CODATA or RDA was testimony to these organisations’ belief that they had overlapping interests, thus arguing for IUCr participation where these considerations and events are of direct interest to us.
4. Consideration of the IUCr Diffraction Data Deposition Working Group Final Report (viewtopic.php?f=21&t=396)
At its meeting with JRH as DDDWG Chair on 23 August, the IUCr Executive Committee had formally accepted the Working Group's Final Report, and this had been posted on the public forum a few days before this CommDat meeting (viewtopic.php?f=21&t=396) . The Report and progress subsequently would be discussed at the next CommDat Meeting. JRH described a manuscript recently submitted to IUCrJ, arising from his invited Hyderabad Congress Keynote Lecture, that assembled a number of case studies emphasising the scientific value of data preservation and sharing as well as reuse, including raw as well as processed and derived data. The DDDWG Final Report encouraged the compilation of further such case studies of raw diffraction data reuse. All the published deliverables from the DDDWG's six years of activity including talks (many videos and all presentations) had been collected by BM at the IUCr DDDWG website (http://www.iucr.org/resources/data/dddwg ).
5. Update on archival of raw diffraction data worldwide
Loes Kroon-Batenburg (LKB) mentioned the Microsymposium on the Scientific Importance of Raw Data which she and BM had chaired as part of the Congress Special Activities sessions. Her own contribution in this session had been to emphasize the need for a set of core metadata for understanding and reusing diffraction images that should be adopted as standard by everyone active in the field. WM emphasized the immediate difficulty of persuading vendors not to keep changing formats, and pointed out that handling existing formats is tractable if they are sufficiently well defined, and there is clear information about which format has been used in every data set. LKB admitted that much could be done with existing practices, but this often involved significant effort (her own DDDWG presentations had demonstrated how much labour can be involved to recover and/or validate basic information about the diffraction image beam centre and instrument axes). Pressure was still needed on suppliers to ensure that a standard set of experimental metadata were collected, and end-users needed to be persuaded that additional effort to do this was ultimately for their own benefit.
JRH applauded the ESRF policy that could assign DOIs for every retained dataset. Edgar Weckert (EW) said that DESY worked closely with ESRF, noting their emerging practices and inclining to use them as a model of good practice. DESY had implemented HDF5 as a standard data container format, and believed that crystallography was in relatively good shape with regard to its data collection practices; this was not true of other application areas served by the synchrotrons. He agreed with WM that beamlines should take on the responsibility for standardising metadata.
There was some discussion about how data centres decided which data sets should be given a DOI, and what were the appropriate retention strategies. Germany requires a 10-year retention period, though the issues of long-term ownership of data sets have not been fully resolved. JRH again referenced the ESRF policy, arising out of consultation with funding agencies, to make data sets openly available after three years without an associated publication, unless users appealed for a longer moratorium. EW said that facilities did not want to build data graveyards, and were likely to welcome guidance on better data management strategies, which, for example, a CommDat-approved set of mandatory minimum metadata and guidelines for ‘data triage’ (see viewtopic.php?f=21&t=57 ) could nicely provide. WM stressed that it was very difficult to built systems with the necessary sophistication to keep track of everything that is created by large-scale data generators, and that a useful touchstone of the success of an effective strategy would be for requests for a specific raw data set actually to succeed in retrieving what was wanted.
6. Metadata for raw diffraction data: update on a checkCIF for raw data
LKB reiterated that such a validation service would require an initial set of metadata requirements, and that the work described under item (5) would provide a useful starting point for such a list. JRH invited her to download some of the (relatively few) diffraction image data sets currently on Zenodo to evaluate how reusable they were.
Herbert Bernstein (HB) introduced a discussion on the novel problems of characterization of the experimental setup being thrown up by the latest high-data-rate macromolecular crystallography experiments at very bright beamlines. Some of the new technical challenges were being explored by the HDRMX group (http://hdrmx.medsbio.org) and challenged the basic assumption that you even know what is the primary data (for example an assumed diffraction image beam centre could be modified by subsequent refinement of the aggregated data). JRH considered these effects to be due to the early development of cutting-edge practices, and would not require fundamental re-thinking of the basic metadata model.
Kamil Dziubek (KD) described how a working group in high-pressure crystallography was developing candidate descriptors for its raw data, and mentioned that they were worried about assessing the correctness of derived data if basic input assumptions (e.g. temperature) were actually incorrect.
SC and Amy Sarjeant (AS) pointed out that the small-molecule community was less likely to want to archive everything, and that there was more likely to be support from this community for a "checkCIF for raw data" that was only invoked where analysis was needed of features other than clean Bragg spots. [However, a subsequent presentation by Ton Spek in the Microsymposium MS-099 "Crystallographic data and structure validation" made a strong case for being able to access diffraction images to validate a wide range of structure interpretation issues using structure-factor amplitudes.]
James Hester (JH), invited by the Chair to comment on the checkCIF for raw data idea, said that COMCIFS' role would be to help define computational strategies for identifying and retrieving specific required metadata items. With existing infrastructure, simply providing a DOI did not guarantee that you could extract relevant metadata (or perhaps even full data sets).
JW suggested that the role for COMCIFS in three areas was reasonably clear. (1) The imgCIF dictionaries provided relatively detailed descriptors suitable for current practice; HB's experiences showed that there might need to be some extensions to these dictionaries to accommodate the new high data-rate techniques. (2) Once you have defined metadata, how do you extract it? Ongoing work, e.g. with the NeXus International Advisory Committee, provided mechanisms for identifying specific metadata in different file formats. (3) If a specific set of metadata is encoded in some standard form, can you provide a community-agreed validation summary? JW also raised the issue of how the PDB (as a structures database) could reliably link to specific diffraction images associated with various experiments or stages in the structure determination. This was being explored with the help of PDB advisory groups, who were charged with requirements such as improving the characterization of multiple crystals as might be used in an FEL experiment. Marek Grabowski (MG) pointed out that even with the existing mmCIF provision for identifying and relating multiple diffraction experiments, people did not always get things right.
JRH and LKB agreed to interact to draw up an initial specification for checkCIF for Raw Data based on the imgCIF tags identified in activities arising from the Rovinj and subsequent workshops. Both noted the excellent CIF Dictionary Writing Workshop organized by COMCIFS on day 0 of the Hyderabad Congress which they had both attended. HB offered to test against Dectris detectors’ diffraction data images.
7. Report from COMCIFS Chair
JH reminded the Committee that COMCIFS was charged with maintaining and extending CIF dictionaries, and was happy to work with other groups to incorporate such dictionary content into different working file formats. He anticipated that COMCIFS would work well with CommDat and requires clear policies and data definitions, and COMCIFS will pick up any data characterization shortcomings identified by CommDat. Any input from across the community is of course warmly welcomed.
8. Development of IUCrData
LVM described how IUCrData arose as a separate data publishing platform following the decision by Thomson Reuters to drop Acta Cryst. E from the Science Citation Index. In its initial implementation it publishes the short structure reports that were predominant in Acta Cryst. E; other types of data-led reports were also envisaged, most immediately crystallization papers [formerly published in Acta Cryst. F and, stated by Louise Jones (LJ), planned to be launched in 2018], and structures based on powder diffraction. JRH urged that crystallization papers should support their assertions about crystal properties by making preliminary diffraction data available, at least to the referees, as had been the case with Acta Cryst F. LJ also stated that images of gels, wells etc. might be supplied as data content for crystallization papers.
JRH noted the potential for data publishing to operate as a new revenue stream for the IUCr, and suggested the development of services that could provide expert support to chemists or other researchers without full crystallographic training in their processing of raw diffraction image data.
9. Data integration and data standards across the Commissions
JRH reviewed the outputs from various Commissions during the DDDWG’s life, identifying publications relating to XAFS and small-angle scattering (sasCIF), as well as the work of the Commission on High Pressure mentioned by KD, and which might be ready towards the end of 2017. JH said that the XAFS initiative had dealt with reduced, not raw, data, and that more work was probably needed in that area. While encouragement from the IUCr might help, the real need was for all journals to make more stringent demands. JRH asked the Chester Editorial Office through LJ to see whether IUCr biological Journals’ Notes for Authors might be revised to provide the necessary firm encouragement for authors to submit articles with their underpinning data for full and detailed referees’ scrutiny (as Acta Cryst. C does).
JH commended the work of the Commission on Magnetic Structures on their recently accepted magCIF dictionary, although this was not directly relevant to raw data.
10. Links and interactions with databases who are not members of CommDat, and the sustainability of smaller databases (Metals Database, Pauling File etc.)
BM suggested that CommDat should maintain at least a watching brief on smaller databases; the websites of Toth Information Systems (maintainer of the Metals Database) and the Biological Macromolecule Crystallization Database (BMCD) and which were no longer active. LJ said that she had some information about a French contact for BMCD that she would pass on.
Stephen Burley (SB) raised concerns about some US funding agencies' apparent shift away from a commitment to long-term preservation of collections of data, and asked for appropriate expressions of concern or other pressure to be brought to bear by appropriate Scientific Unions or higher-level policy organisations. JRH was tasked with liaising with SB and the IUCr Executive Secretary to explore what might be usefully pursued, e.g. a letter of support from IUCr approach.
11. Other possible projects
AS and SC wanted to reiterate concerns about long-term archiving of raw data sets in the small-molecule community, and reminded CommDat that it needed to accommodate both the small-molecule and macromolecular communities. SC noted that the question of when it was appropriate for a small-molecule crystallographer to archive data was still unresolved. These CommDat meeting minutes highlight the DDDWG triage discussion (viewtopic.php?f=21&t=57&p=161) and members are encouraged to share that weblink with their respective IUCr Commissions as a catalyst for their detailed discussion and priorities. WM explained that the Commission on Biological Macromolecules was already preparing a white paper "Proposed mechanisms for making diffraction experiments available" to guide the macromolecular community on this issue.
12 Summary of actions
• Set up website, forums and mailing lists for CommDat discussions (BM)
• Draw up an initial specification for checkCIF for Raw Data followingthe discussions in Rovinj and subsequent DDDWG workshops (JRH/LKB)
• Test the first such specification using standard CIF validationtools and the imgCIF dictionary against Dectris files (HB)
• Review Notes for Authors to see if more pressure can be brought to bear on authors of biological papers to report relevant experimentalmetadata (LJ/Chester Editorial Office)
• Discover status of Metals and BMCD Databases (BM/LJ)
• Explore ways to lobby USA (and other) funding agencies to emphasize the scientific need for raw data retention (JRH+IUCr Exec Sec/SB)
• Members to promote discussions within their communities and commissions on what raw data sets to encourage archiving of (as a start to discussion see viewtopic.php?f=21&t=57 ).
CommDat meeting started 1.30pm and ended 2:50 pm.
Minutes and actions from 1st CommDat Meeting at IUCr2017
-
Brian McMahon
- Site Admin
- Posts: 116
- Joined: Fri May 13, 2011 12:34 pm