Post by Brian McMahon » Mon Sep 06, 2021 4:37 pm

IUCr Committee on Data

Meeting Online 25 August 2021 12:00 UTC (09:00 EST, 13:00 BST, 14:00 CET, 22:00 AEST)

Present: John R. Helliwell (Chair), Herbert Bernstein, Ian Bruno, Simon Coles, Kamil Dziubek, Andy Goetz, James Hester, Soorya Kabekkodu, Brian McMahon, Wladek Minor, Marian Szebenyi, John Westbrook.
Apologies: Amy Sarjeant, Loes Kroon-Batenburg.

1. Introduction and Welcome

John Helliwell (JRH) welcomed everyone to the meeting, which had been timetabled for a few days after the end of the XXV IUCr Congress to allow reflection on the data content of that meeting.

2. Report on CommDat Workshop on raw data for chemical crystallography

Simon Coles (SC) summarised the programme of the very successful two-day workshop 'When should small molecule crystallographers publish raw diffraction data?' organised by him and Amy Sarjeant that had taken place just before the Congress (on 11-12 August): ... orkshop-cx

The workshop had posed a number of questions, to which some tentative answers were forthcoming:

"What are we trying to support/achieve by archiving diffraction images in chemical crystallography?"
  • Several strong use cases from numerous subdomains of chemical crystallography throughout the workshop emphasised the desirability of retaining all raw data, so far as possible. A particularly compelling presentation by Jim Britten showed a gallery of complex and interesting features visualised with MAX3D software (see ... a56948.pdf).
  • NB: given that most chemical crystallography was done in the home lab, data retention might be achieved by a distributed infrastructure (i.e. labs being responsible for their own archiving).
  • This community did not, however, consider it necessary to share everything.
"How to go about [usefully] archiving raw data?"
  • Orderly curation was necessary. Metadata and format standards needed to be agreed, but they largely exist and future development (e.g. within the CIF project) can be done in an orderly way.
  • A hybrid approach is probably necessary, where tractable repositories or databases of metadata link to but are physically distinct from the (large) raw data files.
  • But there is still no clear answer on how to provide sustainable funding of systems to support this model.
"Advocacy - how to get the community on board?"
  • Widespread community uptake still faces challenges, both cultural and financial.
  • Publishers could - and should - play a clear role.
There is an impetus to foster within the community a culture of managing data that is "fit for purpose". This would allow, say, a data set collected for material characterization to be subsequently re-used if more interesting science could be mined from it. The question of "ownership" of the data (especially where structures are determined as a service for paying customers) may be a complicating factor in trying to honour FAIR data principles.

3. Report on CommDat Workshop on raw data for macromolecular crystallography

Herbert Bernstein (HJB) reported on the success of the one-day workshop on "MX raw image data formats, metadata and validation" held on 14 August: ... x-raw-data

The primary purpose of the workshop was to present the Gold Standard paper [Bernstein, Förster et al. (2020), IUCrJ, 7, 784-792] to the community, but the delay in hosting the IUCr Congress meant that the workshop was able to demonstrate its successful deployment in a number of facilities. A detailed report of the proceedings is available at the websites listed above, and HJB concluded that the standard was achieving its objective of becoming the natural deployment mechanism. HJB paid particular tribute to the contributions of the late Andreas Förster in developing and promoting the Gold Standard.

2/3. General discussion

There was vigorous discussion of the matters arising from both workshops.

JRH pointed out that the majority of chemical crystallography was performed in the home lab, and that, especially with service crystallography, ownership of data might be a constraining factor.

Wladek Minor (WM) took up a point that had been made in the chemcryst workshop about the potential of publishers to change patterns of behaviour by mandating data deposition as a condition of publication. IUCr journals were in a position to exercise leadership, as in the recent editorial sponsored by the Commission on Biological Macromolecules and CommDat that encouraged such deposition [Helliwell, Minor et al. (2019), Acta Cryst. D75, 455-457].

SC identified the need for a searchable registry (or database) of metadata to allow the finding of raw data sets, especially if they were not collected in a single central archive. James Hester (JH) agreed that it would be useful to have a two-layer access mechanism to image data; all the useful metadata could be collected and organised in a searchable database, while the images could reside in a more remote data store. He suggested that CommDat could usefully draw up a specification of how the metadata separate from the image data should be managed.

Brian McMahon (BM) noted that there might be possibilities in identifying different granularities of metadata. As a minimum, one would like to be able to find, say, all diffraction images stored in Zenodo. At a more detailed level, one might want to search across facilities for experiments at a particular wavelength or flux, or on a specific target. It was possible that different providers would emerge to service these different layers of requirement. [Explanatory note: Although current efforts focus on defining metadata to interpret raw data that will necessarily accompany (or be embedded in) the raw data, these metadata items may be in a range of formats and might not be exposed by a server such as Zenodo that makes them searchable or easily findable. This will be a particular problem if there is a wide variety of repositories. The function of the ‘metadata databases’ would be to collect heterogeneous metadata across repositories and present a single searchable interface.]

Soorya Kabekkodu (SK) identified problems with lack of standards in describing phase purity and variable step size that prevented manufacturers of new instruments from writing this information straight to a CIF. JRH also pointed out that the move to recording 2-D rather than 1-D data sets greatly increased the size of powder data sets. SK emphasised that beamline parameters from facilities were now often reasonably complete in the necessary metadata to interpret the raw data collected, but this was not always the case from home labs, and more rigorous standards needed to be defined and implemented there. JRH identified a role for the Powder Diffraction Commission to explore this.

HJB suggested that effective search would benefit from collaboration with large-scale commercial partners such as Google, Amazon and Intel. However, Andy Goetz (AG) argued that modern facilities were capable of storing and managing research data sets without relying on commercial infrastructure in spite of the significant growth in volume in recent years. He made the point that the facilities would always make a best effort to store as much as possible, but ultimately some data might need to be stored in a reduced form.

Kamil Dziubek (KD) noted that already there was a diversity of locations for archived data - home labs and universities, facilities, Zenodo, and the prospect of storage locations in China and other countries - that meant there was already a de facto federation outside of the Google/Amazon commercial infrastructure model.

John Westbrook (JW) emphasised the need to persuade funding agencies to cover costs associated with data deposition, but the agencies needed to have a good handle on what data needed to be preserved and what the value of that data might be. Data management plans were an important tool in trying to make this explicit. CommDat had an essential role to play in articulating best practice within its community and demonstrating a coherent and committed culture of data management. But the multiplicity of journals in the chemical crystallography areas of interest made it harder to achieve consistent practice across that broader community. JRH saw the data management plan as a useful tool whereby service crystallographers could help the PI to explain what will happen to the data - and this could find favour in grant reviews.

Ian Bruno (IB) emphasised that the chemical publishing sector was very diverse, making it difficult to implement agreed standards. However, the handling of derived crystallographic data (structural CIFs) had succeeded because of the leadership of the IUCr, and because they had made it sufficiently easy for people to comply.

SC was tasked with establishing an action plan and recruiting contributors to implement recommendations arising from the chemical crystallography workshop, and Ian Bruno (IB) volunteered to help with doing this.

4. Other comments on IUCr25 with respect to data matters

JRH reported that Martin Lutz of the Commission on Crystallographic Computing had offered the support of the Commission in helping to establish and implement the standards for data deposition.

JRH gave a reminder that the Raw Data Letters section of IUCrData would offer an opportunity to demonstrate the usefulness and interest in individual raw data sets (and give some indication of what proportion of data sets routinely collected have the potential to be of greater value). IB pointed to chemical standards efforts (e.g. in spectra) that might usefully be coupled to metadata requirements for raw data.

The meeting concluded at 13:20 UTC.

