Meeting Montreal 11 August 2014 7:30p.m.
Present: John R. Helliwell (Chair), Brian McMahon, Marian Szebenyi, Frances C. Bernstein, Herbert J. Bernstein, Loes Kroon-Batenburg, Patrick Mercier, Matt Zimmerman, Kamil F. Dziubek, Marcin Kowiel, Saulius Grazulis, George Phillips, Jim Kaduk, Andrew Allen
A. Summary of DDDWG recommendations
John Helliwell (JRH) reviewed the recommendations of the DDDWG in the triennial report presented to the IUCr Executive Committee. The main issues that were identified as relevant for the IUCr were:
- With advancing technology and the consequent surge in volume of generated experimental data, it might be necessary to consider subsets of data for deposition/retention, or to retain for limited time periods.
- There is a need to address the question of rights of access to publicly funded but unpublished research data after some appropriate time lapse.
- In addition to the volume of data, there is a possible need for active triage of data at source owing to the rate of generation from the latest instruments or experiments (e.g. X-ray lasers, Eiger detectors).
- IUCr Commissions should be charged by the IUCr Executive to conclude their work projects to define their experimental raw data metadata.
- Journal of Applied Crystallography should consider introducing a ‘difficult raw data’ category of allowed articles .
- A centralised crystallographic repository of raw dataset metadata should be scoped and piloted.
- With such a repository in place, we should revisit the proposal that authors shall provide a permanent and prominent link from an article to the associated raw datasets.
B. Articles in press on raw data archiving and use in Acta Crystallographica Section D
A special set of four articles and an introduction by Tom Terwilliger were in proof for Acta Cryst. D, and publication in the October issue was anticipated.
C. ‘Difficult data articles’ category in JAC
Following a suggestion of Loes Kroon-Batenburg (LKB), one of the Main Editors of JAC (Anke Pyzalla) had been approached and was receptive to the idea of a new category of research articles in which authors would describe the nature of problematic or otherwise challenging data sets. These articles would invite the community to work on these data where there were potentially interesting results e.g. relating to multiple lattices, diffuse scattering, incommensurate structures etc. The author must explain in detail what analysis had already been done; the article would be peer-reviewed; and there would be a link to a repository where the data would be available for a reasonable timescale and should have a DOI (or other robust persistent identifier).
Andrew Allen (AA), another Main Editor of JAC, said that the idea was interesting, but required discussion and consultation among the full set of Main Editors, two of whom were relatively recent appointees. There might be concerns that such articles should not be understood as a dumping ground for low-quality work, and there were worries that they would be poorly (or not) cited, or seem to have an adverse effect on the journal’s reputation or performance benchmarking by bibliometric criteria. In general discussion, it was emphasised that the actual number of such articles might be small, and that their acceptance criteria should be demanding, so that only articles of genuine scientific interest would be published. They might also be very well cited as exemplars for difficulty and also hopefully for successful community research action (‘crowd research’).
There was a general discussion of some of the technical issues that could be problematic if such a category were instituted. LKB anticipated that the most suitable contributions in such a category would be studies where the total accumulated datasets were of the order of GBytes in size, because of the time/bandwidth constraints in network transfer. [This did not rule out studies with large data generation rates, e.g. in XFELS experiments, where subsetting of the collected data was already routinely performed.] Herbert Bernstein (HJB) suggested that lossy compression (e.g. using techniques described by James Holton at the ECM Bergen Workshop) could be used for data transfer – this could make it easier for fellow-scientists to sample a difficult data set before deciding whether to invest time or effort in subsequent solution attempts.
AA commented that the proposed new category fell outside the traditional domain of interest of JAC. If it were introduced, it would be necessary to develop suitable guidelines for both Editors and authors. George Phillips suggested that the process would help to educate the community not only in technical matters but in terms of the policy and ethics associated with this new mode of working; he drew parallels with the ethics of structure-factor deposition in the macromolecular community.
AA confirmed the approach described by LKB that the data should remain with the original authors, and that the article and network of DOI links to associated data sets formed an extension of the original research effort. If not done that way this could make the assignment and tracking of intellectual property rights complicated if other groups made use of the original data. JRH indicated that the emergence of data-centric licensing protocols such as CC0 was intended to help to address these concerns, but agreed that further work might be needed to define the possible IP issues more clearly.
* Action: the DDDWG to send a formal request to the JAC Editors to consider this proposal. AA would act as point of contact. [LKB]
D. Review of DDDWG interactions with ResearchGate etc.
JRH reported on the current state of efforts in Manchester University to provide satisfactory archiving of some of his data sets. The data were now safely retained in the University data store, but DOIs had not yet been assigned. [Post meeting note by JRH: The Manchester University ‘Data Librarian’ has confirmed to JRH that weblink identifiers have been assigned for his datasets and a licence from Datacite sought with a view to commencing their DOI attributions in early 2015.] There seemed to be a general sense that Universities (and some research facilities) were happy to provide ‘safe retention’ policies, but were still reluctant to take on a fully fledged archiving role. Tom Terwilliger had approached ResearchGate, a growing social network provider for academic researchers, who are interested in retaining raw data.
LKB gave a comprehensive review of possible repositories for experimental data sets. Among possible solutions that the DDDWG had already given some thought to were arXiv.org (currently restricts supplementary data to a few Mbytes); Dryad; Figshare; ResearchGate; the PaNData project covering large facilities; TARDIS and the Store.Synchrotron initiative. Additional possibilities included Zenoob and Dataverse Network – the latter is implemented in Holland as EASY.
Saulius Grazulis (SG) outlined a possible approach to robust distributed data repositories built on a ‘least-authority filesystem’. This allowed the configuration of multiple depositories sharing encrypted data, set up in such a way that any 3 from a pool of up to 255 nodes could retrieve any data set. Individual nodes could be of the order of 30 TB (probably affordable for a University), allowing an aggregate storage of several petabytes. Because the filesystem is encrypted, authors would need to provide authorised access to their holdings (but the security keys could be held in escrow during any embargo period). The Crystallography Open Database (COD) had plans to start a pilot project along these lines and would be interested in working with DDDWG to explore this approach. Nature Publishing Group (publishers of Nature Scientific Data) already listed COD as an approved repository for supplementary data. It was noted that data would only be recoverable from a least-authority filesystem if the encryption keys were not lost – their maintenance would need to be an important aspect of the maintenance metadata required of such a depository.
* Action: SG to define the collaboration with DDDWG proposal in detail.
HJB reminded the meeting that Google still offered cost-effective large-scale storage at costs in the region of $120/TB/year. SG remarked that storage space rented from commercial suppliers such as Google could indeed be utilized within a least-authority filesystem solution. George Phillips favoured the idea of an early triage that identified high-value data sets that required particular effort to archive and maintain robustly. In his view, a trusted resource such as the Protein Data Bank was still the preferred option for such data (assuming there was sufficient community support to lead to any necessary increase in funding); but the possibility of retaining the lower-value residue also, in a lower-cost distributed repository, was appealing.
E. Other issues in archiving large data sets with journals, in University repositories or in funded public archives
There was some general discussion about the appropriate ‘horizon’ after which unpublished data should be released into the public domain, Marian Szebenyi thinking that 3 or even 5 years could be considered too short for data of high potential value. Frances Bernstein remarked that PDB release embargo periods used to be longer than is now the case, and had been lowered in accordance with community wishes. JRH emphasised that the DDDWG was not seeking to mandate a specific horizon, but that active discussion within the community (or communities) should take place to establish a timescale that represented a reasonable consensus. It was also pointed out that (unlike in space science, where a rocket launch marked a specific ‘time zero’) there might be difficulties in establishing the starting point from which an embargo period should properly be reckoned.
HJB revisited the idea of data ‘triage’ to reduce the volume of accumulated data. For some detectors there was natural triage, in that data frames were already discarded at source or stored using lossy compression. While he did approve of the general idea of retaining as much data as possible that was associated with published structures, he did argue for a strategy that tried to retain some lossy version of other data that would otherwise be completely discarded.
F Commission Reports
While the DDDWG was keen to encourage progress by Commissions in defining experimental raw data metadata, there was as yet no coordinated way to achieve this. Patrick Mercier (PM) noted that the Commission on Inorganic and Mineral Structures was aware of the requirement and wished to proceed, but needed help in starting. JRH suggested that the XAFS and SAS articles published in recent years in Acta, together with the forthcoming Acta D special articles, would form a useful reference and starting point. SG emphasised that the Crystallography Open Database would be happy to help where possible.
* Action: PM to consult with his Commission colleagues to confirm if they now have enough details to proceed. [Post meeting note: Dr Simon Coles of Southampton University, UK could be approached by PM as a Commission consultant on such matters.]
G Next triennium
A complete work plan for the next triennium would depend on feedback from the Executive Committee to the Working Group report presented at this Congress. In anticipation of a renewed mandate, an application for IUCr funding had been submitted, to support a one-day workshop at the ECM meeting in Rovinj in August 2015. The DDDWG itself has presented the IUCr Executive Committee with a set of summary points for why it should continue (see Appendix below).
John R. Helliwell
To the IUCr Executive Secretary August 10th 2014
The DDDWG wishes to commend to the EC that the activities of the DDDWG should continue in the next Triennium and we unanimously offer our reasons below.
Prof John R Helliwell DSc
> We offer the following good reasons to continue:-
> 1. The raw data archiving and correct metadata logging may take longer than we initially
> imagined to get ingrained as community best practice and thereby we would be on hand
> as a DDDWG to push further where and when needed.
> 2. Increasingly the DDDWG is recognised as an established and experienced group for
> crystallographers, as well as central X-ray and neutron facilities, and the IUCr Executive
> Committee and its Commissions to consult on raw data archiving policy as well as technical matters.
> 3. Technical opportunities for raw data archiving are still evolving and we are better placed
> as a group rather than as individuals to follow what is happening.
> The above said we may need to add new specialists to the DDDWG. The most obvious is
> to have someone on board that represents those scientists that take many years to prepare
> experiments notably of difficult samples and who would be properly alert to mandates like
> "privileged access to measured data can be for a limited period before open access to those
> data by others would be required".
> There may be other specialists that the EC would like to suggest.