Progress report of the Working Group

This forum allows IUCr Commissions, subject experts and invited consultants to provide input to the IUCr Working Group on Diffraction Data Deposition.
Post Reply
Brian McMahon
Site Admin
Posts: 116
Joined: Fri May 13, 2011 12:34 pm

Progress report of the Working Group

Post by Brian McMahon » Tue Feb 28, 2012 10:46 am

Below is one of our occasional reports on the activities of the Working Group and its interactions with the wider community. It will form the basis for an interim report to the IUCr Finance Committee at their meeting at end of March, and subsequently, updated as required, to the IUCr Executive Committee in the summer. This draft has benefited from the input of Tom Terwilliger and Steve Androulakis, and from the general involvement of all the Working Group members. We'd be very happy to hear additional feedback from the Commission representatives and other contributors to this particular forum.


Towards a policy for archiving raw data for macromolecular crystallography
Progress report by The Diffraction Data Deposition Working Group of the IUCr

28 February 2012

WG Members: S Androulakis (TARDIS representative); Sol Gruner (diffuse scattering specialist and SR Facility Director); John R Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the IUCr Journals Commission 1996-2005); Loes Kroon-Battenburg (data processing software); Brian McMahon (IUCr CODATA Representative); Tom Terwilliger (representative of the Commission on Biological Macromolecules); John Westbrook (wwPDB representative and COMCIFS); Heinz-Josef Weyer (SR and Neutron Facility user).
Consultants: Alun Ashton (Diamond Light Source and Data Archive leader there); Herb Bernstein (head of the imgCIF Dictionary Maintenance Group and member of COMCIFS); Frances Bernstein (observer on data deposition policies); Gerard Bricogne (active software and methods developer); Bernhard Rupp (macromolecular crystallographer).

A. The Group was formally approved by the IUCr Executive Committee and launched at the IUCr Congress and General Assembly held in Madrid, August 2011.

B. A series of discussion forums for the topic have been established on the IUCr forum server:
http://forums.iucr.org/
A closed forum is available for the working group to organise its efforts, and a broader closed forum is used by an advisory panel of Commission representatives and other relevant observers. At an early stage a public consultation, third, layer was added to interface with the community as a whole. This has proved particularly popular with a number of postings and many page views. It includes a summary, written by T. Terwilliger, of a significant debate held within the CCP4bb. Weekly or fortnightly snapshots of these pages have been recorded by J. R. Helliwell (JRH) to demonstrate the level of interest in these discussions (a full set can be supplied upon request).

C. Practical next steps have been undertaken.
(i) Commissions were invited to submit exemplar publications with associated raw data and metadata; this is to allow us to gauge the view within a given crystallographic science area of the requirements to describe experiments within that sub-field. [An example of a sub-field where there is lively discussion as to the nature of the raw data is found in the Commission on XAFS: i.e. is the average of raw scans or the upto thousands of raw scans the 'raw data'?] We can also get some measure of the file sizes involved for these sub-fields from the submitted raw data sets.

(ii) As emphasised by our consultant, Dr G. Bricogne, a pilot programme of archiving synchrotron macromolecular crystallography (MX) data sets would establish some of the practical parameters and scope the remaining challenges. The Diamond Light Source (Rutherford Laboratory, UK) has been especially active in this area, and has been in contact with us to explain current practice (at present, all measured data thus far at this relatively new synchrotron facility have been preserved), and to describe the experience of exploring Digital Object Identifier (DOI) registration for a test group of 100 MX diffraction image data sets. It is notable that the Joint Center for Structural Genomics already has a publicly accessible repository of crystallographic data sets including diffraction images (http://www.jcsg.org/datasets-info.shtml); they have advertised this fact for example within the CCP4bb thread on these matters.

(iii) A clearly important activity is the European Photon and Neutron PaN-data initiative (http://www.pan-data.eu/), which is actively coordinating the wider activities of all European synchrotron and neutron facility archiving plans; H.-J. Weyer is our link to this obviously important activity. It is hoped and expected that similar efforts will get under way in Asia and the Americas.

In Australia, the MyTardis data repository was deployed at the Australian Synchrotron in 2010, allowing users of the X-ray beamlines to have their data automatically captured and stored. The system also shifts raw diffraction data to a user's home institution, where supported, and provides download, sharing and publishing capability from its web interface. MyTardis is currently expanding its links to all instruments at the Australian Synchrotron, and has thus far provided its users with a mechanism for depositing and citing data in publications in high-impact journals such as Nature.

(iv) The bulk of all diffraction data is in chemical crystallography, and in turn the largest part of this is measured at home X-ray sources. This has led to active discussions by us with the University of Manchester concerning its plans to establish a local repository of raw data to satisfy the legal and good practice requirements of its research staff. (These include the need to maintain and archive data collections in accordance with funding body policy mandates, a requirement to demonstrate good research conduct,* and best laboratory practice as dictated by the relevant research community.) This repository will be launched partially in late 2012 and is expected to be fully operational in 2014. The Manchester University data archive staff have undertaken to assist JRH by registering DOIs for data sets associated with an imminent publication; these data sets will be archived within the emerging University repository.

[* The University of Manchester's "Code of Practice for Investigating Concerns about the Conduct of Research" defines examples of 'Research misconduct' and includes an item:
(o) Mismanagement or inadequate preservation of data and/or primary materials.]


(v) A draft business case for hosting diffraction image data sets as supplementary material in a suitable IUCr journal (such as Acta Crystallographica Section C) was written by JRH and discussed with the Managing Editor of IUCr Journals. The assumption was that each structure would be accompanied on average by about 1 GByte of image data. The major practical hurdle for such a proposal was seen as being the network traffic for the very large number of publications each year. For an international publication, this relates not only to bandwidth at the hosting site (currently 4 Mega bits per sec aggregate at Chester, and an indeterminate value at the co-location centre hosting the journals themselves), but also to bandwidth available to submitting authors and to readers accessing the supplementary materials. This again emphasised the need to archive the data locally to where it was measured.

(vi) The practicalities of network transfer of large quantities of diffraction images have also been explored within the existing topology of European academic networks. More than 10 data sets each of 1 Gbyte in size were transferred between Manchester and Utrecht during collaborative work between JRH and L. Kroon-Battenburg. An abstract on these experiences, with a positive outcome, has been submitted to the Bergen DDD Workshop to be held at ECM27.

(vii) To facilitate progress reporting and discussion a year on from the Madrid launch of the Working Group, a Workshop at Bergen ECM27 has been announced:
http://ecm27.ecanews.org/satellite-meetings.html
Due to good interest from US colleagues but concerns over the difficulties of attendance in Bergen so soon after the Boston ACA, we are exploring the possibility of a committee-sized event at the ACA Boston to be Chaired by T. Terwilliger. A full Workshop, akin to Bergen in 2012, is being actively discussed to be held at the ACA in 2013; a formal proposal will need to be made.


D. Current status and future directions
Currently the DDD WG are optimistic about the prospect for a clear policy for enhanced diffraction data deposition: namely, that deposition of the actual diffraction images should be encouraged in localised institutional repositories, or synchrotron or neutron facilities' data archives. In all such cases, DOIs should be registered. Work is needed to define appropriate metadata to allow consistent and useful DOI registration through different agencies (e.g. the existing CrossRef and DataCite organisations that oversee academic publication and data collections respectively), and to permit useful search/discovery, linking and retrieval mechanisms across distributed collections. There is currently a problem category where local institutional repositories or SR and neutron facility data archives are not available, whether through lack of funding or absence of data archiving policies. The IUCr may wish to undertake some lobbying to improve data retention policies. In the shorter term, some other solution would be needed for this category before any ‘encouragement’ of data deposition could become mandated. Areas to explore might include an upgrade of bandwidth and storage infrastructure at the IUCr journals headquarters; or the rental and central management of space within existing infrastructures (e.g. research resources such as particle physics Tier 1 storage facilities or commercial resources such as Google or Amazon cloud storage).

John R. Helliwell and Brian McMahon

Post Reply