A business plan for archiving diffraction data

Post Reply
Brian McMahon
Site Admin
Posts: 116
Joined: Fri May 13, 2011 12:34 pm

A business plan for archiving diffraction data

Post by Brian McMahon » Wed Nov 09, 2011 11:14 am

I'm posting a slightly edited first draft of John's business plan for archiving diffraction data in the IUCr publishing operation, for consideration by the IUCr editorial office and Finance Committee. This should not at this stage be advertised to the wider world; it's a tentative proposal, and one that will probably not be followed through in this form. Some following posts will explain why. But by showing this to the consultation group and inviting your comments on it and subsequent discussion, we hope to identify the real-world business decisions that will need to be taken to move forward.

A Business Plan for Archiving of Diffraction Data for consideration by the IUCr Finance Committee
Part I: Acta Cryst E.
Prepared by John R Helliwell and Brian McMahon on behalf of the IUCr Diffraction Data Deposition Working Group (DDD WG)
Sept 12th 2011

1. Introduction
The IUCr DDD WG Terms of Reference are:-
It is becoming increasingly important to deposit the raw data from scattering experiments; a lot of valuable information gets lost when only structure factors are deposited. A number of research centres, e.g. synchrotron and neutron facilities, are fully aware of the need and have established detector working groups addressing this issue.

The IUCr is the natural organization to lead the development of standards for the representation of data and associated metadata that can lead to the routine deposition of raw data. A Working Group on these matters has thereby been launched by the IUCr Executive Committee, to which the Working Group will report, to be Chaired by Professor John R. Helliwell. Its provisional title is 'Diffraction data deposition Working Group of the IUCr'.

The IUCr DDD WG Kick off Meeting was held at the IUCr Congress in Madrid. The wide ranging attendance at this meeting was very good and indeed enthusiastic for the task in hand.

An ideal vision has been put forward: "All relevant scientific data should be archived with its associated published literature or, where appropriate, assigned a DOI so as to be permanently, readily and freely available." A convenient metaphor towards realising this general vision is a Roadmap.

We can envisage the following practical steps towards establishing a Road Map to inform the ultimate vision of archiving data with literature.

  • Practical step 1: The gathering of "exemplar publications" by the different Commissions. These will need to be truly representative of all the range of activities of each IUCr Commission. Furthermore there will need to be a thorough and careful survey including a proper description of the raw data format(s) and associated metadata.
  • Practical step 2: The response of the Journals Commission will be solicited as to how they will handle the likely needs for the archiving of diffraction data from the submitted articles in each such category identified by each Commission, probably more than one category of article per Commission. Associated with this will be serving the whole crystallographic community and other published Journals with standards and definitions e.g. of metadata and all relevant raw data to be archived.
  • Practical step 3: The IUCr is likely to want to undertake a pilot project for one of the IUCr Journals before implementation across all titles of the IUCr Journals.

In setting out such a general vision and associated road map it has become immediately clear, in the first discussions with the Commissions, that some Commissions are at a more advanced stage of community view than others. Thus the Commission on Biological Macromolecules (CBM) have proposed the addition of a new member to the DDDWG, Dr Tom Terwilliger, which has been approved by the IUCr President. In a different approach the Commission on Inorganic and Mineralogical Crystallography is already pressing for an immediate pilot study, e.g. for Acta Cryst B submissions to be instigated by the IUCr, driven by the IUCr DDDWG.

In my experience of being a Member of the IUCr Finance Committee (FC) from 1996 to 2005 I am sure that specific proposals will be welcomed at the earliest opportunity for the FC's detailed consideration. Also, the FC will be keen to see possible measures that will tackle existing practical challenges faced by the IUCr Journals. One such challenge involves the instances of scientific fraud perpetrated within Acta C and Acta E. Since it is widely agreed that scientific fraud in crystallography will be much more difficult, even impossible, if the raw diffraction data images have to accompany a crystal structure article submission then this looks a very good place to start considering a diffraction data images archive held by the IUCr Journals. To this end I have prepared this business case for full diffraction data archiving for Acta E submissions. I have chosen Acta E as it has the simplest business model for how to raise the revenues to pay for the archived data sets, namely the author-pays model. The technical details, analyses and the type of costings detailed below are readily extendable to make appropriate business cases for other IUCr Journals and other types of data sets seen by IUCr Journals, and which are being prepared by the Commissions of the IUCr at the request of the IUCr DDDWG.

Perhaps the biggest question of all is: Why should IUCr Journals have to take this step of setting up a Data Archive to serve its publications, and with it the associated costs? Another prime candidate besides IUCr Journals could be Government agencies. However if left to these agencies it would be impossible to equitably serve all the many countries of authors and readers that are linked with the IUCr Journals. It would be an impossible coordination. It would also not lead to clear standards of data set and metadata definition. Another prime candidate besides IUCr Journals to steward such raw data archives would be the synchrotron radiation and neutron beam facilities. For macromolecular crystallography X-ray diffraction data measured at synchrotron radiation facilities now account for approximately 90% of Protein Data Bank depositions (see http://biosync.sbkb.org/ for up-to-date statistics). SR Facilities are now coordinating their efforts, with neutron facilities, to establish experimental data archives for all techniques exploited at these facilities; these are akin to current practice in particle physics and space science. Thus the SR facilities could be the provider for macromolecular crystallography diffraction archiving, although even here the lines of responsibility for Journals and their published article content, e.g. with respect to speed and ease of access by readers to the raw data, needs careful evaluation. Chemical crystallography, the subject of this business case, is predominantly conducted at home X-ray sources; there is no website equivalent tos Biosync to be sure about the respective percentages, but it is safe to say that the majority of chemical crystallography X-ray data is measured at home sources, and probably true to say >90%, even 95%. Neutron chemical crystallography is a very small fraction, albeit a very important one, of all such data sets; but, again, such statistics are not readily available. In summary, to start with Acta Cryst. E, chemical crystallography done mostly with home X-ray sources, with authors from basically all countries all over the world, is again the best place for IUCr to start considering the issues and costs of doing the diffraction data images archive.

The Business Case, following this Introduction, is set out in the following contents sections:
  • Quantification of the technical challenge for Acta E
  • Costings and financing of the introduction of the need for authors to submit raw diffraction data with their articles; i.e. the refereeing stage
  • Costings and financing of the introduction of archiving of raw diffraction data with articles; i.e. after acceptance
  • Network costs for data transfer associated with article submissions and subsequent reader access to the data archive
  • Promotion strategy for this initiative to authors
  • Promotion strategy for this initiative to readers
  • Any obstacles to implementation?
  • Conclusions and recommendations


2. Quantification of the technical challenge for Acta E

Acta Cryst. E in 2010 comprised crystal structures in the following categories: 87 inorganic, 1705 metal-organic and 3370 organic; i.e. 5162 in total. The numbers for 2009 were similar: 92, 1709 and 3284 respectively; i.e. 5085 in total. We need to know the total number of diffraction data images for all these 5085 crystal structures (one per article in Acta E). We don't readily have access to the number of diffraction data images needed per crystal structure. This number, though, is linked to the crystal system of each crystal structure. We do not have an easy way of working out the percentages of different crystal systems across an Acta E typical year (i.e. without doing an article by article count), but data on this are available.

The table below (from http://pd.chem.ucl.ac.uk/pdnn/symm3/sgpfreq.htm) shows the distribution by crystal system of the 177,291 entries examined in the CSD (this analysis was performed in 1998 but should still be typical):

Crystal system Frequency
Triclinic 20.91%
Monoclinic 53.16%
Orthorhombic 20.98%
Tetragonal 2.33%
Trigonal 1.62%
Hexagonal 0.53%
Cubic 0.47%

The predominance of low-symmetry space groups (triclinic and monoclinic) is evident at 74%. Thus the data collection will most likely involve data sets comprising the larger number of diffraction images. E.g. for triclinic a full data set may well be derived from 1800 0.3 degree images (as typically used in our Structural Chemistry Lab in Manchester on a Bruker Apex CCD diffractometer with 512 x 512 16 bits pixels). Each image is 0.5 Mbytes making a data set comprise 0.9 Gb.

The total unique diffraction data images archive to accompany such a set of published crystal structures, in the extreme case suggested by the IUCr DDDWG, would then be 4.6 Tb for each year, such as 2010 and 2009, based on the contents of Acta Cryst. E.

The rejection rate for Acta E for 2010 was 19% and for 2009 was 18%. Thus, since at time of submission, the associated diffraction data images would be expected to be included by the submitting authors the data volumes per year to be handled at IUCr would increase accordingly, i.e. to 5.4 Tb. However, the preservation in a long-term archive of these rejected articles and their associated diffraction data images would not be needed.

Very recent CCD detector upgrade performances have become available, such as http://www.bruker-axs.com/apex_ii_ccd_detector.html, which has 4k by 4k = 16 Megapixels CCDs with 16 bits (i.e. 2 bytes per pixel) commercially available for chemical crystallography without an optical taper. Other manufacturers have similar upgraded CCD detectors for chemical crystallography, although the Bruker number of pixels per diffraction image is the largest at 16 Megapixels. A volume of Acta E in a coming year could then comprise 5000 crystal structures x 1800 images x 16 Megapixels x 2 bytes = 288 Tb per annum.

What factors might increase the number of crystal structures in Acta Cryst. E? Automation would increase Lab throughput as would higher performance home lab X-ray source intensities. These may add further factors of two each. But would Acta E really be able to publish 20,000 articles per annum? This seems unlikely and so is dismissed here, at least for the moment.

3. Costings

On the costs of long-term data storage (Herbert Bernstein personal commun. Sept 11th 2011):
Dear Colleagues,

The current best estimate for long term data storage is "between $1K and $3K+ per year" for a terabyte, see the NSF RDLM workshop website at http://rcs.columbia.edu/rdlm and in particular slide 15 from http://rcs.columbia.edu/files/researchc ... n_rdlm.pdf

This may seem exaggerated because we all know you can buy a terabyte of disk for about $100 and adding tapes to a modern tape library costs only a few hundred dollars per terabyte. The difference between these estimates is the cost of ensuring that instead of just writing the data, you also can be reasonably certain of being able to read the data years after it is written. That requires significant labor and materials costs in checking and regenerating the data and distributing the data to multiple locations. That may seem paranoid, but experience has shown that is really is necessary, especially when using tape storage. Tape deteriorates surprisingly rapidly and requires regeneration no less frequently than every 2 years. There are more durable media, and costs are dropping, but right now it would be prudent to budget at least $1000 per terabyte per year, or to link the IUCr effort to one of the larger efforts that lower their costs by economies of scale, such as Google or Amazon. In such a relationship the real costs (not the charges -- the costs) for reliable storage may drop to less than $300 per terabyte per year, and it might be possible to make an arrangement in which some or all of those costs might be absorbed by Google or Amazon in exchange for the value of bringing eyeballs to the site.


3.1 Costings and financing of the introduction of the need for authors to submit raw diffraction data with their articles (i]i.e.[/i] through the refereeing stage)
Raw data for submitted articles that are subsequently rejected will not become a cost burden with respect to the archive. They will, however, cause some loading on the IUCr Acta E article upload system and temporary storage capacities. Since article rejection times are fairly short, typically of order 30 days, again it should not be a large cost burden on IUCr but there will be some costs in having this temporary disk store.

3.2 Costings and financing of the introduction of archiving of raw diffraction data with articles (i.e. after acceptance)
On the costing basis suggested by Herbert Bernstein the costs per year for a complete 2009 (or 2010) published articles Acta E archive would be 4.6Tb x 1 to 3k$ Tbyte per annum, i.e. 4.6k$ to 13.8k$ per annum - between 1 to just over 2.5$ per article published per annum. Authors, however, may wish to pay a once-up fee on top of their current open access fee. This could perhaps be set for 10 years, at which time IUCr would need to review its Archive and costings in any case. A 10 year once-only payment could then be set at between 10 to 25$ per published article to be paid by the corresponding author.

A future year costing, if based on the higher-performance CCD image sizes described above, would make the 10-year cost per Acta E article prohibitive (a 64 times bigger fee) unless a compression of the 16 Megapixels to 0.25 Megapixels per diffraction image was made for the purposes of the archive.

3.3 Network costs for data transfer associated with article submissions and subsequent reader access to the data archive
To be assessed

4. Promotion strategy for this initiative to authors

Firstly, authors would get the benefit of having a place, IUCr Chester, where they could go to to retrieve their published Acta Cryst. E associated diffraction data images at any time, should they wish to. Secondly, funding agencies are these days requiring archiving of the raw data arising from publicly funded projects; although in theory this relates to unpublished and published data, in practice it is only published related data that could come under scrutiny and IUCr Chester would be holding this archive.

5. Promotion strategy for this initiative to readers

A primary benefit of the whole scheme of archiving of raw diffraction data images associated with a published article is for readers to be able to fully connect with the measurements underpinning the publication. There are two constituencies of readers. Firstly, there are those readers interested in a particular crystal structure and compound. Secondly, there are those readers interested in methods developments e.g. new software for diffraction data processing.

6. Any obstacles to implementation?

The new generation of manufacturer's detectors now being sold, illustrated with the example above, can potentially escalate the scale and costs of the diffraction data images archiving challenge considerably. A possible technical solution has been offered involving diffraction image compression. This is at the expense of spatial resolution of the measured pattern which could dilute some of the benefits of the archive such as the software methods developments. It should not affect the basic checking of the diffraction data images against the given communicated scientific results, although there will likely be some loss of precision.

7. Conclusions and recommendations

A pilot study should be instigated to trial an IUCr diffraction data images archive for Acta E by testing procedures with the help of a subset of willing, regular, Acta E authors i.e. with their new submissions. This will allow unexpected aspects of the process to be exposed and detailed feedback from authors to be obtained. During the period of the publication of these articles a selected panel of readers, previously appointed, could test the system with respect to the downloading of information from the diffraction data images archive.

Post Reply