Page 1 of 1

Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 6:19 pm
by Brian McMahon
Following the formal establishment at the IUCr Congress of a Working Group on Diffraction Data Deposition, the Chair, John Helliwell, convened a consultation meeting to which were invited IUCr Commission Chairs, members of the IUCr Database Users' Committee, and a number of other interested parties. The purpose of the meeting was to scope the tasks facing the Working Group and to invite initial input from the Commissions. Draft Minutes of the meeting are currently out for review, and will be posted here when approved. In subsequent replies to this topic I shall record some of the comments posted in discussion during the Minutes review period. They have led to some changes in the wording of the Minutes, but are valuable as background and introduction to specific issues that may be elaborated in later posts to this forum.

By and large, I have tried to post a separate digest for each distinct topic raised, but I keep them all in this discussion thread to provide an overview of the contents of the original meeting. If you wish to develop any of these topics in greater depth (which I strongly encourage you to do), please start a fresh topic in order to keep the discussions reasonably well structured.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 6:26 pm
by Brian McMahon
Gerard Bricogne commented on the initial somewhat condensed text of the draft Minutes as follows:

What was to me the most interesting part of the meeting was what happened when we began getting beyond the usual, ritualistic clashes of positions regarding raw data archiving in macromolecular crystallography, from the "it should be made compulsory right now" to the "it is impossible to do at the moment". Initial estimates of volume, and therefore costs, had begun to be made on the basis of the 200TB size of the Diamond archive, until it was pointed out that only a small fraction of that would contain the relevant datasets, i.e. those which had given rise to the data against which actual PDB entries had been refined. An intermediate goal was also identified, namely letting the synchrotrons and the TARDIS archives hold those datasets, rather than immediately putting that burden on the PDB, but ensure that unique identifiers (e.g. DOIs) be assigned to all datasets collected in the future so that these can then provide the means of pointing to the subset of "relevant" datasets, both in the publication describing a structure and in the PDB entry for the latter. Synchrotron archives would then make those datasets publicly available at the time when the PDB entry is released, instead of waiting for the 5-year period currently contemplated by Diamond for all datasets. The idea of a pilot project thus emerged, in which raw image deposition (or disclosure, if they are stored at the synchrotron where they were collected) would be on a voluntary basis, but the PDB would immediately investigate how to accommodate the unique DOI information through which the raw image files would be accessed within the corresponding PDB entries. This would have the advantage of testing the various bits of technology involved, and to also start giving an idea of the uptake of such information by users and developers and hence build the case for moving towards compulsory deposition in the future.

Clearly the minutes do not need to contain all this detail, but a judicious selection of it would go a long way towards dissipating the sense that a reader of the current draft would get (namely that we will be talking about a Roadmap that will organise workshops that will ... - i.e infinite regression!), and perhaps give instead the sense that some real-world action is about to be taken to set up a prototype of a real-world solution to a real-world problem. This may be the case in only one IUCr Commission, but so be it: if we insist on being egalitarian, we will never achieve anything of any substance in time for it to have an impact.

To conclude: I would strongly argue that some digest of what I have described - and which did happen: or have I dreamed it? - should be included in the minutes. Then, readers will get a sense that our assembly is meant to be a Working Group, not a Contemplation Group. I will of course be happy to contribute to the task, but some moderating influence will probably be needed.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 6:33 pm
by Brian McMahon
By way of response to Gerard's distaste at the woolly-sounded 'vision statement' and 'roadmap' referred to in the draft Minutes, I elaborated as follows:

Precisely so. That is, we're looking for a metaphor that even President Bush could grasp. Let me say a little more about this "vision/roadmap" idea that came up after the meeting. One of the deliverables of this Working Group should, I believe, be a detailed report that includes analysis, both technical and in terms of cost/benefit, of the most important issues Many of the specifics raised at the Madrid meeting were very relevant to this: how much data does one *need* to keep, how to manage distributed archives, some in home labs, some at synchrotrons, etc.

In presenting such a report to the widest audience, we need a framing device and a narrative. The "vision statement" provides the framing device. It's certainly an idealistic vision, one that's almost impossible to attain; one, perhaps, that it's not even desirable quite to attain - perhaps we will decide that archiving all "relevant" or "necessary" data is sufficient. But the point of the "vision" is simply to establish the direction in which we set off.

The "roadmap" provides the narrative element. Different parts of the community are travelling different roads towards that ideal(istic) vision - some (perhaps protein crystallography, which is well resourced) highways, others dual carriageways (small-molecule single crystallography?), others winding country lanes. And the actual routes differ, each with its particular obstacles to surmount and terrain to navigate. It's a device which permits different communities of interest (represented, perhaps, by the IUCr's different Commissions) to track their individual routes. Of course the road map will be best filled in by communities that engage most enthusiastically with the Working Group (and perhaps by those who know most clearly where in the journey they have already reached). But it allows the others to at least sketch their positions and gain some sense of where they stand relative to other communities, and perhaps roughly how they proceed towards the same destination.

It's also a device that can be presented to other scientific disciplines altogether. Their part of the map (astronomers, oceanographers) may look very different from ours, but they can develop the same narrative, plotting their route towards the same "vision" at the end of the road.

And, in principle, it allows us to identify specific "milestones" and suggest to lost wayfarers where they should head to next. And perhaps, from time to time, we can take it to an external funding agency and say "here's a major obstacle that requires a substantial bridge in order to be overcome" - and, if the metaphor works, persuade them to help with the bridge-building because they understand the purpose of the enterprise in a more holistic way.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 7:02 pm
by Brian McMahon
During the course of the meeting, John Helliwell put forward the suggestion that each Commission of the IUCr could contribute to the scope of the Working Group by providing one or more exemplar publications from their own area of structural science that would deposit a 'complete' set of supporting data, as they saw it. Patrick Mercier (representing the Commission on Inorganic and Mineral Structures) suggested that an extended pilot involving one of the IUCr journals could be a useful exercise. To this, John replied:

That process, of gathering "exemplar publications" by the different Commissions, mentioned in the Minutes, [could be a] practical step 1.

The response of the Journals Commission as to how they will handle submitted articles in each such category identified by each Commission, probably more than one category of article per Commission, will then have to occur. This will be the practical step 2. Associated with this will be serving the whole crystallographic community and other published Journals with standards and definitions eg of metadata and all relevant raw data to be archived.

Indeed the Journals Commission may wish to undertake a pilot project for one of the IUCr Journals before implementation across all titles of the IUCr Journals.T
his might be called Practical step 3.


Patrick responded by suggesting Acta Cryst. B as a suitable test bed:

I will send [the final version of the Minutes] to members of the Commission on Inorganic and Mineral Structures (CIMS) to get their input as to what would be the 'minimal requirements' for diffraction data deposition of experimental results from inorganic and mineral structures. We can anticipate a huge disparity of digital storage requirements between powder diffraction (traditionally ASCII data sets for both measurements and metadata, but now moving to 2D detectors data sets for measurement with associated metadata probably ill-defined at the moment - perhaps ICDD can help out here?) vs. single-crystal studies (images from CCD cameras). Indeed, the Acta B subscriber model may turn out not to constitute a sustainable business model for IUCr to carry the costs of digital data deposition even for just those from inorganic and mineral structures. The advantage in starting with this journal is that there are many
fewer published structures in this one compared to Acta E. A pilot study for a 1-year period sounds like a logical starting point. Let's get the ball rolling.


The Acta B/E dichotomy had arisen from correspondence in which John Helliwell had argued that Acta E might be a more useful initial test platform because of its scalable financing model (open access to content, funded by an 'author pays' revenue stream). John offered to take up the fall for an Acta E trial:

Acta B has the advantages you clearly and correctly describe. The subscriber model of Acta B is not impossible to incorporate DDD archiving c
osts but Libraries may argue that authors should pay this portion. Thus collecting revenues for an Acta B including a DDD archive, becomes a hybrid financial model, which is messier. Acta E on the other hand already has this financial model of authors pay. Thus I have been mulling over the business plan for DDD for Acta E as follows:-

Acta E suffered particularly from the instance of scientific fraud of invented structures, and as Bernhard Rupp eloquently put it at our meeting, fraud is much more difficult with diffraction data images. Thus the incentive for E to implement archiving of all relevant data has this driver, as well as the overall good sense of our vision for it to happen (improved reader experience, redoing analyses as methods improve, capturing diffuse scattering etc).

Thus, I think we need to estimate a year's worth of the current Acta E articles' digital storage requirements, and this tells us the total cost for that year. W
e need then to know the likely distribution between crystal systems i.e. triclinic having more raw diffraction data images than monoclinic, than orthorhombic, than tetragonal, than hexagonal than cubic. We could perhaps most simply apply the published probability of occurences of space groups to guide this calculation. We can probably assume that the various apparatus in the field is homogeneously CCDs these days; is that really the case? It would be good for you to consult with your Commission Members and the Commission on Structural Chemistry on these aspects please. We also need to estimate the costs of hosting an archive in a separate location than IUCr Journals Chester (to properly take account of the risk of fire or other mishap). Also experts often talk of the longevity of paper, than magnetic tapes, floppy disks etc i.e. the panoply of evolving digital media, so do we need to cost a transfer to a new form of digital media? [This last is probably impossible to cost and in fact such a development of a new medium of digital storage is likely, by Moore's Law, to be relatively cheap anyway.]. The perhaps thorniest challenge to estimate is the possible growth of Acta E, although this may have plateaud with the authors pay model implementation of a few years ago; the IUCr FC itself can request this projection if needs be.


John has subsequently asked the Chester Editorial Office to begin to estimate the costs associated with archiving the diffraction data for articles published in Acta E. These are usually single-crystal diffraction experiments, mostly of organic or metal-organic compounds (the split is of order 50/50, though we can look up more exact statistics), with rather few inorganic structures. It would greatly help Chester in this exercise if someone would start a new thread in this forum which gives an indication of the typical volumes of data that we might expect an author to generate and wish to deposit for the various types of experiments, crystal systems, etc., as John suggests above.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 7:14 pm
by Brian McMahon
Herbert Bernstein replied to John Helliwell's suggestion that a one-year pilot with Acta Cryst. E would be a useful way to begin to assess real-world costs with the following information:

The current best estimate for long term data storage is "between $1K and $3K+ per year" for a terabyte; see the NSF RDLM workshop website at http://rcs.columbia.edu/rdlm, and in particular slide 15 from http://rcs.columbia.edu/files/researchc ... n_rdlm.pdf

This may seem exaggerated because we all know you can buy a terabyte of disk for about $100 and adding tapes to a modern tape library costs only a few hundred dollars per terabyte. The difference between these estimates is the cost of ensuring that instead of just writing the data, you also can be reasonably certain of being able to read the data years after it is written. That requires significant labor and materials costs in checking and regenerating the data and distributing the data to multiple locations. That may seem paranoid, but experience has shown that is really is necessary, especially when using tape storage. Tape deteriorates surprisingly rapidly and requires regeneration no less frequently than every 2 years. There are more durable media, and costs are dropping, but right now it would be prudent to budget at least $1000 per terabyte per year, or to link the IUCr effort to one of the larger efforts that lower their costs by economies of scale, such as Google or Amazon. In such a relationship the real costs (not the charges - the costs) for reliable storage may drop to less than $300 per terabyte per year, and it might be possible to make an arrangement in which some or all of those costs might be absorbed by Google or Amazon in exchange for the value of bringing eyeballs to the site.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 7:31 pm
by Brian McMahon
In correspondence amending some statements regarding the distributed nature of the WorldWide Protein DataBank's (wwPDB) distributed operation, John Westbrook of the PDB and John Helliwell discussed the need to manage data generated by large centrallised facilities (e.g. synchrotron laboratories) and that created at smaller scale ('home') laboratories The relative proportions of these originating sites certainly differs between the macromolecular and small-molecule worlds. John Westbrook suggested the following description:

The wwPDB is providing anonymous access to its archive using synchronized servers hosted by the wwPDB partner sites. The model that is evolving for serving raw diffraction data appears to be a distributed network of servers hosted at large data centers affiliated with data generation facilities or at the academic or commercial data centers supporting locally collected data sets. It would be useful to develop a registration system (e.g. DOI-like system) for this distributed network of diffraction data so that references to this data could be collected with other metadata when structural models are deposited with the wwPDB.


John Helliwell replied with a further amplification:

In University of Manchester for example we as academics have to regularly lodge our research outputs with our internal eScholar system. At present for my PDB depositions I 'only' give the PDB code per entry. As soon as I can have a doi for all the relevant raw data then I will log that detail per research output as well. That however, I suppose, pushes the matter down the track to the data set doi provider i.e. when I register the data set do they take an uploaded copy?


John Westbrook's reply:

We will be adding the metadata to record the raw data URI details once we have studied the situation a bit. We will try to work with some of the existing data stores to figure out how best to deal with this. I do not think that the data sets, while on-line, are necessarily open for public access at this
point so we will need to sort out these details.

As far as the DOI registration is concerned I believe that it is the responsibility of the registering entity to provide a reliable URI to the data. When we register PDB data sets with CrossRef we provide them with URLs on the wwPDB ftp site that point to the data sets. CrossRef does not host the data.


It is likely that this Working Group will take a significant interest in the benefits and challenges of assigning DOIs for data sets. By way of background, I remind you that Crossref (http://www.crossref.org) is a publisher-funded consortium that has been assigning identifiers to published articles for many years. It does also provide DOI registration for data sets (e.g. PDB entries, but also entries in the eCrystals data repository of the UK National Crystallography Service/University of Southampton), but its underlying metadata model is influenced by its predominant experience in literature identification. DataCite (http://www.datacite.org) is a more recent consortium largely supported by national libraries that provides DOI registration for data sets that are not necessarily associated with a publication. Its initial clients have mostly been earth and oceanic scientists, and my suspicion is that they may benefit from input from physical scientists to refine a metadata model that is well tuned, say to crystallographic data sets.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 7:39 pm
by Brian McMahon
Kamil Dziubek has also raised the following point:

While the problem of raw data archiving at the synchrotron and neutron sources was widely discussed during the meeting, there are still no clear plans for
the files from home diffractometers, particularly in light of the necessity to upload the massive amounts of data through the Internet, which can be cumbersome especially in lack of a broadband Internet connection. In this respect, providing technical assistance to developing countries may be required.


This is a very useful point, for even 'broadband' connectivity might not be sufficient for a service provider that handles large volumes of archived data. The IUCr office itself currently has just 4 Mb leased-line connectivity. Either expanding this significantly to handle large volumes of raw data traffic, or putting in place systems to manage ingest and output of data through physical media (memory stick, portable hard drives) could both incur significantly higher costs than the actual storage of terabytes of data.

Re: Inaugural Consultation Meeting, Madrid, 27 August 2011

Posted: Sun Sep 11, 2011 7:41 pm
by Brian McMahon
Jim Kaduk commented on the draft Minutes:

A good summary. The only thing I'd add is that (for proper interpretation later) metadata is probably even more important for powder data than it is for single crystal images. The powderCIF dictionary was designed to capture as much metadata as possible, but I'm not sure we're completely there yet.