Page 1 of 1

Prioritizing raw images for deposition

Posted: Tue Oct 18, 2011 10:27 am
by Brian McMahon
In trying to determine data storage policies and link them to real-world resource availability, it is important to establish priorities. I would like to introduce the idea of data "triage". Triage is the process whereby casualties in battle or patients attending an emergency room are assessed for the severity of their injuries and hence priority of resources needed to handle them. In a similar way, data "triage" would be a process of labelling data sets (ideally as close as possible to the time of collection) that would characterise their importance for particular purposes, and consequently the amount of resources that should be diverted to their archiving. Such labelling should be embedded into the data sets or closely associated with them in a way that allows for automated handling. Then the most important data could be extracted and curated or archived, while less important data can be left in offline stores or, when pressure on resources becomes too great, deleted.

Here is one suggested hierarchy of "importance" - assembled by someone with no practical experience in the field. I'd be interested in whether the community could reshape this with more relevance to real scientific benefits.

1. A representative selection of images associated with a published structure
2. Complete data sets associated with a published structure
3. Complete data sets collected for structures that are solved but not published or deposited in the PDB
4. Complete data sets for structures that are not solved (essentially for practical reasons - funds run out, crystal is twinned, etc.)
5. Complete data sets for structures that cannot be solved
6. Incomplete data sets that might contribute something towards the solution of a structure or some interesting crystal physics (incomplete e.g. because of instrumental breakdown or crystal decay)
7. Incomplete or unreliable data sets with no likely contribution towards scientific discovery (e.g. incorrect or faulty instrument operation, operator error)

One might reasonably characterise level (7) as absolutely useless; retaining such data sets could actually be harmful by needlessly filling up scarce storage capacity. However, if resources were truly infinite, even retaining such datasets might be useful in analysing, say, psychological patterns of poor experimentation (or is that too much like Big Brother?).

On the other hand, one might reasonably consider that level (1) is most valuable because it always provides some assurance about the integrity of the derived structure, could be relatively compact and hence easily stored, duplicated and disseminated. But even within level (1), what is acceptably "representative"? JPEG/TIFF images of the frames, which should at least show the symmetry of the diffraction pattern? One image to prove the experiment was carried out? Every tenth or hundredth frame?

Re: Prioritizing raw images for deposition

Posted: Wed Oct 19, 2011 8:50 am
by SimonColes
I would comment that this is a reasonable list in a sensible order and have a few thoughts (in no particular order).

I would add to (6) complete datasets that exhibit some interesting feature that cant currently be [easily] tackled with existing software (diffuse scattering for example?).

(1) is vitally important! We do this routinely for all data that we collect (much of which is handed over to other crystallographers to work up) - most software can produce simulated precession 'photos' from the raw image data. We find that as a matter of course 0 layer photos are really useful for an 'at a glance' assessment of the data quality, but we also include 1 layer too these days as it is very helpful to quickly get some idea of the nature of twinning, systematic absences, incommensuration etc. I would say 1 layer is a nice to have, but 0 layer essential - this means between 3 and 6 JPG files should be the minimum level of raw data supplied (they can take a little while to generate and therefore we script this part of our processing workflow).

Software developers will produce better products if they can test more broadly and certain datasets would be very useful for training - it would be helpful to include 'labels' in this scheme that would indicate suitability to this type of usage.

Triage is vitally important - especially when scaling up from a single lab to global! However this is often very difficult to do at the time of creation as the worth of a dataset might increase or decrease with time. Ideally one would include processes of redaction and periodic appraisal, however this is very time consuming...

Given these comments, I would argue that outside of an individual laboratory there is no sense in keeping data under point (7).