Prioritizing raw images for deposition
Posted: Tue Oct 18, 2011 10:27 am
In trying to determine data storage policies and link them to real-world resource availability, it is important to establish priorities. I would like to introduce the idea of data "triage". Triage is the process whereby casualties in battle or patients attending an emergency room are assessed for the severity of their injuries and hence priority of resources needed to handle them. In a similar way, data "triage" would be a process of labelling data sets (ideally as close as possible to the time of collection) that would characterise their importance for particular purposes, and consequently the amount of resources that should be diverted to their archiving. Such labelling should be embedded into the data sets or closely associated with them in a way that allows for automated handling. Then the most important data could be extracted and curated or archived, while less important data can be left in offline stores or, when pressure on resources becomes too great, deleted.
Here is one suggested hierarchy of "importance" - assembled by someone with no practical experience in the field. I'd be interested in whether the community could reshape this with more relevance to real scientific benefits.
1. A representative selection of images associated with a published structure
2. Complete data sets associated with a published structure
3. Complete data sets collected for structures that are solved but not published or deposited in the PDB
4. Complete data sets for structures that are not solved (essentially for practical reasons - funds run out, crystal is twinned, etc.)
5. Complete data sets for structures that cannot be solved
6. Incomplete data sets that might contribute something towards the solution of a structure or some interesting crystal physics (incomplete e.g. because of instrumental breakdown or crystal decay)
7. Incomplete or unreliable data sets with no likely contribution towards scientific discovery (e.g. incorrect or faulty instrument operation, operator error)
One might reasonably characterise level (7) as absolutely useless; retaining such data sets could actually be harmful by needlessly filling up scarce storage capacity. However, if resources were truly infinite, even retaining such datasets might be useful in analysing, say, psychological patterns of poor experimentation (or is that too much like Big Brother?).
On the other hand, one might reasonably consider that level (1) is most valuable because it always provides some assurance about the integrity of the derived structure, could be relatively compact and hence easily stored, duplicated and disseminated. But even within level (1), what is acceptably "representative"? JPEG/TIFF images of the frames, which should at least show the symmetry of the diffraction pattern? One image to prove the experiment was carried out? Every tenth or hundredth frame?
Here is one suggested hierarchy of "importance" - assembled by someone with no practical experience in the field. I'd be interested in whether the community could reshape this with more relevance to real scientific benefits.
1. A representative selection of images associated with a published structure
2. Complete data sets associated with a published structure
3. Complete data sets collected for structures that are solved but not published or deposited in the PDB
4. Complete data sets for structures that are not solved (essentially for practical reasons - funds run out, crystal is twinned, etc.)
5. Complete data sets for structures that cannot be solved
6. Incomplete data sets that might contribute something towards the solution of a structure or some interesting crystal physics (incomplete e.g. because of instrumental breakdown or crystal decay)
7. Incomplete or unreliable data sets with no likely contribution towards scientific discovery (e.g. incorrect or faulty instrument operation, operator error)
One might reasonably characterise level (7) as absolutely useless; retaining such data sets could actually be harmful by needlessly filling up scarce storage capacity. However, if resources were truly infinite, even retaining such datasets might be useful in analysing, say, psychological patterns of poor experimentation (or is that too much like Big Brother?).
On the other hand, one might reasonably consider that level (1) is most valuable because it always provides some assurance about the integrity of the derived structure, could be relatively compact and hence easily stored, duplicated and disseminated. But even within level (1), what is acceptably "representative"? JPEG/TIFF images of the frames, which should at least show the symmetry of the diffraction pattern? One image to prove the experiment was carried out? Every tenth or hundredth frame?