Discussion of small-angle scattering data deposition

This is a public forum that invites community input on strategies and desirable practices in providing open and long-term access to diffraction data sets.

Moderator: Brian McMahon

Post Reply
Andrew Allen
Posts: 1
Joined: Tue Oct 18, 2011 8:19 pm

Discussion of small-angle scattering data deposition

Post by Andrew Allen » Mon Feb 27, 2012 12:55 am

The following post is a summary of responses solicited from members of the Commission on Small-Angle Scattering regarding issues for the deposition of small-angle scattering data:

The range and diversity of small-angle scattering based measurements is now so broad as to make it difficult to specify a set of SAS data deposition requirements that would be common to all types of measurement: e.g., conventional SAXS and SANS using a 2D detector, USAXS and USANS using Bonse-Hart crystal optics, GI-SAXS and NS-SANS, SAXS-based XPCS measurements, contrast variation SANS, angular- and energy-dispersive SANS at a reactors and pulsed sources, respectively, anomalous SAXS, SAS-based imaging methods, and even some aspects of neutron and X-ray reflectivity. While, in many cases, it is possible to extract 1D curves of the SAS intensity cross-section versus Q, it is usually not possible to assess the true information content of these data without significant "metadata" information, especially in regard to how the Q-resolution relates to the spacing in Q between successive data points, and how this relationship varies through the measured Q-range. Also, 1D curves do not, themselves, provide any information on what may be an anisotropic microstructure, unless many such datasets are extracted for multiple azimuthal sample orientations. This contrasts with the situation in (say) X-ray powder diffraction where the information contact and resolution limitations are frequently self-evident within the diffraction dataset, itself (i.e., how broad are the peaks inasmuch as this may be determined by the instrument resolution?).

Despite this situation, significant ongoing efforts are underway to develop common data formats as discussed below. Furthermore, there exists at least one sub-set of SAS studies where the need for well-formulated data deposition standards is both more defined and more urgent: biological solution small-angle scattering. Because of this, we have divided our comments into two sections: the general case (i.e. broadly trying to address the above), and biological solution SAS (more specific).

GENERAL CASE:

Two potential aims of SAS data deposition are (i) to ensure justification of published structure (“experimental data supporting a literature publication should be deposited”); and (ii) to build a database so that other researchers can use the data. Currently, most journals might request authors to present the scattering curve, which is the raw data.  So if somebody wants to try a different interpretation of the data, they can simply scan and digitize the plot. Unfortunately, for the reasons given above, this is not necessarily all that useful, and the concept of “raw data” in SAS will certainly remain different from that in most areas of crystallography. Thus, archiving of all the scattering curves in published papers, for example, will not be meaningful without a significant inclusion of "metadata" to define the SAS measurements conditions, as well as the state of the sample. Provision of scattering data from a standard reference sample might help define the experimental measurement conditions, as well as provide better absolute intensity calibration. However, in many cases, providing sufficient information to define the experimental conditions may be more important than providing the SAS dataset, itself.

Data deposition for SAS in the general case may not be as important as it is for other areas of crystallography, given the intrinsically low level of information directly available from the data, and the much greater importance in knowing about the precise sample, its preparation and history. Furthermore, the extremely wide variety of measurements and systems that SAS covers probably requires a somewhat different approach.

Nevertheless, the diversity of application discussed above does suggest the importance of standardizing SAS data formats where this is possible. Standard data formats would maximise the portability of data between different modelling approaches, comparison between different instruments, and long-term support of data. It would of course also facilitate SAS data deposition. There is currently a working group - canSAS (the Collective Action for Nomadic Small Angle Scatterers) - consisting of representatives from major X-ray and neutron facilities - who have been working towards this end for several years – see http://www.smallangles.net/wgwiki/index.php/canSAS-VI
This group plans to meet next at the SAS 2012 conference in Sydney, this coming November, and a potential role for the IUCr SAS Commission has been proposed in facilitating uptake of the canSAS approach both by facilities and by commercial manufacturers of SAS equipment.

BIOLOGICAL SOLUTION SAS:

It is becoming increasingly clear that deposition of SAS data and associated sample information is advisable for experiments reporting data from biological macromolecules, where the interpretation is in terms of the structure of the individual scattering particles. This becomes especially critical when presenting a 3D structural representation of the scattering particle, determined either by ab initio methods to provide shape information, or by rigid-body modeling to produce optimized models from published high-resolution structures of domains or sub-units.

This requirement is becoming more important since the development of powerful analysis software algorithms in recent years for this kind of modeling and optimization using solution scattering data. There is interest in the structural biology community to have SAS-derived models deposited in the Protein Database (PDB), which would require such data deposition. It is sensible to consider PDB data deposition and reporting requirements for associated journal publication concurrently. Thus, for biological solution scattering, it is recommended to work towards a journal policy that includes depositing the models together with the processed data used to generate the models and essential sample information required to ensure the suitability of the data for structural interpretation. The SAXS data are experimental and as such they may contain uncertainties, both random and systematic. Depositing the data would enable others to validate the models that have been used to interpret the data. Depositing model structures without the data are far less useful, and simple digitization of SAS curve figures will not solve this problem. Currently, papers are published that do not show the data at all, or only characteristic functions are displayed.

In connection with these goals, development and acceptance of a common data format would be very useful. However, previous attempts have not always been successful. For example, at the Brookhaven SAS conference in 1999, it was decided to start with a simple 1D data format, which became known as sasCIF. Although the sasCIF effort carefully followed IUCr requirements, but it is still not widely used. Obviously, any format developed for solution scattering data deposition needs to be carefully correlated with IUCr journal requirements and with the PDB.

To address the above issues, the additional activity that the structural biology small-angle scatterers have been involved in has to do with establishing reporting requirements when publishing 3D models and (reliable) structural parameters of biological macromolecules in solution.  A paper has been submitted to Acta Cryst. D that discusses these issues and makes a number of recommendations regarding what data are required.  There will also be a session at the SAS2012 conference on this topic to continue to develop the criteria/requirements and broaden the consensus. Meanwhile, Helen Berman has asked Jill Trewhella to Chair a task force to evaluate and make recommendations for the possible submission of SAS-based models to the PDB.  This task force will consider the question of whether such submissions are appropriate and, if so, what accompanying information/data would be required.  This task force will meet during the SAS 2012 conference where its initial recommendations will be presented for discussion and feedback.

Unfortunately, since it is uncommon to obtain a perfectly monodisperse sample for every protein solution scattering study, one tends to obtain slightly different SAS curves every a measurement is repeated. This is not serious as long as the interpretation of the data is not sensitive to such subtle differences (a point that should be scrutinized by referees).  In this situation, “routine deposition of scattering data” may be difficult to use widely. For such cases, it may be worth considering a data reporting system to which raw data are registered freely, but data quality control is handled through comparison of the newly reported data with an archive of selected data having a guaranteed quality.

Finally, the importance of self-validation tools and criteria to control the consistency of any data and models deposited should be emphasized, and their inclusion in any rigorous discussion of SAS data will always play a more prominent role than in other parts of crystallography where CIF-based software tools exist for automatically checking the validity of results from, e.g., powder diffraction.

Brian McMahon
Site Admin
Posts: 116
Joined: Fri May 13, 2011 12:34 pm

Re: Discussion of small-angle scattering data deposition

Post by Brian McMahon » Tue Feb 28, 2012 11:31 am

Thanks, Andrew: that's a very helpful survey of the issues faced within that community, and I guess it's a model for many other fields where there are emerging or varied experimental techniques.

One thing that seems clear to me is that 'metadata' is a many-layered concept. One needs metadata that describe associated publication or other research outcomes. One needs metadata that characterise the sample material; and metadata that characterise the particular class of instrument (or specific instrument) involved. And one needs metadata that relate to the model to which one is trying to fit the interpretation of the recorded data values.

Would it be useful for this group to identify these different layers of metadata and to provide pointers to existing standards? For example, the core CIF dictionary provides data names that are appropriate for describing bibliographic metadata. neXus (as far as I understand) provides descriptions of a variety of imaging and recording instruments. How easy (or otherwise) is it for developers of data standards (e.g. sasCIF that you mention) to concentrate on characterising the description of the 'raw' data and just 'pull in' the higher-level metadata concepts from existing dictionaries? Of course there are practical differences associated with the fact that some standards use one syntax (e.g. XML, some another (e.g. CIF); but these are relatively straightforward problems to solve compared with the intellectual effort required for the precise definition of terms. I'd be interested if contributors to this group (from any discipline) wish to comment on that aspect of standards interoperability.

Post Reply