4 A Note on the Importance of Understanding the Provenance of Geomagnetic Data

Before continuing, there is an important point that needs to be made about correcting and homogenising historic data. There is the potential to do much more harm than good, if corrections that are based on inadequate understanding (or, worse still, postulated theories) are allowed to modify a dataset but clear metadata and the means to reverse the changes, at any stage in the future, are not retained and made readily available. The full provenance of any one dataset is easily lost and without it such a change could be a massively retrograde step. I therefore strongly recommend that historic datasets that are re-processed should be re-named so they can be recognised for what they are and the original dataset must be retained. Hence, although at the present time it is reasonable, for example, to regard aaC as a corrected form of aa, should something in the revised inter-calibrations in future prove to be invalid or inadequate, then scientists can readily return to the original aa data. For this reason, Lockwood et al. (2006bJump To The Next Citation Point) treated aaC as a different index to aa and gave it a new name.

A good example of the sort of problems that can arise is provided by the hourly mean H data from the Eskdalemuir station. This observatory has operated continuously since 1911, when it was established by Kew observatory on a rural and exceptionally clean magnetic site when the Kew site was rendered too noisy by the introduction of trams into west London (Harrison, 2004). There was a discontinuity at 1932 in the commonly-used set of hourly mean data from this station, which had remained un-noticed until 2004, when Mursula et al. (2004) and Clilverd et al. (2005) analyzed the inter-hour variability of Eskdalemuir data and found very small values in the early part of the 20th century. Detective work by Leif Svalgaard established that prior to 1932 the data stored in the Word Data Centre (WDC) system were 2-hour running means of the data recorded in the observatory yearbook. Such smoothing greatly influences inter-hour indices. MacMillan and Clarke (2011) have confirmed that this was indeed the case and digitised the data from the yearbook, so that all data from Eskdalemuir now available from WDC-C1 are hourly means with no running mean smoothing applied. (Users should check which dataset they are using because one problem with data that has been corrupted or massaged is that it is very hard to expunge from all datasets and bad data tends to resurface). It is not known how, when, where, or why this post-processing was carried out because the available metadata did not tell us the full provenance of the data. Presumably somebody, somewhere had believed that the noise suppression obtained by implementing a running mean was a good thing. If one used daily means of the (supposed) hourly data there would have be a some effect (as an hour of data from both the day before and the day after would be averaged in with half weight), but it would be small and the effect would be negligible on annual means. It is fair to assume that whoever implemented the smoothing never envisaged the use of the data to generate an inter-hour variability index. This example illustrates very graphically the great importance of knowing, as far as is possible, the true provenance of historic data and of all the corrections and changes that may have subsequently been applied to them. Lockwood et al. (2013aJump To The Next Citation Point) have revealed a similar issue with data from Ekaterinburg by implementing an inter-correlation of hourly mean H data from a given station at different UTs as a check of data consistency: they found very high correlations around 1900, revealing that interpolation to hourly values from more sparse data had taken place.

This is a vitally important concern for reconstruction work: being overly ready to accept an adjustment is highly irresponsible as it could deny future generations of scientists the opportunity to properly exploit the data or, in a worst case scenario, seriously mislead them (Council of AGU, 2009; Vogel, 1998).

