## 8 Regression Techniques

Because the reconstructions all rely on finding relationships between modern geomagnetic activity data and simultaneous measurements of near-Earth space, regression techniques are needed to enable extrapolation to before the space age. The discussion in the literature between Svalgaard and Cliver (2005), Lockwood et al. (2006a), and Svalgaard and Cliver (2006) highlights the many pitfalls in this area.^{1}This discussion was on the use of the index. Svalgaard and Cliver (2005) (SC05) used to conclude that increased by only 25% between the 1900s to the 1950s and that this was in contrast to the more than doubling of which they argued was inherent in the results of Lockwood et al. (1999a). Lockwood et al. (2006a) (LEA06) pointed out that some of this difference was due to the fact that Lockwood et al. (1999a) actually reported a doubling in the open solar flux, , not (as shown later by Figure 29, is not proportional to ). However, there were several other factors, all of which worked in the same direction and so combined to make the estimate of the drift by SC05 exceptionally low. SC05 employed a simple ordinary linear least squares (OLS) regression which yielded residuals that showed heteroscedasticity, some non-linearity, and a systematic bias and which do not have a Gaussian distribution, thereby violating central assumptions of least-squares regression and showing the derived fit is unreliable. The regression results of SC05 were strongly influenced by outliers, which applied great leverage to their regression fit. More reliable regressions were obtained by LEA06 using least median squares (LMS) regression and, better still, using Bayesian statistics (the BLS procedure employed by REA07). In addition, SC05 attempted to fill in the data gaps in the IMF data using a 27-day recurrence technique, despite the relatively low autocorrelation functions of the IMF at 27 day lags, and LEA06 show that this also caused a slight underestimation of the long-term drift (piecewise removal of the data during IMF data gaps is much more reliable). Lastly SC05 under-estimated the long-term change in their own results.

The initial response by Svalgaard and Cliver (2006) did not accept these arguments, but as shown in the following sections, a subsequent reconstruction by Svalgaard and Cliver (2010) is in very good agreement with the Lockwood et al. (1999a) reconstruction. In fact, the change between the Svalgaard and Cliver (2010) and SC05 reconstructions of IMF was almost exactly what was called for by the residuals analysis of LEA06. This change was caused by the availability of just four additional annual mean datapoints near the long and low minimum between cycles 23 and 24 for which was low. The fact that change was needed in response to the addition of just a few more datapoints confirms that the original SC05 fit was not robust.

Because the potential pitfalls in regression techniques can have such a major effect on the reconstructions, it is worth exploring the relative merits of the various linear regression procedures used in this context. Figure 22 stresses how much they can differ, showing the scatter plot and the various regressions between annual means of the index and the IMF, . SC05 used OLS but the slope they derived is slightly lower than LEA06’s implementation of OLS because of their different treatment of data gaps. OLS gives the lowest slope, whereas BLS gives the largest.

The details of the regression procedures (with appropriate references for the statistical techniques) and discussion of their relative merits and pitfalls are given in the paper by LEA06. The advantage of the LMS procedure is that it is not as influenced by outliers that can change the slope of the fit dramatically if they have a high value of the Cook-D leverage factor. The MAA (Major Axis Analysis) procedure is inappropriate in the context of these reconstructions and the BLS procedure is as employed by Rouillard et al. (2007) (REA07). The tests described below show that BLS performed best.

Notably, the OLS procedure used by SC05 gives lowest slope in Figure 22, and so would give the lowest long-term drift in the reconstruction of open solar flux. There are a number of ways of evaluating the quality of a regression fit. One of the most important is to check that the fit residuals are randomly and normally distributed: the fit of against used by SC05 is analysed in Figures 23, 24, and 25.

Fits should be homoscedastic, i.e., the residuals should not show a trend in their spread. In their reply, Svalgaard and Cliver (2006) quite rightly state that this should be tested by plotting fit residuals against the fitted values. Figure 23 shows this residual plot which they claim shows the fit is homoscedastic because the mean of the residuals does not change with the fitted value. However, homoscedasticity requires the spread of fit residuals (not their mean value) does not change with the fitted value, and Figure 23 shows the spread does increase with increasing fitted values. Hence the fit is heteroscedastic rather than homoscedastic. In addition, the plot shows a marked tendency towards an inverted-U form which is characteristic of some nonlinearity.

A second test used by LEA06 was quantile-quantile (QQ) plots to test if the residuals were normally distributed, as is assumed by all least squares regressions. The standardized residuals are placed in order by size and plotted against the quantiles for a standard normal distribution. The deviations from the straight line of slope 1 reveal departures from a normal distribution. Figure 24 shows that the OLS fit gave considerably larger deviations from a Gaussian distribution of residuals than did the BLS fit and so the BLS method is giving the more valid least-squares regression.

Because they found the SC05 fit was heteroscedastic, potentially nonlinear and failed the QQ test for normally-distributed residuals, LEA06 tested for a trend in the residuals by plotting the fit residuals as a function of the observed values. The results are shown in Figure 23 for three regression prcodures. The BLS regression meets the requirement that there is no trend (and LEA06 show the LMS regression does as well) but the MAA and OLS fail this bias test. They underestimate the trend in the data because is consistently an overestimate of when is small and consistently an underestimate of when it is large. Thus, reconstructions based on this OLS fit will self-evidently underestimate the true range of variation in IMF .

The better the correlation between two parameters, the more similar will be the results of the various regression procedures and, hence, these tests would become increasingly less important. Given that small changes in the fitted slope will make very large differences to the maximum and minimum values seen in a reconstruction, it is very important that these tests are carried out to ensure that the optimum regression procedure has been used and any one regression fit is valid.