So far, the “Mike’s Nature trick” email aside, I’ve focused on the alleged resistance to releasing data and the loss of raw data by the CRU and how both undermine the ability of other scientists to scrutinise and/or reproduce the CRU’s work. However, it is alleged by many that the leaked/stolen code are evidence of a scientific fraud.
Does the code really demonstrate fraud? In considering this question one must distinguish between legitimate transformations of data done in order to account, e.g. for the varying degrees of reliability of different sources of data and manipulations that facilitate a predetermined outcome.
Before I delve into the code and comments (in forthcoming posts), it is worth considering the problem of how one can go about trying to derive a best guess for how the global average temperature has varied over the millenial timescales that climate scientists often discuss in their work. The nature of the task makes it inevitable that complex transformations of the raw data will be required to produce the graph, possibly even including the use of the odd fudge factor. Thus an honest attempt at this task is liable to see a lot of complex processing of the raw data before it’s turned into reliable estimates of the temperatures at different points in the timescale concerned.
To illustrate why, consider the apparently simple question of what the global average temperature was in the year 2008. Where and when exactly were the temperature measurements made? On the ground? 5 ft up? 1000km up? Next to a power plant? In the middle of a forest? At midday? In summer? The measurements may be made in all sorts of places in all sorts of ways, and will tend not to be distributed uniformly but cluster in easily accessible places. Combining the measurements to produce a reliable global average is thus not necessarily straightforward. Furthermore, that’s for a 21st century date where we can (and do) make reliable temperature measurements in large numbers of places across the globe. Go back in time, and the available measurements become less reliable and fewer in number until eventually there are no direct measurements left.
This is why the climatologists rely on proxy records such as tree-ring data, ice cores, etc that indirectly indicate what the temperature might have been. These proxies are of course less reliable and very patchy. For example, tree-ring data may correlate with temperature but it is also affected by pests, the amount of rain fall, how much exposure to the sun the tree can get, soil quality and many other factors that may retard or boost growth independently of temperature. The tree-rings tend only to be found where trees are located to boot!
However, except for recent centuries, such indirect records are all we have to go on.
Somehow these various, at times contradictory strands of evidence have to be woven together to get a best guess of what the temperatures were. This may mean favouring certain records for certain periods, where we have reason to believe they’re reliable, whilst favouring them less at other times or places, where we have reason to believe they may have been overly influenced by non-temperature variables.
It may also mean that some very complex processing of the data is required that may be difficult to comprehend. It certainly means that one must be careful how you select, weight and combine different data from the different sources, and it will almost inevitably lead to fudge factors being used.
Thus when considering the manipulations the programs perform on the data, one must consider why they’ve been put there, as well as their effects. There might be a valid reason for them being there. E.g. if you find the contributions of some from tree-ring data being reduced for a particular period in time, it may be that the author knew that for that particular period of time those particular records had less reliability due to non-temperature variables playing a stronger role at that time than at others. Or it may be that the author is trying to obtain a particular temperature for that period. The latter would be unscientific, the former need not be at all.
The problem is that distinguishing between the two may require access to documentation that’s not present in the leaked code or even to the author him/herself. Because the code and emails have been leaked/stolen, we may be missing important context that would make it clear precisely why a particular manipulation has been used. Thus it seems to me that only the most blatant attempts to achieve a particular result can be taken at face value.