Thursday, May 16, 2013
Meeting announcement: Characterising surface temperatures in data-sparse and extreme regions (with a focus on high-latitude domains)
Monday, May 13, 2013
Call for regional inhomogeneity info
Ideally we'd like to know:
WHEN - specific date or month or year or even decade etc.
WHERE - a region, a country, an international GTS/WMO change etc.
WHAT - a change in shelter, thermometer type, automation, observing time/practice etc.
HOW - are there any estimates of the size/direction/nature of the effect of this change?
Please post here and encourage others to do so. We then hope to reward you with some realistic error-worlds to play with.
Kate (and the Benchmarking working group)
Thursday, April 18, 2013
Initiative posters at 2013 EGU
One thing that was requested by some was help in getting funding. Sadly, we don't have funding for anything directly, but we are more than happy to write letters of support for any work that furthers the aims of the initiative to funding bodies.
Friday, March 22, 2013
Initiative progress report published
Monday, March 18, 2013
Databank Release: Beta #3
The beta3 release can be found here: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/. Within that directory one can find all the data and code used, along with some graphics depicting the results of all the merge variants.
In addition, the previous betas are still available to look at, if anyone wishes to run comparisons
- beta 1: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta1/
- beta 2: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta2/
- A blacklist of candidate stations was generated to either fix known errors with its metadata/data, or withhold the station completely. This is a required input file for the code to run and is provided with this beta release
- Some minor code changes were applied, including withholding stations when the metadata probability was near perfect, but the data comparisons were so poor the station became unique (when it should have merged). In addition, odd characters were removed from the station name before the Jaccard Index was run.
- The format of stage 3 data was changed so that it was consistent with all stage 2 data. In addition, all data provenance flags have been ported over in order to be open and transparent
- Algorithm output is included with each variant result, in order to provide information about each candidate station and how it made it's decision to merge / unique / withhold. A future post will go into great detail about each output file.
If you wish to provide comments, please feel free to send an e-mail to general.enquiries@surfacetemperatures.org
Thursday, February 7, 2013
A database with daily climate data for more reliable studies of changes in extreme weather
In summary:
- We want to build a global database of parallel measurements: observations of the same climatic parameter made independently at the same site
- This will help research in many fields
- Studies of how inhomogeneities affect the behaviour of daily data (variability and extreme weather)
- Improvement of daily homogenisation algorithms
- Improvement of robust daily climate data for analysis
- Please help us to develop such a dataset
Introduction
We intend to build a database with parallel measurements to study non-climatic changes in the climate record. This is especially important for studies on weather extremes where the distribution of the daily data employed must not be affected by non-climatic changes.
There are many parallel measurements from numerous previous studies analysing the influence of different measurement set-ups on average quantities, especially average annual and monthly temperature. Increasingly, changes in the distribution of daily and sub-daily values are also being investigated (Auchmann and Bönnimann, 2012; Brandsma and Van der Meulen, 2008; Böhm et al., 2010; Brunet et al., 2010; Perry et al., 2006; Trewin, 2012; Van der Meulen and Brandsma, 2008). However, the number of such studies is still limited, while the number of questions that can and need to be answered are much larger for daily data.
Unfortunately, the current common practice is not to share parallel measurements and the analyses have thus been limited to smaller national or regional datasets, in most cases simply to a single station with multiple measurement set-ups. Consequently there is a pressing need for a large global database of parallel measurements on a daily or sub-daily scale.
Also datasets from pairs of nearby stations, while officially not parallel measurements, are interesting to study the influence of relocations. Especially, typical types of relocations, such as the relocation of weather stations from urban areas to airports, could be studied this way. In addition, the influence of urbanization can be studied on pairs of nearby stations.
Daily data
Daily datasets are essential for studying the variability of and extremes in weather and climate. Looking at the physical causes of inhomogeneities, one would expect that many of the effects are amplified on days with special weather conditions and thus especially affect the tails of the distribution of the daily data. Now that the interest in extreme weather and thus in daily data has increased, more and more people are also working on the homogenization of daily data. Increasingly, developers of national and regional temperature datasets have homogenised the temperature distribution (see e.g., Nemec et al., 2012; Auer et al., 2010; Brown et al., 2010; Kuglitsch et al., 2009, 2010). Further improvements in the quantity and quality of such datasets, and a deeper understanding of remaining deficiencies, are important for climatology.Application possibilities of parallel measurements
The most straightforward application of such a dataset would be a comparison of the magnitude of the non-climatic changes to the magnitude of the changes found in the climate record. We need to know whether the non-climatic changes are large enough to artificially hide or strengthen any trends or perturb decadal variability. In addition, such a dataset would help us to better understand the physical causes of inhomogeneities. A large and quasi-global dataset would enable to analyse how the magnitude and nature of inhomogeneities differ depending on the geographical region and the microclimate.The dataset would also benefit homogenisation science in multiple ways. It may reveal typical statistical characteristics of inhomogeneities that would allow for a more accurate detection and correction of breaks. The dataset would facilitate the development of physical homogenisation methods for specific types of breaks that are able to take the weather conditions into account; similar to the method developed for the transition of Wild screens to Stevenson screens for Switzerland by Auchmann and Brönnimann (2012). It would also allow for the development of generalised physical correction methods suitable for multiple climatic regions. Finally, the dataset would improve the ability to create realistic validation datasets, thus improving our estimates of the remaining uncertainties. This in turn again benefits the development of better homogenisation methods.
Organisational matters
As an incentive to contribute to the dataset, initially only contributors will be able to access the data. After joint publications, the dataset will be opened for academic research as a common resource for the climate sciences. These two stages will also enable us to find errors in the dataset before the dataset is published.The International Surface Temperature Initiative (ISTI) and the European Climate Assessment & Dataset (ECA&D) are willing to host the dataset. This is great, because it makes the dataset more visible for contributors and users alike. We are still looking for an organisational platform that could facilitate the building of such a dataset. Any ideas for this are appreciated.
A preliminary list with parallel measurements can be found in our Wiki.
If you have any ideas or suggestions for such an initiative, if you know of further parallel datasets, or if you just want to be kept informed, please update our Wiki, comment at Variable Variability or send an email to Victor.Venema@uni-bonn.de. Furthermore, if you know someone who might be interested, please inform him or her about this initiative. Thank you.
Scientists involved in this initiative are:
- Enric Aguilar (University of Tarragona, Spain)
- Renate Auchmann (University of Bern, Switzerland)
- Ingeborg Auer (Zentralanstalt für Meteorologie und Geodynamik, Austria)
- Andreas Becker (Global Precipitation Climatology Centre, Deutscher Wetterdienst, Germany)
- Stefan Brönnimann (Institute of Geography, University of Bern, Switzerland)
- Michele Brunetti (Institute of Atmospheric Sciences and Climate of the National Research Council, Italy)
- Sorin Cheval (National Meteorological Administration, Romania)
- Peter Domonkos (University of Tarragona, Spain)
- Aryan van Engelen (Royal Netherlands Weather Service, The Netherlands)
- José Guijarro (Agencia Estatal de MeteorologÃa, Spain)
- Franz Gunther Kuglitsch (GFZ German Research Centre for Geosciences, Germany)
- Monika Lakatos (Hungarian Meteorological Service, Hungary)
- Øyvind Nordli (Meteorologisk institutt, Norway)
- David Parker (UK MetOffice, United Kingdom)
- Mário Gonzalez Pereira (Universidade de Trás-os-Montes e Alto Douro, Portugal)
- Tamas Szentimrey (Hungarian Meteorological Service, Hungary)
- Peter Thorne (National Climatic Data Center, USA; International Surface Temperature Initiative)
- Victor Venema (University of Bonn, Germany)
- Kate Willett (UK MetOffice, United Kingdom)
Related posts
- Future research in homogenisation of climate data – EMS 2012 in Poland
- A discussion on homogenisation at a Side Meeting at EMS2012
- What is a change in extreme weather?
- Two possible definitions, one for impact studies, one for understanding.
- HUME: Homogenisation, Uncertainty Measures and Extreme weather
- Proposal for future research in homogenisation of climate network data.
- Homogenization of monthly and annual data from surface stations
- A short description of the causes of inhomogeneities in climate data (non-climatic variability) and how to remove it using the relative homogenization approach.
- New article: Benchmarking homogenization algorithms for monthly data
- Raw climate records contain changes due to non-climatic factors, such as relocations of stations or changes in instrumentation. This post introduces an article that tested how well such non-climatic factors can be removed.
References
Auchmann, R., and S. Brönnimann. A physics-based correction model for homogenizing sub-daily temperature series, J. Geophys. Res., 117, art. no. D17119, doi: 10.1029/2012JD018067, 2012.Auer I., Nemec J., Gruber C., Chimani B., Türk K. HOM-START. Homogenisation of climate series on a daily basis, an application to the StartClim dataset. Wien: Klima- und Energiefonds, Projektbericht, 34 p., 2010.
Brandsma, T. and J.P. van der Meulen, Thermometer Screen Intercomparison in De Bilt (the Nether-lands), Part II: Description and modeling of mean temperature differences and extremes. Int. J. Climatology, 28, pp. 389-400, 2008.
Brown, P. J., R. S. Bradley, and F. T. Keimig. Changes in extreme climate indices for the northeastern United states, 1870–2005, J. Clim., 23, 6555–6572, doi: 10.1175/2010JCLI3363.1, 2010.
Böhm, R., P.D. Jones, J. Hiebl, D. Frank, M. Brunetti, M. Maugeri. The early instrumental warm-bias: a solution for long central European temperature series 1760–2007. Climatic Change, 101, pp. 41–67, doi: 10.1007/s10584-009-9649-4, 2010.
Brunet, M., J. Asin, J. Sigró, M. Banón, F. García, E. Aguilar, J. Esteban Palenzuela, T.C. Peterson and P. Jones. The minimization of the screen bias from ancient Western Mediterranean air temperature records: an exploratory statistical analysis. Int. J. Climatol., doi: 10.1002/joc.2192, 2010.
Klein Tank, A.M.G., Wijngaard, J.B., Können, G.P., Böhm, R., Demarée, G., Gocheva, A., Mileta, M., Pashiardis, S., Hejkrlik, L., Kern-Hansen, C., Heino, R., Bessemoulin, P., Müller-Westermeier, G., Tzanakou, M., Szalai, S., Pálsdóttir, T., Fitzgerald, D., Rubin, S., Capaldo, M., Maugeri, M., Leitass, A., Bukantis, A., Aberfeld, R., van Engelen, A. F.V., Forland, E., Mietus, M., Coelho, F., Mares, C., Razuvaev, V., Nieplova, E., Cegnar, T., Antonio López, J., Dahlström, B., Moberg, A., Kirchhofer, W., Ceylan, A., Pachaliuk, O., Alexander, L.V. and Petrovic, P. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. Climatol., 22, pp. 1441–1453. doi: 10.1002/joc.773, 2002. Data and metadata available at http://www.ecad.eu.
Kuglitsch F.G., Toreti A., Xoplaki E., Della-Marta P.M., Luterbacher J., Wanner H. Homogenisation of daily maximum temperature series in the Mediterranean. Journal of Geophysical Research, 114, art. no. D15108, doi: 10.1029/2008JD011606, 2009.
Kuglitsch F.G., Toreti A., Xoplaki E., Della-Marta P.M., Zerefos C.S., Türkes M., Luterbacher J. Heat wave changes in the eastern Mediterranean since 1960. Geophysical Research Letters, 37, art.no. L04802, doi: 10.1029/2009GL041841, 2010.
Meulen, van der, JP, T Brandsma. Thermometer screen intercomparison in De Bilt (The Netherlands), part I: Understanding the weather-dependent temperature differences. Int. J. Climatol., 28, 371-387, 2008.
Nemec, J., Ch. Gruber, B. Chimani, I. Auer. Trends in extreme temperature indices in Austria based on a new homogenised dataset. Int. J. Climatol., doi: 10.1002/joc.3532, 2012.
Perry, M., Prior, J. and Parker, D.E., 2006: An assessment of the suitability of a plastic thermometer screen for climatic data collection. Int. J. Climatol., 27, 267-276.
Trewin, B. A daily homogenized temperature data set for Australia. Int. J. Climatol., doi: 10.1002/joc.3530, 2012.
Thorne, Peter W., Kate M. Willett, Rob J. Allan, Stephan Bojinskski, John R. Christy, Nigel Fox, Simon Gilbert, Ian Jolliffffe, John J. Kennedy, Elizabeth Kent, Albert Klein Tank, Jay Lawrimore, David E. Parker, Nick Rayner, Adrian Simmons, Lianchun Song, Peter A. Stott, and Blair Trewin 2011: Guiding the Creation of A Comprehensive Surface Temperature Resource for Twenty-First-Century Climate Science. Bull. Amer. Meteor. Soc., 92, ES40–ES47. doi: 10.1175/2011BAMS3124.1. More information at: http://www.surfacetemperatures.org
Tuesday, February 5, 2013
Databank highlighted in EOS issue 5th Feb
This seems an apposite time to update on where we stand vis-a-vis a full first version release. We have done a first blacklisting sweep and are going back for a second try based upon what we learned to see whether we can catch any more issues.
The in-house development version that is a modification of beta 2 now stands at just over 32,500 stations. We have removed 'Atlantis' stations and resolved a large number of issues over wrong geolocation. We almost certainly won't have caught them all at whatever point we release - that's inevitable. But we are increasingly confident we'll have resolved the truly low-hanging fruit issues.
At the same time we have been revising the longer methods paper that Jared is leading based upon author feedback and to reflect the blacklisting. We plan to submit that to the journal soon.
Bottom line is that we are currently shooting for a release of version 1 of the databank mid-to-late March after necessary approval procedures have been followed in mid-March. Of course, that schedule is subject to change if we find any additional issues in the interim.
* The actual paper, for now, is behind a paywall. We will investigate whether we can post a copy and if so will provide a link to such an unrestricted copy in an update to this post.
Monday, January 28, 2013
More on efforts at data rescue and digitization - reposted press release
Friday, January 25, 2013
Blacklisting
There be gremlins in the data decks constituting some of the input data to the databank algorithm - both dubious data and geolocation metadata. We knew this from the start but have stayed blacklisting until we got the algorithm doing sort of what we thought it should and everyone was happy with it. Now we have attacked the problem for several weeks. Here are the four strands of attack:
1. Placing a running F-test through the merged series to find jumps in variance. This found a handful of intra-source cases of craziness. We will delete these stations through blacklisting.
2. Running through NCDC's pairwise homogenization algorithm to see whether any really gigantic breaks in teh series are apparent. This found no such breaks (but rest assured there are breaks and the databank is a raw data holding and not a data product per se).
3. First difference series correlations with proximal neighbors. We looked for cases where correlation was high and distance was high, correlation was low and distance was low and correlation was perfect and distance low. These were then looked at manually. Many are longitude / latitude assignation errors. For example we know Dunedin on the South Island of New Zealand is in the Eastern Hemisphere:
| This is Dunedin. Beautiful place ... |
And not the Western Hemisphere:
| This is not the Dunedin you were looking for ... Dunedin is not the new Atlantis |
But sadly two sources have the sign switched. The algorithm does not know where Dunedin is so is doing what it is supposed to. So, we need to tell it to ignore / correct the metadata for these sources so we don't end up with a phantom station.
There are other issues than simple sign errors in lat / lon that these picked up. One of the data decks has many of its French stations longitudes inflated by a factor of 10, so a station at 1.45 degrees East is wrongly placed at 14.5 degrees East. Pacific island stations appear to have recorded under multiple names and ids which confounds the merging in many cases.
4. As should be obvious from the above we also needed to look at stations proverbially 'in the drink', so we have pulled a high resolution land-sea mask and run through all stations against that. All cases demonstrably wet (greater than 10Km = .1 degree resolution at equator and many sources are only to 0.1 degree accuracy) are getting investigated.
Investigations have used the trusty googlemaps and wikipedia route in general with other approaches where helpful. Its time consuming and thankless. The good news is 'we' (Jared) are (is) nearly there.
The whole blacklist file will be one small text file the algorithm reads and one very large pdf that justifies each line in that text file. As people find other issues (and there undoubtedly will be - we will only catch worst / most obvious offenders even after several weeks on this) we can update and rerun.
Tuesday, January 15, 2013
First public talk on Databank Merge Results: AMS Annual Meeting
In order to continue our aims to be open and transparent, the abstract from the conference can be found here, and the presentation used can be located here. The presentation was also recorded, and once AMS puts the audio online, we will also try and link to it.
Wednesday, January 9, 2013
How should one update global and regional estimates and maintain long-term homogeneity?
The fundamental issue of how to curate and update a global, regional or national product whilst maintaining homogeneity is a vexed one. Non-climatic artifacts are not the sole preserve of the historical portion of the station records. Still today stations move, instruments change, times of observation change etc. etc. often for very good and understandable reason (and often not ...). There is no obvious best way to deal with this issue. If ignored for long enough station, local and even regional series can become highly unrealistic if large very recent biases are not dealt with.
The problem is also intrinsically inter-linked with the question as to which period of the record we should adjust for non-climatic effects. Here, at least there is general agreement that adjustment should be made to match the most recent apparently homogeneous segment so that today's readings can be easily and readily compared to our estimates of past variability and change without performing mental gymnastics.
At one extreme of the set of approaches is the CRUTEM method. Here, real-time data updates are only made to a recent period (I think still just post-2000) and no explicit assessment of homogeneity is made at the monthly update granuality (there is QC applied). Rather adjustments and new pre-2000 data effectively are caught up with major releases or network updates (e.g. with entirely new station record additions / replacements / assessments normally associated with a version increment and manuscript). This ensures values prior to a recent decade or so remain static for most month to month updates but at a potential cost if a station inhomogeneity occurs in the recent past which is de facto unaccounted for. This can only then be caught up with through a substantive update.
At the other extreme is the approach undertaken in GHCN / USHCN. Here the entire network is reassessed based upon new data receipts every night using the automated homogenization algorithm. New modern periods of records can change the identification of recent breaks in stations that contribute to the network. Because the adjustments are time-invariant deltas applied to all points prior to an identified break the impact is to change values in the deep past to better match modern data. So, the addition of station data for Jan 2013 may change values estimated for Jan 1913 (or July 1913) because the algorithm now has enough data to find a break that occurred in 2009. This then may affect the nth significant figure of the national / global calculation in 1913 on a day to day basis. This is why with GHCNv3 a system of version control of v3.x.y.z.ddmmyyyy was introduced and each version archived. If you want bit replication to be possible of your analysis then explicitly reference the version you used.
What is the optimal solution? Perhaps this is a 'How long is a piece of string?' class of question. There are very obvious benefits to either approach or any number of others. In part it depends upon the intended uses of the product. If interested in serving homogeneous station series as well as aggregated area averaged series using your best knowledge as of today perhaps something closer to NCDC's approach. If interested mainly in large scale average determination and under a reasonable null that at least on a few years timescale the inevitable new systematic artifacts average out as gaussian over broad enough space scales the CRUTEM approach makes more sense. And that, perhaps, is fundamentally why they chose these different routes ...
Saturday, January 5, 2013
High School Students Engage in Climate Research
Wednesday, January 2, 2013
Databank update - nearby 'duplicates' issue raised by Nick Stokes
One of the issues arising from the historically fragmented way data has been held and managed is that many versions of the same station may exist across multiple holdings. Often the holding will itself be a consolidation of multiple other sources and, like a Russian doll - well, you get the picture - its a mess. So, in many cases we have little idea what has been done to the data between the original measurement and our receipt. These decks are given low priority in the merge process but ignoring them entirely would be akin to cutting one's nose off to spite one's face - they may well contain unique information.
To investigate this further and more globally we ran a variant of the code with only one line change (Jared will attest that my estimate of line changes are always an underestimate but in this case it really was one line). If the metadata and data disagreed strongly then we withheld the station. We then ran a diff on the output with and without. The check found solely stations that were likely bona fide duplicates (641 in all). This additional check will be enacted in the version 1 release (and hence there will be 641 fewer stations).
Are we there now? Not quite. We have still to do the blacklisting. This is labor-intensive stuff. We will have a post on this - what we are doing and how we are planning to document the decisions in a transparent manner - early next week time permitting.
We currently expect to release version 1 no sooner than February. But it will be better for the feedback we have received and the extra effort is worth it for a more robust set of holdings.
Tuesday, December 18, 2012
So what changed between beta1 and beta2?
Monday, December 10, 2012
Where do differences between the databank and GHCNv3 arise?
The early period record is slightly cooler than the estimates from GHCNv3 while the last decade is warmer than GHCNv3. The net impact is to increase the apparent trend. This pattern is present in all the merge variants to a greater or lesser degree. This raises the logical question as to why this difference is arising. Is it because the databank's improved number of stations are sampling areas of the globe previously unsampled in GHCNv3 which behaved in a different manner to the restricted GHCNv3 sample from this larger whole or is it down to additional station sampling in areas already sampled by GHCNv3? And if so why? The two graphs below do the obvious thing and split it out simply by averaging over grids present in both and those only in the databank (there is a much smaller population of gridboxes present in v3 but not in the databank which would be grossly too small to have a significant material impact on global estimates being considered here).
With GHCNv3 gridbox sampling (concentrate on (spot the?) difference between red and blue)
New gridboxes.
So, most of the difference appears to relect better sampling regions already sampled. The question of why and what impact it has on homogenization efforts is 'future work' ... and is why we now need multiple groups to take up the challenge of creating new data products from the databank.
Thursday, December 6, 2012
Databank Release: Beta #2
The beta2 release can be found here: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/. Within that directory one can find all the data and code used, along with some graphics depicting the results of all the merge variants. A technical description of the merge program (similar to beta1) is also provided, along with a new file documenting changes from beta1 to beta2.
Beta1 is not forgotten and lost forever. All the data and code from beta1 is located in our archive if anyone still wishes to access it: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta1/
Some of the major changes include the following:
- Added a metadata comparison check of when the data record began
- Added source data from Sweden, Uruguay, Norway, Canada, and the MetOffice's new HadISD dataset
- Updated lookup table to determine whether a candidate station is merged, unique, or withheld after a data comparison is made
The deadline has passed for new data to be added for an official version 1 release. However there is still plenty of time to provide feedback on all the methodologies used in constructing the databank. Your comments have helped us so far, and we welcome any more that may arise.
Tuesday, November 6, 2012
Taking the temperature of the Earth: Temperature Variability and Change across all Domains of Earth's Surface
The overarching motivation for this session is the need for better understanding of in-situ measurements and satellite observations to quantify surface temperature (ST). The term "surface temperature" encompasses several distinct temperatures that differently characterize even a single place and time on Earth’s surface, as well as encompassing different domains of Earth’s surface (surface air, sea, land, lakes and ice). Different surface temperatures play inter-connected yet distinct roles in the Earth’s surface system, and are observed with different complementary techniques.
There is a clear need and appetite to improve the interaction of scientists across the in-situ/satellite 'divide' and across all domains of Earth's surface. This will accelerate progress in improving the quality of individual observations and the mutual exploitation of different observing systems over a range of applications.
This session invites oral and poster contributions that emphasize sharing knowledge and make connections across different domains and sub-disciplines. They can include, but are not limited to, topics like:
* How to improve remote sensing of ST in different environments
* Challenges from changes of in-situ observing networks over time
* Current understanding of how different types of ST inter-relate
* Nature of errors and uncertainties in ST observations
* Mutual/integrated quality control between satellite and in-situ observing systems.
If you are interested in attending abstracts need to be submitted by Jan 9th 2013.
More info can be found at http://meetingorganizer.copernicus.org/EGU2013/session/12115
We will run a guest post by the Earthtemp organizers in the coming weeks outlining what their effort involves and how it is synergistic with the International Surface Temperature Initiative. Watch this space ...
Friday, October 26, 2012
Databank poster at Global Framework for Climate Services User Conference
Thursday, October 18, 2012
How do you decide if a station is to be merged, added as unique or withheld?
The above flowchart is a simple visualization on how the merge program works. As you can see there are a number of different options the candidate station can go through. I'm confident enough to say that each and every situation above happens at least once in the recommended merge!
Let's break down this flowchart, starting with the metadata check. The candidate station runs through all the target stations and calculates three metrics:
- distance probability
- height probability
- station name similarity using Jaccard Index
Using a threshold of 0.50 we then see what the next step is. If no metadata probability values exceed this threshold, then we check the validity of the individual metadata metric. If it turns out that 2 metrics are really good (> 0.90) and the third one is terrible, we then determine that there is bad or incomplete metadata, and the station is withheld. Otherwise we are confident that the station is unique in its own right and we add it to the target dataset.
If any stations exceed the threshold of 0.50, then we move down the left side of the chart. The next step is data comparisons. Using an overlap of no less than 5 years, we calculate the Index of Agreement, which is a "goodness-of-fit" measure similar to the coefficient of determination, however not as sensitive to outliers. Similar to the metadata probability, this is calculated between the candidate station and all target stations with metadata probability values greater than 0.50.
We then check to see if any comparisons were made. If not, then that means the two stations did not have any overlap period, or it had some overlap, but it didn't exceed the 5 year threshold. At this point one of two different things can happen. We look at the target station with the highest metadata probability. First, if the best probability is greater than 0.85, then the station merges. If not, then it is withheld.
If a data comparison was made via the Index of Agreement, then a lookup table takes into account both IA and the overlap period and creates a probability of station match, as well as station uniqueness. These are then recombined with the metadata probability to form a posterior probability of station "sameness" and station "uniqueness". If any one of these probabilities pass the same threshold, then the candidate station merges with that target station. If no same probabilities pass the same threshold, but a unique probability passes the unique threshold, then the candidate station is unique. Otherwise, the station is withheld.
A more detailed description of the above flowchart can be found here.
Tuesday, October 16, 2012
How do I work out where a station series in the merged product originates?
Using the above image, it can be found that the sources belong to GHCN-Daily (source #01, black), russsource (source # 35, red), and ghcnmv2 (source #39, blue). Now that the sources are known, one can find the Stage 2 data for this station. A user can also look further back, and find the original digitized copy (Stage 1), and sometimes even the original paper copy (Stage 0).









