Thursday, October 18, 2012

How do you decide if a station is to be merged, added as unique or withheld?


The above flowchart is a simple visualization on how the merge program works. As you can see there are a number of different options the candidate station can go through. I'm confident enough to say that each and every situation above happens at least once in the recommended merge!

Let's break down this flowchart, starting with the metadata check. The candidate station runs through all the target stations and calculates three metrics:
  • distance probability
  • height probability
  • station name similarity using Jaccard Index
These probabilities range from 0-1, where 0 means no station match and 1 means perfect station match. Using a quasi-Bayesian approach, these three metrics are combined to form a posterior probability of station match (again, between 0 and 1). This is known as the metadata probability. The metadata probability is calculated between the candidate station and all target stations.

Using a threshold of 0.50 we then see what the next step is. If no metadata probability values exceed this threshold, then we check the validity of the individual metadata metric. If it turns out that 2 metrics are really good (> 0.90) and the third one is terrible, we then determine that there is bad or incomplete metadata, and the station is withheld. Otherwise we are confident that the station is unique in its own right and we add it to the target dataset.

If any stations exceed the threshold of 0.50, then we move down the left side of the chart. The next step is data comparisons. Using an overlap of no less than 5 years, we calculate the Index of Agreement, which is a "goodness-of-fit" measure similar to the coefficient of determination, however not as sensitive to outliers. Similar to the metadata probability, this is calculated between the candidate station and all target stations with metadata probability values greater than 0.50.

We then check to see if any comparisons were made. If not, then that means the two stations did not have any overlap period, or it had some overlap, but it didn't exceed the 5 year threshold. At this point one of two different things can happen. We look at the target station with the highest metadata probability. First, if the best probability is greater than 0.85, then the station merges. If not, then it is withheld.

If a data comparison was made via the Index of Agreement, then a lookup table takes into account both IA and the overlap period and creates a probability of station match, as well as station uniqueness. These are then recombined with the metadata probability to form a posterior probability of station "sameness" and station "uniqueness". If any one of these probabilities pass the same threshold, then the candidate station merges with that target station. If no same probabilities pass the same threshold, but a unique probability passes the unique threshold, then the candidate station is unique. Otherwise, the station is withheld.

A more detailed description of the above flowchart can be found here.

No comments:

Post a Comment