Matching accuracy metrics: Indicator measures
4 part series
The focus of this series so far has been on hard metrics to measure the error rates of an identity matching system. However, not all measures of accuracy can be easily quantified into hard numbers. This fourth installment of our blog series focuses on different validation methods for real-world systems that can provide reassurance that the system is behaving as expected. We provide methods for placing reasonable bounds on the error rates and for qualitatively identifying problems at common trouble spots.
Read the full Matching Accuracy Metrics White Paper
Matching individuals after they have moved to a new area is particularly challenging. First, a large move often corresponds to a change of care, with all of the opportunities for errors and variations. Even more impactful is the change of scope, from a limited region of tens or hundreds of thousands of people to suddenly 340 million potential targets. The same techniques that worked for matching within a single city or zip code region suddenly have a huge risk of false positives. For instance, there are two people born with the name James Smith every day in the US. Correctly linking these types of cases can be extremely challenging.
By tracking individuals in the matching system and aggregating the results, we can evaluate the migration patterns seen within the data against outside sources. Some excellent sources include US Census, USPS change-of-address orders, and even FEMA requests. Other academic papers may have good relocation data for specific regions or demographics.
The best way to evaluate performance is to focus on a specific subcase. For instance, studying the patterns of relocation from San Juan, Puerto Rico to the mainland US is well documented and easy to quantify. First, create a normalized histogram of the documented movement from one region to each other one. Next, create a similar histogram from the matching system data. The second histogram should be further weighted by the estimated coverage in each region, and then renormalized. For instance, if the matching data has 60% coverage of San Juan and 75% coverage of Manhattan, the entry for people relocating from San Juan to Manhattan should be divided by 0.60*0.75=0.45 to address the expected amount of undercounting.
The two histograms can be compared both quantitatively and qualitatively. A simple sum of squared differences (SSD) can give a good measure of overall accuracy. However, this does not give a specific breakdown of FPR and FNR. For that, keep in mind that false positives during relocation are extraneous matches and are likely to happen anywhere. The false negatives, on the other hand, will happen proportionately. Therefore, a large number of false positives will cause the histogram from the matching data to look flatter or more uniform. This can be measured by looking at the difference in Shannon entropy between the histograms, or similar tools. A large number of false negatives would look like an overall decrease in the number of relocations, relative to the number of people who did not relocate in the same time period. With a little bit of simulation and a hill-climbing search, you can actually determine the likely FPR and FNR for the differences in the data, with reasonable precision.
This method can provide very pointed insight into the behavior of matching individuals who are relocating. Since this is such a common weak point for matching systems, it is good to understand the accuracy of this specific matching behavior. As always, there are a few things to keep in mind with these tests:
Identify Matching System Biases. Some matching systems may have regional migration rules built-in. This may bias the system to make certain types of links across different regions, such as minimizing distance or staying within the same state. The migration patterns test is meant to be a blind test of performance, to see if the matching system can correctly predict the patterns we see in external sources. If a system already has regional preferences, this is clearly no longer a fair test. If possible, disable the regional rules for the matching system before testing.
Align the Evaluation Data. When choosing a source of relocation patterns, keep in mind the parameters of the source data. Your gold-standard likely has some constraints on age, gender, or other demographics. Make sure that you know whether the data includes minors and whether it reports on individuals or households. Pay careful attention to aligning the date range as well. For instance, migration patterns out of Puerto Rico changed significantly after Hurricane Maria. Then, you can restrict your comparison of the matching system to the right section of time and type of individuals.
Focus. The amount of data to consider for migration is large — nearly a million different transition pairs for three-digit zip codes alone — and it can be difficult to make sense of it all. Also, a lot of it is not very interesting or well-studied. Given that the evaluation is qualitative as well as quantitative, it makes sense to focus on some select bands — geographic and demographic — for which there are good information and distinct patterns.
Identity Count Growth
One quick and easy test for the FNR is to watch the number of identities grow over time. In a stable system with steady coverage, the number of people should change slowly over time. New births and immigration will add people at a predictable rate. Even though mortality and emigration will decrease the population, their IDs will persist in the matching system, resulting in consistent growth. This rate of growth is typically well documented or can easily be extrapolated from existing data. However, false positives also add to the total number of individuals, by creating additional fragmented identities. Comparing the expected growth rate to observed growth can help quantify the FNR.
This approach gives a good estimate of the growth of false negatives over time, assuming the coverage is good and the natural growth rate is known. However, the precision will always be lower than the other tests described earlier. Without a true measure of the coverage and the local growth rates, this is best considered as a bound on the FNR rather than an actual metric. There are a few things to keep in mind when using this sort of a test:
Localized estimates. Getting good coverage over the entire region in a data set is very difficult, so it is helpful to break up this test into a bunch of smaller tests. Separate the population by region, and possibly by demographics, to isolate subgroups that have reliably good coverage and well-understood growth rates. This separation can also be useful to eliminate outliers, as unmodeled events could otherwise add significant noise to an average.
Estimating coverage. Good coverage (>90%) is important to separate natural growth from false negatives. However, keep in mind that a high FNR makes estimating coverage particularly difficult, as the population counts for a region can be overestimated by false negatives. Where possible, establish the coverage rate from the data supplier using independent methods.
Upper bounding. Getting a firm measurement of FNR out of the number of identities can be difficult. Instead, this method is best used to place a reasonable bound on the FNR. A good upper bound is the raw growth rate, assuming any new identities could be false negatives. Restricting the assessment to segments with low natural growth rates, such as a specific range of birth years, can help give a tighter upper bound.
Interested in learning more about matching accuracy?