Fuzzy matching theory

The basic idea behind fuzzy matching is to compute a numerical ‘distance’ between every potential string comparison, and then for each string in data set 1, pick the ‘closest’ string in data set 2. One can also specify a threshold such that every match is of a certain quality. The concept of ‘distance’ can be defined in several methods, explained below. Each method will define a similarity \(s\), \(0 \leq s \leq 1\) , such that higher values will mean the strings are more similar, and a corresponding distance \(d = 1 - s\).

Jaro-Winkler

The most common method of matching strings with fedmatch uses the Jaro-Winkler string distance. The Jaro similarity is defined as

\[s_j = \begin{cases} 0 & \text{if } m = 0 \\ \frac{1}{3} \left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m} \right) & \text{otherwise} \end{cases}\]

where \(|s_i|\) is the length of the \(i\)th string, \(m\) is the number of characters that match, and \(t\) is the number of transpositions (the number of letters to be interchanged to place them in the proper order.) For example, if \(s_1\) is “abcd” and \(s_2\) is “acbd”, then the number of transpositions is 1, the matching characters are 4, and the similarity is \(\frac{1}{3}\left( 1 + 1 + \frac{3}{4} \right) \approx .917\).

Winkler’s modification to this similarity is to weight the start of the strings more heavily. Thus, if the strings share a common prefix, the similarity will be higher. The Jaro-Winkler similarity is

\[s_w = s_j + l p (1-s_j),\]

where \(l\) is the shared prefix length (up to four characters, so \(0 \leq l \leq 4\) ), and \(p\) is just user-defined constant, \(0 < p \leq .25\), that defines how heavily to weight the shared prefix. \(p=.1\) is the default used by fedmatch.

Weighted Jaccard Similarity

Another string distance metric is called the Jaccard Similarity. If \(A\) is the set of letters in one string, and \(B\) is the set of letters in another string, then the Jaccard similarity \(J\) is defined as

\[J = \frac{A \bigcap B}{A \bigcup B}.\]

Thus, if the two strings share all of the same letters, then the distance is 0, or the similarity is 1.

We take one step further and innovate to create a new measure of string distance, which we call the Weighted Jaccard distance. Rather than looking at the sets of letters in each string, we look at the set of words in each string. Then, we construct a corpus of all of the words in the two columns to be matched and compute the log inverse frequency for each word:

\[w_i = \log \left( \frac{N}{n_i} \right), \] where \(N\) is the total number of unique words in the sample, and \(n_i\) is the number of occurrences of word \(i\) in the sample.

For example, the word ‘Corporation’ appears frequently in the names of companies in common data sets, and so its log inverse frequency would be low. But, the word ‘Apple’ probably only appears in a few company names out of the many thousands in the data, so its log inverse frequency score would be high.

If these logged inverse frequencies are called \(w\), then the Weighted Jaccard similarity is computed as

\[ J_i = \frac{ \sum_j w_j, j \in A \bigcap B}{ \sum_k w_k, k \in A \bigcup B} \]

Thus, if the two strings share many words that are uniquely identifying in the corpus, the Weighted Jaccard similarity will be higher. We have found this method of matching is able to pick up many more matches when used in conjunction with the Jaro-Winkler distance. To our knowledge, such a Weighted Jaccard method has never been used before, and is one of the key innovations that makes fedmatch unique.

Other similarity metrics

Because fedmatch uses stringdist::amatch for all other options, one can use any of the similarity metrics listed in the documentation for that function in fedmatch.

Using fuzzy matching in fedmatch

Basic Syntax

As shown in the ‘Intro to fedmatch’ vignette, the basic syntax of using fuzzy match is just the same as exact matching, but changing the ‘match_type’ argument in merge_plus:

fuzzy_result <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(fuzzy_result$matches)
#> Key: <unique_key_2>
#> Index: <tier>
#>    unique_key_2 unique_key_1 Country  State   SIC Revenue            Company
#>           <int>        <int>  <char> <char> <num>   <num>             <char>
#> 1:            1            1     USA     OH  3300     485            Walmart
#> 2:            2            2     USA         2222     223   Bershire Hataway
#> 3:            6            6     USA     MA    NA     184 UnitedHealth Group
#>                  Name country state_code SIC_code earnings   tier
#>                <char>  <char>     <char>    <num>   <char> <char>
#> 1:            Walmart     USA         OH     3380  490,000    all
#> 2:  Bershire Hathaway     USA         NE     2220  220,000    all
#> 3: UnitedHealth Group     USA         MA     1130  180,000    all

Note that ‘Bershire Hathaway’ matched to ‘Bershire Hataway,’ which makes sense - the names are only off by one character. All of the settings of fuzzy matching are controlled in the ‘fuzzy_settings’ argument of merge_plus, and these settings can be easily modified with the function build_fuzzy_settings. build_fuzzy_settings returns a list that is properly formatted for merge_plus.The options and their defaults are:

method , default “jw”, determines the string distance metric to use.
p , default 0.1, is a numeric vector of length 1, that determines the factor \(p\) in the Jaro-Winkler string distance. Only used if method == "jw".
maxDist, default 0.05, is a numeric vector of length 1 that determines the maximum allowable string distance that qualifies as a match.
matchNA, default FALSE, is a boolean to determine whether to count NAs as a match (see stringdist::amatch for details.)
nthread, number of threads to use in the underlying C code. If using any method besides “wgt_jaccard”, uses `stringdist::amatch’ to parallelize. If using “wgt_jaccard”, instead uses built in C++ code to parallelize.

The order of which is data1 and which is data2 matters!

The behavior of fuzzy_match is such that you can get different results from swapping data1 and data2. This is because fuzzy_match answers the question: “for each observation in data1, which observation is closest in data2 (while being closest than the maxDist)?” This can result in duplicate observations from data2, as shown in the simple example below:

dummy_data1 <- data.table(id1 = 1:2,
                            name = "abd")
dummy_data2 <- data.table(id2 = 1,
                            name = "abc")
result1 <- fedmatch::merge_plus(
    data1 = dummy_data1,
    match_type = "fuzzy",
    data2 = dummy_data2, by.x = "name", by.y = "name",
    unique_key_1 = "id1", unique_key_2 = "id2",
    suffixes = c("_1", "_2"), fuzzy_settings = build_fuzzy_settings(maxDist = .5))
print(result1$matches)
#> Key: <id2>
#> Index: <tier>
#>      id2   id1 name_1 name_2   tier
#>    <num> <int> <char> <char> <char>
#> 1:     1     1    abd    abc    all
#> 2:     1     2    abd    abc    all

Note how we have two observations of the id2 ‘1’, because each of these observations is the closest in the data to the observations in data1 (and they are closer together than the maxDist specified). Now, if we flip the datasets:

result1 <- fedmatch::merge_plus(
    data1 = dummy_data2,
    match_type = "fuzzy",
    data2 = dummy_data1, by.x = "name", by.y = "name",
    unique_key_1 = "id2", unique_key_2 = "id1",
    suffixes = c("_1", "_2"), fuzzy_settings = build_fuzzy_settings(maxDist = .5))
print(result1$matches)
#> Key: <id1>
#> Index: <tier>
#>      id1   id2 name_1 name_2   tier
#>    <int> <num> <char> <char> <char>
#> 1:     1     1    abc    abd    all

We only get 1 observation, because in the case of ties, fuzzy_match will simply pick the first observation.

An example - weighted Jaccard match

Weighted Jaccard Match

wgt_jaccard_match <- merge_plus(data1 = corp_data1, 
                          data2 = corp_data2,
                          by.x = "Company",
                          by.y = "Name", match_type = "fuzzy", 
                          fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,
                                                                maxDist = .5),
                          unique_key_1 = "unique_key_1",
                          unique_key_2 = "unique_key_2")
print(wgt_jaccard_match)
#> $matches
#> Key: <unique_key_2>
#> Index: <tier>
#>    unique_key_2 unique_key_1 Country  State   SIC Revenue            Company
#>           <int>        <int>  <char> <char> <num>   <num>             <char>
#> 1:            1            1     USA     OH  3300     485            Walmart
#> 2:            4            4     USA     TX  2222     205       Exxon Mobile
#> 3:            6            6     USA     MA    NA     184 UnitedHealth Group
#> 4:           10           10     USA     MI    NA     151 Ford Motor Company
#>                  Name country state_code SIC_code earnings   tier
#>                <char>  <char>     <char>    <num>   <char> <char>
#> 1:            Walmart     USA         OH     3380  490,000    all
#> 2: Exxon Mobile Inc.      USA         TX     2222  210,000    all
#> 3: UnitedHealth Group     USA         MA     1130  180,000    all
#> 4:         Ford Motor     USA         MI     2222  150,000    all
#> 
#> $matches_filter
#> Null data.table (0 rows and 0 cols)
#> 
#> $data1_nomatch
#>             Company Country  State   SIC Revenue unique_key_1
#>              <char>  <char> <char> <num>   <num>        <int>
#> 1: Bershire Hataway     USA         2222     223            2
#> 2:            Apple     USA     CA  3384     215            3
#> 3:        McKesson  Germany     MA   222     192            5
#> 4:       CVS Health     USA     RI  1112     177            7
#> 5:   General Motors     USA     MI  2222     166            8
#> 6:             AT&T     USA     TN  4000     163            9
#> 
#> $data2_nomatch
#>                 Name country state_code SIC_code earnings unique_key_2
#>               <char>  <char>     <char>    <num>   <char>        <int>
#> 1: Bershire Hathaway     USA         NE     2220  220,000            2
#> 2:    Apple Computer     USA         CA       NA  220,000            3
#> 3:    McKesson Corp.                 MA     2222  190,000            5
#> 4:               CVS                 RI     1122  180,000            7
#> 5:                GM                 MI     2222  170,000            8
#> 6:            AT & T     USA         TN     4000  160,000            9
#> 
#> $match_evaluation
#> Index: <tier>
#>      tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#>    <char>   <int>            <int>            <int>         <num>         <num>
#> 1:    all       4                4                4           0.4           0.4
#>    new_unique_1 new_unique_2
#>           <int>        <int>
#> 1:            4            4

Fuzzy Matching

Intro