-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmecRecordLinkage.qmd
251 lines (173 loc) · 12.8 KB
/
mecRecordLinkage.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "mecRecordLinkage"
author: "Łukasz Chrostowski"
output:
html_document: default
format:
html:
self-contained: true
table-of-contents: true
number-sections: true
editor: visual
execute:
eval: true
warning: false
message: false
library:
- dplyr
- reclin2
---
# Introduction
The goal of the main functions available in the package is to link data sets using maximum entropy classifier. Combining different sources of data is widely using step in a data analysis. Many various techniques have been considering during the years so as to link the records basing on a comparison vectors and m- and u- probabilities estimation. The most useful is probabilistic record linkage derived by `Fellegi and Sunter`, where parameters are estimated using `EM` algorithm (`Winkler`) (see `reclin2` package). One can use also machine learning methods, where process of matching pairs of records into match and non-match sets is considering as a classification problem. You can use `glm` with `binomial` or `xgboost` functions to classify the pairs using machine learning. In our setting, pairs matching is a classification problem as well, but matching process is more automated in comparison to the probabilistic method. MEC technique was proposed by [D. Lee, L. C. Zhang and J. K. Kim. Maximum entropy classification for record linkage (2022)](https://arxiv.org/abs/2009.14797). Authors derive supervised and unsupervised methods with full set parameters estimation techniques. `mecRecordLinkage` package contains direct implementation of the proposed algorithms. First we make an introduction to technical details, then show how to use package properly and finally compare the results with the `reclin2` package.
## Technical details
With record linkage we are trying to link two or more data sets using many common entities. The goal is to find the true matches among all possible pairs of the two data files. For this purposes we create so called `comparison space` $\Omega=A \times B=M \cup U$ consist of matches M and non-matches U between the record in given files. First of all we want to estimate probability ratio
$$
r_{a b}=\frac{m\left(\gamma\right)}{u\left(\gamma\right)}
$$ where $m\left(\gamma\right)$ and $u\left(\gamma\right)$ are the probability mass functions for M and U set directly. In maximum entropy classification we classify each pair of records to these sets depending of the $r$ value. We consider several ways to estimate probability ratio. Let $\quad g_{a b}=1$ if $\quad(a, b) \in M \quad$ and $0$ if $\quad(a, b) \in U$. In the first setting we can create a model of $m\left(\gamma\right)$ by $m(\gamma ; \boldsymbol{\theta})=\prod_{k=1}^K \theta_k^{\gamma_k}\left(1-\theta_k\right)^{1-\gamma_k}$. where $\theta_k=\operatorname{Pr}\left(\gamma_{a b, k}=1 \mid g_{a b}=1\right)$, and $\gamma_{a b, k}$ is the $k^{\text {th }}$ component of $\gamma_{a b}$. $u(\gamma ; \xi)$ can be modeled as above with parameters $\xi_k$ instead of $\theta_k$, where $\xi_k=\operatorname{Pr}\left(\gamma_{a b, k}=1 \mid g_{a b}=0\right)$. The goal in the MEC method is to find the set of pairs that maximize the entropy function. There is different possibilities for estimating required parameters. One the one hand we can use number of agreements (non-agreements) on the key variables, on the other hand maximum likelihood method is considering as well. In the default setting we have
$$\theta_k^{(t)}=\frac{1}{\left|M^{(t)}\right|} \sum_{(a, b) \in \Omega} g_{a b}^{(t)} \mathbb{I}\left(\gamma_{a b, k}=1\right)$$
Another propose is to derive $\theta_k$ from all the pairs in $\Omega$ whereas $\theta_k$ given as above uses only the pairs from the MEC set $M$, then we have
$$\theta_k^{(t)}=\frac{1}{n_M^{(t)}} \sum_{(a, b) \in \Omega} \hat{g}_{a b}^{(t)} \gamma_{a b, k}$$
For $\xi_k$ estimation we consider rate agreements from all the pairs in $\Omega$ in which case there is no updating of $u\left(\gamma ; \xi^{(t)}\right)$ in the iterative algorithm for the unsupervised setting.
$$\hat{\xi}_k=\frac{1}{n} \sum_{(a, b) \in \Omega} \mathbb{I}\left(\gamma_{(a b, k)}=1\right)$$
## MEC for unsupervised learning - iterative algorithm
One can divide maximum entropy classification into two machine learning techniques - supervised and unsupervised learning. The second method provides iterative algorithm to obtain terminal result. D. Lee, L. C. Zhang and J. K. Kim propose following procedure.
1. set $\boldsymbol{\theta}^{(0)}=\left(\theta_1^{(0)}, \ldots, \theta_K^{(0)}\right)$ and $n_M^{(0)}=\left|M_1\right|$ , where $M_1$ is the maximal MEC set (only consists of the record pairs with perfect agreement of all the key variables).
2. For the t-th iteration let $g_{a b}^{(t)}=1$ if $(a, b) \in M^{(t)}$, and 0 otherwise.
<!-- -->
i) update $u\left(\gamma ; \xi^{(t)}\right)$ and $m(\gamma ; \theta^{(t)})$, given $\mathbf{g}^{(t)}=\left\{g_{a b}^{(t)}:(a, b) \in \Omega\right\}$, using one of the estimation methods described above. Once $\boldsymbol{\theta}^{(t)}$ and $\xi^{(t)}$ are obtained, we can update $n_M^{(t)}=\sum_\gamma n(\gamma) \hat{g}^{(t)}(\gamma)$, where $$
\hat{g}^{(t)}(\gamma) \equiv \hat{g}\left(\gamma ; \boldsymbol{\theta}^{(t)}, \xi^{(t)}\right)=\min \left\{\frac{\left|M^{(t)}\right| r^{(t)}(\gamma)}{\left|M^{(t)}\right|\left(r^{(t)}(\gamma)-1\right)+n}, 1\right\}
$$ and $r^{(t)}(\gamma)$ is a probability ratio for current setting.
ii) For given $n_{M}^{(t)}$ we find the maximal entropy classification set $M^{(t+1)} = =\left\{(a, b) \in \Omega: g_{a b}^{(t+1)}=1\right\}$ such that $\left|M^{(t+1)}\right|=n_M^{(t)}$ by deduplication in the descending order of probability ratio over $\Omega$. It maximizes the entropy function.
<!-- -->
3. Iterate until $n_M^{(t)}=n_M^{(t+1)}$ or $\left\|\boldsymbol{\theta}^{(t)}-\boldsymbol{\theta}^{(t+1)}\right\|<\epsilon$.
## Bisection procedure for MEC guided by the error rates
MEC for record linkage should generally be guided by the error rates - false link rate (FLR) and missing match rate (MMR), for whose bisection procedure by a threshold value is derived. Errors are defining as follows
$$\psi=\frac{1}{|\hat{M}|} \sum_{(a, b) \in \hat{M}}\left(1-g_{a b}\right)$$
$$\tau=1-\frac{1}{n_M} \sum_{(a, b) \in \hat{M}} g_{a b}$$for FLR and MMR respectively. With the guiding algorithm be the error rates we can add following points to the performed procedure.
2. Let $\psi$ be the target FLR
<!-- -->
i) Choose a threshold value $c$ and form the corresponding MEC set $\hat{M_c}$, where $\hat{r}_{ab} \ge c$ for any $(a,b) \in \hat{M_c}$.
ii) Calculate the estimated FLR and MMR of the resulting set $\hat{M_c}$. if $\hat{\psi} \ge \psi$ then increase $c$, otherwise reduce $c$.
Iteration between these two stops would eventually lead to the value $c$ that makes $\hat{\psi}$ as close as possible to the target $\psi$.
## Supervised learning
For the supervised method we suppose $\operatorname{M}$ observed for given $\operatorname{\Omega}$ and trained classifier is to be applied for the pairs outside of the $\operatorname{\Omega}$. Algorithm:
1. obtain parameters from observed pairs and compute probability ratio $r(\gamma)$ for pairs outside of $\Omega$.
2. Solve following fixed equation so as to find optimal size of objective matching set. $$
n_M=\sum_{(a, b) \in \Omega} \hat{g}\left(\gamma_{a b}\right)=\sum_{\gamma \in S} n(\gamma) \hat{g}(\gamma)
$$ where $$
\hat{g}(\gamma):=\operatorname{Pr}\left(g_{a b}=1 \mid \gamma_{a b}=\gamma\right)=\frac{\pi r(\gamma)}{\pi(r(\gamma)-1)+1}=\frac{n_M r(\gamma)}{n_M(r(\gamma)-1)+n}
$$
3. Obtain $\operatorname{M}$ set by deduplication in the descending order of $r(\gamma)$ such that $|\operatorname{M}| = n_M$.
# Package installation and Usage
No other libraries need to be installed to use the `mecRecordLinkage` package. All functionalities are build in. For preparing data sets and solving fixed equations, `reclin2` and `FixedPoints` packages are used inside of the `mecRecordLinkage`.
```{r setup, echo=FALSE, message=FALSE}
remotes::install_github("ncn-foreigners/mecRecordLinkage")
library(mecRecordLinkage)
```
To perform example of using `mec` function we can use following data sets. Data are obtained from the [reclin tutorial](https://github.com/djvanderlaan/tutorial-reclin-uros2021) and contain columns with the most important attributes such as first name, last name or sex related to a given person (for more see documentation).
```{r}
library(dplyr)
data1 <- read.csv("data/example1.csv", stringsAsFactors = FALSE)
data2 <- read.csv("data/example2.csv", stringsAsFactors = FALSE)
data1 <- data1 %>% select(-id)
data2 <- data2 %>% select(-id)
head(data1)
head(data2)
```
## `mec` function for unsupervised learning
`mec` Function enables the application of an unsupervised learning. In the first case we link data sets on given variables assuming that year of birthday given person is a strata.
```{r}
vars <- c("firstname", "lastname", "sex", "dobday", "dobmon")
model <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear")
```
Let's look at the model parameters.
```{r}
model$params
```
Set of matching pairs.
```{r}
model$M
```
Depending on the estimation method, results can be different. You can define the method with the `control` parameters.
```{r}
model2 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear", control = control_mec(theta_est = "2"))
model2$M
```
## Guiding the algorithm by the error rates
```{r}
model_ER1 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear", error_rate = TRUE)
model_ER1$M
```
```{r}
model_ER2 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear",
control = control_mec(theta_est = "2", increase_rate = .6), error_rate = TRUE)
model_ER2$M
```
## Other packages for record linkage
Let's compare it to the results for the `reclin2` package, where `EM` algorithm is implemented.
```{r}
pairs <- reclin2::pair_blocking(data1, data2, on = "dobyear")
reclin2::compare_pairs(pairs, on = vars, inplace = TRUE)
m <- reclin2::problink_em(~lastname + firstname +
sex + dobday + dobmon, data = pairs)
pairs <- predict(m, pairs = pairs, add = TRUE)
pairs <- reclin2::select_threshold(pairs, "threshold", score = "weights",
threshold = 8)
pairs <- reclin2::select_n_to_m(pairs, "weights", variable = "ntom",
threshold = 0)
linked_data <- reclin2::link(pairs, selection = "ntom", all = FALSE)
subset(pairs[pairs$threshold == TRUE, ], select = -c(weights, threshold, ntom))
subset(pairs[pairs$ntom == TRUE, ], select = -c(weights, threshold, ntom))
```
Notice that this method requires the user to select a threshold, while with the maximal MEC there is no need to define that, what makes this method more automated.
## `mecSup` function for supervised learning
```{r}
set.seed(123)
idx1 <- sample.int(n = nrow(data1)/10)
data1_test <- data1[idx1, ]
data1_train <- data1[-idx1, ]
idx2 <- sample.int(n = nrow(data2)/10)
data2_test <- data2[idx2, ]
data2_train <- data2[-idx2, ]
vars <- c("firstname", "lastname", "sex", "dobday", "dobmon")
model <- mec(A = data1_train, B = data2_train, vars = vars, blockvars = "dobyear")
M <- model$M
U <- model$U
Omega <- subset(rbind(M, U), select = c(-.x, -.y, -r))
g <- Omega %>% select(selected)
Omega <- Omega %>% select(-selected)
############ MecRecordLinkage ##############
test1 <- mecSup(A = data1_test, B = data2_test, Omega = Omega,
g = g, vars = vars, blockvars = "dobyear", prob_ratio = "1")
test2 <- mecSup(A = data1_test, B = data2_test, Omega = Omega,
g = g, vars = vars, blockvars = "dobyear", prob_ratio = "2")
mec1 <- nrow(test1$linked_data)
mec2 <- nrow(test2$linked_data)
```
```{r}
############ reclin2 ##############
pairs <- reclin2::pair_blocking(data1_test, data2_test, on ="dobyear") |>
reclin2::compare_pairs(on = vars, inplace = TRUE)
m <- reclin2::problink_em(~firstname + lastname + sex + dobday + dobmon, data = pairs)
pairs <- predict(m, pairs = pairs, add = TRUE)
model_recc <- pairs |>
reclin2::select_threshold("threshold", score = "weights", threshold = 8) |>
reclin2::select_n_to_m("weights", variable = "ntom", threshold = 0) %>% filter(ntom == TRUE)
model_rec <- model_recc %>% reclin2::link(selection = "ntom", all = FALSE) |>
as.data.frame()
rec <- nrow(model_rec)
```
```{r}
############ glm ##############
pairs <- reclin2::pair_blocking(data1_test, data2_test, on ="dobyear") |>
reclin2::compare_pairs(on = vars, inplace = TRUE)
pairs[is.na(pairs)] <- FALSE
train_data <- cbind(Omega, g = g$selected)
test_data <- pairs %>% select(c(-.x, -.y))
model_glm <- glm(g ~ ., data = train_data, family = binomial())
prob <- predict(model_glm, test_data, type = "response")
glm <- length(na.omit(prob[prob > 1/2]))
# Size of the final set by each method
data.frame(Mec1 = mec1, Mec2 = mec2, reclin2 = rec, glm = glm)
```
Work on this package is supported by the the National Science Center, OPUS 22 grant no. 2020/39/B/HS4/00941.