mecRecordLinkage.qmd

---
title: "mecRecordLinkage"
author: "Łukasz Chrostowski"
output:
  html_document: default
format: 
  html:
    self-contained: true
    table-of-contents: true
    number-sections: true
editor: visual
execute: 
  eval: true
  warning: false
  message: false
library:
  - dplyr
  - reclin2
---

# Introduction

The goal of the main functions available in the package is to link data sets using maximum entropy classifier. Combining different sources of data is widely using step in a data analysis. Many various techniques have been considering during the years so as to link the records basing on a comparison vectors and m- and u- probabilities estimation. The most useful is probabilistic record linkage derived by `Fellegi and Sunter`, where parameters are estimated using `EM` algorithm (`Winkler`) (see `reclin2` package). One can use also machine learning methods, where process of matching pairs of records into match and non-match sets is considering as a classification problem. You can use `glm` with `binomial` or `xgboost` functions to classify the pairs using machine learning. In our setting, pairs matching is a classification problem as well, but matching process is more automated in comparison to the probabilistic method. MEC technique was proposed by [D. Lee, L. C. Zhang and J. K. Kim. Maximum entropy classification for record linkage (2022)](https://arxiv.org/abs/2009.14797). Authors derive supervised and unsupervised methods with full set parameters estimation techniques. `mecRecordLinkage` package contains direct implementation of the proposed algorithms. First we make an introduction to technical details, then show how to use package properly and finally compare the results with the `reclin2` package.

## Technical details

With record linkage we are trying to link two or more data sets using many common entities. The goal is to find the true matches among all possible pairs of the two data files. For this purposes we create so called `comparison space` $\Omega=A \times B=M \cup U$ consist of matches M and non-matches U between the record in given files. First of all we want to estimate probability ratio

$$
r_{a b}=\frac{m\left(\gamma\right)}{u\left(\gamma\right)}
$$ where $m\left(\gamma\right)$ and $u\left(\gamma\right)$ are the probability mass functions for M and U set directly. In maximum entropy classification we classify each pair of records to these sets depending of the $r$ value. We consider several ways to estimate probability ratio. Let $\quad g_{a b}=1$ if $\quad(a, b) \in M \quad$ and $0$ if $\quad(a, b) \in U$. In the first setting we can create a model of $m\left(\gamma\right)$ by $m(\gamma ; \boldsymbol{\theta})=\prod_{k=1}^K \theta_k^{\gamma_k}\left(1-\theta_k\right)^{1-\gamma_k}$. where $\theta_k=\operatorname{Pr}\left(\gamma_{a b, k}=1 \mid g_{a b}=1\right)$, and $\gamma_{a b, k}$ is the $k^{\text {th }}$ component of $\gamma_{a b}$. $u(\gamma ; \xi)$ can be modeled as above with parameters $\xi_k$ instead of $\theta_k$, where $\xi_k=\operatorname{Pr}\left(\gamma_{a b, k}=1 \mid g_{a b}=0\right)$. The goal in the MEC method is to find the set of pairs that maximize the entropy function. There is different possibilities for estimating required parameters. One the one hand we can use number of agreements (non-agreements) on the key variables, on the other hand maximum likelihood method is considering as well. In the default setting we have

$$\theta_k^{(t)}=\frac{1}{\left|M^{(t)}\right|} \sum_{(a, b) \in \Omega} g_{a b}^{(t)} \mathbb{I}\left(\gamma_{a b, k}=1\right)$$

Another propose is to derive $\theta_k$ from all the pairs in $\Omega$ whereas $\theta_k$ given as above uses only the pairs from the MEC set $M$, then we have

$$\theta_k^{(t)}=\frac{1}{n_M^{(t)}} \sum_{(a, b) \in \Omega} \hat{g}_{a b}^{(t)} \gamma_{a b, k}$$

For $\xi_k$ estimation we consider rate agreements from all the pairs in $\Omega$ in which case there is no updating of $u\left(\gamma ; \xi^{(t)}\right)$ in the iterative algorithm for the unsupervised setting.

$$\hat{\xi}_k=\frac{1}{n} \sum_{(a, b) \in \Omega} \mathbb{I}\left(\gamma_{(a b, k)}=1\right)$$

## MEC for unsupervised learning - iterative algorithm

One can divide maximum entropy classification into two machine learning techniques - supervised and unsupervised learning. The second method provides iterative algorithm to obtain terminal result. D. Lee, L. C. Zhang and J. K. Kim propose following procedure.

1.  set $\boldsymbol{\theta}^{(0)}=\left(\theta_1^{(0)}, \ldots, \theta_K^{(0)}\right)$ and $n_M^{(0)}=\left|M_1\right|$ , where $M_1$ is the maximal MEC set (only consists of the record pairs with perfect agreement of all the key variables).
2.  For the t-th iteration let $g_{a b}^{(t)}=1$ if $(a, b) \in M^{(t)}$, and 0 otherwise.

<!-- -->

i)  update $u\left(\gamma ; \xi^{(t)}\right)$ and $m(\gamma ; \theta^{(t)})$, given $\mathbf{g}^{(t)}=\left\{g_{a b}^{(t)}:(a, b) \in \Omega\right\}$, using one of the estimation methods described above. Once $\boldsymbol{\theta}^{(t)}$ and $\xi^{(t)}$ are obtained, we can update $n_M^{(t)}=\sum_\gamma n(\gamma) \hat{g}^{(t)}(\gamma)$, where $$
    \hat{g}^{(t)}(\gamma) \equiv \hat{g}\left(\gamma ; \boldsymbol{\theta}^{(t)}, \xi^{(t)}\right)=\min \left\{\frac{\left|M^{(t)}\right| r^{(t)}(\gamma)}{\left|M^{(t)}\right|\left(r^{(t)}(\gamma)-1\right)+n}, 1\right\}
    $$ and $r^{(t)}(\gamma)$ is a probability ratio for current setting.
ii) For given $n_{M}^{(t)}$ we find the maximal entropy classification set $M^{(t+1)} = =\left\{(a, b) \in \Omega: g_{a b}^{(t+1)}=1\right\}$ such that $\left|M^{(t+1)}\right|=n_M^{(t)}$ by deduplication in the descending order of probability ratio over $\Omega$. It maximizes the entropy function.

<!-- -->

3.  Iterate until $n_M^{(t)}=n_M^{(t+1)}$ or $\left\|\boldsymbol{\theta}^{(t)}-\boldsymbol{\theta}^{(t+1)}\right\|<\epsilon$.

## Bisection procedure for MEC guided by the error rates

MEC for record linkage should generally be guided by the error rates - false link rate (FLR) and missing match rate (MMR), for whose bisection procedure by a threshold value is derived. Errors are defining as follows

$$\psi=\frac{1}{|\hat{M}|} \sum_{(a, b) \in \hat{M}}\left(1-g_{a b}\right)$$

$$\tau=1-\frac{1}{n_M} \sum_{(a, b) \in \hat{M}} g_{a b}$$for FLR and MMR respectively. With the guiding algorithm be the error rates we can add following points to the performed procedure.

2.  Let $\psi$ be the target FLR

<!-- -->

i)  Choose a threshold value $c$ and form the corresponding MEC set $\hat{M_c}$, where $\hat{r}_{ab} \ge c$ for any $(a,b) \in \hat{M_c}$.
ii) Calculate the estimated FLR and MMR of the resulting set $\hat{M_c}$. if $\hat{\psi} \ge \psi$ then increase $c$, otherwise reduce $c$.

Iteration between these two stops would eventually lead to the value $c$ that makes $\hat{\psi}$ as close as possible to the target $\psi$.

## Supervised learning

For the supervised method we suppose $\operatorname{M}$ observed for given $\operatorname{\Omega}$ and trained classifier is to be applied for the pairs outside of the $\operatorname{\Omega}$. Algorithm:

1.  obtain parameters from observed pairs and compute probability ratio $r(\gamma)$ for pairs outside of $\Omega$.
2.  Solve following fixed equation so as to find optimal size of objective matching set. $$
    n_M=\sum_{(a, b) \in \Omega} \hat{g}\left(\gamma_{a b}\right)=\sum_{\gamma \in S} n(\gamma) \hat{g}(\gamma)
    $$ where $$
    \hat{g}(\gamma):=\operatorname{Pr}\left(g_{a b}=1 \mid \gamma_{a b}=\gamma\right)=\frac{\pi r(\gamma)}{\pi(r(\gamma)-1)+1}=\frac{n_M r(\gamma)}{n_M(r(\gamma)-1)+n}
    $$
3.  Obtain $\operatorname{M}$ set by deduplication in the descending order of $r(\gamma)$ such that $|\operatorname{M}| = n_M$.

# Package installation and Usage

No other libraries need to be installed to use the `mecRecordLinkage` package. All functionalities are build in. For preparing data sets and solving fixed equations, `reclin2` and `FixedPoints` packages are used inside of the `mecRecordLinkage`.

```{r setup, echo=FALSE, message=FALSE}
remotes::install_github("ncn-foreigners/mecRecordLinkage")
library(mecRecordLinkage)
```

To perform example of using `mec` function we can use following data sets. Data are obtained from the [reclin tutorial](https://github.com/djvanderlaan/tutorial-reclin-uros2021) and contain columns with the most important attributes such as first name, last name or sex related to a given person (for more see documentation).

```{r}
library(dplyr)
data1 <- read.csv("data/example1.csv", stringsAsFactors = FALSE)
data2 <- read.csv("data/example2.csv", stringsAsFactors = FALSE)
data1 <- data1 %>% select(-id)
data2 <- data2 %>% select(-id)
head(data1)
head(data2)
```

## `mec` function for unsupervised learning

`mec` Function enables the application of an unsupervised learning. In the first case we link data sets on given variables assuming that year of birthday given person is a strata.

```{r}
vars <- c("firstname", "lastname", "sex", "dobday", "dobmon")
model <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear")
```

Let's look at the model parameters.

```{r}
model$params
```

Set of matching pairs.

```{r}
model$M
```

Depending on the estimation method, results can be different. You can define the method with the `control` parameters.

```{r}
model2 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear", control = control_mec(theta_est = "2"))
model2$M
```

## Guiding the algorithm by the error rates

```{r}
model_ER1 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear", error_rate = TRUE)
model_ER1$M
```

```{r}
model_ER2 <- mec(A = data1, B = data2, vars = vars, blockvars = "dobyear",
                 control = control_mec(theta_est = "2", increase_rate = .6), error_rate = TRUE)
model_ER2$M
```

## Other packages for record linkage

Let's compare it to the results for the `reclin2` package, where `EM` algorithm is implemented.

```{r}
pairs <- reclin2::pair_blocking(data1, data2, on = "dobyear")
reclin2::compare_pairs(pairs, on = vars, inplace = TRUE)
m <- reclin2::problink_em(~lastname + firstname + 
                            sex + dobday + dobmon, data = pairs)

pairs <- predict(m, pairs = pairs, add = TRUE)

pairs <- reclin2::select_threshold(pairs, "threshold", score = "weights", 
                                   threshold = 8)
pairs <- reclin2::select_n_to_m(pairs, "weights", variable = "ntom", 
                                threshold = 0)

linked_data <- reclin2::link(pairs, selection = "ntom", all = FALSE)
subset(pairs[pairs$threshold == TRUE, ], select = -c(weights, threshold, ntom))
subset(pairs[pairs$ntom == TRUE, ],  select = -c(weights, threshold, ntom))
  
```

Notice that this method requires the user to select a threshold, while with the maximal MEC there is no need to define that, what makes this method more automated.

## `mecSup` function for supervised learning

```{r}

set.seed(123)
idx1 <- sample.int(n = nrow(data1)/10)
data1_test <- data1[idx1, ]
data1_train <- data1[-idx1, ]

idx2 <- sample.int(n = nrow(data2)/10)
data2_test <- data2[idx2, ]
data2_train <- data2[-idx2, ]

vars <- c("firstname", "lastname", "sex", "dobday", "dobmon")

model <- mec(A = data1_train, B = data2_train, vars = vars, blockvars = "dobyear")
M <- model$M
U <- model$U

Omega <- subset(rbind(M, U), select = c(-.x, -.y, -r))
g <- Omega %>% select(selected)
Omega <- Omega %>% select(-selected)

############ MecRecordLinkage ##############

test1 <- mecSup(A = data1_test, B = data2_test, Omega = Omega,
                g = g, vars = vars, blockvars = "dobyear", prob_ratio = "1")
test2 <- mecSup(A = data1_test, B = data2_test, Omega = Omega,
                g = g, vars = vars, blockvars = "dobyear", prob_ratio = "2")

mec1 <- nrow(test1$linked_data)
mec2 <- nrow(test2$linked_data)
```

```{r}
############ reclin2 ##############

pairs <- reclin2::pair_blocking(data1_test, data2_test, on ="dobyear") |>
  reclin2::compare_pairs(on = vars, inplace = TRUE)

m <- reclin2::problink_em(~firstname + lastname + sex + dobday + dobmon, data = pairs)

pairs <- predict(m, pairs = pairs, add = TRUE)

model_recc <- pairs |>
  reclin2::select_threshold("threshold", score = "weights", threshold = 8) |>
  reclin2::select_n_to_m("weights", variable = "ntom",  threshold = 0) %>% filter(ntom == TRUE)

model_rec <- model_recc %>% reclin2::link(selection = "ntom", all = FALSE)  |>
  as.data.frame()

rec <- nrow(model_rec)
```

```{r}
############ glm ##############

pairs <- reclin2::pair_blocking(data1_test, data2_test, on ="dobyear") |>
  reclin2::compare_pairs(on = vars, inplace = TRUE)

pairs[is.na(pairs)] <- FALSE

train_data <- cbind(Omega, g = g$selected)
test_data <- pairs %>% select(c(-.x, -.y))

model_glm <- glm(g ~ ., data = train_data, family = binomial())

prob <- predict(model_glm, test_data, type = "response")
glm <- length(na.omit(prob[prob > 1/2]))

# Size of the final set by each method
data.frame(Mec1 = mec1, Mec2 = mec2, reclin2 = rec, glm = glm)
```

Work on this package is supported by the the National Science Center, OPUS 22 grant no. 2020/39/B/HS4/00941.