Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: assessing goodness of fit between two PyKEEN models #11

Open
cthoyt opened this issue Aug 5, 2021 · 1 comment
Open

Use case: assessing goodness of fit between two PyKEEN models #11

cthoyt opened this issue Aug 5, 2021 · 1 comment

Comments

@cthoyt
Copy link
Contributor

cthoyt commented Aug 5, 2021

If I have two different embedding spaces describing the same entities, like if I train two models on the same dataset in PyKEEN, how can I use Kiez to assess how good they correspond? Or maybe there's a notion of how "good" the Kiez fit is?

A naive idea is I could I iterate through each entity and calculate the overlap coefficient of the nearest neighbors in both embedding spaces, then maybe report the average overlap coefficient. I'm sure I could come up with a few things like this, but I bet you know better! Any ideas appreciated.

I would start with code like this:

from pykeen.pipeline import pipeline
from pykeen.datasets import Nations

dataset = Nations()

# Train the same dataset with two different models
r1 = pipeline(
    model='TransE',
    dataset=dataset,
    epochs=1,  # change this to ~25 for real usage on Nations
)

r2 = pipeline(
    model='PairRE',
    dataset=dataset,
    epochs=1,  # change this to ~25 for real usage on Nations
) 

from kiez import Kiez

k_inst = Kiez()
k_inst.fit(
    r1.model.entity_representations[0]().detach().numpy(),
    r2.model.entity_representations[0]().detach().numpy(),
)

# How do I assess how well these spaces correspond? Is there a metric for how "good" the fit is?
@dobraczka
Copy link
Owner

dobraczka commented Aug 6, 2021

I'm afraid this would not really work properly, because Kiez assumes that the two embeddings are in the same space and based on that performs the hubness reduction and returns the k nearest neighbors of the source entities in the target entity space.
The embedding approaches for these types of problems rely on training data via known entity duplicates between data sources and during training try to embed these known duplicates closely in the space.
For an overview of the entity alignment setting Kiez was built for see e.g. this paper: OpenEA Benchmark

However what you are proposing is a really interesting investigation to determine how similar the results of different embedding approaches are.
The simplest implementation would maybe be something like this:

from kiez import Kiez
from pykeen.datasets import Nations
from pykeen.pipeline import pipeline

dataset = Nations()
r1 = pipeline(model="TransE", dataset=dataset)
r2 = pipeline(model="PairRE", dataset=dataset)


k_inst_transe = Kiez()

# old single-source api usage since I haven't released the patch yet
k_inst_transe.fit(
    r1.model.entity_representations[0]().detach().numpy(),
    r1.model.entity_representations[0]().detach().numpy(),
)
k_inst_pairre = Kiez()
k_inst_pairre.fit(
    r2.model.entity_representations[0]().detach().numpy(),
    r2.model.entity_representations[0]().detach().numpy(),
)

transe_k_neighbors = k_inst_transe.kneighbors(return_distance=False)
pairre_k_neighbors = k_inst_pairre.kneighbors(return_distance=False)


def overlap(left_neighbors, right_neighbors):
    perfect_matching_rolling_percentage_sum = 0
    for l_ent_neighbors, r_ent_neighbors in zip(left_neighbors, right_neighbors):
        matching = 0
        for l, r in zip(l_ent_neighbors, r_ent_neighbors):
            if l == r:
                matching += 1

        perfect_matching_rolling_percentage_sum += matching / len(l_ent_neighbors)
    return perfect_matching_rolling_percentage_sum / len(left_neighbors)


print(overlap(transe_k_neighbors, pairre_k_neighbors))

But I think there should be a more clever metric, that takes into account how distantly the respective neighbors are ranked. I'd have to think about that one...

@dobraczka dobraczka mentioned this issue Aug 9, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants