Skip to content
This repository has been archived by the owner on Oct 26, 2023. It is now read-only.

Designing a disambiguation model #8

Open
pudo opened this issue Oct 17, 2022 · 2 comments
Open

Designing a disambiguation model #8

pudo opened this issue Oct 17, 2022 · 2 comments

Comments

@pudo
Copy link
Member

pudo commented Oct 17, 2022

Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

Screenshot 2022-10-17 at 12 58 42

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.

Challenge

While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.

In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:

  • Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
  • Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

Stuff I want to avoid

@slavaGanzin
Copy link

@pudo Friedrich, very interesting thoughts. I am working on something similar, and I think that using Spacy relation extraction co-ocurences can be factorized using inverted information entropy (or transferred information quantity, in other words). Which in an oversimplified way can be represented as 1/(term frequency). Just to start with, it's not an ideal solution.

Jane Doe established Fancy Corp in Italy
etablished - would be a strong link (tf is small)
in - could be light link (tf is high)

So for coocurences it may be:

Jane Doe - Fancy Corp + 1 link * high coeficient, because 'established' is really rare link type
Jane Doe - Italy +0.5 link (it's a link from a link, so we discount it) * negligible coeficient, because 'in' is really popular link type

Of course it should use synonyms and all other stuff. I'm oversimplifying idea

p.s. And I think that Bayes classifier is a great idea. I did this for Human-in-the-loop theme classification and it works as "magic". I didn't scale it up for production system, but prototypes worked well

@pudo
Copy link
Member Author

pudo commented Oct 18, 2022

@slavaGanzin I agree with what you're proposing, but my sense is that we need to address disambiguation/same-as before we can do other edge types. For tags that are not linked between articles, the co-occurrence count is always technically 1 at most ("in the article in which John Doe is mentioned, there's also Fancy Corp") - except if we consider all mentions of the same tag to refer to the same entity, which is sort of where this misery started :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants