Designing a disambiguation model #8

pudo · 2022-10-17T11:27:01Z

Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.

Challenge

While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.

In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:

Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

Stuff I want to avoid

I'd really like to avoid some sort of article-content-based mystery vectorisation (e.g. BERT), unless there's a really nice and reproducible way of productising this. However, that's what a lot of the literature is pushing:
Very keen to avoid using an external knowledge base to disambiguate, because the entities we're most interested in are the ones that would not yet be recorded and identified in a KB like Wikidata - the people who work for oligarchs, kleptocrats, etc.

The text was updated successfully, but these errors were encountered:

slavaGanzin · 2022-10-17T12:38:41Z

@pudo Friedrich, very interesting thoughts. I am working on something similar, and I think that using Spacy relation extraction co-ocurences can be factorized using inverted information entropy (or transferred information quantity, in other words). Which in an oversimplified way can be represented as 1/(term frequency). Just to start with, it's not an ideal solution.

Jane Doe established Fancy Corp in Italy
etablished - would be a strong link (tf is small)
in - could be light link (tf is high)

So for coocurences it may be:

Jane Doe - Fancy Corp + 1 link * high coeficient, because 'established' is really rare link type
Jane Doe - Italy +0.5 link (it's a link from a link, so we discount it) * negligible coeficient, because 'in' is really popular link type

Of course it should use synonyms and all other stuff. I'm oversimplifying idea

p.s. And I think that Bayes classifier is a great idea. I did this for Human-in-the-loop theme classification and it works as "magic". I didn't scale it up for production system, but prototypes worked well

pudo · 2022-10-18T06:09:20Z

@slavaGanzin I agree with what you're proposing, but my sense is that we need to address disambiguation/same-as before we can do other edge types. For tags that are not linked between articles, the co-occurrence count is always technically 1 at most ("in the article in which John Doe is mentioned, there's also Fancy Corp") - except if we consider all mentions of the same tag to refer to the same entity, which is sort of where this misery started :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Designing a disambiguation model #8

Designing a disambiguation model #8

pudo commented Oct 17, 2022 •

edited

Loading

slavaGanzin commented Oct 17, 2022

pudo commented Oct 18, 2022

Designing a disambiguation model #8

Designing a disambiguation model #8

Comments

pudo commented Oct 17, 2022 • edited Loading

Progress

Challenge

Stuff I want to avoid

slavaGanzin commented Oct 17, 2022

pudo commented Oct 18, 2022

pudo commented Oct 17, 2022 •

edited

Loading