You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We probably want to retire regression-v2 relatively quickly: it's never been used much and doesn't produce amazing results.
Addresses are an area we are really not doing well in. I don't even know what shape a plan would take there - serious address parsing/normalisation would require adopting something like libpostal, but that's just a prohibitively large dependency to adopt. Perhaps we can have some marginal gains by using partial matching based on token alignment (see below)
Back-porting things from logic-v1
We've done a lot of work on logic-v1 that isn't yet being employed by the regression matchers. In particular around identifiers and name matching. But there's a caveat: some of the logic-v1 matchers are "asymmetric" now - they handle the query and match candidate arguments differently, so using them for dedupe doesn't make sense.
We've got a lot of "name alignment" in the logic-v1 name matchers. This means a comparison between "smith, john" and "john smith" gets re-sorted by doing pairwise string distance on the tokens and then doing an overall string distance based on the aligned names (cf. nomenklatura.matching.compare.names:_align_name_parts).
We have a bunch of matchers for specific, strongly-typed identifiers (like INNs, OGRNs, SWIFT BICs, ISINs etc.) in nomenklatura.matching.compare.identifiers. It would be fun to see if these want to be regression features.
Specific stuff for de-dupe
Regression features seem to be working best when they're fully independent. So we have to be careful with stuff like having features both for INNs and for identifiers in general, that sort of react to the same signal in the data on some level. I also think that overlap in "signal" is what makes our DOB matching be really crap.
We should try if having a countries_overlap as well as a countries_disjoint is a good idea.
Birthdates/start dates matching in reg-v1 is deeply broken. We probably want to have both negative and positive features here, too, and both of them for year-only and day-precision.
Final thoughts
This all needs to be fast. lol.
The text was updated successfully, but these errors were encountered:
Overview
libpostal
, but that's just a prohibitively large dependency to adopt. Perhaps we can have some marginal gains by using partial matching based on token alignment (see below)Back-porting things from logic-v1
We've done a lot of work on logic-v1 that isn't yet being employed by the regression matchers. In particular around identifiers and name matching. But there's a caveat: some of the logic-v1 matchers are "asymmetric" now - they handle the query and match candidate arguments differently, so using them for dedupe doesn't make sense.
nomenklatura.matching.compare.names:_align_name_parts
).nomenklatura.matching.compare.identifiers
. It would be fun to see if these want to be regression features.Specific stuff for de-dupe
countries_overlap
as well as acountries_disjoint
is a good idea.Final thoughts
This all needs to be fast. lol.
The text was updated successfully, but these errors were encountered: