Skip to content

Latest commit

 

History

History
101 lines (75 loc) · 6.72 KB

README.md

File metadata and controls

101 lines (75 loc) · 6.72 KB

Awesome Entity Resolution Resources Awesome


Open-Source Software

End-to-End Entity Resolution

  • Splink (Python, SQL, Spark) - Scalable Fellegi-Sunter and rule-based entity resolution using your choice of SQL or Spark backend.
  • Zingg (Python, Java) - Scalable, active learning model for entity resolution.
  • dedupe (Python) - Active learning and flexible Python tooling for entity resolution.
  • PyJedAI (Python, Java) - State-of-the-art entity resolution clustering algorithms.
  • DeepMatcher (Python) - Deep learning-based entity ersolution
  • FastLink (R) - Easy, scalable Fellegi-Sunter entity resolution on your laptop.
  • RecordLinkage (Python) - Toolkit for prototyping entity resolution systems.
  • dblink (R, Spark) - Scalable Bayesian graphical entity resolution.
  • exchanger (R, C++) - More flexible Bayesian graphical entity resolution on your laptop.
  • RELAIS (R, SQL, Java) - Record linkage software used at the Italian National Statistics Institute.

Evaluation

  • ER-Evaluation (Python) - End-to-End evaluation, including summary statistics for monitoring, principled performance metric estimators, and error analysis.
  • clevr (R) - Performance metrics and error tables.

String Comparison

  • jellyfish (Python, C) - Fast string distance and phonetic matching.
  • py_stringmatching (Python, C) - Large set of string comparison functions and tokenizaztion methods.
  • textdistance (Python) - Very large collection of sequence comparison functions, including token-based distances.
  • SecondString (Java) - Java implementation of string comparison functions.
  • StringCompare (Python, C++) - Time and space efficient implementation of common string distance functions. Architectured for maintainability and extendability.
  • Comparator (R, C++) - Efficient string comparison functions in R.

Embeddings (for pairwise comparison)

  • Entity Embed (Python, PyTorch) - Pytorch text embedding model for blocking.
  • FaceNet-PyTorch (Python, PyTorch) - Embeddings for facial identity resolution.

Data Cleaning and Parsing

  • cleanco (Python) - Company name cleaning.
  • libpostal (C, and bindings for Python, Java, Go, Ruby, PHP, and NodeJS) - Multinational address parsing.
  • Ftfy (Python) - Fixes text (unicode artifacts) for you.
  • PyJanitor (Python) - Clean code for clean data.
  • ProbablePeople - Western name parser.
  • python-nameparser (Python) - Separate names into individual components.
  • Nominally - Name parser for record linkage.

Data Quality Control

Blocking, Candidate Selection, and Search

  • blocking (R) - Blocking based on approximate nearest neighbours.
  • ElasticSearch - Search text.
  • DeezyMatch (Python) - Deep embedding and approximate nearest-beighbor blocking for entity resolution.
  • StarSpace (C++, Python) - Embedding model suitable for similarity learning.

Commercial Solutions

Books

Contributors