Skip to content

EastridgeAnalytics/Entity_Resolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Entity Resolution & Network Visualization Demo

This repository contains tools for entity resolution and graph visualization. It provides:

  • A Jupyter Notebook (`run-ER.ipynb) for end-to-end entity resolution using Neo4j.
  • A Streamlit application (visualize_network.py) for exploring relationships in your resolved entity data.

Table of Contents

  1. Entity Resolution Overview
  2. Network Visualization
  3. Setup & Requirements
  4. Running the Applications
  5. Data Sources
  6. Customization
  7. License & Contribution

Entity Resolution Overview

The Jupyter Notebook (entity_resolution_notebook.ipynb) demonstrates an end-to-end entity resolution pipeline* in Neo4j

Pipeline Breakdown

  1. Configuration & Setup

    • Imports required libraries and sets up Neo4j credentials.
    • Configures logging and a Neo4j driver instance.
  2. Data Generation & Simulation

    • Generates synthetic entity data (using Faker) with intentional duplicates (e.g., name variations, typos, phone/email format changes).
    • Inserts controlled test clusters to validate resolution.
  3. Data Normalization

    • Standardizes names, emails, addresses, and phone numbers.
    • Updates Neo4j candidate nodes with normalized values.
  4. Similarity Calculation & Blocking

    • Computes Jaro-Winkler, Levenshtein, and exact-match similarities for key fields.
    • Uses Neo4j indexing & blocking strategies to optimize comparisons.
  5. Duplicate Resolution Strategies

    • Merge High Confidence: Automatically merges highly similar nodes.
    • Link High Confidence: Establishes SAME_AS relationships instead of merging.
  6. Master Entity Resolution

    • Clusters similar records and creates master nodes representing deduplicated entities.
    • Assigns canonical attributes based on supporting candidate nodes.

Network Visualization

This repository also includes a Streamlit application interactive entity relationship visualization The Graph Visualization Tool (visualize_network.py) displays Neo4j & SQL-based networks dynamically in a Streamlit interface.


Setup & Requirements

Pre-requisites

Before running the applications, ensure you have:

  • Python 3.8+ installed.
  • Neo4j (for entity resolution and visualization).
  • Jupyter Notebook or JupyterLab (for the resolution pipeline).
  • Streamlit (for graph visualization).

Install Python Dependencies

Install all necessary packages with:

pip install -r requirements.yaml
conda env create -f requirements.yaml

Or install manually:

pip install streamlit neo4j faker duckdb sqlalchemy python-dotenv st-link-analysis splink

Neo4j Database

Environment Variables

The Streamlit apps (app.py, visualize_network.py) use a .env file for Neo4j credentials:

NEO4J_URI=yourURI
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=yourpassword
NEO4J_DB_NAME=neo4j

Running the Applications

1. Entity Resolution in Jupyter Notebook

Run the entity resolution pipeline inside Jupyter:

jupyter notebook

Open entity_resolution_notebook.ipynb and follow the step-by-step entity resolution process.


2. Network Graph Visualization

Launch the Streamlit-based graph visualization:

streamlit run visualize_network.py
  • Select "Neo4j" as the data source if using it for this pipeline.
  • Configure the Cypher query.
  • Click "Load Data" to visualize the network.

Contributions are welcome!
Feel free to:

  • Open issues for bugs or feature requests.
  • Submit pull requests with improvements.

Visualization Examples

Below are sample screenshots from the network visualization tool:

Cluster View of Entities

Clustered Entity View

Graph View of Relationships

Graph View


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •