This repository contains tools for entity resolution and graph visualization. It provides:
- A Jupyter Notebook (`run-ER.ipynb) for end-to-end entity resolution using Neo4j.
- A Streamlit application (
visualize_network.py) for exploring relationships in your resolved entity data.
- Entity Resolution Overview
- Network Visualization
- Setup & Requirements
- Running the Applications
- Data Sources
- Customization
- License & Contribution
The Jupyter Notebook (entity_resolution_notebook.ipynb) demonstrates an end-to-end entity resolution pipeline* in Neo4j
-
Configuration & Setup
- Imports required libraries and sets up Neo4j credentials.
- Configures logging and a Neo4j driver instance.
-
Data Generation & Simulation
- Generates synthetic entity data (using
Faker) with intentional duplicates (e.g., name variations, typos, phone/email format changes). - Inserts controlled test clusters to validate resolution.
- Generates synthetic entity data (using
-
Data Normalization
- Standardizes names, emails, addresses, and phone numbers.
- Updates Neo4j candidate nodes with normalized values.
-
Similarity Calculation & Blocking
- Computes Jaro-Winkler, Levenshtein, and exact-match similarities for key fields.
- Uses Neo4j indexing & blocking strategies to optimize comparisons.
-
Duplicate Resolution Strategies
- Merge High Confidence: Automatically merges highly similar nodes.
- Link High Confidence: Establishes
SAME_ASrelationships instead of merging.
-
Master Entity Resolution
- Clusters similar records and creates master nodes representing deduplicated entities.
- Assigns canonical attributes based on supporting candidate nodes.
This repository also includes a Streamlit application interactive entity relationship visualization The Graph Visualization Tool (visualize_network.py) displays Neo4j & SQL-based networks dynamically in a Streamlit interface.
Before running the applications, ensure you have:
- Python 3.8+ installed.
- Neo4j (for entity resolution and visualization).
- Jupyter Notebook or JupyterLab (for the resolution pipeline).
- Streamlit (for graph visualization).
Install all necessary packages with:
pip install -r requirements.yamlconda env create -f requirements.yamlOr install manually:
pip install streamlit neo4j faker duckdb sqlalchemy python-dotenv st-link-analysis splink- Ensure Neo4j Desktop or Neo4j Server is running.
- Enable APOC and Graph Data Science (GDS) if using similarity calculations:
The Streamlit apps (app.py, visualize_network.py) use a .env file for Neo4j credentials:
NEO4J_URI=yourURI
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=yourpassword
NEO4J_DB_NAME=neo4jRun the entity resolution pipeline inside Jupyter:
jupyter notebookOpen entity_resolution_notebook.ipynb and follow the step-by-step entity resolution process.
Launch the Streamlit-based graph visualization:
streamlit run visualize_network.py- Select "Neo4j" as the data source if using it for this pipeline.
- Configure the Cypher query.
- Click "Load Data" to visualize the network.
Contributions are welcome!
Feel free to:
- Open issues for bugs or feature requests.
- Submit pull requests with improvements.
Below are sample screenshots from the network visualization tool:

