This repository contains a lightweight prototype of a larger industrial project implemented on the ARdian Azure server. It serves as a local testing environment for an interactive search tool that recommends similar companies based on textual business descriptions and financial metadata.
The core objective is to match companies using semantic embeddings, dimension reduction, and user-specified filters (e.g., country, sector, FTE bounds). This small-scale version allows for rapid experimentation and is modular enough to evolve further.
vincentg1234-ardian_capstone.git/
βββ README.md β This file
βββ NB_main.ipynb β Interactive notebook for local testing
βββ requirements.txt β List of dependencies
βββ data/ β Sample dataset for local usage
βββ Scripts/ β Core logic (modular Python scripts)
βββ main.py β Entry point to run full pipeline
βββ filter_user/
β βββ ask_user.py β User prompts and CLI interactions
β βββ filter.py β Data filtering and description preprocessing
βββ language_model_folder/
βββ language_model.py β Embedding generation and similarity ranking
βββ PCA_functions.py β Dimensionality reduction using PCA
Python version: Make sure youβre using Python 3.11
Install dependencies:
bash
pip install -r requirements.txt
Itβs strongly recommended to use a virtual environment:
bash
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
Run the notebook: Open NB_main.ipynb to run the search pipeline step-by-step with a small dataset.
Filters companies based on user input (country, sector, size)
Uses Sentence-BERT to embed business descriptions
Enriches descriptions with user-provided or inferred keywords
Applies PCA to reduce vector dimensions while preserving semantic structure
Ranks and displays the top-N most similar companies
Youβll be prompted interactively in the terminal or notebook to guide each step.
The dataset in the data/ folder is a minimal subset for local prototyping. The full version runs on Ardian's secure Azure infrastructure.
Please create a new branch before making any changes.
Be careful with git add and commits: avoid pushing unwanted cache or system files.
This repo is maintained as part of a capstone project at ARDIAN. For any questions or suggestions, feel free to reach out to the team.