Skip to content

VincentG1234/ARDIAN_CAPSTONE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” ARDIAN Capstone – Similar Company Search Prototype

This repository contains a lightweight prototype of a larger industrial project implemented on the ARdian Azure server. It serves as a local testing environment for an interactive search tool that recommends similar companies based on textual business descriptions and financial metadata.

The core objective is to match companies using semantic embeddings, dimension reduction, and user-specified filters (e.g., country, sector, FTE bounds). This small-scale version allows for rapid experimentation and is modular enough to evolve further.


πŸ“ Repository Structure

vincentg1234-ardian_capstone.git/
β”œβ”€β”€ README.md               ← This file
β”œβ”€β”€ NB_main.ipynb           ← Interactive notebook for local testing
β”œβ”€β”€ requirements.txt        ← List of dependencies
β”œβ”€β”€ data/                   ← Sample dataset for local usage
└── Scripts/                ← Core logic (modular Python scripts)
    β”œβ”€β”€ main.py                         ← Entry point to run full pipeline
    β”œβ”€β”€ filter_user/
    β”‚   β”œβ”€β”€ ask_user.py                 ← User prompts and CLI interactions
    β”‚   └── filter.py                   ← Data filtering and description preprocessing
    └── language_model_folder/
        β”œβ”€β”€ language_model.py           ← Embedding generation and similarity ranking
        └── PCA_functions.py            ← Dimensionality reduction using PCA

βš™οΈ Setup Instructions

Python version: Make sure you’re using Python 3.11

Install dependencies:

bash

pip install -r requirements.txt

It’s strongly recommended to use a virtual environment:

bash

python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

Run the notebook: Open NB_main.ipynb to run the search pipeline step-by-step with a small dataset.

🧠 What the Pipeline Does

Filters companies based on user input (country, sector, size)

Uses Sentence-BERT to embed business descriptions

Enriches descriptions with user-provided or inferred keywords

Applies PCA to reduce vector dimensions while preserving semantic structure

Ranks and displays the top-N most similar companies

You’ll be prompted interactively in the terminal or notebook to guide each step.

πŸ“Œ Notes

The dataset in the data/ folder is a minimal subset for local prototyping. The full version runs on Ardian's secure Azure infrastructure.

Please create a new branch before making any changes.

Be careful with git add and commits: avoid pushing unwanted cache or system files.

πŸ“§ Contact

This repo is maintained as part of a capstone project at ARDIAN. For any questions or suggestions, feel free to reach out to the team.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •