Document Classification with BERTopic and Streamlit is a project designed to automatically label documents based on the textual content present near key areas of interest. This repository demonstrates the development of a document classification system leveraging self-supervised learning with BERTopic and an interactive UI built using Streamlit.
- Self-Supervised Learning with BERTopic: Utilize BERTopic, a topic modeling algorithm, to group and classify documents based on semantic similarity.
- Streamlit for Interactive UI: Provide a user-friendly interface to upload documents, view classifications, and interact with the model in real-time.
- Flexibility and Adaptability: Easily extend the system to handle new, unlabeled datasets without requiring extensive retraining.
- Efficient Document Labeling: Automatically classify documents into relevant categories based on context, reducing manual effort.
- Text content is extracted from documents, with a focus on regions near key information areas.
- Preprocessed text serves as input for further analysis.
- BERTopic is trained on document text to generate embeddings and cluster documents based on semantic similarity.
- Fine-tuning ensures that the model captures domain-specific nuances for classification.
- Trained BERTopic model assigns topic labels to documents.
- Labels can be mapped to predefined categories for better interpretability.
- Streamlit application allows users to:
- Upload documents in PDF format.
- View extracted text and predicted labels.
- Explore clustering insights through interactive visualizations.
- Python: 3.x (Tested with Python 3.10.12 on Ubuntu 22.04)
- Libraries:
bertopic
,streamlit
,pandas
,scikit-learn
,nltk
,PyPDF2
- Tools:
- Streamlit for the UI
- Dependencies listed in
requirements.txt
-
Clone the Repository:
git clone https://github.com/PavanSugreev04/Document-Classification.git cd document-classification-bertopic
-
Install Dependencies:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Prepare Data:
- Place your training documents in the
data/
directory. - Preprocess text using the scripts provided.
- Place your training documents in the
-
Train BERTopic Model:
- Use
train_model.py
to train the BERTopic model on your dataset.
- Use
-
Launch Streamlit App:
streamlit run app.py
-
Interact with the App:
- Upload PDF documents.
- View extracted text, clusters, and topic labels.
- Challenge: Training BERTopic on large datasets requires significant resources.
- Remedy: Utilize cloud-based GPUs or pre-train on smaller datasets and fine-tune as needed.
- Challenge: Preprocessing large document sets can be time-intensive.
- Remedy: Batch process documents and save intermediate results for reuse.
- Challenge: Inconsistent or sparse text data affects model performance.
- Remedy: Apply data augmentation and preprocessing to enhance data quality.
- Challenge: Optimizing BERTopic parameters can be complex.
- Remedy: Use grid search or automated tools to explore parameter configurations.
Kaggal News DataSet