HR Application (RAG)

A modular Retrieval Augmented Generation (RAG) application built with LangChain 0.3 and Python 3.13. This application allows you to index PDF documents and query them using natural language with OpenAI's language models.

Features

Modular Architecture: Factory pattern implementation for easy extensibility
Configurable: YAML-based configuration for all components
PDF Support: Load and process PDF documents
Persistent Vector Store: ChromaDB for efficient document storage and retrieval
CLI Interface: Simple command-line interface for indexing and querying
Comprehensive Testing: Unit tests for all major components

Architecture

rag-application/
├── config/
│   └── config.yaml              # Main configuration file
├── src/
│   ├── factories/               # Factory pattern implementations
│   │   ├── llm_factory.py
│   │   ├── embedding_factory.py
│   │   └── vectorstore_factory.py
│   ├── components/              # Core components
│   │   ├── document_loader.py
│   │   ├── text_splitter.py
│   │   └── retriever.py
│   ├── rag/
│   │   └── rag_pipeline.py      # Main RAG pipeline
│   └── utils/
│       └── config_loader.py     # Configuration loader
├── tests/                       # Unit tests
└── main.py                      # CLI entry point

Installation

Prerequisites

Python 3.13+
OpenAI API key

Setup

Clone the repository:

git clone <repository-url>
cd rag-application

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env and configure all application settings including API keys

Configure the application:

# Edit config/config.yaml with your preferred settings

Configuration

The application is configured via environment variables in the .env file. All configuration has been moved from code to environment variables for better security and flexibility. Key configuration options:

LLM Configuration

# In .env file
LLM_TYPE=anthropic
LLM_MODEL_NAME=claude-haiku-4-5-20251001
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=500

Embedding Configuration

# In .env file
EMBEDDING_TYPE=huggingface
EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2

Vector Store Configuration

# In .env file
VECTORSTORE_TYPE=chroma
VECTORSTORE_PERSIST_DIRECTORY=./indexes/chroma_db
VECTORSTORE_COLLECTION_NAME=rag_documents

Document Processing

# In .env file
DOCUMENT_CHUNK_SIZE=1000
DOCUMENT_CHUNK_OVERLAP=200

Retrieval Configuration

# In .env file
RETRIEVAL_TOP_K=4
RETRIEVAL_SEARCH_TYPE=similarity

System Prompt Configuration

# In .env file
# Customize how the AI assistant responds to queries
SYSTEM_PROMPT=You are an AI HR Assistant for TechnoSphere India Private Limited...

See SYSTEM_PROMPT_GUIDE.md for detailed customization options and examples.

Usage

Indexing Documents

Index a single PDF file:

python main.py index /path/to/document.pdf

Index all PDFs in a directory:

python main.py index /path/to/documents/

Querying Documents

Interactive mode (recommended):

python main.py query --interactive

Single query:

python main.py query --question "What is this document about?"

Show source documents:

python main.py query --interactive --show-sources

Custom Configuration

Use a different configuration file:

python main.py --config custom_config.yaml index document.pdf

Testing

Run all tests:

pytest

Run tests with coverage:

pytest --cov=src tests/

Run specific test file:

pytest tests/test_factories.py

Factory Pattern

The application uses the Factory Pattern for creating instances of:

LLM Factory: Creates language model instances (OpenAI)
Embedding Factory: Creates embedding model instances (OpenAI)
Vector Store Factory: Creates vector store instances (Chroma)

To add support for new providers, simply:

Extend the appropriate factory class
Add configuration in config.yaml
Implement the provider-specific creation method

Example:

def _create_anthropic_llm(self, config: Dict[str, Any]) -> Any:
    return ChatAnthropic(
        model=config.get('model_name', 'claude-sonnet-4-20250514'),
        temperature=config.get('temperature', 0.7)
    )

Components

Document Loader

Handles loading PDF documents from files or directories.

Text Splitter

Splits documents into chunks for efficient processing and retrieval.

Retriever

Retrieves relevant document chunks based on similarity search.

RAG Pipeline

Orchestrates the entire RAG workflow:

Document loading
Text splitting
Vector store creation/loading
Query processing
Answer generation

Project Structure Details

src/
├── factories/           # Factory pattern implementations
│   ├── base_factory.py  # Abstract base class for factories
│   ├── llm_factory.py   # LLM instance creation
│   ├── embedding_factory.py  # Embedding model creation
│   └── vectorstore_factory.py  # Vector store creation
│
├── components/          # Modular components
│   ├── document_loader.py  # PDF loading
│   ├── text_splitter.py    # Text chunking
│   └── retriever.py        # Document retrieval
│
├── rag/                 # Main RAG logic
│   └── rag_pipeline.py  # Pipeline orchestration
│
└── utils/               # Utility functions
    └── config_loader.py # YAML configuration loading

Extending the Application

Adding a New LLM Provider

Update src/factories/llm_factory.py:

def create(self, config: Dict[str, Any]) -> Any:
    llm_type = config.get('type', '').lower()
    
    if llm_type == 'openai':
        return self._create_openai_llm(config)
    elif llm_type == 'anthropic':  # New provider
        return self._create_anthropic_llm(config)
    else:
        raise ValueError(f"Unsupported LLM type: {llm_type}")

Update config/config.yaml:

llm:
  type: "anthropic"
  model_name: "claude-sonnet-4-20250514"

Adding a New Vector Store

Follow the same pattern in src/factories/vectorstore_factory.py.

Troubleshooting

Common Issues

API Key Error: Ensure your OpenAI API key is set in .env
File Not Found: Check that PDF paths are correct
Memory Issues: Reduce chunk_size or top_k in config
Empty Results: Ensure documents are indexed before querying

Performance Tips

Chunk Size: Larger chunks (1000-2000) for comprehensive context, smaller (500-1000) for precise retrieval
Overlap: 10-20% of chunk size for better context continuity
Top K: 3-5 documents for most queries, increase for complex questions
Temperature: Lower (0.3-0.5) for factual answers, higher (0.7-0.9) for creative responses

License

MIT License

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

Support

For issues and questions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
components		components
config		config
data		data
factories		factories
rag		rag
utils		utils
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG_SYSTEM_PROMPT.md		CHANGELOG_SYSTEM_PROMPT.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SYSTEM_PROMPT_CHANGES.md		SYSTEM_PROMPT_CHANGES.md
SYSTEM_PROMPT_GUIDE.md		SYSTEM_PROMPT_GUIDE.md
app.py		app.py
config_manager.py		config_manager.py
main.py		main.py
query_history.json		query_history.json
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py
test_system_prompt.py		test_system_prompt.py

License

Ashokmcal/HR_APP

Folders and files

Latest commit

History

Repository files navigation

HR Application (RAG)

Features

Architecture

Installation

Prerequisites

Setup

Configuration

LLM Configuration

Embedding Configuration

Vector Store Configuration

Document Processing

Retrieval Configuration

System Prompt Configuration

Usage

Indexing Documents

Querying Documents

Custom Configuration

Testing

Factory Pattern

Components

Document Loader

Text Splitter

Retriever

RAG Pipeline

Project Structure Details

Extending the Application

Adding a New LLM Provider

Adding a New Vector Store

Troubleshooting

Common Issues

Performance Tips

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages