Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
19e9a38
Refactor code structure and remove redundant sections for improved re…
brylie Feb 2, 2026
820d779
Refactor with dependency injection
brylie Feb 2, 2026
736702f
Enhance documentation in CONTRIBUTING.md and README.md to improve cla…
brylie Feb 2, 2026
32a2743
Remove VectorizerFactory from the tapio package initialization and it…
brylie Feb 2, 2026
a4e6232
Refactor ChromaStore to use specific Embeddings type for improved typ…
brylie Feb 2, 2026
59f6226
Remove redundant import of ConfigManager in test_relative_links.py fo…
brylie Feb 2, 2026
e6bafbd
Refactor RAGConfig to use default settings from tapio.config.settings…
brylie Feb 2, 2026
f85aa65
Refactor test_rag_pipeline.py to improve mocking and assertion clarit…
brylie Feb 2, 2026
e4fb224
Remove redundant import of Chroma in test_vectorization_pipeline.py f…
brylie Feb 2, 2026
667270a
Remove redundant import of ConfigManager in test_parser.py for improv…
brylie Feb 2, 2026
8d51c8e
Add missing patch decorator to test_enhance_document_with_citation fo…
brylie Feb 2, 2026
d6b2a3c
Refactor mock_embeddings to avoid mutation in returned embeddings for…
brylie Feb 2, 2026
f3f2d46
Fix typo in fallback_to_body description for improved clarity
brylie Feb 2, 2026
1aab554
Fix import statement for HuggingFaceEmbeddings to use the correct module
brylie Feb 2, 2026
8f18d6d
Remove redundant import of patch in test_rag_pipeline_end_to_end for …
brylie Feb 2, 2026
f4d6621
Refactor test_parser to use class-level DEFAULT_DIRS for improved mai…
brylie Feb 2, 2026
ef2239d
Refactor create_chroma_store method to accept Embeddings type for imp…
brylie Feb 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 196 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,19 @@ Thank you for considering contributing to Tapio Assistant! This document provide
- [Testing Guidelines](#testing-guidelines)
- [Running Tests](#running-tests)
- [Code Coverage](#code-coverage)
- [Test Categories](#test-categories)
- [Test Fixtures](#test-fixtures)
- [Project Structure](#project-structure)
- [Programmatic API](#programmatic-api)
- [Using Factory Pattern (Recommended)](#using-factory-pattern-recommended)
- [Manual Dependency Injection (Advanced)](#manual-dependency-injection-advanced)
- [Key Components](#key-components)
- [Configuration System](#configuration-system)
- [Default Settings](#default-settings)
- [Site Configurations](#site-configurations)
- [Configuration Structure](#configuration-structure)
- [Required vs Optional Fields](#required-vs-optional-fields)
- [Adding New Sites](#adding-new-sites)
- [Ollama for LLM Inference](#ollama-for-llm-inference)
- [Pull Request Process](#pull-request-process)

Expand Down Expand Up @@ -241,7 +252,7 @@ uv run pytest

### Code Coverage

We aim for high test coverage. When submitting code:
We aim for high test coverage (minimum 80%). When submitting code:

1. Check your coverage with:

Expand All @@ -263,18 +274,112 @@ uv run pytest --cov=tapio.utils tests/utils/

Aim for at least 80% coverage for new code. The HTML coverage report can be found in the `htmlcov` directory. Open `htmlcov/index.html` in your browser to view it.

### Test Categories

We maintain different types of tests:

**Unit Tests** - Fast, isolated tests with mocked dependencies:
```bash
uv run pytest -m "not integration"
```

**Integration Tests** - Tests using real components (marked with `@pytest.mark.integration`):
```bash
uv run pytest -m integration
```

**All Tests**:
```bash
uv run pytest
```

### Test Fixtures

Common mock fixtures are available in `tests/conftest.py`:
- `mock_embeddings` - Mocked HuggingFace embeddings
- `mock_chroma_store` - Mocked ChromaDB vector store
- `mock_llm_service` - Mocked LLM service
- `mock_doc_retrieval_service` - Mocked document retrieval service
- `mock_rag_orchestrator` - Mocked RAG orchestrator

Use these fixtures in your tests for consistent mocking:
```python
def test_my_feature(mock_rag_orchestrator):
# Test uses mocked orchestrator
pass
```

## Project Structure

The project has been designed with a clear separation of concerns:

- `crawler/`: Module responsible for crawling websites and saving HTML content
- `parsers/`: Module responsible for parsing HTML content into structured formats
- `vectorstore/`: Module responsible for vectorizing content and storing in ChromaDB
- `services/`: RAG orchestration and LLM services
- `config/`: Configuration settings for the project
- `gradio_app.py`: Gradio interface for the RAG chatbot
- `app.py`: Gradio interface for the RAG chatbot
- `cli.py`: Command-line interface
- `factories.py`: Factory classes for dependency injection
- `utils/`: Utility modules for embedding generation, markdown processing, etc.
- `tests/`: Test suite for all modules

## Programmatic API

For developers who want to use Tapio as a library or extend its functionality:

### Using Factory Pattern (Recommended)

```python
from tapio import RAGConfig, RAGOrchestratorFactory

# Create configuration
config = RAGConfig(
collection_name="my_docs",
persist_directory="./db",
llm_model_name="llama3.2",
max_tokens=1024,
num_results=5
)

# Create orchestrator using factory
factory = RAGOrchestratorFactory(config)
orchestrator = factory.create_orchestrator()

# Query the system
response, documents = orchestrator.query("What are the visa requirements?")
print(response)
```

### Manual Dependency Injection (Advanced)

For full control over component creation:

```python
from langchain_huggingface import HuggingFaceEmbeddings
from tapio.vectorstore.chroma_store import ChromaStore
from tapio.services.document_retrieval_service import DocumentRetrievalService
from tapio.services.llm_service import LLMService
from tapio.services.rag_orchestrator import RAGOrchestrator

# Create dependencies
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
chroma_store = ChromaStore("my_docs", embeddings, "./db")
doc_service = DocumentRetrievalService(chroma_store, num_results=5)
llm_service = LLMService(model_name="llama3.2", max_tokens=1024)

# Create orchestrator
orchestrator = RAGOrchestrator(doc_service, llm_service)
```

### Key Components

- **RAGOrchestrator**: Main orchestrator that coordinates document retrieval and LLM generation
- **DocumentRetrievalService**: Handles vector-based document retrieval
- **LLMService**: Manages LLM interactions via Ollama
- **ChromaStore**: Vector database abstraction layer
- **Factories**: Simplify dependency wiring with sensible defaults

## Configuration System

The application uses a centralized configuration system:
Expand All @@ -291,6 +396,95 @@ When adding new features that require configuration values:
3. Avoid hardcoding values that might need to change in the future
4. Use descriptive keys for configuration values

### Default Settings

Centralized configuration in `tapio/config/settings.py`:

```python
DEFAULT_DIRS = {
"CRAWLED_DIR": "content/crawled", # HTML storage
"PARSED_DIR": "content/parsed", # Markdown storage
"CHROMA_DIR": "chroma_db", # Vector database
}

DEFAULT_CHROMA_COLLECTION = "tapio" # ChromaDB collection name
DEFAULT_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
DEFAULT_LLM_MODEL = "llama3.2"
DEFAULT_MAX_TOKENS = 1024
DEFAULT_NUM_RESULTS = 5
```

## Site Configurations

Site configurations define how to crawl and parse specific websites. They're stored in `tapio/config/site_configs.yaml` and used by both crawl and parse commands.

### Configuration Structure

```yaml
sites:
migri:
base_url: "https://migri.fi" # Used for crawling and converting relative links
description: "Finnish Immigration Service website"
crawler_config: # Crawling behavior
delay_between_requests: 1.0 # Seconds between requests
max_concurrent: 3 # Concurrent request limit
parser_config: # Parser-specific configuration
title_selector: "//title" # XPath for page titles
content_selectors: # Priority-ordered content extraction
- '//div[@id="main-content"]'
- "//main"
- "//article"
- '//div[@class="content"]'
fallback_to_body: true # Use <body> if selectors fail
markdown_config: # HTML-to-Markdown options
ignore_links: false
body_width: 0 # No text wrapping
protect_links: true
unicode_snob: true
ignore_images: false
ignore_tables: false
```

### Required vs Optional Fields

**Required:**
- `base_url` - Base URL for the site (used for crawling and link resolution)

**Optional (with defaults):**
- `description` - Human-readable description
- `parser_config` - Parser-specific settings (uses defaults if omitted)
- `title_selector` - Page title XPath (default: "//title")
- `content_selectors` - XPath selectors for content extraction (default: ["//main", "//article", "//body"])
- `fallback_to_body` - Use full-body content if selectors fail (default: true)
- `markdown_config` - HTML conversion settings (uses defaults if omitted)
- `crawler_config` - Crawling behavior settings (uses defaults if omitted)
- `delay_between_requests` - Delay between requests in seconds (default: 1.0)
- `max_concurrent` - Maximum concurrent requests (default: 5)

### Adding New Sites

1. Analyze the target website's structure
2. Identify XPath selectors for content extraction
3. Add configuration to `site_configs.yaml`:

```yaml
sites:
my_site:
base_url: "https://example.com"
description: "Example site configuration"
parser_config:
content_selectors:
- '//div[@class="main-content"]'
```

4. Use with commands:
```bash
uv run -m tapio.cli crawl my_site
uv run -m tapio.cli parse my_site
uv run -m tapio.cli vectorize
uv run -m tapio.cli tapio-app
```

## Ollama for LLM Inference

We use Ollama for local LLM inference.
Expand Down
87 changes: 1 addition & 86 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,92 +90,7 @@ To view detailed site configurations:
uv run -m tapio.cli list-sites --verbose
```

## Site Configurations

Site configurations define how to crawl and parse specific websites. They're stored in `tapio/config/site_configs.yaml` and used by both crawl and parse commands.

### Configuration Structure

```yaml
sites:
migri:
base_url: "https://migri.fi" # Used for crawling and converting relative links
description: "Finnish Immigration Service website"
crawler_config: # Crawling behavior
delay_between_requests: 1.0 # Seconds between requests
max_concurrent: 3 # Concurrent request limit
parser_config: # Parser-specific configuration
title_selector: "//title" # XPath for page titles
content_selectors: # Priority-ordered content extraction
- '//div[@id="main-content"]'
- "//main"
- "//article"
- '//div[@class="content"]'
fallback_to_body: true # Use <body> if selectors fail
markdown_config: # HTML-to-Markdown options
ignore_links: false
body_width: 0 # No text wrapping
protect_links: true
unicode_snob: true
ignore_images: false
ignore_tables: false
```

### Required vs Optional Fields

**Required:**
- `base_url` - Base URL for the site (used for crawling and link resolution)

**Optional (with defaults):**
- `description` - Human-readable description
- `parser_config` - Parser-specific settings (uses defaults if omitted)
- `title_selector` - Page title XPath (default: "//title")
- `content_selectors` - XPath selectors for content extraction (default: ["//main", "//article", "//body"])
- `fallback_to_body` - Use full body content if selectors fail (default: true)
- `markdown_config` - HTML conversion settings (uses defaults if omitted)
- `crawler_config` - Crawling behavior settings (uses defaults if omitted)
- `delay_between_requests` - Delay between requests in seconds (default: 1.0)
- `max_concurrent` - Maximum concurrent requests (default: 5)

### Adding New Sites

1. Analyze the target website's structure
2. Identify XPath selectors for content extraction
3. Add configuration to `site_configs.yaml`:

```yaml
sites:
my_site:
base_url: "https://example.com"
description: "Example site configuration"
parser_config:
content_selectors:
- '//div[@class="main-content"]'
```

4. Use with commands:
```bash
uv run -m tapio.cli crawl my_site
uv run -m tapio.cli parse my_site
uv run -m tapio.cli vectorize
uv run -m tapio.cli tapio-app
```

## Configuration

Tapio uses centralized configuration in `tapio/config/settings.py`:

```python
DEFAULT_DIRS = {
"CRAWLED_DIR": "content/crawled", # HTML storage
"PARSED_DIR": "content/parsed", # Markdown storage
"CHROMA_DIR": "chroma_db", # Vector database
}

DEFAULT_CHROMA_COLLECTION = "tapio" # ChromaDB collection name
```

Site-specific configurations are in `tapio/config/site_configs.yaml` and automatically handle content extraction and directory organization based on the site's domain.
For technical details on site configurations, programmatic API usage, and adding new sites, see [CONTRIBUTING.md](CONTRIBUTING.md).

## Contributing

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "tapio"
version = "0.1.0"
version = "2.0.0"
description = "An assistant for Finnish immigrants"
readme = "README.md"
requires-python = ">=3.10"
Expand Down
4 changes: 4 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,7 @@ filterwarnings =
ignore::DeprecationWarning:websockets.legacy:
ignore::DeprecationWarning:.*:
ignore::DeprecationWarning:gradio.utils:

markers =
integration: Integration tests that use real components (slower, may download models)

8 changes: 8 additions & 0 deletions tapio/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,9 @@
"""This file initializes the tapio package."""

from tapio.config.config_models import RAGConfig
from tapio.factories import RAGOrchestratorFactory

__all__ = [
"RAGConfig",
"RAGOrchestratorFactory",
]
Loading