Skip to content

Refactor code structure and enhance documentation#1

Merged
brylie merged 17 commits intomainfrom
improve-maintainability
Feb 2, 2026
Merged

Refactor code structure and enhance documentation#1
brylie merged 17 commits intomainfrom
improve-maintainability

Conversation

@brylie
Copy link
Owner

@brylie brylie commented Feb 2, 2026

User description

Refactor the code for improved readability and maintainability, implement dependency injection for better testing, and enhance documentation in CONTRIBUTING.md and README.md to clarify testing guidelines and configuration structure.


PR Type

Enhancement, Tests


Description

  • Implement dependency injection pattern across RAG system components

  • Create factory classes for simplified orchestrator instantiation

  • Add comprehensive test fixtures and integration test suite

  • Refactor Parser, ChromaStore, and Vectorizer with injected dependencies

  • Enhance documentation with programmatic API and configuration guides


Diagram Walkthrough

flowchart LR
  Config["RAGConfig<br/>Configuration"]
  Factory["RAGOrchestratorFactory<br/>Dependency Wiring"]
  Embeddings["HuggingFaceEmbeddings<br/>Injected"]
  ChromaStore["ChromaStore<br/>Injected"]
  DocService["DocumentRetrievalService<br/>Injected"]
  LLMService["LLMService<br/>Injected"]
  Orchestrator["RAGOrchestrator<br/>Coordinated"]
  App["TapioAssistantApp<br/>Simplified"]
  
  Config -- "creates" --> Factory
  Factory -- "creates" --> Embeddings
  Factory -- "creates" --> ChromaStore
  Factory -- "creates" --> DocService
  Factory -- "creates" --> LLMService
  Embeddings -- "injected into" --> ChromaStore
  ChromaStore -- "injected into" --> DocService
  DocService -- "injected into" --> Orchestrator
  LLMService -- "injected into" --> Orchestrator
  Orchestrator -- "injected into" --> App
Loading

File Walkthrough

Relevant files
Enhancement
11 files
__init__.py
Export public API with factory classes                                     
+9/-0     
app.py
Refactor with dependency injection for RAG orchestrator   
+51/-88 
cli.py
Update CLI to use factory pattern and dependency injection
+63/-11 
config_models.py
Add RAGConfig dataclass for centralized configuration       
+32/-0   
settings.py
Add embedding and RAG configuration defaults                         
+8/-0     
factories.py
Create factory classes for dependency injection                   
+183/-0 
parser.py
Refactor Parser with injected configuration and directories
+24/-16 
document_retrieval_service.py
Implement dependency injection for vector store                   
+19/-13 
rag_orchestrator.py
Refactor with injected document and LLM services                 
+27/-25 
chroma_store.py
Inject embeddings instance for flexibility                             
+23/-7   
vectorizer.py
Inject vector database and text splitter dependencies       
+30/-39 
Tests
13 files
conftest.py
Add comprehensive mock fixtures for testing                           
+129/-0 
__init__.py
Create integration test package                                                   
+1/-0     
test_parser_pipeline.py
Add integration tests for parser pipeline                               
+136/-0 
test_rag_pipeline.py
Add integration tests for RAG system end-to-end                   
+169/-0 
test_vectorization_pipeline.py
Add integration tests for vectorization pipeline                 
+129/-0 
test_parser.py
Update tests to use dependency injection pattern                 
+29/-25 
test_relative_links.py
Update tests to use dependency injection pattern                 
+19/-4   
test_document_retrieval_service.py
Simplify tests with injected mock dependencies                     
+7/-11   
test_rag_orchestrator.py
Simplify tests with injected mock dependencies                     
+11/-19 
test_cli.py
Update CLI tests for dependency injection pattern               
+22/-136
test_gradio_app.py
Refactor tests to use injected orchestrator                           
+20/-70 
test_chroma_store.py
Update tests to inject embeddings dependency                         
+59/-47 
test_vectorizer.py
Simplify tests with injected dependencies                               
+38/-132
Documentation
2 files
CONTRIBUTING.md
Enhance documentation with API and testing guidelines       
+196/-2 
README.md
Move configuration details to CONTRIBUTING.md                       
+1/-86   
Configuration changes
2 files
pyproject.toml
Bump version to 2.0.0                                                                       
+1/-1     
pytest.ini
Add integration test marker configuration                               
+4/-0     

Summary by CodeRabbit

  • New Features

    • Public API exports: RAGConfig and RAGOrchestratorFactory for programmatic orchestration.
  • Documentation

    • Expanded CONTRIBUTING.md with programmatic API examples, site configuration guidance, default settings, and testing guidelines; README now redirects to CONTRIBUTING.md.
  • Tests

    • Added integration test suite and new pytest integration marker; many test fixtures and end-to-end tests added.
  • Refactor / Chores

    • Project refactored to dependency-injection patterns; bumped version to 2.0.0.

@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Warning

Rate limit exceeded

@brylie has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 11 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Refactors the codebase to adopt dependency injection and a factory for RAG orchestration, introduces RAGConfig and factory wiring (RAGOrchestratorFactory), updates constructors to accept injected services, adds defaults and pytest integration marker, bumps project version to 2.0.0, and expands integration tests and contributing docs.

Changes

Cohort / File(s) Summary
Documentation & Project
CONTRIBUTING.md, README.md, pyproject.toml, pytest.ini
Expanded CONTRIBUTING.md with testing/programmatic API/site-config docs; removed site-config block from README (points to CONTRIBUTING.md); bumped version to 2.0.0; added pytest integration marker.
Package exports & config defaults
tapio/__init__.py, tapio/config/config_models.py, tapio/config/settings.py
Exported RAGConfig and RAGOrchestratorFactory from package; added RAGConfig dataclass and new default constants (DEFAULT_EMBEDDING_MODEL, DEFAULT_LLM_MODEL, DEFAULT_MAX_TOKENS, DEFAULT_NUM_RESULTS).
Factories / Wiring
tapio/factories.py
Added RAGOrchestratorFactory to build embeddings, chroma store, document retrieval service, LLM service, and assemble a RAGOrchestrator.
App & CLI changes
tapio/app.py, tapio/cli.py
TapioAssistantApp now accepts an injected RAGOrchestrator; added check_model_availability; main() and CLI now create/configure a RAGOrchestrator via factory and pass it into the app.
Core services (DI)
tapio/services/document_retrieval_service.py, tapio/services/rag_orchestrator.py
Switched services to accept injected dependencies: DocumentRetrievalService takes a vector_store; RAGOrchestrator takes doc_retrieval_service and llm_service.
Vector store & vectorizer (DI)
tapio/vectorstore/chroma_store.py, tapio/vectorstore/vectorizer.py
ChromaStore now accepts injected embeddings; MarkdownVectorizer now accepts vector_db and text_splitter (no internal construction).
Parser (DI)
tapio/parser/parser.py
Parser constructor changed to accept site_config, input_dir, and output_dir (no longer loads config via config_path internally).
Tests & fixtures
tests/conftest.py, tests/integration/*, tests/*, tests/vectorstore/*, tests/services/*, tests/parser/*
Added many fixtures (mocks for embeddings, chroma, LLM, doc retrieval, rag orchestrator), registered integration pytest marker, added integration tests for parser/RAG/vectorization, and updated unit tests to the DI-based constructors and factory wiring.
Other test scaffolding
tests/integration/__init__.py
Added package initializer for integration tests.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI/main()
    participant Factory as RAGOrchestratorFactory
    participant Emb as Embeddings
    participant Store as ChromaStore
    participant DRS as DocumentRetrievalService
    participant LLM as LLMService
    participant Orch as RAGOrchestrator
    participant App as TapioAssistantApp

    CLI->>Factory: create_orchestrator()
    Factory->>Emb: create_embeddings()
    Emb-->>Factory: embeddings instance
    Factory->>Store: create_chroma_store(embeddings)
    Store-->>Factory: chroma store
    Factory->>DRS: create_document_retrieval_service(chroma_store)
    DRS-->>Factory: document retrieval service
    Factory->>LLM: create_llm_service()
    LLM-->>Factory: LLM service
    Factory->>Orch: create_orchestrator(drs, llm)
    Orch-->>Factory: orchestrator
    Factory-->>CLI: orchestrator
    CLI->>App: TapioAssistantApp(rag_orchestrator)
    App-->>CLI: app ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped through factories, one by one,
Injected services, now neatly spun.
Version two springs into the glade,
Tests and docs in order made.
A rabbit cheers: dependencies, well-played! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Refactor code structure and enhance documentation' is vague and generic, using non-descriptive language that fails to convey the main technical changes (dependency injection refactor, factory pattern implementation). Consider a more specific title that highlights the primary change, such as 'Implement dependency injection pattern across RAG components' or 'Refactor to factory-based orchestrator initialization and dependency injection'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 98.72% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch improve-maintainability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 2, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: 🏷️
Unstructured logging: New log entries are plain-text messages rather than structured logs (e.g., JSON), which
reduces auditability and makes monitoring and parsing harder as required by the checklist.

Referred Code
logger.info(
    "Initialized RAG orchestrator",
)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: 🏷️
Missing audit context: The new app flow logs operational messages (e.g., model availability) but does not show
any audit-trail fields like user identity or action outcome for potentially critical
actions, and it is unclear from the diff whether such audit logging exists elsewhere.

Referred Code
def check_model_availability(self) -> None:
    """Check if the LLM model is available.

    Raises:
        SystemExit: If the model is not available
    """
    if not self.rag_orchestrator.check_model_availability():
        logger.error("Required LLM model is not available")
        raise SystemExit(1)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: 🏷️
Error exposure unclear: The diff introduces a hard exit on model unavailability and logging, but it is not fully
visible whether user-facing error paths (e.g., UI/CLI exception messages) avoid leaking
internal details across all execution paths.

Referred Code
def check_model_availability(self) -> None:
    """Check if the LLM model is available.

    Raises:
        SystemExit: If the model is not available
    """
    if not self.rag_orchestrator.check_model_availability():
        logger.error("Required LLM model is not available")
        raise SystemExit(1)

def generate_rag_response(
    self,
    query: str,
    history: list[dict[str, Any]] | None = None,
) -> tuple[str, str]:
    """Generate a response using RAG and return both the response and retrieved documents.

    Args:
        query: The user's query
        history: Chat history



 ... (clipped 15 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: 🏷️
Path handling review: The CLI now constructs input_dir/output_dir paths from the site argument and default
directories, and while it appears constrained by available_sites, full
validation/sanitization of external inputs and filesystem safety cannot be confirmed from
the diff alone.

Referred Code
# Parse a specific site
if site in available_sites:
    typer.echo(f"🔧 Using configuration for site: {site}")

    # Get site config from config manager
    site_config = config_manager.get_site_config(site)

    # Determine input and output directories
    input_dir = os.path.join(DEFAULT_CONTENT_DIR, site, DEFAULT_DIRS["CRAWLED_DIR"])
    output_dir = os.path.join(DEFAULT_CONTENT_DIR, site, DEFAULT_DIRS["PARSED_DIR"])

    # Create parser with dependency injection
    parser = Parser(
        site_name=site,
        site_config=site_config,
        input_dir=input_dir,
        output_dir=output_dir,
    )
    results = parser.parse_all()

    # Output information


 ... (clipped 59 lines)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 2, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Refactor the new factory pattern implementation
Suggestion Impact:Instead of completing and using VectorizerFactory, the commit removed the incomplete/unused VectorizerFactory class (and its MarkdownTextSplitter import), eliminating the factory pattern implementation rather than refactoring/finishing it.

code diff:

# File: tapio/factories.py
@@ -6,7 +6,6 @@
 """
 
 from langchain_community.embeddings import HuggingFaceEmbeddings
-from langchain_text_splitters import MarkdownTextSplitter
 
 from tapio.config.config_models import RAGConfig
 from tapio.services.document_retrieval_service import DocumentRetrievalService
@@ -125,60 +124,3 @@
             llm_service=llm_service,
         )
 
-
-class VectorizerFactory:
-    """Factory for creating MarkdownVectorizer instances.
-
-    Handles creation of text splitters and vector database instances for
-    the vectorization pipeline.
-
-    Args:
-        collection_name: Name of the ChromaDB collection
-        persist_directory: Directory path for ChromaDB persistence
-        embedding_model_name: Name of the HuggingFace embedding model
-        chunk_size: Size of text chunks for splitting
-        chunk_overlap: Overlap between consecutive chunks
-    """
-
-    def __init__(
-        self,
-        collection_name: str,
-        persist_directory: str = "chroma_db",
-        embedding_model_name: str = "all-MiniLM-L6-v2",
-        chunk_size: int = 1000,
-        chunk_overlap: int = 200,
-    ) -> None:
-        """Initialize the vectorizer factory.
-
-        Args:
-            collection_name: Name of the ChromaDB collection
-            persist_directory: Directory for ChromaDB persistence
-            embedding_model_name: HuggingFace embedding model name
-            chunk_size: Text chunk size
-            chunk_overlap: Overlap between chunks
-        """
-        self.collection_name = collection_name
-        self.persist_directory = persist_directory
-        self.embedding_model_name = embedding_model_name
-        self.chunk_size = chunk_size
-        self.chunk_overlap = chunk_overlap
-
-    def create_embeddings(self) -> HuggingFaceEmbeddings:
-        """Create embeddings instance.
-
-        Returns:
-            Configured HuggingFaceEmbeddings instance
-        """
-        return HuggingFaceEmbeddings(model_name=self.embedding_model_name)
-
-    def create_text_splitter(self) -> MarkdownTextSplitter:
-        """Create text splitter for markdown.
-
-        Returns:
-            Configured MarkdownTextSplitter instance
-        """
-        return MarkdownTextSplitter(
-            chunk_size=self.chunk_size,
-            chunk_overlap=self.chunk_overlap,
-        )
-

The VectorizerFactory is incomplete and unused. It should be completed to create
the MarkdownVectorizer and its dependencies, and then used in the vectorize CLI
command to replace manual dependency creation and ensure consistency with the
factory pattern.

Examples:

tapio/factories.py [129-183]
class VectorizerFactory:
    """Factory for creating MarkdownVectorizer instances.

    Handles creation of text splitters and vector database instances for
    the vectorization pipeline.

    Args:
        collection_name: Name of the ChromaDB collection
        persist_directory: Directory path for ChromaDB persistence
        embedding_model_name: Name of the HuggingFace embedding model

 ... (clipped 45 lines)
tapio/cli.py [362-378]
        # Create dependencies
        embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
        text_splitter = MarkdownTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
        )
        vector_db = Chroma(
            collection_name=collection_name,
            embedding_function=embeddings,
            persist_directory=db_dir,

 ... (clipped 7 lines)

Solution Walkthrough:

Before:

# In tapio/factories.py
class VectorizerFactory:
    # ... methods to create dependencies...
    def create_embeddings(self): ...
    def create_text_splitter(self): ...
    # Missing a method to create the actual vectorizer

# In tapio/cli.py
@app.command()
def vectorize(...):
    # Dependencies are created manually
    embeddings = HuggingFaceEmbeddings(...)
    text_splitter = MarkdownTextSplitter(...)
    vector_db = Chroma(...)

    # Vectorizer is instantiated with manually created dependencies
    vectorizer = MarkdownVectorizer(
        vector_db=vector_db,
        text_splitter=text_splitter,
    )
    vectorizer.process_directory(...)

After:

# In tapio/factories.py
class VectorizerFactory:
    # ... methods to create dependencies...
    def create_vector_db(self): ...
    def create_text_splitter(self): ...

    def create_vectorizer(self) -> MarkdownVectorizer:
        # Wires all dependencies together
        vector_db = self.create_vector_db()
        text_splitter = self.create_text_splitter()
        return MarkdownVectorizer(vector_db, text_splitter)

# In tapio/cli.py
@app.command()
def vectorize(...):
    # Use the factory to create the vectorizer
    factory = VectorizerFactory(...)
    vectorizer = factory.create_vectorizer()
    vectorizer.process_directory(...)
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a significant architectural inconsistency where the new VectorizerFactory is incomplete and unused, while the vectorize CLI command manually creates dependencies, contradicting the PR's goal of using factories for DI.

Medium
General
Raise custom exception instead of SystemExit

In check_model_availability, replace the SystemExit(1) call with a custom
exception (e.g., ModelUnavailableError) to allow for more graceful error
handling by callers.

tapio/app.py [44-52]

+class ModelUnavailableError(Exception):
+    """Custom exception for when the LLM model is not available."""
+    pass
+
+...
+
 def check_model_availability(self) -> None:
     """Check if the LLM model is available.
 
     Raises:
-        SystemExit: If the model is not available
+        ModelUnavailableError: If the model is not available
     """
     if not self.rag_orchestrator.check_model_availability():
         logger.error("Required LLM model is not available")
-        raise SystemExit(1)
+        raise ModelUnavailableError("Required LLM model is not available")

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that using SystemExit is poor practice in a class, as it hinders reusability and testability. Replacing it with a custom exception is a significant improvement to the code's design and robustness.

Medium
Expose chunking parameters in CLI

Expose chunk_size and chunk_overlap as command-line options in the vectorize
command to allow users to configure the text splitting process.

tapio/cli.py [301-325]

 @app.command()
 def vectorize(
     site: str | None = typer.Argument(
         None,
         help="Site to vectorize (e.g. 'migri'). If not provided, all sites are processed.",
     ),
     embedding_model: str = typer.Option(
         "all-MiniLM-L6-v2",
         "--model",
         "-m",
         help="Name of the sentence-transformers model to use",
+    ),
+    chunk_size: int = typer.Option(
+        1000,
+        "--chunk-size",
+        help="Size of text chunks for splitting",
+    ),
+    chunk_overlap: int = typer.Option(
+        200,
+        "--chunk-overlap",
+        help="Overlap between consecutive text chunks",
     ),
     batch_size: int = typer.Option(
         20,
         "--batch-size",
         "-b",
         help="Number of documents to process in each batch",
     ),
     verbose: bool = typer.Option(
         False,
         "--verbose",
         "-v",
         help="Enable verbose output",
     ),
 ) -> None:

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly points out that chunk_size and chunk_overlap are hardcoded within the vectorize command. Exposing them as CLI options is a valuable enhancement for user flexibility and tuning the vectorization process.

Low
Expose RAG parameters in CLI

Add command-line options for embedding_model and num_results to the tapio_app
command to allow for greater configuration of the RAG pipeline.

tapio/cli.py [453-472]

 @app.command()
 def tapio_app(
     model_name: str = typer.Option(
         "llama3.2:latest",
         "--model-name",
         "-m",
         help="Ollama model to use for LLM inference",
+    ),
+    embedding_model: str = typer.Option(
+        DEFAULT_EMBEDDING_MODEL,
+        "--embedding-model",
+        help="HuggingFace embedding model to use",
+    ),
+    num_results: int = typer.Option(
+        DEFAULT_NUM_RESULTS,
+        "--num-results",
+        "-n",
+        help="Number of documents to retrieve for context",
     ),
     max_tokens: int = typer.Option(
         1024,
         "--max-tokens",
         "-t",
         help="Maximum number of tokens to generate",
     ),
     share: bool = typer.Option(
         False,
         "--share",
         help="Create a shareable link for the app",
     ),
 ) -> None:

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that key RAG parameters like embedding_model_name and num_results are hardcoded in the tapio_app command. Adding them as CLI options significantly improves the command's flexibility and usability for tuning the RAG pipeline.

Low
  • Update

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
CONTRIBUTING.md (1)

60-66: ⚠️ Potential issue | 🟡 Minor

Align the architecture diagram file name with the actual module name.

The diagram references gradio_app.py, but the Project Structure section lists app.py. Please update one of them so contributors don’t chase a non-existent file.

Also applies to: 321-323

🤖 Fix all issues with AI agents
In `@CONTRIBUTING.md`:
- Around line 456-459: Update the CONTRIBUTING.md entry for the fallback_to_body
option to hyphenate the compound adjective by changing the phrase "full body
content" to "full-body content" so the list item reads "... Use full-body
content if selectors fail (default: true)"; locate the line describing
`fallback_to_body` and make that single-word hyphenation change.
🧹 Nitpick comments (10)
tapio/vectorstore/chroma_store.py (1)

20-42: Good DI refactor; consider a more specific type hint for embeddings.

The dependency injection pattern is correctly implemented. Using Any for the embeddings parameter is pragmatic given LangChain's various embedding interfaces, but you could consider using langchain_core.embeddings.Embeddings base class for better type safety and IDE support.

💡 Optional type improvement
+from langchain_core.embeddings import Embeddings
+
 class ChromaStore:
     ...
     def __init__(
         self,
         collection_name: str,
-        embeddings: Any,
+        embeddings: Embeddings,
         persist_directory: str = "chroma_db",
     ) -> None:
tests/parser/test_relative_links.py (1)

205-216: Consider extracting ConfigManager import to module level.

The ConfigManager import is repeated inside test_domain_specific_url_handling. Since it's also used in setUp, consider moving the import to the top of the file with other imports for consistency.

♻️ Suggested change

Add to imports at the top of the file:

from tapio.config.config_manager import ConfigManager

Then remove the local imports on lines 90-91 and 206-207.

tapio/config/config_models.py (1)

146-151: Consider using constants from settings.py for defaults.

The AI summary mentions tapio/config/settings.py defines DEFAULT_EMBEDDING_MODEL, DEFAULT_LLM_MODEL, etc. Using those constants here would centralize default values and follow the DRY principle.

#!/bin/bash
# Check if settings.py has these constants
rg -n "DEFAULT_EMBEDDING_MODEL|DEFAULT_LLM_MODEL|DEFAULT_MAX_TOKENS|DEFAULT_NUM_RESULTS" tapio/config/settings.py
tests/integration/test_rag_pipeline.py (2)

114-119: Consider using unittest.mock.patch instead of monkey-patching.

Directly assigning a lambda to factory.create_embeddings works but is less explicit than using proper mocking, which provides better cleanup and is more idiomatic for tests.

♻️ Alternative using patch
+    from unittest.mock import patch
+
     # Create factory
     factory = RAGOrchestratorFactory(config)
 
-    # Mock the create_embeddings to return our mock
-    factory.create_embeddings = lambda: mock_embeddings
+    # Mock the create_embeddings to return our mock
+    with patch.object(factory, 'create_embeddings', return_value=mock_embeddings):
+        # Create orchestrator
+        orchestrator = factory.create_orchestrator()
 
-    # Create orchestrator
-    orchestrator = factory.create_orchestrator()

163-164: Assertion could be more precise.

assert len(results) <= 2 is a weak assertion. With 3 documents added and num_results=2, we'd typically expect exactly 2 results. Consider using assert len(results) == 2 unless there's a reason fewer results might be returned (e.g., similarity threshold filtering).

♻️ Suggested change
-    assert len(results) <= 2  # Should respect num_results limit
+    assert len(results) == 2  # Should return exactly num_results documents
tests/integration/test_vectorization_pipeline.py (1)

59-65: Consider hoisting the Chroma import to module level.

The from langchain_chroma import Chroma import is repeated inside each test function. Moving it to the top of the file with other imports would improve readability, unless there's a specific reason to delay the import (e.g., avoiding import errors when running non-integration tests).

Suggested refactor
 from langchain_text_splitters import MarkdownTextSplitter  # type: ignore[import-not-found]
+from langchain_chroma import Chroma  # type: ignore[import-not-found]
 
 from tapio.vectorstore.vectorizer import MarkdownVectorizer

Then remove the inline imports at lines 59 and 102.

Also applies to: 102-108

tests/parser/test_parser.py (2)

143-148: Remove duplicate import of ConfigManager.

ConfigManager is already imported at line 11. The local import here is redundant.

Suggested fix
-        # Create config manager and get site config
-        from tapio.config.config_manager import ConfigManager
-
-        config_manager = ConfigManager(self.config_path)
+        # Create config manager and get site config
+        config_manager = ConfigManager(self.config_path)

189-195: Remove duplicate import of ConfigManager.

ConfigManager is already imported at line 11. This local import at line 191 is redundant.

Suggested fix
     def test_init_with_invalid_site(self):
         """Test initialization with invalid site."""
-        from tapio.config.config_manager import ConfigManager
-
         config_manager = ConfigManager(self.config_path)
         with self.assertRaises(ValueError):
             config_manager.get_site_config("nonexistent")
tests/vectorstore/test_chroma_store.py (1)

242-271: Add @patch decorator to avoid real ChromaDB initialization.

This test doesn't patch Chroma, so it will attempt to create a real ChromaDB instance which may be slow or fail in some environments. Consider adding the patch for consistency with other tests.

Suggested fix
+    `@patch`("tapio.vectorstore.chroma_store.Chroma")
-    def test_enhance_document_with_citation(self, mock_embeddings):
+    def test_enhance_document_with_citation(self, mock_chroma, mock_embeddings):
         """Test enhancing a document with citation information."""
         # Initialize ChromaStore with injected embeddings
         store = ChromaStore(
             collection_name="test_collection",
             embeddings=mock_embeddings,
         )
tests/conftest.py (1)

19-39: Avoid shared mutable embedding lists in the mock.

embed_query/embed_documents currently return the same list instance, which can cause cross-test coupling if any downstream code mutates it. Returning copies better mirrors real embedding behavior.

♻️ Proposed adjustment
-    embeddings.embed_query.return_value = dummy_embedding
+    embeddings.embed_query.side_effect = lambda _: dummy_embedding.copy()
@@
-    def embed_documents_mock(texts):
-        return [dummy_embedding for _ in texts]
+    def embed_documents_mock(texts):
+        return [dummy_embedding.copy() for _ in texts]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tapio/factories.py`:
- Line 8: Replace the deprecated import of HuggingFaceEmbeddings in
tapio/factories.py: change the module import from langchain_community.embeddings
to the supported langchain_huggingface package so the symbol
HuggingFaceEmbeddings is imported from langchain_huggingface instead; locate the
import line that currently reads using HuggingFaceEmbeddings and update it to
import from langchain_huggingface to eliminate the deprecation warning and match
usage elsewhere (e.g., as done in cli.py).

In `@tests/integration/test_rag_pipeline.py`:
- Line 78: Remove the duplicate import of patch by deleting the redundant "from
unittest.mock import patch" statement (the one inside the test function) since
patch is already imported at the top; ensure no other references are affected
and run tests to confirm imports remain valid.
🧹 Nitpick comments (2)
tests/parser/test_parser.py (1)

231-247: Consider removing redundant import.

The DEFAULT_DIRS import at line 235 is redundant since it's already imported in setUp() at line 39. While this works correctly, you could reference the already-imported DEFAULT_DIRS or store it as an instance variable in setUp() for reuse.

♻️ Proposed refactor to remove redundant import

Store DEFAULT_DIRS as an instance attribute in setUp():

         from tapio.config.settings import DEFAULT_DIRS
+        self.DEFAULT_DIRS = DEFAULT_DIRS

         self.input_dir = os.path.join(self.temp_dir, self.site_name, DEFAULT_DIRS["CRAWLED_DIR"])

Then use self.DEFAULT_DIRS in test methods:

-        from tapio.config.settings import DEFAULT_DIRS
-
-        no_fallback_input_dir = os.path.join(self.temp_dir, self.no_fallback_site_name, DEFAULT_DIRS["CRAWLED_DIR"])
-        no_fallback_output_dir = os.path.join(self.temp_dir, self.no_fallback_site_name, DEFAULT_DIRS["PARSED_DIR"])
+        no_fallback_input_dir = os.path.join(self.temp_dir, self.no_fallback_site_name, self.DEFAULT_DIRS["CRAWLED_DIR"])
+        no_fallback_output_dir = os.path.join(self.temp_dir, self.no_fallback_site_name, self.DEFAULT_DIRS["PARSED_DIR"])
tapio/factories.py (1)

51-67: Consider using the broader Embeddings type for the parameter.

The type hint HuggingFaceEmbeddings | None is more restrictive than necessary. Since ChromaStore.__init__ accepts Embeddings (the base type), using Embeddings | None here would allow injecting other embedding implementations (e.g., for testing with mocks).

♻️ Suggested type broadening
+from langchain_core.embeddings import Embeddings
+
 ...
 
-    def create_chroma_store(self, embeddings: HuggingFaceEmbeddings | None = None) -> ChromaStore:
+    def create_chroma_store(self, embeddings: Embeddings | None = None) -> ChromaStore:

@brylie brylie merged commit 78dfcee into main Feb 2, 2026
4 checks passed
@brylie brylie deleted the improve-maintainability branch February 2, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant