Skip to content

Feature: Selective includes and LLM-powered extraction for token-efficient prompts #190

@gltanaka

Description

@gltanaka

Problem

From docs/prompting_guide.md:

Tradeoff: Large includes consume context tokens. If only a small portion of a file is relevant, consider extracting that portion into a dedicated include file (e.g., docs/output_conventions.md rather than the full README.md).

Currently, the workaround is manual extraction—creating smaller include files by hand. This is labor-intensive and creates maintenance burden when source files change.

Proposal: Layered Selective Includes

A layered system that starts fully deterministic (Phases 1-5) and offers an explicit escape hatch for semantic extraction (Phase 6).


Phase 1-5: Deterministic Selectors

Extend <include> with a select attribute for structural extraction:

Syntax

<!-- Line range -->
<include path="src/config.py" lines="10-50"/>

<!-- Python: function/class extraction -->
<include path="src/utils.py" select="def:parse_user_id"/>
<include path="src/models.py" select="class:User"/>
<include path="src/service.py" select="class:UserService.validate"/>

<!-- Multiple selectors -->
<include path="src/api.py" select="def:get_user, def:create_user, class:UserSchema"/>

<!-- Markdown: section under heading -->
<include path="docs/config.md" select="section:Environment Variables"/>

<!-- Regex pattern -->
<include path="src/constants.py" select="pattern:/^API_.*=/"/>

<!-- Interface mode: signatures + docstrings only -->
<include path="src/billing.py" mode="interface"/>
<include path="src/billing.py" select="class:BillingService" mode="interface"/>

<!-- Token budget -->
<include path="docs/reference.md" select="section:API" max_tokens="1000" overflow="error"/>

Selector Types by File Format

File Type Selectors
Python def:name, class:Name, class:Name.method, lines:N-M
Markdown section:Heading, heading:Title
JSON/YAML path:config.database.host (JSONPath-like)
Any lines:N-M, pattern:/regex/

Implementation

  • Python: Use ast module for robust parsing
  • Markdown: Regex-based heading hierarchy parser
  • Generic: Line ranges and regex patterns

mode="interface"

Extracts public API only (huge token savings for large modules):

  • Class/function signatures (not bodies)
  • Docstrings
  • Type hints
  • Skip _private members
<!-- Before: 500 tokens for full class -->
<include>src/billing/service.py</include>

<!-- After: ~80 tokens for interface only -->
<include path="src/billing/service.py" select="class:BillingService" mode="interface"/>

Phase 6: LLM-Powered Extraction with Caching

For semantic/fuzzy queries that can't be expressed as structural selectors.

Syntax

<extract path="docs/large_api_reference.md">
  Authentication flow, JWT token structure, and refresh token handling
</extract>

Behavior

First run:

$ pdd generate my_module

⚠️  New extraction: docs/large_api.md → .pdd/extracts/a1b2c3d4.md
    Query: "Authentication flow, JWT tokens..."
    Extracting... done (842 tokens)
    
    Commit this file to ensure reproducible builds:
    git add .pdd/extracts/a1b2c3d4.md

✓ Generated: src/my_module.py

Subsequent runs (deterministic):

$ pdd generate my_module

✓ Using extraction: .pdd/extracts/a1b2c3d4.md (842 tokens)
✓ Generated: src/my_module.py

Source file changes (warn, don't auto-refresh):

$ pdd generate my_module

⚠️  Source changed: docs/large_api.md (modified since extraction)
    To refresh: pdd extract --refresh my_module.prompt
    Continuing with existing extraction...

✓ Generated: src/my_module.py

Key Design Decisions

  1. No auto-refresh — Source changes trigger a warning, not re-extraction. This prevents silent changes to builds from upstream typo fixes.

  2. Lockfile pattern — Cache is expected to be committed (like package-lock.json). This ensures reproducible builds across team members.

  3. Explicit refreshpdd extract --refresh when you intentionally want to pull in source changes.

Cache Structure

.pdd/
└── extracts/
    ├── a1b2c3d4.md          # Extracted content (human-readable/editable)
    └── a1b2c3d4.meta.json   # Provenance: source path, query, timestamp

CLI Commands

# Refresh extractions for a prompt
pdd extract --refresh prompts/my_module.prompt

# Preview extraction without caching
pdd extract --preview docs/large_api.md --query "auth flow"

# List all cached extractions
pdd extract --list

# Show staleness status
pdd extract --status

Comparison: When to Use What

Scenario Approach
"Get the UserService class" select="class:UserService"
"Get lines 50-100" lines="50-100"
"Get the Configuration section" select="section:Configuration"
"Get just the public API" mode="interface"
"Get everything about retry policies scattered across this 50-page doc" <extract>

Guidance: Use deterministic selectors (Phases 1-5) whenever possible. Use <extract> only for semantic queries that structural selectors can't handle.


Implementation Phases

Phase Feature Complexity
1 lines="N-M" selector Low
2 Python AST selectors (def:, class:) Medium
3 Markdown section: selector Low
4 mode="interface" Medium
5 max_tokens + overflow Low
6 <extract> with LLM + caching High

Example: Before and After

Before:

<!-- 800 lines, ~4000 tokens -->
<billing_service>
  <include>src/billing/service.py</include>
</billing_service>

<!-- 2000 tokens for full README -->
<project_context>
  <include>README.md</include>
</project_context>

After:

<!-- ~80 tokens for interface only -->
<billing_service>
  <include path="src/billing/service.py" 
           select="class:BillingService" 
           mode="interface"/>
</billing_service>

<!-- ~200 tokens for relevant section -->
<project_context>
  <include path="README.md" 
           select="section:Environment Variables, section:Configuration"/>
</project_context>

<!-- Semantic extraction for large unstructured docs -->
<api_reference>
  <extract path="docs/external_api_v3.md">
    Rate limiting, pagination, and error response formats
  </extract>
</api_reference>

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions