Feature: Selective includes and LLM-powered extraction for token-efficient prompts

## Problem

From `docs/prompting_guide.md`:

> *Tradeoff:* Large includes consume context tokens. If only a small portion of a file is relevant, consider extracting that portion into a dedicated include file (e.g., `docs/output_conventions.md` rather than the full `README.md`).

Currently, the workaround is **manual extraction**—creating smaller include files by hand. This is labor-intensive and creates maintenance burden when source files change.

## Proposal: Layered Selective Includes

A layered system that starts fully deterministic (Phases 1-5) and offers an explicit escape hatch for semantic extraction (Phase 6).

---

## Phase 1-5: Deterministic Selectors

Extend `<include>` with a `select` attribute for structural extraction:

### Syntax

```xml

<include path="src/config.py" lines="10-50"/>


<include path="src/utils.py" select="def:parse_user_id"/>
<include path="src/models.py" select="class:User"/>
<include path="src/service.py" select="class:UserService.validate"/>


<include path="src/api.py" select="def:get_user, def:create_user, class:UserSchema"/>


<include path="docs/config.md" select="section:Environment Variables"/>


<include path="src/constants.py" select="pattern:/^API_.*=/"/>


<include path="src/billing.py" mode="interface"/>
<include path="src/billing.py" select="class:BillingService" mode="interface"/>


<include path="docs/reference.md" select="section:API" max_tokens="1000" overflow="error"/>
```

### Selector Types by File Format

| File Type | Selectors |
|-----------|-----------|
| Python | `def:name`, `class:Name`, `class:Name.method`, `lines:N-M` |
| Markdown | `section:Heading`, `heading:Title` |
| JSON/YAML | `path:config.database.host` (JSONPath-like) |
| Any | `lines:N-M`, `pattern:/regex/` |

### Implementation

- **Python**: Use `ast` module for robust parsing
- **Markdown**: Regex-based heading hierarchy parser
- **Generic**: Line ranges and regex patterns

### `mode="interface"`

Extracts public API only (huge token savings for large modules):
- Class/function **signatures** (not bodies)
- **Docstrings**
- **Type hints**
- Skip `_private` members

```xml

<include>src/billing/service.py</include>


<include path="src/billing/service.py" select="class:BillingService" mode="interface"/>
```

---

## Phase 6: LLM-Powered Extraction with Caching

For semantic/fuzzy queries that can't be expressed as structural selectors.

### Syntax

```xml
<extract path="docs/large_api_reference.md">
  Authentication flow, JWT token structure, and refresh token handling
</extract>
```

### Behavior

**First run:**
```
$ pdd generate my_module

⚠️  New extraction: docs/large_api.md → .pdd/extracts/a1b2c3d4.md
    Query: "Authentication flow, JWT tokens..."
    Extracting... done (842 tokens)
    
    Commit this file to ensure reproducible builds:
    git add .pdd/extracts/a1b2c3d4.md

✓ Generated: src/my_module.py
```

**Subsequent runs** (deterministic):
```
$ pdd generate my_module

✓ Using extraction: .pdd/extracts/a1b2c3d4.md (842 tokens)
✓ Generated: src/my_module.py
```

**Source file changes** (warn, don't auto-refresh):
```
$ pdd generate my_module

⚠️  Source changed: docs/large_api.md (modified since extraction)
    To refresh: pdd extract --refresh my_module.prompt
    Continuing with existing extraction...

✓ Generated: src/my_module.py
```

### Key Design Decisions

1. **No auto-refresh** — Source changes trigger a warning, not re-extraction. This prevents silent changes to builds from upstream typo fixes.

2. **Lockfile pattern** — Cache is expected to be committed (like `package-lock.json`). This ensures reproducible builds across team members.

3. **Explicit refresh** — `pdd extract --refresh` when you intentionally want to pull in source changes.

### Cache Structure

```
.pdd/
└── extracts/
    ├── a1b2c3d4.md          # Extracted content (human-readable/editable)
    └── a1b2c3d4.meta.json   # Provenance: source path, query, timestamp
```

### CLI Commands

```bash
# Refresh extractions for a prompt
pdd extract --refresh prompts/my_module.prompt

# Preview extraction without caching
pdd extract --preview docs/large_api.md --query "auth flow"

# List all cached extractions
pdd extract --list

# Show staleness status
pdd extract --status
```

---

## Comparison: When to Use What

| Scenario | Approach |
|----------|----------|
| "Get the `UserService` class" | `select="class:UserService"` |
| "Get lines 50-100" | `lines="50-100"` |
| "Get the Configuration section" | `select="section:Configuration"` |
| "Get just the public API" | `mode="interface"` |
| "Get everything about retry policies scattered across this 50-page doc" | `<extract>` |

**Guidance:** Use deterministic selectors (Phases 1-5) whenever possible. Use `<extract>` only for semantic queries that structural selectors can't handle.

---

## Implementation Phases

| Phase | Feature | Complexity |
|-------|---------|------------|
| 1 | `lines="N-M"` selector | Low |
| 2 | Python AST selectors (`def:`, `class:`) | Medium |
| 3 | Markdown `section:` selector | Low |
| 4 | `mode="interface"` | Medium |
| 5 | `max_tokens` + `overflow` | Low |
| 6 | `<extract>` with LLM + caching | High |

---

## Example: Before and After

**Before:**
```xml

<billing_service>
  <include>src/billing/service.py</include>
</billing_service>


<project_context>
  <include>README.md</include>
</project_context>
```

**After:**
```xml

<billing_service>
  <include path="src/billing/service.py" 
           select="class:BillingService" 
           mode="interface"/>
</billing_service>


<project_context>
  <include path="README.md" 
           select="section:Environment Variables, section:Configuration"/>
</project_context>


<api_reference>
  <extract path="docs/external_api_v3.md">
    Rate limiting, pagination, and error response formats
  </extract>
</api_reference>
```

---

## References

- `docs/prompting_guide.md` lines 156-161 (tradeoff discussion)
- `docs/prompting_guide.md` lines 272-278 (token-efficient examples)
- Related: #150 (Toon format for token efficiency)
- Related: #107 (Configurable context assembly)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Selective includes and LLM-powered extraction for token-efficient prompts #190

Problem

Proposal: Layered Selective Includes

Phase 1-5: Deterministic Selectors

Syntax

Selector Types by File Format

Implementation

`mode="interface"`

Phase 6: LLM-Powered Extraction with Caching

Syntax

Behavior

Key Design Decisions

Cache Structure

CLI Commands

Comparison: When to Use What

Implementation Phases

Example: Before and After

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File Type	Selectors
Python	`def:name`, `class:Name`, `class:Name.method`, `lines:N-M`
Markdown	`section:Heading`, `heading:Title`
JSON/YAML	`path:config.database.host` (JSONPath-like)
Any	`lines:N-M`, `pattern:/regex/`

Scenario	Approach
"Get the `UserService` class"	`select="class:UserService"`
"Get lines 50-100"	`lines="50-100"`
"Get the Configuration section"	`select="section:Configuration"`
"Get just the public API"	`mode="interface"`
"Get everything about retry policies scattered across this 50-page doc"	`<extract>`

Phase	Feature	Complexity
1	`lines="N-M"` selector	Low
2	Python AST selectors (`def:`, `class:`)	Medium
3	Markdown `section:` selector	Low
4	`mode="interface"`	Medium
5	`max_tokens` + `overflow`	Low
6	`<extract>` with LLM + caching	High

Feature: Selective includes and LLM-powered extraction for token-efficient prompts #190

Description

Problem

Proposal: Layered Selective Includes

Phase 1-5: Deterministic Selectors

Syntax

Selector Types by File Format

Implementation

mode="interface"

Phase 6: LLM-Powered Extraction with Caching

Syntax

Behavior

Key Design Decisions

Cache Structure

CLI Commands

Comparison: When to Use What

Implementation Phases

Example: Before and After

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`mode="interface"`