GitHub - BeckettFrey/doc-weaver-agent: Template-based markdown hydration with tunable length & context flow

Appendix

How It Works
What Problems Does This Solve?
Placeholder Syntax
CLI Usage
CLI Installation
Project Structure

Doc Weaver fills <batch, min_chars, max_chars> (and optionally <batch, min_chars, max_chars, context_id>) placeholders in markdown templates with LLM-generated content that respects character-length constraints. Batches run sequentially so later placeholders see earlier results, while items within a batch run concurrently.

How It Works

Pipeline Details

HydrateQueue (hydrate_queue.py) parses all placeholders, replaces them with unique <<TASK_N>> markers, and groups tasks by batch number.
Batches process sequentially (lower batch first). Within a batch, all items run concurrently via asyncio.gather.
For each task, the queue builds a Document where the current task's marker becomes <TODO> and all other unresolved markers become (will be filled later). This gives the LLM full document context.
hydrate_item (hydrate_batch.py) calls the LLM via structured output (Pydantic Response model), then runs the text morpher if the result falls outside the character bounds.
Results feed back into the queue; subsequent batches see prior results in the document.

Text Morpher

text_morpher/ is a LangGraph state machine that iteratively summarizes or expands text to fit [min_chars, max_chars]. The graph loops through summarizer/expander nodes with a configurable retry budget (default 3).

What Problems Does This Solve?

Format and length constraints. Free-form generation gives you little control over how much text ends up in each section. Doc Weaver anchors every placeholder to explicit character bounds (min_chars, max_chars), and the text morpher enforces those bounds automatically. The template itself locks down heading hierarchy, section order, and overall document shape, so the output matches a predefined format every time.

Sequentially improved context. Because batches run in order, each generation step sees the resolved output of every earlier batch. Later sections build on top of what was already written rather than guessing at it. This gives the model an increasingly complete picture of the document as it fills in remaining placeholders, producing a more coherent and internally consistent result.

Grounded output, not hallucination. LLMs generating long-form documents from a single prompt tend to drift, fabricate details, or lose coherence. Doc Weaver breaks the problem into one placeholder at a time: each LLM call receives a full document preview with a single <TODO> marking exactly where to write, combined with the user-supplied prompt, this gives the model a narrow, well-defined task against concrete surrounding content rather than open-ended generation.

Placeholder Syntax

Placeholders follow the format <batch, min_chars, max_chars> or <batch, min_chars, max_chars, context_id>:

Field	Description
`batch`	Processing order. Batch 1 runs first, then batch 2, etc.
`min_chars`	Minimum character count for the generated text.
`max_chars`	Maximum character count for the generated text.
`context_id`	(optional) Name of a stored context to include for this placeholder. Must be a valid identifier.

Items sharing the same batch number run concurrently. Lower batch numbers run first, so their results are visible to later batches.

When a placeholder includes a context_id, the corresponding stored context text is prepended to the document preview for that task (in addition to the global --prompt context). See Context Management for how to store contexts.

Required Markdown Structure

Templates must follow this hierarchy pattern:

# Title
> Tagline

## Section
### Subsection
- Content item
- Content item

Exactly one # Title heading
Zero or One > Tagline line immediately after the title
Zero or more ## Section headings
Zero or more ### Subsection headings per section
Zero or more - Content items per subsection

Placeholders can appear on at any level.

load_markdown produces a Document object (from doc_weaver.document) with header, tagline, and sections attributes. Each section maps to a list of SubSection objects containing Content items. Call document.preview() to render the document back to markdown. Document can be used for programmatic tasks like custom rendering or post-processing of your generated documents.

CLI Usage

Doc Weaver installs a doc-weaver command. This serves as a wrapper around the flowchart functionality, and lets you manage versions with ease.

Configure

Store configuration values (such as your OpenAI API key) in ~/.doc_weaver/.env:

# Set a config value
doc-weaver config set OPENAI_API_KEY sk-...

# View stored config (values are masked)
doc-weaver config show

The config file is created with 600 permissions. Values set here are loaded automatically on every doc-weaver invocation.

Validate a Template

Check that a markdown file has valid structure and well-formed placeholders before using it:

doc-weaver validate ./my-template.md

On success, prints a summary of placeholders and batches. On failure, lists all errors found and exits with code 1.

Template Management

# Save a template
doc-weaver template add resume-template ./my-template.md

# List saved templates
doc-weaver template list

# View a template
doc-weaver template show resume

# Delete a template
doc-weaver template remove resume

Templates are stored in ~/.doc_weaver/templates/.

Context Management

Store per-task context files that placeholders can reference via context_id:

# Save a context file
doc-weaver context add dam_engineering ./dam-context.txt

# List saved contexts
doc-weaver context list

# View a context
doc-weaver context show dam_engineering

# Delete a context
doc-weaver context remove dam_engineering

Contexts are stored in ~/.doc_weaver/contexts/. During generation, any placeholder with a context_id (e.g. <1, 50, 200, dam_engineering>) will have the matching context text prepended to its LLM prompt. If a placeholder references a context that hasn't been added, generation fails with an error listing the missing context(s).

Generate a Document

doc-weaver generate resume-template \
    --output-dir ./output \
    --prompt "Here are the jobs I've held... and here is the job I'm optimizing for..." \
    --model gpt-4o \
    --timeout 30

Option	Description
`--output-dir`	Directory for `output.md` and `metadata.json` (required).
`--prompt`	Context string passed to the LLM for all placeholders.
`--prompt-file`	Read context from a file instead (mutually exclusive with `--prompt`).
`--model`	OpenAI model to use (default: `gpt-4o`).
`--timeout`	Seconds to wait per batch (default: `30`).

Each generation run writes a metadata.json alongside output.md. This file records per-task diagnostics so you can inspect exactly what happened during generation:

{
  "tasks": [
    {
      "task_number": 0,
      "marker": "<<TASK_0>>",
      "batch_num": 1,
      "char_range": [1, 50],
      "total_chars": 50,
      "elapsed_ms": 11541.36,
      "model": "gpt-4o",
      "context_id": null
    }
  ],
  "total_elapsed_ms": 14326.47,
  "model": "gpt-4o",
  "marker_document": "# <<TASK_0>>\n> <<TASK_1>>\n## Section\n"
}

Field	Description
`task_number`	Index of the placeholder in parse order.
`batch_num`	Which batch the task belonged to.
`char_range`	The `[min_chars, max_chars]` constraint from the placeholder.
`total_chars`	Actual character count of the generated text.
`elapsed_ms`	Wall-clock time for that task (including any morph retries).
`context_id`	The context ID referenced by the placeholder, or `null` if none.
`total_elapsed_ms`	Wall-clock time for the entire generation run.
`marker_document`	The template with placeholders replaced by their `<<TASK_N>>` markers, useful for mapping tasks back to document positions.

This is helpful for debugging length violations, spotting slow tasks, comparing models, and understanding how the template was decomposed into tasks.

Project Structure

src/doc_weaver/
  cli.py              # Click CLI (template management + generate)
  document.py         # Data models: Document, SubSection, Content
  parser.py           # Markdown parser with structural validation
  hydrate_queue.py    # Orchestration: batch ordering, marker injection, hydrate()
  hydrate_batch.py    # Per-item hydration: LLM call + morph
  responder.py        # LangChain structured output (Task agent)
  text_morpher/
    __init__.py       # LangGraph definition, simple_morph() entry point
    state.py          # AgentState TypedDict
    nodes.py          # Graph nodes: validate, summarize, expand, track_progress
  test/
    test_parser.py
    test_hydrate_queue.py
    test_hydrate_batch.py
    ...

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
images		images
src/doc_weaver		src/doc_weaver
.env.example		.env.example
.gitignore		.gitignore
INSTALLATION.md		INSTALLATION.md
Makefile		Makefile
README.md		README.md
example-template.md		example-template.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Appendix

How It Works

Pipeline Details

Text Morpher

What Problems Does This Solve?

Placeholder Syntax

Required Markdown Structure

CLI Usage

Configure

Validate a Template

Template Management

Context Management

Generate a Document

Project Structure

About

Uh oh!

Languages

BeckettFrey/doc-weaver-agent

Folders and files

Latest commit

History

Repository files navigation

Appendix

How It Works

Pipeline Details

Text Morpher

What Problems Does This Solve?

Placeholder Syntax

Required Markdown Structure

CLI Usage

Configure

Validate a Template

Template Management

Context Management

Generate a Document

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages