Skip to content

High-quality tool for adapting EPUB content using LLMs - cleanup, style preservation, annotations, and more

Notifications You must be signed in to change notification settings

Firindil/booksmith

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Booksmith

A comprehensive toolkit for adapting EPUB content using Large Language Models. From simple cleanup to style-preserved deep editing to adding scholarly annotations, Booksmith provides a complete workflow for book transformation while maintaining quality, consistency, and author voice.

Alpha Software: This project is in early development. Core features work but expect rough edges, incomplete documentation, and breaking changes. Bug reports and feedback welcome via GitHub Issues.

Table of Contents


Project Overview

Booksmith empowers users to adapt books however they see fit while maintaining the highest possible quality. Unlike simple find-and-replace tools, it uses LLM analysis to understand context, preserve author voice, and ensure narrative consistency across changes.

Design Philosophy

  • Quality by Default: Style preservation and consistency checking are built-in, not afterthoughts
  • Full Spectrum Support: From typo fixes to genre transformations
  • Transparency: Detailed change reports, drift scores, and dry-run modes
  • User Control: Configurable thresholds, approval workflows, and granular prompts

Use Case Spectrum

MINIMAL CHANGES                                              EXTENSIVE CHANGES
      |                                                              |
      v                                                              v
+----------+  +----------+  +----------+  +----------+  +------------------+
|  Cleanup |  | Filtering|  |  Style   |  |  Plot    |  |  Genre/Setting   |
|          |  |          |  | Adapt    |  |  Changes |  |  Transformation  |
+----------+  +----------+  +----------+  +----------+  +------------------+
| OCR fix  |  | Content  |  | Modernize|  | Character|  | Steampunk LOTR   |
| Typos    |  | removal  |  | language |  | arcs     |  | Sci-fi to Fantasy|
| Format   |  | Age-gate |  | Simplify |  | Endings  |  | Period changes   |
+----------+  +----------+  +----------+  +----------+  +------------------+

Features

Core Editing (epub_cleaner.py)

  • Two-Stage Analysis: Quick classification pass (FILTER/PASS) followed by selective rewriting
  • Full Context Preservation: Sends entire chapters for context, only rewrites flagged paragraphs
  • HTML Preservation: Maintains attributes, classes, and formatting on modified elements
  • Paragraph Removal: Can remove paragraphs entirely when appropriate
  • Audit Trail: Detailed change reports with before/after comparisons
  • Dry-Run Mode: Preview changes without modifying files

Style Preservation (style_profiler.py, style_validator.py)

  • LLM-Based Style Analysis: Analyzes EPUBs to create comprehensive author style profiles
  • Profile Components: Prose style, vocabulary patterns, rhythm, character voices, themes
  • Do/Avoid Guidelines: Actionable rules for maintaining author voice
  • Drift Measurement: Scores how well modifications preserve original style (0-100)
  • Multi-Dimension Analysis: Sentence structure, tone, vocabulary, rhythm, voice, literary devices

Deep Editing (book_analyzer.py, change_planner.py, consistency_checker.py)

  • Book Model Extraction: Characters, locations, timeline, relationships, plot structure
  • Change Planning: Map ripple effects before making changes
  • Consistency Validation: Cross-chapter checks for contradictions
  • Structured Output: JSON models for programmatic access

Annotations (annotator.py, footnote_inserter.py)

  • Multiple Commentary Styles: Scholarly, historical, educational, devil's advocate, thematic, fun facts, funny, cross-reference
  • Smart Passage Selection: LLM identifies annotation-worthy passages
  • EPUB3 Compliance: Proper footnote/endnote markup with bidirectional links
  • Configurable Frequency: Control annotations per chapter
  • Focus Areas: Target specific topics or themes

Unified CLI (cli.py)

  • Subcommand Architecture: Clean separation of functionality
  • Workflow Support: Load preset configurations from YAML files
  • Consistent Interface: Common flags across all commands
  • Verbose Mode: Detailed progress information

Installation

Prerequisites

  • Python 3.8 or higher
  • An Anthropic API key (Claude)
  • DRM-free EPUB files - This tool requires standard, unprotected EPUB files.
    • Many stores sell DRM-free: Tor.com, Smashwords, Kobo (some titles), Google Play (some)
    • Public domain: Project Gutenberg, Standard Ebooks, Faded Page
    • If your legally purchased EPUB has DRM, you'll need to remove it first (search "Calibre DRM" - we can't link directly for legal reasons)

Setup

  1. Clone or download this repository:
git clone https://github.com/Firindil/booksmith.git
cd booksmith
  1. Install Python dependencies:
pip install -r requirements.txt

Required packages:

  • anthropic - Claude API client
  • beautifulsoup4 - HTML parsing
  • pyyaml - Configuration files
  • lxml (optional) - Faster HTML parsing
  1. Set your API key:
# Linux/macOS
export ANTHROPIC_API_KEY="your-api-key-here"

# Windows (Command Prompt)
set ANTHROPIC_API_KEY=your-api-key-here

# Windows (PowerShell)
$env:ANTHROPIC_API_KEY="your-api-key-here"
  1. Create configuration files:
cp config.example.yaml config.yaml
cp prompts.example.yaml prompts.yaml

Quick Start Guides

Editing (with Style Preservation)

All editing in Booksmith preserves author voice by default. For best results, create a style profile first:

# Step 1: Estimate costs before processing
python cli.py estimate --input book.epub --workflow cleanup

# Step 2: Build an author profile (recommended)
python cli.py profile --input book.epub --output author_profile.json
# Or use multiple works for a richer profile:
python cli.py profile --input "author_works/*.epub" --output author_profile.json

# Step 3: Edit with style guidance
python cli.py edit --input book.epub --output cleaned.epub \
    --profile author_profile.json --max-drift 30

# Step 4: Preview changes first (optional)
python cli.py edit --input book.epub --dry-run

The --max-drift flag warns you when style deviation exceeds your threshold (0-100 scale).

Key files to configure:

  • config.yaml - Model settings, paragraph selectors
  • prompts.yaml - Analysis and cleaning prompts (define what to look for and how to fix it)

Deep Editing (Plot/Character Changes)

For significant changes requiring narrative consistency:

# Step 1: Analyze the book structure
python cli.py analyze --input book.epub --output book_model.json

# Step 2: Plan your changes (maps ripple effects)
python cli.py plan --model book_model.json \
    --goal "Make the villain more sympathetic" --output change_plan.json

# Step 3: Review the plan, then apply changes via edit command

# Step 4: Check for consistency issues
python cli.py check --input modified.epub --model book_model.json \
    --output consistency_report.json

Annotations

Add scholarly commentary, historical context, or fun facts without modifying the original text:

# Step 1: List chapters to find what you want to annotate
python cli.py annotate --input book.epub --list-chapters

# Step 2: Test on a single chapter first
python cli.py annotate --input book.epub --chapters 4 \
    --style scholarly,historical,fun_facts --output test_annotated.epub

# Step 3: Annotate the full book (or selected chapters)
python cli.py annotate --input book.epub --output annotated.epub \
    --style scholarly,funny --format footnotes

Example commentary.yaml for more control:

styles: [scholarly, historical, fun_facts]
frequency: "2-4 per chapter"
focus_areas: [character motivation, historical context, literary devices]
avoid: [spoilers, plot revelations]
model: claude-sonnet-4-5-20250929

CLI Reference

The unified CLI (cli.py) provides these subcommands:

Global Options

--version, -v    Show version and exit
--verbose, -v    Enable verbose output
--workflow, -w   Load preset configuration from workflows/ directory

edit - Core EPUB Editing

python cli.py edit [options]

Options:
  --input, -i FILE      Input EPUB file (required)
  --output, -o FILE     Output EPUB file (default: input_cleaned.epub)
  --config, -c FILE     Path to config.yaml
  --prompts, -p FILE    Path to prompts.yaml
  --dry-run, -n         Analyze without modifying files

Examples:
  python cli.py edit --input book.epub --output clean.epub
  python cli.py edit --input book.epub --dry-run --verbose

profile - Build Author Style Profile

python cli.py profile [options]

Options:
  --input, -i FILES     Input EPUB file(s), supports glob patterns (required)
  --output, -o FILE     Output JSON profile path (default: author_profile.json)
  --model, -m MODEL     Claude model to use (default: claude-sonnet-4-20250514)

Examples:
  python cli.py profile --input book.epub --output profile.json
  python cli.py profile --input "author_works/*.epub" --output tolkien.json

validate - Check Style Drift

python cli.py validate [options]

Options:
  --original, -o FILE   Path to original text file (required)
  --modified, -m FILE   Path to modified text file (required)
  --profile, -p FILE    Optional author profile for comparison
  --output FILE         Save JSON report to file
  --model MODEL         Claude model (default: claude-sonnet-4-5-20250929)
  --json                Output only JSON (no formatted output)

Score Interpretation:
  90-100: Excellent - Style nearly perfectly preserved
  75-89:  Good - Minor stylistic differences
  50-74:  Moderate - Noticeable drift
  25-49:  Poor - Significant style changes
  0-24:   Severe - Almost entirely different style

Examples:
  python cli.py validate --original ch1.txt --modified ch1_edited.txt
  python cli.py validate -o orig.txt -m mod.txt --profile author.json

analyze - Extract Book Model

python cli.py analyze [options]

Options:
  --input, -i FILE          Input EPUB file (required)
  --output, -o FILE         Output JSON file (default: book_model.json)
  --model, -m MODEL         Claude model (default: claude-sonnet-4-5-20250929)
  --rate-limit-delay SECS   Delay between API calls (default: 0.5)

Output Contains:
  - Characters (names, descriptions, relationships)
  - Locations (names, descriptions, significance)
  - Timeline (chapter-by-chapter events)
  - Plot structure (setup, rising action, climax, resolution)
  - Themes and narrative notes

Examples:
  python cli.py analyze --input book.epub --output model.json
  python cli.py analyze --input book.epub --verbose

plan - Plan Changes (Not Yet Implemented)

python cli.py plan [options]

Options:
  --model, -m FILE      Path to book_model.json (required)
  --goal, -g TEXT       Description of desired changes (required)
  --output, -o FILE     Output change plan JSON (default: change_plan.json)

Examples:
  python cli.py plan --model book_model.json \
      --goal "Convert to steampunk setting" --output plan.json

check - Verify Consistency (Not Yet Implemented)

python cli.py check [options]

Options:
  --input, -i FILE      Input EPUB to check (required)
  --model, -m FILE      Path to book_model.json for reference
  --original, -o FILE   Path to original EPUB for comparison
  --output FILE         Output report JSON (default: consistency_report.json)

Examples:
  python cli.py check --input modified.epub --model book_model.json

estimate - Estimate API Costs

python cli.py estimate [options]

Options:
  --input, -i FILE      Input EPUB file (required)
  --model, -m MODEL     Model to estimate for: haiku, sonnet, opus (default: sonnet)
  --workflow, -w TYPE   Workflow type: cleanup, filter, modernize, transform, annotate
  --with-profile        Include style profiling cost
  --with-analysis       Include book analysis cost
  --all-features        Include all optional features in estimate
  --json                Output as JSON

Examples:
  python cli.py estimate --input book.epub
  python cli.py estimate --input book.epub --model opus --workflow transform
  python cli.py estimate --input book.epub --all-features

annotate - Add Commentary

python cli.py annotate [options]

Options:
  --input, -i FILE      Input EPUB file (required)
  --output, -o FILE     Output EPUB file (default: input_annotated.epub)
  --config, -c FILE     Path to commentary config YAML
  --style, -s STYLES    Comma-separated styles (e.g., scholarly,funny)
  --format, -f FORMAT   Note placement: footnotes or endnotes (default: footnotes)
  --frequency TEXT      Annotation frequency (default: "2-4 per chapter")
  --list-chapters       List available chapters and exit
  --chapters SELECTION  Process only specific chapters (e.g., "1-3", "4,6,8")
  --dry-run, -n         Analyze without generating commentary
  --annotations-only    Generate JSON only, don't insert into EPUB

Commentary Styles:
  scholarly       - Literary analysis, sources, references
  historical      - Period context, author biography, events
  educational     - Vocabulary, concepts, explanations
  devils_advocate - Challenge assumptions, alternative views
  thematic        - Connections to other works, parallels
  personal_lens   - User-specified perspective
  fun_facts       - Trivia, behind-the-scenes, inspirations
  funny           - Humorous observations, witty asides
  cross_reference - Links to other texts, author's other works

Examples:
  python cli.py annotate --input book.epub --list-chapters
  python cli.py annotate --input book.epub --chapters 4-6 --style scholarly,funny
  python cli.py annotate --input book.epub --config commentary.yaml

Workflow Reference

Workflows are preset configurations stored in the workflows/ directory. They combine settings for common tasks.

Using Workflows

# Load a workflow by name
python cli.py --workflow cleanup --input book.epub

# Workflow can specify the command
python cli.py --workflow annotate_scholarly --input book.epub

Creating Workflows

Create a YAML file in workflows/ (e.g., workflows/cleanup.yaml):

# workflows/cleanup.yaml
command: edit

# Settings for the edit command
edit:
  dry_run: false

# Can override config settings
model: claude-sonnet-4-5-20250929
rate_limit_delay: 0.2

Example Workflows

cleanup.yaml - OCR and formatting fixes:

command: edit
edit:
  config: configs/cleanup_config.yaml
  prompts: prompts/cleanup_prompts.yaml

annotate_scholarly.yaml - Academic commentary:

command: annotate
annotate:
  style: scholarly,thematic,cross_reference
  format: endnotes
  frequency: "3-5 per chapter"

Configuration Formats

config.yaml

Main configuration for the editing engine:

# LLM Settings
model: claude-sonnet-4-5-20250929
provider: anthropic
rate_limit_delay: 0.1

# Token Limits
max_tokens_analysis: 10      # For FILTER/PASS decision
max_tokens_cleaning: 8000    # For paragraph rewrites

# HTML Parsing
paragraph_selectors:
  - "p.body-text"
  - "p.content"
  - "p"

chapter_selectors:
  - "h1"
  - "h2"
  - ".chapter-title"
  - "[class*='chapter']"

prompts.yaml

Defines how the LLM analyzes and cleans content:

# Analysis phase - quick classification
analysis:
  system: |
    You are a content analyzer. Respond with only FILTER or PASS.
  user: |
    Review this chapter for [YOUR CRITERIA].
    Respond FILTER if criteria is met, PASS if clean.

    {chapter_text}

# Cleaning phase - selective rewrites
cleaning:
  system: |
    You are an editor. Only output paragraphs that need changes.
  user: |
    Review these numbered paragraphs. For each that needs changes:
    PARAGRAPH N: [rewritten text]

    If none need changes: NONE

    {numbered_chapter}

Placeholders:

  • {chapter_text} - Full chapter text (analysis phase)
  • {numbered_chapter} - Paragraphs with [PARAGRAPH N] markers (cleaning phase)

commentary.yaml (for annotate command)

# Commentary styles to apply
styles:
  - scholarly
  - historical
  - funny

# How many annotations per chapter
frequency: "2-4 per chapter"

# Topics to focus on
focus_areas:
  - character motivation
  - historical context
  - literary devices

# Topics to avoid
avoid:
  - spoilers
  - plot revelations

# For personal_lens style
personal_lens: "economic analysis"

# LLM settings
model: claude-sonnet-4-5-20250929
provider: anthropic
rate_limit_delay: 0.5

# Passage selection
min_passage_length: 50
max_passage_length: 1000

constraints.yaml (Advanced - for voice composition)

Explicit style rules for precise control:

# Hard rules that must be followed
constraints:
  vocabulary:
    forbidden_words:
      - "utilize"  # use "use" instead
      - "commence" # use "begin" or "start"
    required_patterns:
      - "dialogue uses em-dashes for interruptions"

  sentence_structure:
    max_length: 35  # words
    vary_length: true

  tone:
    avoid:
      - "modern slang"
      - "casual contractions in narration"
    maintain:
      - "formal but warm narrator voice"

  character_voices:
    protagonist:
      speech_pattern: "short, direct sentences"
      vocabulary: "simple, action-oriented"
    mentor:
      speech_pattern: "longer, thoughtful sentences"
      vocabulary: "archaic, philosophical"

Examples

Example 1: Clean OCR Errors

prompts.yaml:

analysis:
  user: |
    Check if this text has OCR errors (rn->m, l->1, broken words).
    Reply FILTER if errors found, PASS if clean.
    {chapter_text}

cleaning:
  user: |
    Fix OCR errors in these paragraphs. Only output fixed paragraphs:
    PARAGRAPH N: [corrected text]

    If none need fixing: NONE

    {numbered_chapter}

Run:

python cli.py edit --input scanned_book.epub --output fixed.epub

Example 2: Content Filtering with Style Preservation

# Build profile from author's other works
python cli.py profile --input "author/*.epub" --output author.json

# Edit with style awareness and drift monitoring
python epub_cleaner.py --input book.epub --output filtered.epub \
    --profile author.json --max-drift 25

Example 3: Academic Annotation

commentary.yaml:

styles: [scholarly, historical, thematic]
frequency: "4-6 per chapter"
focus_areas:
  - literary techniques
  - historical context
  - intertextual references
avoid: [plot spoilers]

Run:

python cli.py annotate --input classic.epub --config commentary.yaml \
    --output classic_annotated.epub --format endnotes

Example 4: Fun Commentary for Book Clubs

python cli.py annotate --input book.epub --output book_fun.epub \
    --style funny,fun_facts,devils_advocate \
    --frequency "3-4 per chapter" \
    --format footnotes

Example 5: Deep Edit Workflow

# Step 1: Understand the book
python cli.py analyze --input fantasy.epub --output fantasy_model.json

# Step 2: Review the model (characters, plot, timeline)
cat fantasy_model.json | jq '.characters'

# Step 3: Plan changes
python cli.py plan --model fantasy_model.json \
    --goal "Change the mentor character from wise wizard to gruff warrior" \
    --output change_plan.json

# Step 4: Apply changes (manually or via edit command with modified prompts)

# Step 5: Validate consistency
python cli.py check --input fantasy_modified.epub --model fantasy_model.json

Troubleshooting

Common Issues

"No paragraphs found" for every chapter

  • Your CSS selectors don't match the EPUB's HTML structure
  • Solution: Unzip the EPUB, inspect the HTML to find correct class names
  • Update paragraph_selectors in config.yaml

Rate limit errors

  • Increase rate_limit_delay in config.yaml (try 0.5 or 1.0)
  • Consider using a model with higher rate limits

API key not found

  • Ensure ANTHROPIC_API_KEY environment variable is set
  • Check for typos in the key
  • Verify the key is valid at console.anthropic.com

Style drift too high

  • Your prompts may be too aggressive
  • Add more specific style guidance to prompts
  • Use --profile to provide author style reference
  • Lower the --max-drift threshold for stricter enforcement

Changes not applied correctly

  • Check the change report (*_changes_report.txt)
  • Verify your cleaning prompt specifies the PARAGRAPH N: text format
  • Ensure prompts handle the NONE case for clean chapters

Annotations not appearing in EPUB

  • Verify the annotations JSON was generated (check for *_annotations.json)
  • Ensure paragraph indices match the EPUB structure
  • Check footnote CSS is properly linked

Debug Mode

Enable verbose output to diagnose issues:

python cli.py edit --input book.epub --dry-run --verbose

This shows:

  • Which chapters are analyzed
  • FILTER/PASS decisions
  • Paragraphs selected for rewriting
  • LLM responses
  • Style drift scores (if profile provided)

Getting the Right Selectors

  1. Rename your EPUB to .zip and extract it
  2. Open an HTML file from the content folder
  3. Find the paragraph elements and note their classes
  4. Update paragraph_selectors in config.yaml

Example HTML inspection:

<p class="x04-body-text">This is body text...</p>
<p class="x05-chapter-opener">Chapter opening text...</p>

Config update:

paragraph_selectors:
  - ".x04-body-text"
  - ".x05-chapter-opener"

Exit Codes

The CLI uses these exit codes for scripting:

validate command:

  • 0: Style preserved well (score >= 50)
  • 1: Poor preservation (score 25-49)
  • 2: Severe drift (score < 25)

check command:

  • 0: No issues found
  • 1: Some issues found
  • 2: High severity issues
  • 3: Critical issues

License

MIT License

Copyright (c) 2024

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

High-quality tool for adapting EPUB content using LLMs - cleanup, style preservation, annotations, and more

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages