A comprehensive toolkit for adapting EPUB content using Large Language Models. From simple cleanup to style-preserved deep editing to adding scholarly annotations, Booksmith provides a complete workflow for book transformation while maintaining quality, consistency, and author voice.
Alpha Software: This project is in early development. Core features work but expect rough edges, incomplete documentation, and breaking changes. Bug reports and feedback welcome via GitHub Issues.
- Project Overview
- Features
- Installation
- Quick Start Guides
- CLI Reference
- Workflow Reference
- Configuration Formats
- Examples
- Troubleshooting
- License
Booksmith empowers users to adapt books however they see fit while maintaining the highest possible quality. Unlike simple find-and-replace tools, it uses LLM analysis to understand context, preserve author voice, and ensure narrative consistency across changes.
- Quality by Default: Style preservation and consistency checking are built-in, not afterthoughts
- Full Spectrum Support: From typo fixes to genre transformations
- Transparency: Detailed change reports, drift scores, and dry-run modes
- User Control: Configurable thresholds, approval workflows, and granular prompts
MINIMAL CHANGES EXTENSIVE CHANGES
| |
v v
+----------+ +----------+ +----------+ +----------+ +------------------+
| Cleanup | | Filtering| | Style | | Plot | | Genre/Setting |
| | | | | Adapt | | Changes | | Transformation |
+----------+ +----------+ +----------+ +----------+ +------------------+
| OCR fix | | Content | | Modernize| | Character| | Steampunk LOTR |
| Typos | | removal | | language | | arcs | | Sci-fi to Fantasy|
| Format | | Age-gate | | Simplify | | Endings | | Period changes |
+----------+ +----------+ +----------+ +----------+ +------------------+
- Two-Stage Analysis: Quick classification pass (FILTER/PASS) followed by selective rewriting
- Full Context Preservation: Sends entire chapters for context, only rewrites flagged paragraphs
- HTML Preservation: Maintains attributes, classes, and formatting on modified elements
- Paragraph Removal: Can remove paragraphs entirely when appropriate
- Audit Trail: Detailed change reports with before/after comparisons
- Dry-Run Mode: Preview changes without modifying files
- LLM-Based Style Analysis: Analyzes EPUBs to create comprehensive author style profiles
- Profile Components: Prose style, vocabulary patterns, rhythm, character voices, themes
- Do/Avoid Guidelines: Actionable rules for maintaining author voice
- Drift Measurement: Scores how well modifications preserve original style (0-100)
- Multi-Dimension Analysis: Sentence structure, tone, vocabulary, rhythm, voice, literary devices
- Book Model Extraction: Characters, locations, timeline, relationships, plot structure
- Change Planning: Map ripple effects before making changes
- Consistency Validation: Cross-chapter checks for contradictions
- Structured Output: JSON models for programmatic access
- Multiple Commentary Styles: Scholarly, historical, educational, devil's advocate, thematic, fun facts, funny, cross-reference
- Smart Passage Selection: LLM identifies annotation-worthy passages
- EPUB3 Compliance: Proper footnote/endnote markup with bidirectional links
- Configurable Frequency: Control annotations per chapter
- Focus Areas: Target specific topics or themes
- Subcommand Architecture: Clean separation of functionality
- Workflow Support: Load preset configurations from YAML files
- Consistent Interface: Common flags across all commands
- Verbose Mode: Detailed progress information
- Python 3.8 or higher
- An Anthropic API key (Claude)
- DRM-free EPUB files - This tool requires standard, unprotected EPUB files.
- Many stores sell DRM-free: Tor.com, Smashwords, Kobo (some titles), Google Play (some)
- Public domain: Project Gutenberg, Standard Ebooks, Faded Page
- If your legally purchased EPUB has DRM, you'll need to remove it first (search "Calibre DRM" - we can't link directly for legal reasons)
- Clone or download this repository:
git clone https://github.com/Firindil/booksmith.git
cd booksmith- Install Python dependencies:
pip install -r requirements.txtRequired packages:
anthropic- Claude API clientbeautifulsoup4- HTML parsingpyyaml- Configuration fileslxml(optional) - Faster HTML parsing
- Set your API key:
# Linux/macOS
export ANTHROPIC_API_KEY="your-api-key-here"
# Windows (Command Prompt)
set ANTHROPIC_API_KEY=your-api-key-here
# Windows (PowerShell)
$env:ANTHROPIC_API_KEY="your-api-key-here"- Create configuration files:
cp config.example.yaml config.yaml
cp prompts.example.yaml prompts.yamlAll editing in Booksmith preserves author voice by default. For best results, create a style profile first:
# Step 1: Estimate costs before processing
python cli.py estimate --input book.epub --workflow cleanup
# Step 2: Build an author profile (recommended)
python cli.py profile --input book.epub --output author_profile.json
# Or use multiple works for a richer profile:
python cli.py profile --input "author_works/*.epub" --output author_profile.json
# Step 3: Edit with style guidance
python cli.py edit --input book.epub --output cleaned.epub \
--profile author_profile.json --max-drift 30
# Step 4: Preview changes first (optional)
python cli.py edit --input book.epub --dry-runThe --max-drift flag warns you when style deviation exceeds your threshold (0-100 scale).
Key files to configure:
config.yaml- Model settings, paragraph selectorsprompts.yaml- Analysis and cleaning prompts (define what to look for and how to fix it)
For significant changes requiring narrative consistency:
# Step 1: Analyze the book structure
python cli.py analyze --input book.epub --output book_model.json
# Step 2: Plan your changes (maps ripple effects)
python cli.py plan --model book_model.json \
--goal "Make the villain more sympathetic" --output change_plan.json
# Step 3: Review the plan, then apply changes via edit command
# Step 4: Check for consistency issues
python cli.py check --input modified.epub --model book_model.json \
--output consistency_report.jsonAdd scholarly commentary, historical context, or fun facts without modifying the original text:
# Step 1: List chapters to find what you want to annotate
python cli.py annotate --input book.epub --list-chapters
# Step 2: Test on a single chapter first
python cli.py annotate --input book.epub --chapters 4 \
--style scholarly,historical,fun_facts --output test_annotated.epub
# Step 3: Annotate the full book (or selected chapters)
python cli.py annotate --input book.epub --output annotated.epub \
--style scholarly,funny --format footnotesExample commentary.yaml for more control:
styles: [scholarly, historical, fun_facts]
frequency: "2-4 per chapter"
focus_areas: [character motivation, historical context, literary devices]
avoid: [spoilers, plot revelations]
model: claude-sonnet-4-5-20250929The unified CLI (cli.py) provides these subcommands:
--version, -v Show version and exit
--verbose, -v Enable verbose output
--workflow, -w Load preset configuration from workflows/ directory
python cli.py edit [options]
Options:
--input, -i FILE Input EPUB file (required)
--output, -o FILE Output EPUB file (default: input_cleaned.epub)
--config, -c FILE Path to config.yaml
--prompts, -p FILE Path to prompts.yaml
--dry-run, -n Analyze without modifying files
Examples:
python cli.py edit --input book.epub --output clean.epub
python cli.py edit --input book.epub --dry-run --verbosepython cli.py profile [options]
Options:
--input, -i FILES Input EPUB file(s), supports glob patterns (required)
--output, -o FILE Output JSON profile path (default: author_profile.json)
--model, -m MODEL Claude model to use (default: claude-sonnet-4-20250514)
Examples:
python cli.py profile --input book.epub --output profile.json
python cli.py profile --input "author_works/*.epub" --output tolkien.jsonpython cli.py validate [options]
Options:
--original, -o FILE Path to original text file (required)
--modified, -m FILE Path to modified text file (required)
--profile, -p FILE Optional author profile for comparison
--output FILE Save JSON report to file
--model MODEL Claude model (default: claude-sonnet-4-5-20250929)
--json Output only JSON (no formatted output)
Score Interpretation:
90-100: Excellent - Style nearly perfectly preserved
75-89: Good - Minor stylistic differences
50-74: Moderate - Noticeable drift
25-49: Poor - Significant style changes
0-24: Severe - Almost entirely different style
Examples:
python cli.py validate --original ch1.txt --modified ch1_edited.txt
python cli.py validate -o orig.txt -m mod.txt --profile author.jsonpython cli.py analyze [options]
Options:
--input, -i FILE Input EPUB file (required)
--output, -o FILE Output JSON file (default: book_model.json)
--model, -m MODEL Claude model (default: claude-sonnet-4-5-20250929)
--rate-limit-delay SECS Delay between API calls (default: 0.5)
Output Contains:
- Characters (names, descriptions, relationships)
- Locations (names, descriptions, significance)
- Timeline (chapter-by-chapter events)
- Plot structure (setup, rising action, climax, resolution)
- Themes and narrative notes
Examples:
python cli.py analyze --input book.epub --output model.json
python cli.py analyze --input book.epub --verbosepython cli.py plan [options]
Options:
--model, -m FILE Path to book_model.json (required)
--goal, -g TEXT Description of desired changes (required)
--output, -o FILE Output change plan JSON (default: change_plan.json)
Examples:
python cli.py plan --model book_model.json \
--goal "Convert to steampunk setting" --output plan.jsonpython cli.py check [options]
Options:
--input, -i FILE Input EPUB to check (required)
--model, -m FILE Path to book_model.json for reference
--original, -o FILE Path to original EPUB for comparison
--output FILE Output report JSON (default: consistency_report.json)
Examples:
python cli.py check --input modified.epub --model book_model.jsonpython cli.py estimate [options]
Options:
--input, -i FILE Input EPUB file (required)
--model, -m MODEL Model to estimate for: haiku, sonnet, opus (default: sonnet)
--workflow, -w TYPE Workflow type: cleanup, filter, modernize, transform, annotate
--with-profile Include style profiling cost
--with-analysis Include book analysis cost
--all-features Include all optional features in estimate
--json Output as JSON
Examples:
python cli.py estimate --input book.epub
python cli.py estimate --input book.epub --model opus --workflow transform
python cli.py estimate --input book.epub --all-featurespython cli.py annotate [options]
Options:
--input, -i FILE Input EPUB file (required)
--output, -o FILE Output EPUB file (default: input_annotated.epub)
--config, -c FILE Path to commentary config YAML
--style, -s STYLES Comma-separated styles (e.g., scholarly,funny)
--format, -f FORMAT Note placement: footnotes or endnotes (default: footnotes)
--frequency TEXT Annotation frequency (default: "2-4 per chapter")
--list-chapters List available chapters and exit
--chapters SELECTION Process only specific chapters (e.g., "1-3", "4,6,8")
--dry-run, -n Analyze without generating commentary
--annotations-only Generate JSON only, don't insert into EPUB
Commentary Styles:
scholarly - Literary analysis, sources, references
historical - Period context, author biography, events
educational - Vocabulary, concepts, explanations
devils_advocate - Challenge assumptions, alternative views
thematic - Connections to other works, parallels
personal_lens - User-specified perspective
fun_facts - Trivia, behind-the-scenes, inspirations
funny - Humorous observations, witty asides
cross_reference - Links to other texts, author's other works
Examples:
python cli.py annotate --input book.epub --list-chapters
python cli.py annotate --input book.epub --chapters 4-6 --style scholarly,funny
python cli.py annotate --input book.epub --config commentary.yamlWorkflows are preset configurations stored in the workflows/ directory. They combine settings for common tasks.
# Load a workflow by name
python cli.py --workflow cleanup --input book.epub
# Workflow can specify the command
python cli.py --workflow annotate_scholarly --input book.epubCreate a YAML file in workflows/ (e.g., workflows/cleanup.yaml):
# workflows/cleanup.yaml
command: edit
# Settings for the edit command
edit:
dry_run: false
# Can override config settings
model: claude-sonnet-4-5-20250929
rate_limit_delay: 0.2cleanup.yaml - OCR and formatting fixes:
command: edit
edit:
config: configs/cleanup_config.yaml
prompts: prompts/cleanup_prompts.yamlannotate_scholarly.yaml - Academic commentary:
command: annotate
annotate:
style: scholarly,thematic,cross_reference
format: endnotes
frequency: "3-5 per chapter"Main configuration for the editing engine:
# LLM Settings
model: claude-sonnet-4-5-20250929
provider: anthropic
rate_limit_delay: 0.1
# Token Limits
max_tokens_analysis: 10 # For FILTER/PASS decision
max_tokens_cleaning: 8000 # For paragraph rewrites
# HTML Parsing
paragraph_selectors:
- "p.body-text"
- "p.content"
- "p"
chapter_selectors:
- "h1"
- "h2"
- ".chapter-title"
- "[class*='chapter']"Defines how the LLM analyzes and cleans content:
# Analysis phase - quick classification
analysis:
system: |
You are a content analyzer. Respond with only FILTER or PASS.
user: |
Review this chapter for [YOUR CRITERIA].
Respond FILTER if criteria is met, PASS if clean.
{chapter_text}
# Cleaning phase - selective rewrites
cleaning:
system: |
You are an editor. Only output paragraphs that need changes.
user: |
Review these numbered paragraphs. For each that needs changes:
PARAGRAPH N: [rewritten text]
If none need changes: NONE
{numbered_chapter}Placeholders:
{chapter_text}- Full chapter text (analysis phase){numbered_chapter}- Paragraphs with[PARAGRAPH N]markers (cleaning phase)
# Commentary styles to apply
styles:
- scholarly
- historical
- funny
# How many annotations per chapter
frequency: "2-4 per chapter"
# Topics to focus on
focus_areas:
- character motivation
- historical context
- literary devices
# Topics to avoid
avoid:
- spoilers
- plot revelations
# For personal_lens style
personal_lens: "economic analysis"
# LLM settings
model: claude-sonnet-4-5-20250929
provider: anthropic
rate_limit_delay: 0.5
# Passage selection
min_passage_length: 50
max_passage_length: 1000Explicit style rules for precise control:
# Hard rules that must be followed
constraints:
vocabulary:
forbidden_words:
- "utilize" # use "use" instead
- "commence" # use "begin" or "start"
required_patterns:
- "dialogue uses em-dashes for interruptions"
sentence_structure:
max_length: 35 # words
vary_length: true
tone:
avoid:
- "modern slang"
- "casual contractions in narration"
maintain:
- "formal but warm narrator voice"
character_voices:
protagonist:
speech_pattern: "short, direct sentences"
vocabulary: "simple, action-oriented"
mentor:
speech_pattern: "longer, thoughtful sentences"
vocabulary: "archaic, philosophical"prompts.yaml:
analysis:
user: |
Check if this text has OCR errors (rn->m, l->1, broken words).
Reply FILTER if errors found, PASS if clean.
{chapter_text}
cleaning:
user: |
Fix OCR errors in these paragraphs. Only output fixed paragraphs:
PARAGRAPH N: [corrected text]
If none need fixing: NONE
{numbered_chapter}Run:
python cli.py edit --input scanned_book.epub --output fixed.epub# Build profile from author's other works
python cli.py profile --input "author/*.epub" --output author.json
# Edit with style awareness and drift monitoring
python epub_cleaner.py --input book.epub --output filtered.epub \
--profile author.json --max-drift 25commentary.yaml:
styles: [scholarly, historical, thematic]
frequency: "4-6 per chapter"
focus_areas:
- literary techniques
- historical context
- intertextual references
avoid: [plot spoilers]Run:
python cli.py annotate --input classic.epub --config commentary.yaml \
--output classic_annotated.epub --format endnotespython cli.py annotate --input book.epub --output book_fun.epub \
--style funny,fun_facts,devils_advocate \
--frequency "3-4 per chapter" \
--format footnotes# Step 1: Understand the book
python cli.py analyze --input fantasy.epub --output fantasy_model.json
# Step 2: Review the model (characters, plot, timeline)
cat fantasy_model.json | jq '.characters'
# Step 3: Plan changes
python cli.py plan --model fantasy_model.json \
--goal "Change the mentor character from wise wizard to gruff warrior" \
--output change_plan.json
# Step 4: Apply changes (manually or via edit command with modified prompts)
# Step 5: Validate consistency
python cli.py check --input fantasy_modified.epub --model fantasy_model.json"No paragraphs found" for every chapter
- Your CSS selectors don't match the EPUB's HTML structure
- Solution: Unzip the EPUB, inspect the HTML to find correct class names
- Update
paragraph_selectorsin config.yaml
Rate limit errors
- Increase
rate_limit_delayin config.yaml (try 0.5 or 1.0) - Consider using a model with higher rate limits
API key not found
- Ensure
ANTHROPIC_API_KEYenvironment variable is set - Check for typos in the key
- Verify the key is valid at console.anthropic.com
Style drift too high
- Your prompts may be too aggressive
- Add more specific style guidance to prompts
- Use
--profileto provide author style reference - Lower the
--max-driftthreshold for stricter enforcement
Changes not applied correctly
- Check the change report (
*_changes_report.txt) - Verify your cleaning prompt specifies the
PARAGRAPH N: textformat - Ensure prompts handle the
NONEcase for clean chapters
Annotations not appearing in EPUB
- Verify the annotations JSON was generated (check for
*_annotations.json) - Ensure paragraph indices match the EPUB structure
- Check footnote CSS is properly linked
Enable verbose output to diagnose issues:
python cli.py edit --input book.epub --dry-run --verboseThis shows:
- Which chapters are analyzed
- FILTER/PASS decisions
- Paragraphs selected for rewriting
- LLM responses
- Style drift scores (if profile provided)
- Rename your EPUB to .zip and extract it
- Open an HTML file from the content folder
- Find the paragraph elements and note their classes
- Update
paragraph_selectorsin config.yaml
Example HTML inspection:
<p class="x04-body-text">This is body text...</p>
<p class="x05-chapter-opener">Chapter opening text...</p>Config update:
paragraph_selectors:
- ".x04-body-text"
- ".x05-chapter-opener"The CLI uses these exit codes for scripting:
validate command:
- 0: Style preserved well (score >= 50)
- 1: Poor preservation (score 25-49)
- 2: Severe drift (score < 25)
check command:
- 0: No issues found
- 1: Some issues found
- 2: High severity issues
- 3: Critical issues
MIT License
Copyright (c) 2024
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.