Skip to content

Conversation

@ethanolivertroy
Copy link

Summary

Adds a new skill for universal document conversion using the Docling library.

Capabilities:

  • Converts PDF, DOCX, PPTX, XLSX, HTML, images, audio to Markdown/HTML/JSON
  • OCR support (Tesseract, EasyOCR, RapidOCR, OcrMac)
  • Vision Language Model pipeline (GraniteDocling) for complex layouts
  • Table extraction to CSV/Excel
  • RAG framework integrations (LangChain, LlamaIndex, Haystack)
  • Batch processing with parallel execution

Structure:

  • SKILL.md - Main skill file (~380 lines)
  • references/ - 4 detailed guides (OCR, VLM, RAG, advanced options)
  • scripts/ - 5 Python helper scripts

Test plan

  • Verify SKILL.md frontmatter triggers correctly
  • Test scripts with sample documents
  • Validate reference file links work

🤖 Generated with Claude Code

Universal document conversion skill using Docling library. Converts PDF, DOCX,
PPTX, XLSX, HTML, images, and audio to Markdown/HTML/JSON.

Features:
- Document conversion with layout-aware parsing
- OCR support (Tesseract, EasyOCR, RapidOCR, OcrMac)
- Vision Language Model pipeline (GraniteDocling)
- Table extraction to CSV/Excel
- RAG framework integrations (LangChain, LlamaIndex, Haystack)
- Batch processing with parallel execution

Includes 5 helper scripts and 4 reference guides.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copilot AI review requested due to automatic review settings January 3, 2026 20:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive skill for universal document conversion using the Docling library, enabling processing of PDFs, Office documents, images, and audio files into structured formats like Markdown, HTML, and JSON.

Key changes:

  • Adds core documentation with OCR, VLM, and RAG integration guides
  • Provides 5 Python helper scripts for common document processing tasks
  • Includes comprehensive examples for batch processing, table extraction, and RAG preparation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
skills/docling/SKILL.md Main skill documentation with quick start guide, workflows, and CLI reference
skills/docling/scripts/prepare_rag_chunks.py Script to chunk documents for RAG ingestion with configurable overlap
skills/docling/scripts/extract_tables.py Script to extract tables from documents to CSV or Excel format
skills/docling/scripts/convert_document.py General-purpose document conversion script with OCR and VLM support
skills/docling/scripts/check_ocr_engines.py Utility to check available OCR engine installations
skills/docling/scripts/batch_convert.py Parallel batch document conversion with progress tracking
skills/docling/references/vlm-pipelines.md Guide for Vision Language Model pipeline configuration and usage
skills/docling/references/rag-integrations.md Integration examples for LangChain, LlamaIndex, and Haystack frameworks
skills/docling/references/ocr-configuration.md OCR engine setup and configuration guide
skills/docling/references/advanced-options.md Advanced features including document model, export options, and chunking strategies
skills/docling/LICENSE.txt MIT license for the skill

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +321 to +327
from docling.document_converter import DocumentConverter, ConversionError

converter = DocumentConverter()

try:
result = converter.convert("document.pdf")
except ConversionError as e:
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for ConversionError. Line 327 catches ConversionError but this exception class is not imported on line 321. The import statement should include ConversionError from the appropriate module.

Copilot uses AI. Check for mistakes.
Comment on lines 171 to 175
converter = DocumentConverter(
format_options={
InputFormat.PDF: {"pipeline_options": pipeline_options}
}
)
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for PdfFormatOption. The code uses PdfFormatOption on line 172 but only imports it conditionally inside the function. The format_options parameter in line 172 expects a PdfFormatOption instance but the import is not available at that scope.

Copilot uses AI. Check for mistakes.
Comment on lines 172 to 175
format_options={
InputFormat.PDF: {"pipeline_options": pipeline_options}
}
)
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_options syntax is inconsistent with the rest of the codebase. On line 173, format_options is set to a dictionary with a string value for pipeline_options, but in other examples throughout the file (e.g., lines 210-213), it correctly uses PdfFormatOption objects. This will likely cause a runtime error.

Copilot uses AI. Check for mistakes.
vlm_options = VlmPipelineOptions(
model_name="granite_docling",
batch_size=1, # Process one page at a time
offload_to_cpu=True, # Offload model weights to CPU when idle
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect comment on line 260. The comment says "Offload model weights to CPU when idle" but the parameter name is offload_to_cpu, which in most ML frameworks typically means offloading weights from GPU to CPU to save VRAM, not when idle. The comment is misleading about when the offloading occurs.

Suggested change
offload_to_cpu=True, # Offload model weights to CPU when idle
offload_to_cpu=True, # Offload model weights from GPU to CPU to save VRAM

Copilot uses AI. Check for mistakes.
Comment on lines +222 to +225
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.docling import DoclingConverter
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DocumentWriter import is missing on line 223. The code uses DocumentWriter on line 233 but it's not included in the imports at the top of the example.

Copilot uses AI. Check for mistakes.
Comment on lines 49 to 50
if start <= chunks[-1]["start_char"] if chunks else 0:
start = end # Prevent infinite loop
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition on line 49 will raise an IndexError when the chunks list is empty. When chunks is empty, chunks[-1] will fail. This should check if chunks is not empty before accessing chunks[-1].

Copilot uses AI. Check for mistakes.
Comment on lines +260 to +261
indexing.add_component("embedder", SentenceTransformersDocumentEmbedder())
indexing.add_component("writer", DocumentWriter(document_store=document_store))
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for DocumentWriter. Line 233 uses DocumentWriter but it's not imported. The import statement should be added alongside the other imports from haystack.components.writers.

Copilot uses AI. Check for mistakes.
Comment on lines 314 to 315
success = sum(1 for _, ok in results if ok)
print(f"Converted {success}/{len(files)} files")
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear variable name. The variable name 'ok' on line 314 is too terse and doesn't clearly convey that it represents conversion success status. Consider renaming to 'success' for better readability and consistency with line 118.

Suggested change
success = sum(1 for _, ok in results if ok)
print(f"Converted {success}/{len(files)} files")
success_count = sum(1 for _, success in results if success)
print(f"Converted {success_count}/{len(files)} files")

Copilot uses AI. Check for mistakes.
import json
import sys
from pathlib import Path
from typing import Optional
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Optional' is not used.

Suggested change
from typing import Optional

Copilot uses AI. Check for mistakes.
- Fix potential IndexError in prepare_rag_chunks.py when chunks list is empty
- Remove unused Optional import
- Fix SKILL.md OCR example to use PdfFormatOption instead of dict
- Rename unclear variable 'ok' to 'converted' in advanced-options.md
- Fix misleading offload_to_cpu comment in vlm-pipelines.md
- Add missing DocumentWriter import in rag-integrations.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant