-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add Docling document processing skill #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Docling document processing skill #203
Conversation
Universal document conversion skill using Docling library. Converts PDF, DOCX, PPTX, XLSX, HTML, images, and audio to Markdown/HTML/JSON. Features: - Document conversion with layout-aware parsing - OCR support (Tesseract, EasyOCR, RapidOCR, OcrMac) - Vision Language Model pipeline (GraniteDocling) - Table extraction to CSV/Excel - RAG framework integrations (LangChain, LlamaIndex, Haystack) - Batch processing with parallel execution Includes 5 helper scripts and 4 reference guides. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive skill for universal document conversion using the Docling library, enabling processing of PDFs, Office documents, images, and audio files into structured formats like Markdown, HTML, and JSON.
Key changes:
- Adds core documentation with OCR, VLM, and RAG integration guides
- Provides 5 Python helper scripts for common document processing tasks
- Includes comprehensive examples for batch processing, table extraction, and RAG preparation
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/docling/SKILL.md | Main skill documentation with quick start guide, workflows, and CLI reference |
| skills/docling/scripts/prepare_rag_chunks.py | Script to chunk documents for RAG ingestion with configurable overlap |
| skills/docling/scripts/extract_tables.py | Script to extract tables from documents to CSV or Excel format |
| skills/docling/scripts/convert_document.py | General-purpose document conversion script with OCR and VLM support |
| skills/docling/scripts/check_ocr_engines.py | Utility to check available OCR engine installations |
| skills/docling/scripts/batch_convert.py | Parallel batch document conversion with progress tracking |
| skills/docling/references/vlm-pipelines.md | Guide for Vision Language Model pipeline configuration and usage |
| skills/docling/references/rag-integrations.md | Integration examples for LangChain, LlamaIndex, and Haystack frameworks |
| skills/docling/references/ocr-configuration.md | OCR engine setup and configuration guide |
| skills/docling/references/advanced-options.md | Advanced features including document model, export options, and chunking strategies |
| skills/docling/LICENSE.txt | MIT license for the skill |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from docling.document_converter import DocumentConverter, ConversionError | ||
|
|
||
| converter = DocumentConverter() | ||
|
|
||
| try: | ||
| result = converter.convert("document.pdf") | ||
| except ConversionError as e: |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for ConversionError. Line 327 catches ConversionError but this exception class is not imported on line 321. The import statement should include ConversionError from the appropriate module.
| converter = DocumentConverter( | ||
| format_options={ | ||
| InputFormat.PDF: {"pipeline_options": pipeline_options} | ||
| } | ||
| ) |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for PdfFormatOption. The code uses PdfFormatOption on line 172 but only imports it conditionally inside the function. The format_options parameter in line 172 expects a PdfFormatOption instance but the import is not available at that scope.
| format_options={ | ||
| InputFormat.PDF: {"pipeline_options": pipeline_options} | ||
| } | ||
| ) |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format_options syntax is inconsistent with the rest of the codebase. On line 173, format_options is set to a dictionary with a string value for pipeline_options, but in other examples throughout the file (e.g., lines 210-213), it correctly uses PdfFormatOption objects. This will likely cause a runtime error.
| vlm_options = VlmPipelineOptions( | ||
| model_name="granite_docling", | ||
| batch_size=1, # Process one page at a time | ||
| offload_to_cpu=True, # Offload model weights to CPU when idle |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect comment on line 260. The comment says "Offload model weights to CPU when idle" but the parameter name is offload_to_cpu, which in most ML frameworks typically means offloading weights from GPU to CPU to save VRAM, not when idle. The comment is misleading about when the offloading occurs.
| offload_to_cpu=True, # Offload model weights to CPU when idle | |
| offload_to_cpu=True, # Offload model weights from GPU to CPU to save VRAM |
| from haystack import Pipeline | ||
| from haystack.components.writers import DocumentWriter | ||
| from haystack.document_stores.in_memory import InMemoryDocumentStore | ||
| from haystack_integrations.components.converters.docling import DoclingConverter |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DocumentWriter import is missing on line 223. The code uses DocumentWriter on line 233 but it's not included in the imports at the top of the example.
| if start <= chunks[-1]["start_char"] if chunks else 0: | ||
| start = end # Prevent infinite loop |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition on line 49 will raise an IndexError when the chunks list is empty. When chunks is empty, chunks[-1] will fail. This should check if chunks is not empty before accessing chunks[-1].
| indexing.add_component("embedder", SentenceTransformersDocumentEmbedder()) | ||
| indexing.add_component("writer", DocumentWriter(document_store=document_store)) |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for DocumentWriter. Line 233 uses DocumentWriter but it's not imported. The import statement should be added alongside the other imports from haystack.components.writers.
| success = sum(1 for _, ok in results if ok) | ||
| print(f"Converted {success}/{len(files)} files") |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear variable name. The variable name 'ok' on line 314 is too terse and doesn't clearly convey that it represents conversion success status. Consider renaming to 'success' for better readability and consistency with line 118.
| success = sum(1 for _, ok in results if ok) | |
| print(f"Converted {success}/{len(files)} files") | |
| success_count = sum(1 for _, success in results if success) | |
| print(f"Converted {success_count}/{len(files)} files") |
| import json | ||
| import sys | ||
| from pathlib import Path | ||
| from typing import Optional |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Optional' is not used.
| from typing import Optional |
- Fix potential IndexError in prepare_rag_chunks.py when chunks list is empty - Remove unused Optional import - Fix SKILL.md OCR example to use PdfFormatOption instead of dict - Rename unclear variable 'ok' to 'converted' in advanced-options.md - Fix misleading offload_to_cpu comment in vlm-pipelines.md - Add missing DocumentWriter import in rag-integrations.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
Adds a new skill for universal document conversion using the Docling library.
Capabilities:
Structure:
SKILL.md- Main skill file (~380 lines)references/- 4 detailed guides (OCR, VLM, RAG, advanced options)scripts/- 5 Python helper scriptsTest plan
🤖 Generated with Claude Code