Add Docling document processing skill #203

ethanolivertroy · 2026-01-03T20:03:01Z

Summary

Adds a new skill for universal document conversion using the Docling library.

Capabilities:

Converts PDF, DOCX, PPTX, XLSX, HTML, images, audio to Markdown/HTML/JSON
OCR support (Tesseract, EasyOCR, RapidOCR, OcrMac)
Vision Language Model pipeline (GraniteDocling) for complex layouts
Table extraction to CSV/Excel
RAG framework integrations (LangChain, LlamaIndex, Haystack)
Batch processing with parallel execution

Structure:

SKILL.md - Main skill file (~380 lines)
references/ - 4 detailed guides (OCR, VLM, RAG, advanced options)
scripts/ - 5 Python helper scripts

Test plan

Verify SKILL.md frontmatter triggers correctly
Test scripts with sample documents
Validate reference file links work

🤖 Generated with Claude Code

Universal document conversion skill using Docling library. Converts PDF, DOCX, PPTX, XLSX, HTML, images, and audio to Markdown/HTML/JSON. Features: - Document conversion with layout-aware parsing - OCR support (Tesseract, EasyOCR, RapidOCR, OcrMac) - Vision Language Model pipeline (GraniteDocling) - Table extraction to CSV/Excel - RAG framework integrations (LangChain, LlamaIndex, Haystack) - Batch processing with parallel execution Includes 5 helper scripts and 4 reference guides. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Copilot

Pull request overview

This PR adds a comprehensive skill for universal document conversion using the Docling library, enabling processing of PDFs, Office documents, images, and audio files into structured formats like Markdown, HTML, and JSON.

Key changes:

Adds core documentation with OCR, VLM, and RAG integration guides
Provides 5 Python helper scripts for common document processing tasks
Includes comprehensive examples for batch processing, table extraction, and RAG preparation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
skills/docling/SKILL.md	Main skill documentation with quick start guide, workflows, and CLI reference
skills/docling/scripts/prepare_rag_chunks.py	Script to chunk documents for RAG ingestion with configurable overlap
skills/docling/scripts/extract_tables.py	Script to extract tables from documents to CSV or Excel format
skills/docling/scripts/convert_document.py	General-purpose document conversion script with OCR and VLM support
skills/docling/scripts/check_ocr_engines.py	Utility to check available OCR engine installations
skills/docling/scripts/batch_convert.py	Parallel batch document conversion with progress tracking
skills/docling/references/vlm-pipelines.md	Guide for Vision Language Model pipeline configuration and usage
skills/docling/references/rag-integrations.md	Integration examples for LangChain, LlamaIndex, and Haystack frameworks
skills/docling/references/ocr-configuration.md	OCR engine setup and configuration guide
skills/docling/references/advanced-options.md	Advanced features including document model, export options, and chunking strategies
skills/docling/LICENSE.txt	MIT license for the skill

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-03T20:06:09Z

skills/docling/references/advanced-options.md

+from docling.document_converter import DocumentConverter, ConversionError
+
+converter = DocumentConverter()
+
+try:
+    result = converter.convert("document.pdf")
+except ConversionError as e:


Missing import for ConversionError. Line 327 catches ConversionError but this exception class is not imported on line 321. The import statement should include ConversionError from the appropriate module.

Copilot · 2026-01-03T20:06:10Z

skills/docling/SKILL.md

+converter = DocumentConverter(
+    format_options={
+        InputFormat.PDF: {"pipeline_options": pipeline_options}
+    }
+)


Missing import for PdfFormatOption. The code uses PdfFormatOption on line 172 but only imports it conditionally inside the function. The format_options parameter in line 172 expects a PdfFormatOption instance but the import is not available at that scope.

Copilot · 2026-01-03T20:06:10Z

skills/docling/SKILL.md

+    format_options={
+        InputFormat.PDF: {"pipeline_options": pipeline_options}
+    }
+)


The format_options syntax is inconsistent with the rest of the codebase. On line 173, format_options is set to a dictionary with a string value for pipeline_options, but in other examples throughout the file (e.g., lines 210-213), it correctly uses PdfFormatOption objects. This will likely cause a runtime error.

Copilot · 2026-01-03T20:06:10Z

skills/docling/references/vlm-pipelines.md

+vlm_options = VlmPipelineOptions(
+    model_name="granite_docling",
+    batch_size=1,           # Process one page at a time
+    offload_to_cpu=True,    # Offload model weights to CPU when idle


Incorrect comment on line 260. The comment says "Offload model weights to CPU when idle" but the parameter name is offload_to_cpu, which in most ML frameworks typically means offloading weights from GPU to CPU to save VRAM, not when idle. The comment is misleading about when the offloading occurs.

Suggested change

offload_to_cpu=True, # Offload model weights to CPU when idle

offload_to_cpu=True, # Offload model weights from GPU to CPU to save VRAM

Copilot · 2026-01-03T20:06:11Z

skills/docling/references/rag-integrations.md

+from haystack import Pipeline
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.in_memory import InMemoryDocumentStore
+from haystack_integrations.components.converters.docling import DoclingConverter


The DocumentWriter import is missing on line 223. The code uses DocumentWriter on line 233 but it's not included in the imports at the top of the example.

Copilot · 2026-01-03T20:06:11Z

skills/docling/scripts/prepare_rag_chunks.py

+        if start <= chunks[-1]["start_char"] if chunks else 0:
+            start = end  # Prevent infinite loop


The condition on line 49 will raise an IndexError when the chunks list is empty. When chunks is empty, chunks[-1] will fail. This should check if chunks is not empty before accessing chunks[-1].

Copilot · 2026-01-03T20:06:11Z

skills/docling/references/rag-integrations.md

+indexing.add_component("embedder", SentenceTransformersDocumentEmbedder())
+indexing.add_component("writer", DocumentWriter(document_store=document_store))


Missing import for DocumentWriter. Line 233 uses DocumentWriter but it's not imported. The import statement should be added alongside the other imports from haystack.components.writers.

Copilot · 2026-01-03T20:06:12Z

skills/docling/references/advanced-options.md

+success = sum(1 for _, ok in results if ok)
+print(f"Converted {success}/{len(files)} files")


Unclear variable name. The variable name 'ok' on line 314 is too terse and doesn't clearly convey that it represents conversion success status. Consider renaming to 'success' for better readability and consistency with line 118.

Suggested change

success = sum(1 for _, ok in results if ok)

print(f"Converted {success}/{len(files)} files")

success_count = sum(1 for _, success in results if success)

print(f"Converted {success_count}/{len(files)} files")

Copilot · 2026-01-03T20:06:12Z

skills/docling/scripts/prepare_rag_chunks.py

+import json
+import sys
+from pathlib import Path
+from typing import Optional


Import of 'Optional' is not used.

Suggested change

from typing import Optional

- Fix potential IndexError in prepare_rag_chunks.py when chunks list is empty - Remove unused Optional import - Fix SKILL.md OCR example to use PdfFormatOption instead of dict - Rename unclear variable 'ok' to 'converted' in advanced-options.md - Fix misleading offload_to_cpu comment in vlm-pipelines.md - Add missing DocumentWriter import in rag-integrations.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Copilot AI review requested due to automatic review settings January 3, 2026 20:03

Copilot started reviewing on behalf of ethanolivertroy January 3, 2026 20:03 View session

Copilot AI reviewed Jan 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Docling document processing skill #203

Add Docling document processing skill #203

ethanolivertroy commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	offload_to_cpu=True, # Offload model weights to CPU when idle
	offload_to_cpu=True, # Offload model weights from GPU to CPU to save VRAM

		if start <= chunks[-1]["start_char"] if chunks else 0:
		start = end # Prevent infinite loop

		indexing.add_component("embedder", SentenceTransformersDocumentEmbedder())
		indexing.add_component("writer", DocumentWriter(document_store=document_store))

		success = sum(1 for _, ok in results if ok)
		print(f"Converted {success}/{len(files)} files")

Add Docling document processing skill #203

Are you sure you want to change the base?

Add Docling document processing skill #203

Conversation

ethanolivertroy commented Jan 3, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant