Overview
3DCF/doc2dataset is a Rust-based document-to-dataset pipeline for LLMs and RAG systems.
It ingests PDFs, Markdown, HTML, CSV, and other text-like formats into a normalized index (documents.jsonl, pages.jsonl, cells.jsonl), builds token-efficient macro-cells, applies numeric integrity checks (NumGuard), and exports QA/Summary/RAG datasets for HuggingFace, LLaMA-Factory, Axolotl, OpenAI, and custom RAG stacks.
This release (v0.1.0) contains the first public version of the full pipeline, the Rust core workspace, and a matching evaluation bundle.
What's in v0.1.0 (Rust workspace)
Crates
three_dcf_core– core library for encoding documents into macro-cells and a JSONL index (documents/pages/cells) with NumGuard numeric hashes.three_dcf_cli– CLI for encoding, stats, and benchmarks (3dcf encode,3dcf stats,3dcf bench,3dcf report).doc2dataset– YAML-driven doc→dataset pipeline (ingest, QA/Summary/RAG sample generation, multi-framework exports).three_dcf_service– Axum-based HTTP service with/encodeand/rag/queryendpoints and a bundled UI.three_dcf_rag– RAG store, embedding, and query execution helpers.three_dcf_llm– LLM client abstraction (OpenAI, Anthropic, Gemini, Deepseek).three_dcf_index– JSONL index helpers.three_dcf_py/three_dcf_node– Python and Node.js bindings for the Rust core.
Features
- Encoder presets for different document types (
reports,news,slides, …). - Macro-cell index in three JSONL files:
documents.jsonl,pages.jsonl,cells.jsonlwithkind/bbox/importanceand NumGuard metadata. - NumGuard numeric integrity: per-cell hashes for numeric content, used to detect drift across the pipeline.
- Exports for:
- HuggingFace (text/chat),
- LLaMA-Factory (Alpaca/ShareGPT),
- Axolotl (text/chat),
- OpenAI
messagesJSONL, - a generic RAG JSONL format.
License
- All code is released under Apache-2.0.
Evaluation bundle (3DCF Eval Data)
This release also ships the evaluation bundle that matches code tag v0.1.0.
Evaluation bundle summary:
- Multiple corpora (policies, financial reports, technical docs, synthetic fixtures).
- Saved macro-cell indexes, outputs, and metrics (
eval/out/,eval/results/). - Reproducible scripts described in
eval/README.md.
To reproduce the evaluation:
-
Clone the repo at tag
v0.1.0:git clone https://github.com/3DCF-Labs/doc2dataset.git cd doc2dataset git checkout v0.1.0 -
Download
3dcf-eval-v0.1.tar.gzfrom the assets below. -
Unpack it at the repo root:
tar -xzf 3dcf-eval-v0.1.tar.gz
This restores the entire
eval/tree (corpora,out/,results/). -
See
eval/README.mdfor reproduction steps and detailed metrics.