Skip to content

3DCF/doc2dataset v0.1.0

Latest

Choose a tag to compare

@yevh yevh released this 27 Nov 20:28
· 10 commits to main since this release
0ab4036

Overview

3DCF/doc2dataset is a Rust-based document-to-dataset pipeline for LLMs and RAG systems.

It ingests PDFs, Markdown, HTML, CSV, and other text-like formats into a normalized index (documents.jsonl, pages.jsonl, cells.jsonl), builds token-efficient macro-cells, applies numeric integrity checks (NumGuard), and exports QA/Summary/RAG datasets for HuggingFace, LLaMA-Factory, Axolotl, OpenAI, and custom RAG stacks.

This release (v0.1.0) contains the first public version of the full pipeline, the Rust core workspace, and a matching evaluation bundle.

Research paper (.pdf)


What's in v0.1.0 (Rust workspace)

Crates

  • three_dcf_core – core library for encoding documents into macro-cells and a JSONL index (documents/pages/cells) with NumGuard numeric hashes.
  • three_dcf_cli – CLI for encoding, stats, and benchmarks (3dcf encode, 3dcf stats, 3dcf bench, 3dcf report).
  • doc2dataset – YAML-driven doc→dataset pipeline (ingest, QA/Summary/RAG sample generation, multi-framework exports).
  • three_dcf_service – Axum-based HTTP service with /encode and /rag/query endpoints and a bundled UI.
  • three_dcf_rag – RAG store, embedding, and query execution helpers.
  • three_dcf_llm – LLM client abstraction (OpenAI, Anthropic, Gemini, Deepseek).
  • three_dcf_index – JSONL index helpers.
  • three_dcf_py / three_dcf_node – Python and Node.js bindings for the Rust core.

Features

  • Encoder presets for different document types (reports, news, slides, …).
  • Macro-cell index in three JSONL files: documents.jsonl, pages.jsonl, cells.jsonl with kind / bbox / importance and NumGuard metadata.
  • NumGuard numeric integrity: per-cell hashes for numeric content, used to detect drift across the pipeline.
  • Exports for:
    • HuggingFace (text/chat),
    • LLaMA-Factory (Alpaca/ShareGPT),
    • Axolotl (text/chat),
    • OpenAI messages JSONL,
    • a generic RAG JSONL format.

License

  • All code is released under Apache-2.0.

Evaluation bundle (3DCF Eval Data)

This release also ships the evaluation bundle that matches code tag v0.1.0.

Evaluation bundle summary:

  • Multiple corpora (policies, financial reports, technical docs, synthetic fixtures).
  • Saved macro-cell indexes, outputs, and metrics (eval/out/, eval/results/).
  • Reproducible scripts described in eval/README.md.

To reproduce the evaluation:

  1. Clone the repo at tag v0.1.0:

    git clone https://github.com/3DCF-Labs/doc2dataset.git
    cd doc2dataset
    git checkout v0.1.0
  2. Download 3dcf-eval-v0.1.tar.gz from the assets below.

  3. Unpack it at the repo root:

    tar -xzf 3dcf-eval-v0.1.tar.gz

    This restores the entire eval/ tree (corpora, out/, results/).

  4. See eval/README.md for reproduction steps and detailed metrics.