Skip to content

Experiment with PDF generator #6

@hieunguyen-cyber

Description

@hieunguyen-cyber

PDF Generator — Task Description

Description of the LaTeX → PDF Generator component only, as used in the experiment pipeline to produce polished documents (PDFs) from modified/induced documents for ingestion into RAG systems.


Objective

Provide a secure, reproducible pipeline that converts instrumented document text (including injected inducing paragraphs or code comments) into well-formed PDF files. PDFs simulate real-world artifacts (tutorials, READMEs, papers) that are then ingested by retrieval systems and LLM pipelines.


Core Concept

  • Render document content (markdown / LaTeX / code + comments) into PDF using a sandboxed LaTeX toolchain.
  • Produce deterministic, metadata-rich PDFs that include provenance markers (hidden or visible) to trace injection origin and version.

Input

  • induced_document or modified_document (raw text, may include injection markers or docstrings)
  • render_template (LaTeX template or markdown-to-tex mapping)
  • metadata (doc_id, injection_id, template_id, author/date, provenance tags)
  • compile_config (latex engine, resource limits, sandbox settings)

Output

  • document.pdf — compiled PDF file (stored in controlled experiment storage)
  • render_log — stdout/stderr of compilation, warnings, and timestamps
  • pdf_metadata.json{ doc_id, pdf_id, template_id, injected_markers, compile_time, checksum }

Rendering Pipeline

  1. Template selection: choose appropriate LaTeX template based on document type (tutorial, README, code example).
  2. Conversion: map source text to LaTeX (markdown → tex or direct LaTeX). Preserve code blocks and comment formatting.
  3. Embed provenance: insert invisible or visible markers (e.g., comment % INJECTED_BY:..., small footer line) into the LaTeX source.
  4. Sanitization: remove or neutralize dangerous constructs (shell-escape, \write18, external resource includes).
  5. Compile (sandboxed): run LaTeX engine (pdfLaTeX / XeLaTeX) inside restricted environment with CPU/memory/time limits and no network access.
  6. Post-process: linearize/optimize PDF if needed, compute checksum, and store with metadata.

Security & Sandbox Controls

  • No network access during compilation.
  • Disable/exclude shell-escape and external package installation.
  • Run each compile in isolated container or restricted process with resource caps (CPU, RAM, time).
  • Limit file system access to input/output directories only.
  • Validate LaTeX source for suspicious commands before compile; reject or sanitize problematic constructs.

Provenance & Traceability

  • Inject provenance tags into LaTeX source: %INJECTED_BY:pdf_generator; id:... or small visible footer with experiment id.
  • Store pdf_metadata.json including original doc_id, injection_id, template used, compile timestamp, and SHA256 checksum.
  • Retain render_log to aid debugging and to verify deterministic behavior.

Template & Styling Guidelines

  • Use clean, documentation-friendly templates (title, short abstract/intro, code blocks with monospace fonts, small footer).
  • Keep PDF length and token density moderate (optimal for retrieval embedding): prefer concise pages (1–5 pages) when simulating tutorials/READMEs.
  • Ensure code blocks are verbatim (no syntax execution).

Error Handling & Retries

  • Capture compilation errors and classify (missing package, syntax error, resource limit).
  • For transient issues (e.g., time limit), allow 1 retry with adjusted resource parameters; otherwise record failure and skip ingestion.
  • Log full stderr/stdout in render_log.

Storage & Ingestion

  • Store PDFs in controlled bucket/path with strict access policies.
  • Provide manifest for RAG ingestion: mapping pdf_id → doc_id → metadata.
  • Optionally precompute text-extraction (OCR/tex extraction) to produce the document text used for embedding.

Evaluation Signals

  • Compile success rate: % of documents successfully rendered.
  • Determinism check: identical input + template → identical checksum across runs.
  • Provenance presence: verify marker exists in compiled PDF (e.g., metadata or footer).
  • Size/length stats: distribution of pages/tokens to ensure retrieval suitability.

Minimal Procedure

  1. Select induced_document and template.
  2. Convert/sanitize source → LaTeX.
  3. Insert provenance marker.
  4. Compile in sandbox with resource limits.
  5. Post-process, compute checksum, save document.pdf, render_log, and pdf_metadata.json.
  6. Provide manifest entry for RAG ingestion.

Quick Checklist

  • Template chosen
  • LaTeX source sanitized (no shell-escape)
  • Provenance marker inserted
  • Compile in sandbox (no network, resource limits)
  • Save document.pdf, render_log, pdf_metadata.json
  • Manifest updated for ingestion

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions