-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
PDF Generator — Task Description
Description of the LaTeX → PDF Generator component only, as used in the experiment pipeline to produce polished documents (PDFs) from modified/induced documents for ingestion into RAG systems.
Objective
Provide a secure, reproducible pipeline that converts instrumented document text (including injected inducing paragraphs or code comments) into well-formed PDF files. PDFs simulate real-world artifacts (tutorials, READMEs, papers) that are then ingested by retrieval systems and LLM pipelines.
Core Concept
- Render document content (markdown / LaTeX / code + comments) into PDF using a sandboxed LaTeX toolchain.
- Produce deterministic, metadata-rich PDFs that include provenance markers (hidden or visible) to trace injection origin and version.
Input
induced_documentormodified_document(raw text, may include injection markers or docstrings)render_template(LaTeX template or markdown-to-tex mapping)metadata(doc_id, injection_id, template_id, author/date, provenance tags)compile_config(latex engine, resource limits, sandbox settings)
Output
document.pdf— compiled PDF file (stored in controlled experiment storage)render_log— stdout/stderr of compilation, warnings, and timestampspdf_metadata.json—{ doc_id, pdf_id, template_id, injected_markers, compile_time, checksum }
Rendering Pipeline
- Template selection: choose appropriate LaTeX template based on document type (tutorial, README, code example).
- Conversion: map source text to LaTeX (markdown → tex or direct LaTeX). Preserve code blocks and comment formatting.
- Embed provenance: insert invisible or visible markers (e.g., comment
% INJECTED_BY:..., small footer line) into the LaTeX source. - Sanitization: remove or neutralize dangerous constructs (shell-escape,
\write18, external resource includes). - Compile (sandboxed): run LaTeX engine (pdfLaTeX / XeLaTeX) inside restricted environment with CPU/memory/time limits and no network access.
- Post-process: linearize/optimize PDF if needed, compute checksum, and store with metadata.
Security & Sandbox Controls
- No network access during compilation.
- Disable/exclude
shell-escapeand external package installation. - Run each compile in isolated container or restricted process with resource caps (CPU, RAM, time).
- Limit file system access to input/output directories only.
- Validate LaTeX source for suspicious commands before compile; reject or sanitize problematic constructs.
Provenance & Traceability
- Inject provenance tags into LaTeX source:
%INJECTED_BY:pdf_generator; id:...or small visible footer with experiment id. - Store
pdf_metadata.jsonincluding original doc_id, injection_id, template used, compile timestamp, and SHA256 checksum. - Retain
render_logto aid debugging and to verify deterministic behavior.
Template & Styling Guidelines
- Use clean, documentation-friendly templates (title, short abstract/intro, code blocks with monospace fonts, small footer).
- Keep PDF length and token density moderate (optimal for retrieval embedding): prefer concise pages (1–5 pages) when simulating tutorials/READMEs.
- Ensure code blocks are verbatim (no syntax execution).
Error Handling & Retries
- Capture compilation errors and classify (missing package, syntax error, resource limit).
- For transient issues (e.g., time limit), allow 1 retry with adjusted resource parameters; otherwise record failure and skip ingestion.
- Log full stderr/stdout in
render_log.
Storage & Ingestion
- Store PDFs in controlled bucket/path with strict access policies.
- Provide manifest for RAG ingestion: mapping
pdf_id → doc_id → metadata. - Optionally precompute text-extraction (OCR/tex extraction) to produce the document text used for embedding.
Evaluation Signals
- Compile success rate: % of documents successfully rendered.
- Determinism check: identical input + template → identical checksum across runs.
- Provenance presence: verify marker exists in compiled PDF (e.g., metadata or footer).
- Size/length stats: distribution of pages/tokens to ensure retrieval suitability.
Minimal Procedure
- Select
induced_documentand template. - Convert/sanitize source → LaTeX.
- Insert provenance marker.
- Compile in sandbox with resource limits.
- Post-process, compute checksum, save
document.pdf,render_log, andpdf_metadata.json. - Provide manifest entry for RAG ingestion.
Quick Checklist
- Template chosen
- LaTeX source sanitized (no shell-escape)
- Provenance marker inserted
- Compile in sandbox (no network, resource limits)
- Save
document.pdf,render_log,pdf_metadata.json - Manifest updated for ingestion
Metadata
Metadata
Labels
enhancementNew feature or requestNew feature or request