Skip to content

examples: add RAG failure diagnostics flow example (#20795)#20818

Open
onestardao wants to merge 10 commits intoPrefectHQ:mainfrom
onestardao:main
Open

examples: add RAG failure diagnostics flow example (#20795)#20818
onestardao wants to merge 10 commits intoPrefectHQ:mainfrom
onestardao:main

Conversation

@onestardao
Copy link

Add a new example in examples/rag_failure_diagnostics.py that shows how to:

  • instrument each stage of a simple RAG pipeline with Prefect tasks and logs
  • surface signals that map incidents to common failure patterns (retrieval hallucination, retriever coverage issues, recall gaps, etc.)

This is a docs-only change that addresses the request in #20795.

Overview

This PR introduces a self-contained example for users running RAG or LLM pipelines with Prefect.

The example constructs a tiny FAQ knowledge base, applies naive chunking, uses a toy keyword-based retriever, and simulates an incorrect model answer. The flow logs diagnostics such as retrieval coverage, retrieved chunk IDs, missing or forbidden keywords, and prints possible failure patterns to investigate.

The goal is to give users a minimal, inspectable template for debugging RAG flows using Prefect’s task boundaries and logging.

Checklist

  • This pull request references the related issue by including: closes Proposal: RAG flow failure analysis tutorial using WFGY 16-problem ProblemMap #20795
  • This pull request adds no new functionality and is a docs-only change, so no unit tests are required
  • No docs files are removed and no redirect settings are needed in mint.json
  • This pull request adds an example script; docstrings are included within the file where appropriate

Add a new example in examples/rag_failure_diagnostics.py that shows how to:
- instrument each stage of a simple RAG pipeline with Prefect tasks and logs
- surface signals that map incidents to common failure patterns (retrieval hallucination, retriever coverage issues, etc.)

This is a docs-only change that addresses the request in PrefectHQ#20795.
@onestardao
Copy link
Author

Thanks for the earlier guidance.
I’ve aligned the example with the existing Prefect examples and all checks are passing now.
Happy to adjust anything further if needed.

Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but overall, this example doesn't seem very interesting. I'm not sure how users would apply this to real-world use cases, and it doesn't showcase any Prefect features that help solve common RAG pipeline challenges. It's possible that RAG failure diagnostics and workflow orchestration are orthogonal, and an example for this isn't useful, but let us know if you have ideas to make this example more interesting.

from __future__ import annotations

from dataclasses import dataclass
from typing import Dict, List, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use the built-in dict, list, and tuple for typing.

Comment on lines 262 to 272
logger.info("Diagnostics summary:")
logger.info(" retrieved_ids = %s", retrieved_ids)
logger.info(
" retrieval_coverage = %.2f",
retrieval_metrics.get("coverage", 0.0),
)
logger.info(" missing_keywords_in_answer = %s", missing_keywords)
logger.info(" answer_contains_forbidden = %s", answer_contains_forbidden)

# Map observations to higher level failure patterns.
logger.info("Possible failure patterns to investigate:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this use a Prefect artifact instead of a wall of log messages?

- Update the RAG failure diagnostics example to emit a `rag-failure-diagnostics`
  markdown artifact instead of only relying on logs
- Keep logs focused on step-level events while the artifact summarizes a single
  query (coverage, retrieved_ids, missing keywords, and probable failure patterns)
- Switch typing in the example to use built-in generics (list, dict, tuple)
- Still a docs-only change that adds a self-contained RAG diagnostics flow
  addressing the request in PrefectHQ#20795
@onestardao
Copy link
Author

Thanks a lot for the review and suggestions.

I’ve updated the example to:

  • switch the typing to the built-in generics (list[str], dict[str, float], etc.), and
  • emit a small rag-failure-diagnostics markdown artifact that summarizes a single run
    (query, retrieved_ids, retrieval coverage, missing keywords, and the inferred failure patterns).

The artifact shows up in the Prefect UI next to the flow run, so users can inspect the
RAG failure signals at a glance instead of scrolling through a long wall of log lines.
The example is still self-contained and docs-only, and the pattern is meant to be a
minimal template that teams can adapt to their own internal failure mode checklist.

Happy to tweak the artifact content or naming if you prefer something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: RAG flow failure analysis tutorial using WFGY 16-problem ProblemMap

2 participants