A document intelligence solution accelerator built on Azure AI. Extracts structured answers from document collections using AI agents and proves those answers are grounded in actual source material.
prismv2.mp4
This template is built to showcase Azure AI services. We strongly advise against using this code in production without implementing additional security features. See productionizing guide.
| Challenge | Prism's Solution |
|---|---|
| Expensive Vision API calls | Hybrid extraction: PyMuPDF4LLM extracts text locally (free), Vision AI only validates pages with images/diagrams. 70%+ cost reduction. |
| Poor table extraction | pymupdf4llm preserves table structure as markdown. openpyxl extracts Excel with formulas and formatting. |
| Lost document structure | Structure-aware chunking respects markdown hierarchy (##, ###). Extracts section titles as metadata. |
| Hallucinated answers | Agentic retrieval with strict grounding instructions. Always cites sources. Distinguishes "not found" vs "explicitly excluded." |
| Manual Q&A workflows | Define question templates per project. Run workflows against your knowledge base. Export results to CSV. |
Documents go through hybrid extraction using Microsoft Agent Framework. Reliable local libraries handle the parsing, AI agents handle validation and enhancement.
PDF Processing
- PyMuPDF4LLM: Fast, local text/table extraction - free, structure-preserving
- Vision_Validator agent: Validates pages containing images, diagrams, or schematics using GPT-4.1 Vision
- Smart optimization: Text-only pages skip Vision entirely. Repeated images (logos, headers) auto-filtered.
- Custom instructions: Project-specific extraction prompts via
config.json
Excel Processing
- openpyxl: Extracts all worksheets (including hidden), formulas, merged cells
- Excel_Enhancement agent: Restructures raw data into search-optimized markdown, preserving item numbers, part codes, specifications
Email Processing
- extract-msg: Reliable .msg parsing with attachment extraction
- Email_Enhancement agent: Classifies email purpose and urgency, extracts requirements and action items, identifies deadlines, generates summaries
Upload → Extract → Deduplicate → Chunk → Embed → Index → Query
| Stage | What It Does |
|---|---|
| Extract | Hybrid local + AI agent extraction to structured markdown |
| Deduplicate | SHA256 hashing removes duplicate content |
| Chunk | Document-aware recursive chunking (1000 tokens, 200 overlap) |
| Embed | text-embedding-3-large (1024 dimensions, batch processing) |
| Index | Azure AI Search with hybrid search + semantic ranking |
| Query | Agentic retrieval with Knowledge Source + Knowledge Base |
Before embedding, documents go through document-aware recursive chunking:
- PDFs split on page boundaries, Excel on sheet markers, emails on metadata/body/attachment sections
- Chunks target 1000 tokens with 200-token overlap, using tiktoken for accurate counting
- Preserves markdown header hierarchy (H1-H4) as metadata, merges small sections with neighbors
- Table-aware regex avoids breaking markdown tables mid-row
- Each chunk enriched with context prefix (document name, section hierarchy, location) to improve embedding quality
PrismRAG uses Azure AI Search Agentic Retrieval for intelligent document retrieval.
The search index uses hybrid search: HNSW vectors with cosine distance, full-text search, and semantic ranking (required for agentic retrieval). On top of the index sits a two-layer architecture:
- Knowledge Source - wraps the search index with properties for agentic retrieval
- Knowledge Base - orchestrates the multi-query pipeline, connects to the LLM
When you submit a query with conversation history, agentic retrieval:
- Uses the LLM (gpt-4o, gpt-4.1, or gpt-5) to analyze context and break the query into focused subqueries
- Executes all subqueries in parallel against the knowledge source
- Applies semantic reranking to filter results
- Returns grounding data, source references, and execution details
Your application then uses this grounding data to generate the final answer. PrismRAG adds custom retry logic: if the original query returns nothing, it tries a simplified version (removing acronyms), then an expanded version (adding synonyms).
Define structured Q&A templates for systematic document analysis:
{
"sections": [
{
"name": "Technical Specifications",
"template": "Answer based on technical documents. Provide specific values with units.",
"questions": [
{ "question": "What is the rated voltage?", "instructions": "Check electrical specs" },
{ "question": "Operating temperature range?", "instructions": "Check environmental specs" }
]
}
]
}- Run workflows against your knowledge base
- Track completion percentage per section
- Export results to CSV
- Edit and comment on answers
- Evaluation: Assess answer quality with Azure AI Evaluation SDK (relevance, coherence, fluency, groundedness)
See Architecture Documentation for detailed system design.
| Service | Purpose |
|---|---|
| Azure AI Foundry | GPT-4.1 (chat, evaluation), GPT-5-chat (extraction agents, workflows), text-embedding-3-large (1024 dimensions) |
| Azure AI Search Agentic Retrieval | Knowledge Source + Knowledge Base for multi-query retrieval pipeline |
| Azure AI Evaluation SDK | Answer quality scoring (relevance, coherence, fluency, groundedness) |
| Azure Blob Storage | Document and project data storage |
| Container Apps | Serverless hosting for backend/frontend |
| Framework | Purpose |
|---|---|
| Microsoft Agent Framework | Orchestrates extraction agents (Vision_Validator, Excel_Enhancement, Email_Enhancement) and workflow agents |
| Library | Purpose |
|---|---|
| PyMuPDF4LLM | PDF text/table extraction with layout detection |
| openpyxl | Excel extraction with formula support |
| extract-msg | Outlook .msg email parsing |
| tiktoken | Token counting for accurate chunk sizing |
| LangChain text splitters | Structure-aware recursive chunking |
| Component | Technology |
|---|---|
| Backend | FastAPI (Python 3.11) |
| Frontend | Vue 3 + Vite + TailwindCSS + Pinia |
| Infrastructure | Bicep + Azure Developer CLI |
- Azure subscription with permissions to create resources
- Azure Developer CLI
- Docker
# Clone and deploy
git clone https://github.com/Azure-Samples/Prism---Transform-Data-into-Queryable-Knowledge.git
cd Prism---Transform-Data-into-Queryable-Knowledge
azd auth login
azd upWhat gets deployed:
- AI Foundry with GPT-4.1, gpt-5-chat (workflows), text-embedding-3-large
- Azure AI Search with semantic ranking enabled
- Azure Blob Storage for project data
- Container Apps (backend + frontend)
- Container Registry, Log Analytics, Application Insights
Get the auth password:
az containerapp secret show --name prism-backend --resource-group <your-rg> --secret-name auth-password --query value -o tsvAfter running azd up, generate a local .env file from your deployed Container App:
# Set your resource group
RG=<your-rg>
# Get environment variables and secrets
az containerapp show --name prism-backend --resource-group $RG \
--query "properties.template.containers[0].env[?value!=null].{name:name, value:value}" \
-o tsv | awk '{print $1"="$2}' > .env
# Append secrets
echo "AZURE_OPENAI_API_KEY=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name ai-services-key --query value -o tsv)" >> .env
echo "AZURE_SEARCH_ADMIN_KEY=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name search-admin-key --query value -o tsv)" >> .env
echo "AUTH_PASSWORD=$(az containerapp secret show --name prism-backend --resource-group $RG --secret-name auth-password --query value -o tsv)" >> .envThen run locally:
docker-compose -f infra/docker/docker-compose.yml --env-file .env up -dAccess at http://localhost:3000
prism/
├── apps/
│ ├── api/ # FastAPI backend
│ │ └── app/
│ │ ├── api/ # REST endpoints
│ │ └── services/ # Pipeline, workflow, storage services
│ └── web/ # Vue 3 frontend
│ └── src/views/ # Dashboard, Query, Workflows, Results
├── scripts/
│ ├── extraction/ # Document extractors
│ │ ├── pdf_extraction_hybrid.py # PyMuPDF4LLM + Vision
│ │ ├── excel_extraction_agents.py # openpyxl + AI
│ │ └── email_extraction_agents.py # extract-msg + AI
│ ├── rag/ # RAG pipeline
│ │ ├── deduplicate_documents.py
│ │ ├── chunk_documents.py # Structure-aware chunking
│ │ └── generate_embeddings.py
│ ├── search_index/ # Azure AI Search
│ │ ├── create_search_index.py
│ │ ├── create_knowledge_source.py
│ │ └── create_knowledge_agent.py
│ └── evaluation/ # Answer quality evaluation
│ └── evaluate_results.py
├── workflows/
│ └── workflow_agent.py # Q&A workflow execution
└── infra/
├── bicep/ # Azure infrastructure
└── docker/ # Local development (includes Azurite)
All project data is stored in Azure Blob Storage:
- Production: Azure Blob Storage with managed identity authentication
- Local Development: Azurite (Azure Storage emulator, included in docker-compose)
Container: prism-projects
└── {project-name}/
├── documents/ # Uploaded files
├── output/ # Processed results
│ ├── extraction_results/*.md
│ ├── chunked_documents/*.json
│ ├── embedded_documents/*.json
│ └── results.json # Workflow answers + evaluations
├── config.json # Extraction instructions
└── workflow_config.json # Q&A templates
Browse local storage with Azure Storage Explorer connected to http://localhost:10000.
| Service | SKU | Pricing |
|---|---|---|
| Azure Container Apps | Consumption | Pricing |
| Azure OpenAI | Standard | Pricing |
| Azure AI Search | Basic | Pricing |
Cost optimization: Hybrid PDF extraction reduces Vision API calls by 70%+ compared to full-vision approaches.
azd down- Quick Start - Get running in 5 minutes
- User Guide - Complete usage instructions
- Architecture - System design details
- Data Ingestion - Supported formats and pipeline
- Troubleshooting - Common issues
- Productionizing - Production readiness
- Local Development - Development setup
- Azure AI Foundry
- Azure AI Search Agentic Retrieval
- Microsoft Agent Framework
- Azure AI Evaluation SDK
- PyMuPDF4LLM
MIT License - see LICENSE
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines.