Skip to content

Commit

Permalink
Add media description feature using Azure Content Understanding (#2195)
Browse files Browse the repository at this point in the history
* First pass

* CU kinda working

* CU integration

* Better splitting

* Add Bicep

* Rm unneeded figures

* Remove en-us from URLs

* Fix URLs

* Remote figures output JSON

* Update matrix comments

* Make mypy happy

* Add same errors to file strategy

* Add pymupdf to skip modules for mypy

* Output the endpoint from Bicep

* 100 percent coverage for mediadescriber.py

* Tests added for PDFParser

* Fix that tuple type

* Add pricing link

* Fix content read issue
  • Loading branch information
pamelafox authored Dec 9, 2024
1 parent e90920f commit 0bb3f95
Show file tree
Hide file tree
Showing 36 changed files with 962 additions and 65 deletions.
1 change: 1 addition & 0 deletions .azdo/pipelines/azure-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ steps:
DEPLOYMENT_TARGET: $(DEPLOYMENT_TARGET)
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: $(AZURE_CONTAINER_APPS_WORKLOAD_PROFILE)
USE_CHAT_HISTORY_BROWSER: $(USE_CHAT_HISTORY_BROWSER)
USE_MEDIA_DESCRIBER_AZURE_CU: $(USE_MEDIA_DESCRIBER_AZURE_CU)
- task: AzureCLI@2
displayName: Deploy Application
inputs:
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/azure-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ on:
# To configure required secrets for connecting to Azure, simply run `azd pipeline config`

# Set up permissions for deploying with secretless Azure federated credentials
# https://learn.microsoft.com/en-us/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
# https://learn.microsoft.com/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
permissions:
id-token: write
contents: read
Expand Down Expand Up @@ -103,6 +103,7 @@ jobs:
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}
USE_CHAT_HISTORY_BROWSER: ${{ vars.USE_CHAT_HISTORY_BROWSER }}
USE_MEDIA_DESCRIBER_AZURE_CU: ${{ vars.USE_MEDIA_DESCRIBER_AZURE_CU }}
steps:
- name: Checkout
uses: actions/checkout@v4
Expand Down
2 changes: 2 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,8 @@ If you followed the steps above to install the pre-commit hooks, then you can ju

When adding new azd environment variables, please remember to update:

1. [main.parameters.json](./infra/main.parameters.json)
1. [appEnvVariables in main.bicep](./infra/main.bicep)
1. App Service's [azure.yaml](./azure.yaml)
1. [ADO pipeline](.azdo/pipelines/azure-dev.yml).
1. [Github workflows](.github/workflows/azure-dev.yml)
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,9 @@ However, you can try the [Azure pricing calculator](https://azure.com/e/e3490de2
- Azure AI Document Intelligence: SO (Standard) tier using pre-built layout. Pricing per document page, sample documents have 261 pages total. [Pricing](https://azure.microsoft.com/pricing/details/form-recognizer/)
- Azure AI Search: Basic tier, 1 replica, free level of semantic search. Pricing per hour. [Pricing](https://azure.microsoft.com/pricing/details/search/)
- Azure Blob Storage: Standard tier with ZRS (Zone-redundant storage). Pricing per storage and read operations. [Pricing](https://azure.microsoft.com/pricing/details/storage/blobs/)
- Azure Cosmos DB: Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
- Azure Cosmos DB: Only provisioned if you enabled [chat history with Cosmos DB](docs/deploy_features.md#enabling-persistent-chat-history-with-azure-cosmos-db). Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
- Azure AI Vision: Only provisioned if you enabled [GPT-4 with vision](docs/gpt4v.md). Pricing per 1K transactions. [Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/computer-vision/)
- Azure AI Content Understanding: Only provisioned if you enabled [media description](docs/deploy_features.md#enabling-media-description-with-azure-content-understanding). Pricing per 1K images. [Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)
- Azure Monitor: Pay-as-you-go tier. Costs based on data ingested. [Pricing](https://azure.microsoft.com/pricing/details/monitor/)

To reduce costs, you can switch to free SKUs for various services, but those SKUs have limitations.
Expand Down
2 changes: 1 addition & 1 deletion app/backend/gunicorn.conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
bind = "0.0.0.0"

timeout = 230
# https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds
# https://learn.microsoft.com/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds

num_cpus = multiprocessing.cpu_count()
if os.getenv("WEBSITE_SKU") == "LinuxFree":
Expand Down
16 changes: 13 additions & 3 deletions app/backend/prepdocs.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials_async import AsyncTokenCredential
from azure.identity.aio import AzureDeveloperCliCredential, get_bearer_token_provider
from rich.logging import RichHandler

from load_azd_env import load_azd_env
from prepdocslib.blobmanager import BlobManager
Expand Down Expand Up @@ -158,8 +159,10 @@ def setup_file_processors(
local_pdf_parser: bool = False,
local_html_parser: bool = False,
search_images: bool = False,
use_content_understanding: bool = False,
content_understanding_endpoint: Union[str, None] = None,
):
sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=search_images)
sentence_text_splitter = SentenceTextSplitter()

doc_int_parser: Optional[DocumentAnalysisParser] = None
# check if Azure Document Intelligence credentials are provided
Expand All @@ -170,6 +173,8 @@ def setup_file_processors(
doc_int_parser = DocumentAnalysisParser(
endpoint=f"https://{document_intelligence_service}.cognitiveservices.azure.com/",
credential=documentintelligence_creds,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=content_understanding_endpoint,
)

pdf_parser: Optional[Parser] = None
Expand Down Expand Up @@ -294,10 +299,10 @@ async def main(strategy: Strategy, setup_index: bool = True):
args = parser.parse_args()

if args.verbose:
logging.basicConfig(format="%(message)s")
logging.basicConfig(format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)])
# We only set the level to INFO for our logger,
# to avoid seeing the noisy INFO level logs from the Azure SDKs
logger.setLevel(logging.INFO)
logger.setLevel(logging.DEBUG)

load_azd_env()

Expand All @@ -309,6 +314,7 @@ async def main(strategy: Strategy, setup_index: bool = True):
use_gptvision = os.getenv("USE_GPT4V", "").lower() == "true"
use_acls = os.getenv("AZURE_ADLS_GEN2_STORAGE_ACCOUNT") is not None
dont_use_vectors = os.getenv("USE_VECTORS", "").lower() == "false"
use_content_understanding = os.getenv("USE_MEDIA_DESCRIBER_AZURE_CU", "").lower() == "true"

# Use the current user identity to connect to Azure services. See infra/main.bicep for role assignments.
if tenant_id := os.getenv("AZURE_TENANT_ID"):
Expand Down Expand Up @@ -406,6 +412,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER") == "true",
local_html_parser=os.getenv("USE_LOCAL_HTML_PARSER") == "true",
search_images=use_gptvision,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
)
image_embeddings_service = setup_image_embeddings_service(
azure_credential=azd_credential,
Expand All @@ -424,6 +432,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
search_analyzer_name=os.getenv("AZURE_SEARCH_ANALYZER_NAME"),
use_acls=use_acls,
category=args.category,
use_content_understanding=use_content_understanding,
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
)

loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))
Expand Down
2 changes: 1 addition & 1 deletion app/backend/prepdocslib/blobmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ def sourcepage_from_file_page(cls, filename, page=0) -> str:

@classmethod
def blob_image_name_from_file_page(cls, filename, page=0) -> str:
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".png"
return os.path.splitext(os.path.basename(filename))[0] + f"-{page+1}" + ".png"

@classmethod
def blob_name_from_file_name(cls, filename) -> str:
Expand Down
17 changes: 17 additions & 0 deletions app/backend/prepdocslib/filestrategy.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
import logging
from typing import List, Optional

from azure.core.credentials import AzureKeyCredential

from .blobmanager import BlobManager
from .embeddings import ImageEmbeddings, OpenAIEmbeddings
from .fileprocessor import FileProcessor
from .listfilestrategy import File, ListFileStrategy
from .mediadescriber import ContentUnderstandingDescriber
from .searchmanager import SearchManager, Section
from .strategy import DocumentAction, SearchInfo, Strategy

Expand Down Expand Up @@ -50,6 +53,8 @@ def __init__(
search_analyzer_name: Optional[str] = None,
use_acls: bool = False,
category: Optional[str] = None,
use_content_understanding: bool = False,
content_understanding_endpoint: Optional[str] = None,
):
self.list_file_strategy = list_file_strategy
self.blob_manager = blob_manager
Expand All @@ -61,6 +66,8 @@ def __init__(
self.search_info = search_info
self.use_acls = use_acls
self.category = category
self.use_content_understanding = use_content_understanding
self.content_understanding_endpoint = content_understanding_endpoint

async def setup(self):
search_manager = SearchManager(
Expand All @@ -73,6 +80,16 @@ async def setup(self):
)
await search_manager.create_index()

if self.use_content_understanding:
if self.content_understanding_endpoint is None:
raise ValueError("Content Understanding is enabled but no endpoint was provided")
if isinstance(self.search_info.credential, AzureKeyCredential):
raise ValueError(
"AzureKeyCredential is not supported for Content Understanding, use keyless auth instead"
)
cu_manager = ContentUnderstandingDescriber(self.content_understanding_endpoint, self.search_info.credential)
await cu_manager.create_analyzer()

async def run(self):
search_manager = SearchManager(
self.search_info, self.search_analyzer_name, self.use_acls, False, self.embeddings
Expand Down
107 changes: 107 additions & 0 deletions app/backend/prepdocslib/mediadescriber.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import logging
from abc import ABC

import aiohttp
from azure.core.credentials_async import AsyncTokenCredential
from azure.identity.aio import get_bearer_token_provider
from rich.progress import Progress
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed

logger = logging.getLogger("scripts")


class MediaDescriber(ABC):

async def describe_image(self, image_bytes) -> str:
raise NotImplementedError # pragma: no cover


class ContentUnderstandingDescriber:
CU_API_VERSION = "2024-12-01-preview"

analyzer_schema = {
"analyzerId": "image_analyzer",
"name": "Image understanding",
"description": "Extract detailed structured information from images extracted from documents.",
"baseAnalyzerId": "prebuilt-image",
"scenario": "image",
"config": {"returnDetails": False},
"fieldSchema": {
"name": "ImageInformation",
"descriptions": "Description of image.",
"fields": {
"Description": {
"type": "string",
"description": "Description of the image. If the image has a title, start with the title. Include a 2-sentence summary. If the image is a chart, diagram, or table, include the underlying data in an HTML table tag, with accurate numbers. If the image is a chart, describe any axis or legends. The only allowed HTML tags are the table/thead/tr/td/tbody tags.",
},
},
},
}

def __init__(self, endpoint: str, credential: AsyncTokenCredential):
self.endpoint = endpoint
self.credential = credential

async def poll_api(self, session, poll_url, headers):

@retry(stop=stop_after_attempt(60), wait=wait_fixed(2), retry=retry_if_exception_type(ValueError))
async def poll():
async with session.get(poll_url, headers=headers) as response:
response.raise_for_status()
response_json = await response.json()
if response_json["status"] == "Failed":
raise Exception("Failed")
if response_json["status"] == "Running":
raise ValueError("Running")
return response_json

return await poll()

async def create_analyzer(self):
logger.info("Creating analyzer '%s'...", self.analyzer_schema["analyzerId"])

token_provider = get_bearer_token_provider(self.credential, "https://cognitiveservices.azure.com/.default")
token = await token_provider()
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
params = {"api-version": self.CU_API_VERSION}
analyzer_id = self.analyzer_schema["analyzerId"]
cu_endpoint = f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_id}"
async with aiohttp.ClientSession() as session:
async with session.put(
url=cu_endpoint, params=params, headers=headers, json=self.analyzer_schema
) as response:
if response.status == 409:
logger.info("Analyzer '%s' already exists.", analyzer_id)
return
elif response.status != 201:
data = await response.text()
raise Exception("Error creating analyzer", data)
else:
poll_url = response.headers.get("Operation-Location")

with Progress() as progress:
progress.add_task("Creating analyzer...", total=None, start=False)
await self.poll_api(session, poll_url, headers)

async def describe_image(self, image_bytes: bytes) -> str:
logger.info("Sending image to Azure Content Understanding service...")
async with aiohttp.ClientSession() as session:
token = await self.credential.get_token("https://cognitiveservices.azure.com/.default")
headers = {"Authorization": "Bearer " + token.token}
params = {"api-version": self.CU_API_VERSION}
analyzer_name = self.analyzer_schema["analyzerId"]
async with session.post(
url=f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_name}:analyze",
params=params,
headers=headers,
data=image_bytes,
) as response:
response.raise_for_status()
poll_url = response.headers["Operation-Location"]

with Progress() as progress:
progress.add_task("Processing...", total=None, start=False)
results = await self.poll_api(session, poll_url, headers)

fields = results["result"]["contents"][0]["fields"]
return fields["Description"]["valueString"]
6 changes: 5 additions & 1 deletion app/backend/prepdocslib/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ class Page:
A single page from a document
Attributes:
page_num (int): Page number
page_num (int): Page number (0-indexed)
offset (int): If the text of the entire Document was concatenated into a single string, the index of the first character on the page. For example, if page 1 had the text "hello" and page 2 had the text "world", the offset of page 2 is 5 ("hellow")
text (str): The text of the page
"""
Expand All @@ -17,6 +17,10 @@ def __init__(self, page_num: int, offset: int, text: str):
class SplitPage:
"""
A section of a page that has been split into a smaller chunk.
Attributes:
page_num (int): Page number (0-indexed)
text (str): The text of the section
"""

def __init__(self, page_num: int, text: str):
Expand Down
Loading

0 comments on commit 0bb3f95

Please sign in to comment.