Skip to content

Commit

Permalink
Support more doc formats with new documentintelligence SDK (#1224)
Browse files Browse the repository at this point in the history
* Support more doc formats with new documentintelligence SDK

* Location picker for Document Intelligence

* Move comment up

* Add other data types

* Add section on reusing Doc Intelligence

* Rename to Doc Intel everywhere
  • Loading branch information
pamelafox authored Feb 15, 2024
1 parent f90c660 commit 2680bd6
Show file tree
Hide file tree
Showing 11 changed files with 81 additions and 57 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,16 @@ You can also customize the search service (new or existing) for non-English sear
1. To turn off the spell checker, run `azd env set AZURE_SEARCH_QUERY_SPELLER none`. Consult [this table](https://learn.microsoft.com/rest/api/searchservice/preview-api/search-documents#queryLanguage) to determine if spell checker is supported for your query language.
1. To configure the name of the analyzer to use for a searchable text field to a value other than "en.microsoft", run `azd env set AZURE_SEARCH_ANALYZER_NAME {Name of analyzer name}`. ([See other possible values](https://learn.microsoft.com/dotnet/api/microsoft.azure.search.models.field.analyzer?view=azure-dotnet-legacy&viewFallbackFrom=azure-dotnet))

#### Existing Azure Document Intelligence resource

In order to support analysis of many document formats, this repository uses a preview version of Azure Document Intelligence (formerly Form Recognizer) that is only available in [limited regions](https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-layout).
If your existing resource is in one of those regions, then you can re-use it by setting the following environment variables:

1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_SERVICE {Name of existing Azure AI Document Intelligence service}`
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_LOCATION {Location of existing service}`
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP {Name of resource group with existing service, defaults to main resource group}`
1. Run `azd env set AZURE_DOCUMENTINTELLIGENCE_SKU {SKU of existing service, defaults to S0}`

#### Other existing Azure resources

You can also use existing Azure AI Document Intelligence and Storage Accounts. See `./infra/main.parameters.json` for list of environment variables to pass to `azd env set` to configure those existing resources.
Expand Down
2 changes: 1 addition & 1 deletion docs/deploy_lowcost.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ However, if your goal is to minimize costs while prototyping your application, f
4. Use the free tier of Azure Document Intelligence (used in analyzing PDFs):

```shell
azd env set AZURE_FORMRECOGNIZER_SKU F0
azd env set AZURE_DOCUMENTINTELLIGENCE_SKU F0
```

Limitation: The free tier will only scan the first two pages of each PDF.
Expand Down
1 change: 1 addition & 0 deletions infra/abbreviations.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"cdnProfiles": "cdnp-",
"cdnProfilesEndpoints": "cdne-",
"cognitiveServicesAccounts": "cog-",
"cognitiveServicesDocumentIntelligence": "cog-di-",
"cognitiveServicesFormRecognizer": "cog-fr-",
"cognitiveServicesComputerVision": "cog-cv-",
"cognitiveServicesTextAnalytics": "cog-ta-",
Expand Down
45 changes: 28 additions & 17 deletions infra/main.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,20 @@ param openAiSkuName string = 'S0'
param openAiApiKey string = ''
param openAiApiOrganization string = ''

param formRecognizerServiceName string = ''
param formRecognizerResourceGroupName string = ''
param formRecognizerResourceGroupLocation string = location
param formRecognizerSkuName string = 'S0'
param documentIntelligenceServiceName string = ''
param documentIntelligenceResourceGroupName string = ''
// Limited regions for new version:
// https://learn.microsoft.com/azure/ai-services/document-intelligence/concept-layout
@description('Location for the Document Intelligence resource group')
@allowed(['eastus', 'westus2', 'westeurope'])
@metadata({
azd: {
type: 'location'
}
})
param documentIntelligenceResourceGroupLocation string

param documentIntelligenceSkuName string = 'S0'

param computerVisionServiceName string = ''
param computerVisionResourceGroupName string = ''
Expand Down Expand Up @@ -139,8 +149,8 @@ resource openAiResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' exi
name: !empty(openAiResourceGroupName) ? openAiResourceGroupName : resourceGroup.name
}

resource formRecognizerResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(formRecognizerResourceGroupName)) {
name: !empty(formRecognizerResourceGroupName) ? formRecognizerResourceGroupName : resourceGroup.name
resource documentIntelligenceResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(documentIntelligenceResourceGroupName)) {
name: !empty(documentIntelligenceResourceGroupName) ? documentIntelligenceResourceGroupName : resourceGroup.name
}

resource computerVisionResourceGroup 'Microsoft.Resources/resourceGroups@2021-04-01' existing = if (!empty(computerVisionResourceGroupName)) {
Expand Down Expand Up @@ -320,16 +330,17 @@ module openAi 'core/ai/cognitiveservices.bicep' = if (openAiHost == 'azure') {
}
}

module formRecognizer 'core/ai/cognitiveservices.bicep' = {
name: 'formrecognizer'
scope: formRecognizerResourceGroup
// Formerly known as Form Recognizer
module documentIntelligence 'core/ai/cognitiveservices.bicep' = {
name: 'documentintelligence'
scope: documentIntelligenceResourceGroup
params: {
name: !empty(formRecognizerServiceName) ? formRecognizerServiceName : '${abbrs.cognitiveServicesFormRecognizer}${resourceToken}'
name: !empty(documentIntelligenceServiceName) ? documentIntelligenceServiceName : '${abbrs.cognitiveServicesDocumentIntelligence}${resourceToken}'
kind: 'FormRecognizer'
location: formRecognizerResourceGroupLocation
location: documentIntelligenceResourceGroupLocation
tags: tags
sku: {
name: formRecognizerSkuName
name: documentIntelligenceSkuName
}
}
}
Expand Down Expand Up @@ -442,9 +453,9 @@ module openAiRoleUser 'core/security/role.bicep' = if (openAiHost == 'azure') {
}
}

module formRecognizerRoleUser 'core/security/role.bicep' = {
scope: formRecognizerResourceGroup
name: 'formrecognizer-role-user'
module documentIntelligenceRoleUser 'core/security/role.bicep' = {
scope: documentIntelligenceResourceGroup
name: 'documentintelligence-role-user'
params: {
principalId: principalId
roleDefinitionId: 'a97b65f3-24c7-4388-baec-2e87135dc908'
Expand Down Expand Up @@ -595,8 +606,8 @@ output AZURE_VISION_ENDPOINT string = useGPT4V ? computerVision.outputs.endpoint
output VISION_SECRET_NAME string = useGPT4V ? computerVisionSecretName : ''
output AZURE_KEY_VAULT_NAME string = useKeyVault ? keyVault.outputs.name : ''

output AZURE_FORMRECOGNIZER_SERVICE string = formRecognizer.outputs.name
output AZURE_FORMRECOGNIZER_RESOURCE_GROUP string = formRecognizerResourceGroup.name
output AZURE_DOCUMENTINTELLIGENCE_SERVICE string = documentIntelligence.outputs.name
output AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP string = documentIntelligenceResourceGroup.name

output AZURE_SEARCH_INDEX string = searchIndexName
output AZURE_SEARCH_SERVICE string = searchService.outputs.name
Expand Down
15 changes: 9 additions & 6 deletions infra/main.parameters.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,17 @@
"openAiSkuName": {
"value": "S0"
},
"formRecognizerServiceName": {
"value": "${AZURE_FORMRECOGNIZER_SERVICE}"
"documentIntelligenceServiceName": {
"value": "${AZURE_DOCUMENTINTELLIGENCE_SERVICE}"
},
"formRecognizerResourceGroupName": {
"value": "${AZURE_FORMRECOGNIZER_RESOURCE_GROUP}"
"documentIntelligenceResourceGroupName": {
"value": "${AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP}"
},
"formRecognizerSkuName": {
"value": "${AZURE_FORMRECOGNIZER_SKU=S0}"
"documentIntelligenceSkuName": {
"value": "${AZURE_DOCUMENTINTELLIGENCE_SKU=S0}"
},
"documentIntelligenceResourceGroupLocation": {
"value": "${AZURE_DOCUMENTINTELLIGENCE_LOCATION}"
},
"searchIndexName": {
"value": "${AZURE_SEARCH_INDEX=gptkbindex}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/prepdocs.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ $argumentList = "./scripts/prepdocs.py $dataArg --verbose " + `
"--openaihost `"$env:OPENAI_HOST`" --openaimodelname `"$env:AZURE_OPENAI_EMB_MODEL_NAME`" " + `
"--openaiservice `"$env:AZURE_OPENAI_SERVICE`" --openaideployment `"$env:AZURE_OPENAI_EMB_DEPLOYMENT`" " + `
"--openaikey `"$env:OPENAI_API_KEY`" --openaiorg `"$env:OPENAI_ORGANIZATION`" " + `
"--formrecognizerservice $env:AZURE_FORMRECOGNIZER_SERVICE " + `
"--documentintelligenceservice $env:AZURE_DOCUMENTINTELLIGENCE_SERVICE " + `
"$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg " + `
"$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg " + `
"$tenantArg $aclArg " + `
Expand Down
26 changes: 18 additions & 8 deletions scripts/prepdocs.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,16 +65,18 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
doc_int_parser: DocumentAnalysisParser

# check if Azure Document Intelligence credentials are provided
if args.formrecognizerservice is not None:
formrecognizer_creds: Union[AsyncTokenCredential, AzureKeyCredential] = (
credential if is_key_empty(args.formrecognizerkey) else AzureKeyCredential(args.formrecognizerkey)
if args.documentintelligenceservice is not None:
documentintelligence_creds: Union[AsyncTokenCredential, AzureKeyCredential] = (
credential
if is_key_empty(args.documentintelligencekey)
else AzureKeyCredential(args.documentintelligencekey)
)
doc_int_parser = DocumentAnalysisParser(
endpoint=f"https://{args.formrecognizerservice}.cognitiveservices.azure.com/",
credential=formrecognizer_creds,
endpoint=f"https://{args.documentintelligenceservice}.cognitiveservices.azure.com/",
credential=documentintelligence_creds,
verbose=args.verbose,
)
if args.localpdfparser or args.formrecognizerservice is None:
if args.localpdfparser or args.documentintelligenceservice is None:
pdf_parser = LocalPdfParser()
else:
pdf_parser = doc_int_parser
Expand All @@ -83,6 +85,14 @@ async def setup_file_strategy(credential: AsyncTokenCredential, args: Any) -> St
".pdf": FileProcessor(pdf_parser, sentence_text_splitter),
".json": FileProcessor(JsonParser(), SimpleTextSplitter()),
".docx": FileProcessor(doc_int_parser, sentence_text_splitter),
".pptx": FileProcessor(doc_int_parser, sentence_text_splitter),
".xlsx": FileProcessor(doc_int_parser, sentence_text_splitter),
".png": FileProcessor(doc_int_parser, sentence_text_splitter),
".jpg": FileProcessor(doc_int_parser, sentence_text_splitter),
".jpeg": FileProcessor(doc_int_parser, sentence_text_splitter),
".tiff": FileProcessor(doc_int_parser, sentence_text_splitter),
".bmp": FileProcessor(doc_int_parser, sentence_text_splitter),
".heic": FileProcessor(doc_int_parser, sentence_text_splitter),
}
use_vectors = not args.novectors
embeddings: Optional[OpenAIEmbeddings] = None
Expand Down Expand Up @@ -355,12 +365,12 @@ async def main(strategy: Strategy, credential: AsyncTokenCredential, args: Any):
help="Use PyPdf local PDF parser (supports only digital PDFs) instead of Azure Document Intelligence service to extract text, tables and layout from the documents",
)
parser.add_argument(
"--formrecognizerservice",
"--documentintelligenceservice",
required=False,
help="Optional. Name of the Azure Document Intelligence service which will be used to extract text, tables and layout from the documents (must exist already)",
)
parser.add_argument(
"--formrecognizerkey",
"--documentintelligencekey",
required=False,
help="Optional. Use this Azure Document Intelligence account key instead of the current user identity to login (use az login to set current user for Azure)",
)
Expand Down
2 changes: 1 addition & 1 deletion scripts/prepdocs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ $searchAnalyzerNameArg $searchSecretNameArg \
--openaihost "$OPENAI_HOST" --openaimodelname "$AZURE_OPENAI_EMB_MODEL_NAME" \
--openaiservice "$AZURE_OPENAI_SERVICE" --openaideployment "$AZURE_OPENAI_EMB_DEPLOYMENT" \
--openaikey "$OPENAI_API_KEY" --openaiorg "$OPENAI_ORGANIZATION" \
--formrecognizerservice "$AZURE_FORMRECOGNIZER_SERVICE" \
--documentintelligenceservice "$AZURE_DOCUMENTINTELLIGENCE_SERVICE" \
$searchImagesArg $visionEndpointArg $visionKeyArg $visionSecretNameArg \
$adlsGen2StorageAccountArg $adlsGen2FilesystemArg $adlsGen2FilesystemPathArg \
$tenantArg $aclArg \
Expand Down
15 changes: 8 additions & 7 deletions scripts/prepdocslib/pdfparser.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
import html
from typing import IO, AsyncGenerator, Union

from azure.ai.formrecognizer import DocumentTable
from azure.ai.formrecognizer.aio import DocumentAnalysisClient
from azure.ai.documentintelligence.aio import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import DocumentTable
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials_async import AsyncTokenCredential
from pypdf import PdfReader

from .page import Page
from .parser import Parser
from .strategy import USER_AGENT


class LocalPdfParser(Parser):
Expand Down Expand Up @@ -50,10 +49,12 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
if self.verbose:
print(f"Extracting text from '{content.name}' using Azure Document Intelligence")

async with DocumentAnalysisClient(
endpoint=self.endpoint, credential=self.credential, headers={"x-ms-useragent": USER_AGENT}
) as form_recognizer_client:
poller = await form_recognizer_client.begin_analyze_document(model_id=self.model_id, document=content)
async with DocumentIntelligenceClient(
endpoint=self.endpoint, credential=self.credential
) as document_intelligence_client:
poller = await document_intelligence_client.begin_analyze_document(
model_id=self.model_id, analyze_request=content, content_type="application/octet-stream"
)
form_recognizer_results = await poller.result()

offset = 0
Expand Down
2 changes: 1 addition & 1 deletion scripts/requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ pypdf
aiohttp
azure-identity
azure-search-documents==11.6.0b1
azure-ai-formrecognizer
azure-ai-documentintelligence
azure-storage-blob
azure-storage-file-datalake
openai[datalib]>=1.3.5
Expand Down
18 changes: 3 additions & 15 deletions scripts/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,22 +16,20 @@ anyio==4.2.0
# openai
attrs==23.2.0
# via aiohttp
azure-ai-formrecognizer==3.3.2
azure-ai-documentintelligence==1.0.0b1
# via -r requirements.in
azure-common==1.1.28
# via
# azure-ai-formrecognizer
# azure-keyvault-secrets
# azure-search-documents
azure-core==1.30.0
# via
# azure-ai-formrecognizer
# azure-ai-documentintelligence
# azure-identity
# azure-keyvault-secrets
# azure-search-documents
# azure-storage-blob
# azure-storage-file-datalake
# msrest
azure-identity==1.15.0
# via -r requirements.in
azure-keyvault-secrets==4.7.0
Expand All @@ -48,7 +46,6 @@ certifi==2024.2.2
# via
# httpcore
# httpx
# msrest
# requests
cffi==1.16.0
# via cryptography
Expand Down Expand Up @@ -84,19 +81,17 @@ idna==3.6
# yarl
isodate==0.6.1
# via
# azure-ai-documentintelligence
# azure-keyvault-secrets
# azure-search-documents
# azure-storage-blob
# azure-storage-file-datalake
# msrest
msal==1.26.0
# via
# azure-identity
# msal-extensions
msal-extensions==1.1.0
# via azure-identity
msrest==0.7.1
# via azure-ai-formrecognizer
multidict==6.0.5
# via
# aiohttp
Expand All @@ -106,8 +101,6 @@ numpy==1.26.4
# openai
# pandas
# pandas-stubs
oauthlib==3.2.2
# via requests-oauthlib
openai[datalib]==1.12.0
# via -r requirements.in
packaging==23.2
Expand Down Expand Up @@ -150,11 +143,7 @@ requests==2.31.0
# via
# azure-core
# msal
# msrest
# requests-oauthlib
# tiktoken
requests-oauthlib==1.3.1
# via msrest
rsa==4.9
# via python-jose
six==1.16.0
Expand All @@ -180,7 +169,6 @@ types-pytz==2024.1.0.20240203
# via pandas-stubs
typing-extensions==4.9.0
# via
# azure-ai-formrecognizer
# azure-core
# azure-keyvault-secrets
# azure-storage-blob
Expand Down

0 comments on commit 2680bd6

Please sign in to comment.