This code is NOT intended for production use but instead as a starting point/reference implenentation of the Azure OpenAI (AOAI) Batch API. The code here is provided AS IS, you assume all responsibility (e.g., charges) from running this code. Testing in your environment should be done before running large use cases. Lastly, this is a work in progress and will be updated frequently. Please check back regularly for updates. This accelerator is designed to help users to quickly start using the Azure OpenAI Batch API. An overview of how the accelerator works is shown below:
Key features of the accelerator are:
- Automated Batch Job Submission and Creation
- Multi-threaded Async Processing to Reduce Overall Processing Time
- Automated Error Tracking
- Multi-directory Hierarchy Support
- Configurable Micro-batch support
- Automated Post-job Cleanup
For more details, including a detailed data flow diagram, please see this overview. Environment:
- Python 3.11 (or higher)
- Pip
- An Azure Data Lake Storage (v2) account
- An Azure OpenAI deployment
The following pip packages are required:
- azure-storage-file-datalake
- openai
- tiktoken
- requests
- token-count
- asyncio
- aiohttp
In addition to this, it is recommended to install these dependencies in a virtual environment to avoid conflicts (e.g., .venv)
The `Storage Blob Data Contributer` role must be given to the AOAI service's Managed Identity to allow AOAI to access the data in the Azure Storage Account. There are three configuration files required to use this accelerator:AOAI_config.json
- This file contains the settings for AOAI.storage_config.json
- This file contains the settings for the Azure Data Lake Storage Account which will hold the input/output of the job.app_config.json
- This file contains the application configuration settings.APP_CONFIG
inrunBatch.py
- This variable should be set to point to theapp_config.json
file which defines the app settings. Alternatively, this value can be set as an environment variable in the underlying OS. This will support command line parameter-based input in the future.
Reference templates of these files have been provided in the templates
directory where <> denote settings that must be filled in.
Other important settings are:
- aoai_api_version - This must be set to
2024-07-01-preview
as that's the only API version which supports the Batch API at this time. In the future, different versions can be set here. - batch_job_endpoint - This must be set to
/chat/completions
. - batch_size - This controls the 'micro batch' size which is the number of files that will be sent to the batch service in paralle. It is set to a recommended value of
10
but can be changed based on the requirements/file sizes being sent to the batch service. - download_to_local - This controls if the files should be downloaded to local to count the number of tokens in a file. Currently this should be set to the default value of
false
but may be used in future versions. - input_directory/filesystem - This is the directory and filesystem the code will check for input files, respectively. The default directory setting of
/
assumes no directories in the input filesystem. The current implementation is not recursive; if input files are stored in a directory in the input filesystem/container then it should be specified here. - output_directory/filesystem - This is the directory and filesystem the code will write output files, respectively. The default directory setting of
/
assumes no directories in the ouput filesystem. - error_directory/filesystem - This is the directory and filesystem the code will write error files, respectively. The default directory setting of
/
assumes no directories in the error filesystem. - continuous_mode - This setting controls how the code is run. If set to
true
, it will continuously check the input directory for files every 60 seconds, taking a snapshot of the files and kicking off a series of batch jobs to process until all files are processed. To stop, pressctrl+c
. If set tofalse
it will only run when executed.
- Input: Upload formatted batch files to the input location specified in the
storage_config.json
configuration file. Once all files are uploaded, start therunBatch.py
in the code directoy. When run, the code will run continuously or once, depending on thecontinuous_mode
setting described above. - Output: The code will create a directory in the
processed_filesystem_system_name
location instorage_config.json
configuration file for each file processed along with a timestamp of when the file was processed. The raw input file will also be moved to theprocessed
directory. In addition, if there are any errors, they will be put in theerror_filesystem_system_name
location, with a timestamp. - Metadata: The output creates a metadata file for each input file which contains mapping information which may be useful for automated processing of results.
- Cleanup: After processing is complete, the code will automatically process and clean up all files in the input directory, locally downloaded files, and all uploaded files to the AOAI Batch Service.