-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add input folder support for batch api #74
base: main
Are you sure you want to change the base?
Conversation
Copilot
AI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
examples/parse_batch_fetch.py:14
- The variable name MAX_WORKER should be renamed to MAX_WORKERS for consistency.
MAX_WORKER = 10
README.md:87
- The example code is missing the step to read the request ID from the response. It should be updated for clarity.
markdown = ap.batches.retrieve(request_id)
response["requestStatus"] = "COMPLETED" | ||
response["completionTime"] = markdown.completionTime | ||
except Exception as e: | ||
print(f"Error processing {request_id}: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace print statement with a proper logging mechanism for error handling.
print(f"Error processing {request_id}: {str(e)}") | |
logging.error(f"Error processing {request_id}: {str(e)}") |
Copilot is powered by AI, so mistakes are possible. Review output carefully before use.
@@ -8,6 +13,8 @@ | |||
from any_parser.base_parser import BaseParser | |||
|
|||
TIMEOUT = 60 | |||
MAX_FILES = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to restrict on this. For the batch API, the logic is that
- there is a crone job for every 2 hour
- there is a scan every 5 mins and if the batch queue size is more than 1000
One of the two situation match will trigger a batch run.
if len(files) > MAX_FILES: | ||
raise ValueError( | ||
f"Found {len(files)} files. Maximum allowed is {MAX_FILES}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need for this.
output_path = folder_path.parent / output_filename | ||
|
||
with open(output_path, "w") as f: | ||
for response in responses: | ||
f.write(json.dumps(response) + "\n") | ||
|
||
return str(output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rule of thumb is that it is better to return a list of UploadResponse instead of a awkward file path that people have to know what this means and read it in.
Description
Previous batch api only accepts single file input, which is just like a super long async function.
Now refer to OpenAI's batch API implementation https://platform.openai.com/docs/guides/batch/getting-started?lang=node
We should add for a folder support by:
Now I set max worker = 10 & max input folder size = 1000
Related Issue
Type of Change
How Has This Been Tested?
Test scripts being added to examples folder
parse_batch_upload.py
andparse_batch_fetch.py
Screenshots (if applicable)
Generated Json after upload input folder
Fetch (Added markdown contents after processing is finished)
Checklist
Additional Notes