Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add input folder support for batch api #74

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

feat: Add input folder support for batch api #74

wants to merge 2 commits into from

Conversation

boqiny
Copy link
Member

@boqiny boqiny commented Dec 19, 2024

Description

Previous batch api only accepts single file input, which is just like a super long async function.
Now refer to OpenAI's batch API implementation https://platform.openai.com/docs/guides/batch/getting-started?lang=node
We should add for a folder support by:

  1. First upload the contents in the folder and save the filename and requestId as jsonl (shown in examples/parse_batch_upload.py)
  2. Use generated jsonl to get the markdown for each elements when finished (shown in examples/parse_batch_fetch.py)

Now I set max worker = 10 & max input folder size = 1000

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

Test scripts being added to examples folder parse_batch_upload.py and parse_batch_fetch.py

Screenshots (if applicable)

Generated Json after upload input folder
image

Fetch (Added markdown contents after processing is finished)
image

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

@lingjiekong lingjiekong requested a review from Copilot December 23, 2024 07:33
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

examples/parse_batch_fetch.py:14

  • The variable name MAX_WORKER should be renamed to MAX_WORKERS for consistency.
MAX_WORKER = 10

README.md:87

  • The example code is missing the step to read the request ID from the response. It should be updated for clarity.
markdown = ap.batches.retrieve(request_id)

response["requestStatus"] = "COMPLETED"
response["completionTime"] = markdown.completionTime
except Exception as e:
print(f"Error processing {request_id}: {str(e)}")
Copy link
Preview

Copilot AI Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace print statement with a proper logging mechanism for error handling.

Suggested change
print(f"Error processing {request_id}: {str(e)}")
logging.error(f"Error processing {request_id}: {str(e)}")

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
@@ -8,6 +13,8 @@
from any_parser.base_parser import BaseParser

TIMEOUT = 60
MAX_FILES = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to restrict on this. For the batch API, the logic is that

  • there is a crone job for every 2 hour
  • there is a scan every 5 mins and if the batch queue size is more than 1000
    One of the two situation match will trigger a batch run.

Comment on lines +106 to +109
if len(files) > MAX_FILES:
raise ValueError(
f"Found {len(files)} files. Maximum allowed is {MAX_FILES}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not need for this.

Comment on lines +132 to +138
output_path = folder_path.parent / output_filename

with open(output_path, "w") as f:
for response in responses:
f.write(json.dumps(response) + "\n")

return str(output_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule of thumb is that it is better to return a list of UploadResponse instead of a awkward file path that people have to know what this means and read it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants