feat: Add input folder support for batch api #74

boqiny · 2024-12-19T15:18:22Z

Description

Previous batch api only accepts single file input, which is just like a super long async function.
Now refer to OpenAI's batch API implementation https://platform.openai.com/docs/guides/batch/getting-started?lang=node
We should add for a folder support by:

First upload the contents in the folder and save the filename and requestId as jsonl (shown in examples/parse_batch_upload.py)
Use generated jsonl to get the markdown for each elements when finished (shown in examples/parse_batch_fetch.py)

Now I set max worker = 10 & max input folder size = 1000

Related Issue

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring
Performance improvement

How Has This Been Tested?

Test scripts being added to examples folder parse_batch_upload.py and parse_batch_fetch.py

Screenshots (if applicable)

Generated Json after upload input folder

Fetch (Added markdown contents after processing is finished)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

examples/parse_batch_fetch.py:14

The variable name MAX_WORKER should be renamed to MAX_WORKERS for consistency.

MAX_WORKER = 10

README.md:87

The example code is missing the step to read the request ID from the response. It should be updated for clarity.

markdown = ap.batches.retrieve(request_id)

2024-12-23T07:34:19Z

examples/parse_batch_fetch.py

+            response["requestStatus"] = "COMPLETED"
+            response["completionTime"] = markdown.completionTime
+    except Exception as e:
+        print(f"Error processing {request_id}: {str(e)}")


Replace print statement with a proper logging mechanism for error handling.

Suggested change

print(f"Error processing {request_id}: {str(e)}")

logging.error(f"Error processing {request_id}: {str(e)}")

lingjiekong · 2024-12-23T07:36:19Z

any_parser/batch_parser.py

@@ -8,6 +13,8 @@
 from any_parser.base_parser import BaseParser

 TIMEOUT = 60
+MAX_FILES = 1000


There is no need to restrict on this. For the batch API, the logic is that

there is a crone job for every 2 hour

there is a scan every 5 mins and if the batch queue size is more than 1000
One of the two situation match will trigger a batch run.

lingjiekong · 2024-12-23T07:36:42Z

any_parser/batch_parser.py

+        if len(files) > MAX_FILES:
+            raise ValueError(
+                f"Found {len(files)} files. Maximum allowed is {MAX_FILES}"
+            )


not need for this.

lingjiekong · 2024-12-23T07:38:38Z

any_parser/batch_parser.py

+        output_path = folder_path.parent / output_filename
+
+        with open(output_path, "w") as f:
+            for response in responses:
+                f.write(json.dumps(response) + "\n")
+
+        return str(output_path)


Rule of thumb is that it is better to return a list of UploadResponse instead of a awkward file path that people have to know what this means and read it in.

add input folder support for batch api

074e036

boqiny requested review from Sdddell, goldmermaid and lingjiekong as code owners December 19, 2024 15:18

update readme

da914a2

lingjiekong requested a review from Copilot December 23, 2024 07:33

Copilot AI reviewed Dec 23, 2024

View reviewed changes

lingjiekong reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add input folder support for batch api #74

feat: Add input folder support for batch api #74

boqiny commented Dec 19, 2024 •

edited

Loading

Copilot AI left a comment

Copilot AI Dec 23, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

lingjiekong Dec 23, 2024

lingjiekong Dec 23, 2024

lingjiekong Dec 23, 2024

	print(f"Error processing {request_id}: {str(e)}")
	logging.error(f"Error processing {request_id}: {str(e)}")

feat: Add input folder support for batch api #74

Are you sure you want to change the base?

feat: Add input folder support for batch api #74

Conversation

boqiny commented Dec 19, 2024 • edited Loading

Description

Related Issue

Type of Change

How Has This Been Tested?

Screenshots (if applicable)

Checklist

Additional Notes

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI Dec 23, 2024

Choose a reason for hiding this comment

lingjiekong Dec 23, 2024

Choose a reason for hiding this comment

lingjiekong Dec 23, 2024

Choose a reason for hiding this comment

lingjiekong Dec 23, 2024

Choose a reason for hiding this comment

boqiny commented Dec 19, 2024 •

edited

Loading