Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Deep Search PDF to MD file conversion #33

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nerdalert
Copy link
Member

@nerdalert nerdalert commented Jun 26, 2024

WIP/POC for PDF -> MD file conversion using Deep Search

Issue #2

While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.

All of this code is inferring the API from the client typescript lib in src/lib/api/deepsearch/index.ts. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to the getDocumentHashes DS4SD library code removing the / before api/xxx in path: api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.

The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:

  • Upload component from the client side sent to the server side rendering.
  • Implement a POST endpoint to handle PDF to Markdown conversion
  • Authenticate the user with the username and API key
  • Launch the conversion task and wait for its completion with the status/wait API call. The API server mock go binary adds a 15 second delay, enough for the client side to send a second GETv2/project/mockProject/celery_tasks/mock-task-id.
  • Retrieve the transaction ID from the completed task
  • Fetch document hashes using the transaction ID
  • Obtain document artifacts using the document hash
  • Return the document artifacts as the API response to the nextjs client side for rendering the results.

To start the Go mock deep search api server do the following:

cd mock/go-deepseed/
go run deepseed-mock.go

Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the src/app/api/conversion/route.ts code in this PR.

  1. Authenticate
curl -X POST http://localhost:8080/api/cps/user/v1/user/token \
    -H "Content-Type: application/json" \
    -H "Authorization: Basic $(echo -n 'your-username:your-api-key' | base64)"
  1. POST a PDF to convert
curl -X POST http://localhost:8080/api/cps/public/v1/project/mockProject/data_indices/mockIndex/actions/ccs_convert_upload \
    -H "Authorization: mock-token" \
    -H "Content-Type: application/json" \
    -d '{
          "file_url": ["data:application/pdf;base64,<your-base64-pdf>"]
        }'
  1. Wait for the Task ID to return
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/celery_tasks/mock-task-id?wait=10 \
    -H "Authorization: mock-token"
  1. Get Document Hashes
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/transactions/mock-transaction-id \
    -H "Authorization: mock-token"
  1. Get Document Artifacts
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/mock-document-hash/artifacts \
    -H "Authorization: mock-token"

Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.

conversion-mockup.mp4

@nerdalert nerdalert marked this pull request as draft June 26, 2024 05:55
@nerdalert nerdalert force-pushed the deepsearch-poc branch 2 times, most recently from 9f4de57 to b808390 Compare July 1, 2024 03:22
@nerdalert
Copy link
Member Author

@nerdalert nerdalert force-pushed the deepsearch-poc branch 2 times, most recently from 6ee74c7 to 3ede608 Compare July 11, 2024 02:25
@nerdalert nerdalert changed the title POC: Deep Search PDF to MD file conversion Feature: Deep Search PDF to MD file conversion Jul 11, 2024
@vishnoianil vishnoianil added the demo PR that contains Demo related changes label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo PR that contains Demo related changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants