Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract information #10

Closed
Viet1004 opened this issue May 18, 2024 · 4 comments · Fixed by #21 · May be fixed by #14
Closed

Extract information #10

Viet1004 opened this issue May 18, 2024 · 4 comments · Fixed by #21 · May be fixed by #14
Assignees

Comments

@Viet1004
Copy link
Collaborator

Extract texts, images, tables

@thinhngo-x
Copy link
Collaborator

@haiyenvu96 @Viet1004 any progress?

@Viet1004
Copy link
Collaborator Author

Not yet. But will try st tonight

@thinhngo-x
Copy link
Collaborator

thinhngo-x commented Jun 2, 2024

Some issues:

  • Mapping tables to their captions.
  • Remove section headings, footer, header, ...

@thinhngo-x
Copy link
Collaborator

Langchain has some good APIs for this, take a look: https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/

This was linked to pull requests Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants