Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searchable PDFs #39

Closed
ajRiverav opened this issue Feb 24, 2019 · 1 comment
Closed

Searchable PDFs #39

ajRiverav opened this issue Feb 24, 2019 · 1 comment

Comments

@ajRiverav
Copy link

ajRiverav commented Feb 24, 2019

Is your feature request related to a problem? Please describe.
If you really want to index the whole thing, PDFs have to be searchable.

Describe the solution you'd like
I've got code to OCR -even difficult to OCR(e.g. rotated) - PDFs.

Describe alternatives you've considered
Workaround would be to read the PDF (i.e. the contract). Another competitive advantage would be that your search also includes contract text. This opens the possibility to interesting statistics and document tagging.

Additional context
My intent is to understand what you have got in terms of infrastructure and processes to download PDFs. OCR can be added to that process.

@jpadilla
Copy link
Contributor

jpadilla commented Mar 1, 2019

@ajRiverav So right now we're actually extracting text from these documents two ways: first try using FilePreviews(disclaimer: I run it) and if that doesn't return great results, we fallback to Google Cloud Vision. I was actually working on a small tweak(on this branch) to this flow, by first trying to extract text using poppler's pdftotext. I think we'd be pretty set regarding that. Once we've finished working on the search page(Code4PuertoRico/contratospr#3) we'll need to revisit #17.

The other thing I'd love to do is to start requesting for those documents that have not yet been uploaded to https://consultacontratos.ocpr.gov.pr. Right now you can request for a document but I'm not too sure if they actually end up uploading them...

What kind of "interesting statistics and document tagging" come to mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants