You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
If you really want to index the whole thing, PDFs have to be searchable.
Describe the solution you'd like
I've got code to OCR -even difficult to OCR(e.g. rotated) - PDFs.
Describe alternatives you've considered
Workaround would be to read the PDF (i.e. the contract). Another competitive advantage would be that your search also includes contract text. This opens the possibility to interesting statistics and document tagging.
Additional context
My intent is to understand what you have got in terms of infrastructure and processes to download PDFs. OCR can be added to that process.
The text was updated successfully, but these errors were encountered:
@ajRiverav So right now we're actually extracting text from these documents two ways: first try using FilePreviews(disclaimer: I run it) and if that doesn't return great results, we fallback to Google Cloud Vision. I was actually working on a small tweak(on this branch) to this flow, by first trying to extract text using poppler's pdftotext. I think we'd be pretty set regarding that. Once we've finished working on the search page(Code4PuertoRico/contratospr#3) we'll need to revisit #17.
The other thing I'd love to do is to start requesting for those documents that have not yet been uploaded to https://consultacontratos.ocpr.gov.pr. Right now you can request for a document but I'm not too sure if they actually end up uploading them...
What kind of "interesting statistics and document tagging" come to mind?
Is your feature request related to a problem? Please describe.
If you really want to index the whole thing, PDFs have to be searchable.
Describe the solution you'd like
I've got code to OCR -even difficult to OCR(e.g. rotated) - PDFs.
Describe alternatives you've considered
Workaround would be to read the PDF (i.e. the contract). Another competitive advantage would be that your search also includes contract text. This opens the possibility to interesting statistics and document tagging.
Additional context
My intent is to understand what you have got in terms of infrastructure and processes to download PDFs. OCR can be added to that process.
The text was updated successfully, but these errors were encountered: