Searchable PDFs #39

ajRiverav · 2019-02-24T22:50:52Z

Is your feature request related to a problem? Please describe.
If you really want to index the whole thing, PDFs have to be searchable.

Describe the solution you'd like
I've got code to OCR -even difficult to OCR(e.g. rotated) - PDFs.

Describe alternatives you've considered
Workaround would be to read the PDF (i.e. the contract). Another competitive advantage would be that your search also includes contract text. This opens the possibility to interesting statistics and document tagging.

Additional context
My intent is to understand what you have got in terms of infrastructure and processes to download PDFs. OCR can be added to that process.

jpadilla · 2019-03-01T23:32:15Z

@ajRiverav So right now we're actually extracting text from these documents two ways: first try using FilePreviews(disclaimer: I run it) and if that doesn't return great results, we fallback to Google Cloud Vision. I was actually working on a small tweak(on this branch) to this flow, by first trying to extract text using poppler's pdftotext. I think we'd be pretty set regarding that. Once we've finished working on the search page(Code4PuertoRico/contratospr#3) we'll need to revisit #17.

The other thing I'd love to do is to start requesting for those documents that have not yet been uploaded to https://consultacontratos.ocpr.gov.pr. Right now you can request for a document but I'm not too sure if they actually end up uploading them...

What kind of "interesting statistics and document tagging" come to mind?

jpadilla closed this as completed May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searchable PDFs #39

Searchable PDFs #39

ajRiverav commented Feb 24, 2019 •

edited

Loading

jpadilla commented Mar 1, 2019

Searchable PDFs #39

Searchable PDFs #39

Comments

ajRiverav commented Feb 24, 2019 • edited Loading

jpadilla commented Mar 1, 2019

ajRiverav commented Feb 24, 2019 •

edited

Loading