Skip to content
This repository has been archived by the owner on Sep 8, 2023. It is now read-only.

[Discussion] Tika 1.14 update and OCR of PDF #36

Open
hkkenneth opened this issue Feb 2, 2017 · 4 comments
Open

[Discussion] Tika 1.14 update and OCR of PDF #36

hkkenneth opened this issue Feb 2, 2017 · 4 comments

Comments

@hkkenneth
Copy link
Collaborator

Was looking at Tika 1.14 release notes (https://tika.apache.org/1.14/index.html) and saw some interesting discussion around OCR PDF files (https://issues.apache.org/jira/browse/TIKA-1994) . Any thoughts on how this may affect us - should we upgrade? @RichJackson

@RichJackson
Copy link
Owner

RichJackson commented Feb 2, 2017 via email

@hkkenneth
Copy link
Collaborator Author

Maybe I can find some real documents from the Timeline project to compare the our performance v.s. theirs. Do you have any real documents from KCH/other institutes to test that?

@RichJackson
Copy link
Owner

Actually, thinking about it, there's no reason not to upgrade to 1.14 and override their OCR class with ours for now? We can always switch in theirs if/when it becomes better. I think incorporating timeline docs will be difficult, as they might be actual patient data and therefore not public. We might think about putting together a small corpus of 4-5 documents and devising a series of tests to compare the two attempts?

@hkkenneth
Copy link
Collaborator Author

I also submit a stack overflow question, let's see... http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content

but before upgrading, we may need to test with some semi-production data (KCH / SLaM)?

RichJackson pushed a commit that referenced this issue Mar 18, 2018
* add fig to README

* ed readme fig

* ed readme fig

* ed readme fig rm flaw

* add figure to readme and cite RJ's paper

* contributors and funders logos

* fix logo img size
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants