-
Notifications
You must be signed in to change notification settings - Fork 2
[Discussion] Tika 1.14 update and OCR of PDF #36
Comments
Saw that too!
I think we should stick with our custom OCR class for now. Not convinced their approach is right at the moment. Let's see where the discussion goes next. Perhaps we should comment ourselves?
Sent from Yahoo Mail on Android
On Thu, 2 Feb 2017 at 17:20, Kenneth Lui<[email protected]> wrote:
Was looking at Tika 1.14 release notes (https://tika.apache.org/1.14/index.html) and saw some interesting discussion around OCR PDF files (https://issues.apache.org/jira/browse/TIKA-1994) . Any thoughts on how this may affect us - should we upgrade? @RichJackson
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Maybe I can find some real documents from the Timeline project to compare the our performance v.s. theirs. Do you have any real documents from KCH/other institutes to test that? |
Actually, thinking about it, there's no reason not to upgrade to 1.14 and override their OCR class with ours for now? We can always switch in theirs if/when it becomes better. I think incorporating timeline docs will be difficult, as they might be actual patient data and therefore not public. We might think about putting together a small corpus of 4-5 documents and devising a series of tests to compare the two attempts? |
I also submit a stack overflow question, let's see... http://stackoverflow.com/questions/42217327/apache-tika-extract-only-field-names-from-pdf-xfa-forms-but-not-the-text-content but before upgrading, we may need to test with some semi-production data (KCH / SLaM)? |
* add fig to README * ed readme fig * ed readme fig * ed readme fig rm flaw * add figure to readme and cite RJ's paper * contributors and funders logos * fix logo img size
Was looking at Tika 1.14 release notes (https://tika.apache.org/1.14/index.html) and saw some interesting discussion around OCR PDF files (https://issues.apache.org/jira/browse/TIKA-1994) . Any thoughts on how this may affect us - should we upgrade? @RichJackson
The text was updated successfully, but these errors were encountered: