tabula-java

tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that powers Tabula (repo). You can use tabula-java as a command-line tool to programmatically extract tables from PDFs.

DrugBank maintains a forked copy of tabula-java. W e forked tabula-java quite a while ago to add features that would help when extracting EMA product information from pdfs. We made a few changes, some of which were contributed back upstream, and some of which are very specific to our use case and have been kept private.

Maintenance @ DrugBank

This version includes multiple features that we have developed but not upstreamed. Our strategy to keep merges simple is to keep feature branches which are updated individually. Each feature branch should include the minimal amount of changes possible. This way, each merge conflict should be as easy as possible to resolve.

See #1 for more info about how this strategy was developed and chosen.

Update Procedure

If you haven't yet, add the upstream remote.

git remote add upstream [email protected]:tabulapdf/tabula-java.git

Because we have feature branches which have not been pushed upstream, we use a strategy of maintaining those feature branches over time. This makes it easier to resolve conflicts as they can be considered in isolation as part of a single feature.

Repeat the following procedure for each feature branch:

git pull upstream master
fix merge issue
run mvn test and fix any issues that appear
git checkout master
git merge the branch you just updated

Release Procedure

mvn test && mvn compile
publish release to tabula-java repo
download the new jar to vendor/tabula-java/ in the main DrugBank repo and include its version number in the filename
edit Pipeline::Tools::Ema::PdfToJson to use the new version
Test the new version of tabula-java in QA with the ema_products:check_all_package_pdfs rake task

Original Readme Content

Usage Examples

tabula-java provides a command line application:

$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-f <FORMAT>]
       [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s
       <PASSWORD>] [-t] [-u] [-v]

Tabula helps you extract tables from PDFs

 -a,--area <AREA>           -a/--area = Portion of the page to analyze.
                            Example: --area 269.875,12.75,790.5,561.
                            Accepts top,left,bottom,right i.e. y1,x1,y2,x2
                            where all values are in points relative to the
                            top left corner. If all values are between
                            0-100 (inclusive) and preceded by '%', input
                            will be taken as % of actual height or width
                            of the page. Example: --area %0,0,100,50. To
                            specify multiple areas, -a option should be
                            repeated. Default is entire page
 -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory.
 -c,--columns <COLUMNS>     X coordinates of column boundaries. Example
                            --columns 10.1,20.2,30.3. If all values are
                            between 0-100 (inclusive) and preceded by '%',
                            input will be taken as % of actual width of
                            the page. Example: --columns %25,50,80.6
 -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV
 -g,--guess                 Guess the portion of the page to analyze per
                            page.
 -h,--help                  Print this help text.
 -ha,--detect-horizontal-alignment   Detect horizontal alignment of text to
                                    improve column detection.
 -i,--silent                Suppress all stderr output.
 -pn,--rm-page-numbers      Attempt to remove page numbers
 -l,--lattice               Force PDF to be extracted using lattice-mode
                            extraction (if there are ruling lines
                            separating each cell, as in a PDF of an Excel
                            spreadsheet)
 -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF
                            not to be extracted using spreadsheet-style
                            extraction (if there are no ruling lines
                            separating each cell)
 -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.
                            Default: -
 -p,--pages <PAGES>         Comma separated list of ranges, or all.
                            Examples: --pages 1-3,5-7, --pages 3 or
                            --pages all. Default is --pages 1
 -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force
                            PDF to be extracted using spreadsheet-style
                            extraction (if there are ruling lines
                            separating each cell, as in a PDF of an Excel
                            spreadsheet)
 -s,--password <PASSWORD>   Password to decrypt document. Default is empty
 -t,--stream                Force PDF to be extracted using stream-mode
                            extraction (if there are no ruling lines
                            separating each cell)
 -u,--use-line-returns      Use embedded line returns in cells. (Only in
                            spreadsheet mode.)
 -v,--version               Print version and exit.

It also includes a debugging tool, run java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.

You can also integrate tabula-java with any JVM language. For Java examples, see the tests folder.

JVM start-up time is a lot of the cost of the tabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:

the -b option, which allows you to convert all pdfs in a given directory
the drip utility
the Ruby, Python, R, and Node.js bindings
writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
waiting for us to implement an API/server-style system (it's on the roadmap)

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Contributing

Interested in helping out? We'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request.
Spreading the word about tabula-java to people who might be able to benefit from using it.

Backers

You can also support our continued work on tabula-java with a one-time or monthly donation on OpenCollective. Organizations who use tabula-java can also sponsor the project for acknowledgement on our official site and this README.

Special thanks to the following users and organizations for generously supporting Tabula with donations and grants:

Name		Name	Last commit message	Last commit date
Latest commit History 468 Commits
.github		.github
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jbang-catalog.json		jbang-catalog.json
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tabula-java

Maintenance @ DrugBank

Update Procedure

Release Procedure

Original Readme Content

Usage Examples

Building from Source

Contributing

Backers

About

Releases 6

Packages

Languages

License

omxhealth/tabula-java

Folders and files

Latest commit

History

Repository files navigation

tabula-java

Maintenance @ DrugBank

Update Procedure

Release Procedure

Original Readme Content

Usage Examples

Building from Source

Contributing

Backers

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages