tabula-java
is a library for extracting tables from PDF files — it is the table extraction engine that used to power Tabula (repo). You can use tabula-java
as a command-line tool to programmatically extract tables from PDFs.
(This is the new version of the extraction engine; the previous code can be found at tabula-extractor
.)
Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our releases page.
Clone this repo and run:
mvn clean compile assembly:single
tabula-java
provides a command line application:
$ java -jar ./target/tabula-0.8.0-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-c <COLUMNS>] [-d] [-f <FORMAT>] [-g] [-h] [-i]
[-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-u] [-v]
Tabula helps you extract tables from PDFs
-a,--area <AREA> Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire
page
-c,--columns <COLUMNS> X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
-d,--debug Print detected table areas instead of
processing.
-f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-n,--no-spreadsheet Force PDF not to be extracted using
spreadsheet-style extraction (if there are
ruling lines separating each cell, as in a PDF
of an Excel spreadsheet)
-o,--outfile <OUTFILE> Write output to <file> instead of STDOUT.
Default: -
-p,--pages <PAGES> Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet Force PDF to be extracted using
spreadsheet-style extraction (if there are
ruling lines separating each cell, as in a PDF
of an Excel spreadsheet)
-s,--password <PASSWORD> Password to decrypt document. Default is empty
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
It also includes a debugging tool, run java -cp ./target/tabula-0.8.0-jar-with-dependencies.jar technology.tabula.debug.Debug -h
for the available options.
You can also integrate tabula-java
with any JVM language. For Java examples, see the tests
folder.
© 2014 Manuel Aristarán. Available under MIT License. See LICENSE
.