Spark extension for OpenRefine

This extension makes it possible to use Apache Spark as execution backend for OpenRefine.

It is an experimental feature which is designed to work with OpenRefine 4.0, currently in development.

Building

Run mvn package. This will build, run tests and create a file in target/spark-extension-0.1-SNAPSHOT.zip.

Usage

Install the extension in OpenRefine. Then, you can run OpenRefine with the Spark runner: ./refine -r org.openrefine.runners.spark.SparkRunner. This will spin up a local Spark cluster and use it to execute all operations run within this OpenRefine instance.

Known limitations

Using this runner for interactive data cleaning will generally be less efficient than the default runner. We expect this integration to become more interesting when running OpenRefine workflows from the command line, which is currently only possible via third-party tools and not officially supported by OpenRefine.

Some interactive features (such as progress reporting) are not available with this runner.

Configuration parameters

The following configuration parameters are supported:

refine.runner.sparkMasterURI: the URI of the existing Spark instance to connect to. If not provided, a local Spark instance will be used;
refine.runner.defaultParallelism: the default parallelism of the local Spark instance that is spun up (if any).

Those options can be set up in refine.ini or on the command line (-x refine.runner.sparkMasterURI=mysparkhost.com).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
refine-spark-runner		refine-spark-runner
spark-extension		spark-extension
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark extension for OpenRefine

Building

Usage

Known limitations

Configuration parameters

About

Releases

Packages

Languages

License

OpenRefine/refine-spark

Folders and files

Latest commit

History

Repository files navigation

Spark extension for OpenRefine

Building

Usage

Known limitations

Configuration parameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages