GitHub - galliaproject/gallia-genemania-spark

See original announcements on:

For more information, see gallia-core documentation, in particular:

The Spark section
The examples section

Description

This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)

Test Run

You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli and don't mind the cost (see below).

The script does the following:

Creates an S3 bucket for the code and data
Retrieves code and uploads it to the bucket (source+binaries)
Retrieves the data (or a subset thereof) and uploads it to the bucket
Creates an EMR Spark cluster and run the program as a single step
Awaits until termination and logs results

To run it on a small subset (expect ~$3^[2] in AWS charges), use:

./testrun.sh 10 4 # process first 10 files, using 4 workers

To run it in full (expect ~$18^[2] in AWS charges), use:

./testrun.sh ALL <number-of-workers> # eg 60 workers

The full EMR run will take about 120 minutes with 60 workers^[1]. As one would expect, it follows the distribution below:

Input

Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/

Output

Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz files

Limitations

Notable limitations are:

Only available for Scala 2.12 because:
- sbt-assembly does not seem to be available for 2.13
- Spark support for 2.13 is still immature
The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:
- gallia-core's io.in mechanisms (fluency, actions and atoms) vs gallia-spark's
- gallia-core's io.out mechanisms (fluency, actions and atoms) vs gallia-spark's

See list of spark-related tasks for more limitations.

Footnotes

^[1] ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
^[2] Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)

Contact

You may contact the author at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
licenses		licenses
project		project
src/main/scala/galliaexample/genemania		src/main/scala/galliaexample/genemania
LICENSE		LICENSE
LICENSE-binary		LICENSE-binary
NOTICE		NOTICE
NOTICE-binary		NOTICE-binary
README.md		README.md
build.sbt		build.sbt
testrun.sh		testrun.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Description

Test Run

Input

Output

Limitations

Footnotes

Contact

About

Licenses found

Releases 3

Packages

Languages

License

Licenses found

galliaproject/gallia-genemania-spark

Folders and files

Latest commit

History

Repository files navigation

Description

Test Run

Input

Output

Limitations

Footnotes

Contact

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages