See original announcements on:
- Spark mailing list
- GeneMania google group
- BioStars
For more information, see gallia-core documentation, in particular:
This is the Spark RDD-powered counterpart to the genemania parent repo (which was using Gallia's "poor man scaling" instead of Spark)
You can test it by running the ./testrun.sh script at the root of the repo, provided you are set up with aws-cli
and don't mind the cost (see below).
The script does the following:
- Creates an S3 bucket for the code and data
- Retrieves code and uploads it to the bucket (source+binaries)
- Retrieves the data (or a subset thereof) and uploads it to the bucket
- Creates an EMR Spark cluster and run the program as a single step
- Awaits until termination and logs results
To run it on a small subset (expect ~$3[2] in AWS charges), use:
./testrun.sh 10 4 # process first 10 files, using 4 workers
To run it in full (expect ~$18[2] in AWS charges), use:
./testrun.sh ALL <number-of-workers> # eg 60 workers
The full EMR run will take about 120 minutes with 60 workers[1]. As one would expect, it follows the distribution below:
Same input as parent repo, except uploaded to an s3 bucket first: s3://<bucket>/input/
Same output as parent repo, except made available on s3 bucket as s3://<bucket>/output/part-NNNNN.gz
files
Notable limitations are:
- Only available for Scala 2.12 because:
- sbt-assembly does not seem to be available for 2.13
- Spark support for 2.13 is still immature
- The I/O abstractions need to be aligned with the core's, they are somewhat hacky at the moment:
- gallia-core's
io.in
mechanisms (fluency, actions and atoms) vs gallia-spark's - gallia-core's
io.out
mechanisms (fluency, actions and atoms) vs gallia-spark's
- gallia-core's
See list of spark-related tasks for more limitations.
- [1] ~+1h to accumulate the input data and upload it on s3 bucket (using a 5 seconds courtesy delay in between each request)
- [2] Cost estimates provided are not guaranteed at all, run it at own risk (but please let me know if yours are significantly different)
You may contact the author at [email protected]