Spark interface
According to my measurements running the same code in Spark is 3-4 times faster than running it on Hadoop.
For running it on Spark, you should install Hadoop, Scala and finally Spark. The JSON files are stored in Hadoop Distributed File System at /europeana
directory. In my case Hadoop's core-site.xml has fs.default.name
property value: hdfs://localhost:54310
, so I Spark access the files as hdfs://localhost:54310/europeana/*.json
. The result will go to HDFS's /result
directory. If that directory exists Spark (such as Hadoop) stops, so you should remove it first. The run on all Europeana records takes roughly two hours, so it is worth to run it at background and with nohup.
nohup ./run-full.sh v2020-07 > logs/run-full-v2020-07.log &
./run-all-proxy-based-completeness [output CSV] [--skipEnrichments] [--extendedFieldExtraction]
e.g.
nohup ./run-all-proxy-based-completeness v2018-08-completeness.csv "" --extendedFieldExtraction \
> run-all-proxy-based-completeness.log &
./proxy-based-completeness-to-parquet.sh [csv file]
./proxy-based-completeness-all [parquet file] --keep-dirs
e.g.
nohup ./proxy-based-completeness-all.sh v2018-08-completeness2.parquet keep_dirs \
> proxy-based-completeness-all.log &
It will produce three files:
- [project]/output/completeness.csv
- [project]/output/completeness-histogram.csv
- [project]/output/completeness-histogram-raw.csv
cd ../scripts
./split-completeness.sh $VERSION
./run-all-multilingual-saturation [output CSV] "" --extendedFieldExtraction
e.g.
nohup ./run-all-multilingual-saturation v2018-08-multilingual-saturation.csv "" --extendedFieldExtraction \
> multilingual-saturation.log &
./multilinguality-to-parquet.sh [csv file]
./multilinguality-all.sh [parquet file] --keep-dirs
e.g.
nohup ./multilinguality-all.sh ../v2018-08-multilingual-saturation.parquet --keep-dirs \
> multilinguality-all.log &
cd ../script
./split-multilinguality.sh