title | weight |
---|---|
Execution |
3 |
The multimno software is a python application that launches a single component with a given configuration. This atomic design allows the application to be integrated with multiple orchestration software.
At the moment a python script called orchestrator_multimno.py
is provided which will execute a
pipeline of components sequentially using spark-submit
commands.
The execution process can be divided into four steps:
The following input data is required to execute the multimno software addtionally from the MNO Data:
National holiday data is required to execute the software. This data must be in parquet format and contain the following schema:
Column | Format | Value |
---|---|---|
iso2 | str | Country code in iso2 format |
date | date | Date of the festivity in yyyy-mm-dd format |
name | str | Festivity description |
Example:
iso2 | date | name |
---|---|---|
ES | 2022-01-01 | Año Nuevo |
ES | 2022-01-06 | Epifanía del Señor |
ES | 2022-04-15 | Viernes Santo |
The path to this data must be specified in the holiday_calendar_data_bronze
variable under the section [Paths.Bronze]
in the general_configuration.ini file of the apllication.
The pipeline is defined as a json file that glues all the configuration files and defines the sequential execution order of the components. The structure is as follows:
- general_config_path: Path to the general configuration file
- spark_submit_args: List containing arguments that will be passed to the spark-submit command. It can be empty.
- pipeline: List containing the order in which the components will be executed. Each item is composed of the values:
- component_id: Id of the component to be executed.
- component_config_path: Path to the component configuration file.
Example:
{
"general_config_path": "pipe_configs/configurations/general_config.ini",
"spark_submit_args": [
"--master=spark://spark-master:7077",
"--packages=org.apache.sedona:sedona-spark-3.5_2.12:1.6.0,org.datasyslab:geotools-wrapper:1.6.0-28.2"
],
"pipeline": [
{
"component_id": "InspireGridGeneration",
"component_config_path": "pipe_configs/configurations/grid/grid_generation.ini"
},
{
"component_id": "EventCleaning",
"component_config_path": "pipe_configs/configurations/event/event_cleaning.ini"
}
]
}
Configuration for executing a demo pipeline is given in the file: pipe_configs/pipelines/pipeline.json
This file contains the order of the execution of the pipeline components and references to its demo configuration files
that are given as well in the repository.
Each component of the pipeline to be executed must be configured to the user desired settings. It is recommended to take
the configurations defined in pipe_configs/configurations
as a base and refine them using the configuration guide.
spark-submit args
The entrypoint for the pipeline execution: orchestrator_multimno.py
, performs spark-submit
commands to execute each component of the pipeline as a Spark Job. To define spark-submit
arguments edit the spark_submit_args
variable in the pipeline.json.
This variable follows the same syntax as
spark-submit
arguments.
- Spark submit documentation: https://spark.apache.org/docs/latest/submitting-applications.html
Spark Configuration
To define Spark session specific configurations, edit the [Spark]
section in the general_configuration file. If you want to change the Spark configuration for only one component in the pipeline, you can edit the the [Spark]
section in the component configuration file which will override values defined in the general configuration file.
- Spark configuration documentation: https://spark.apache.org/docs/latest/configuration.html
For executing a pipeline the orchestrator_multimno.py
entrypoint shall be used. This takes as input a path to a json file
with the pipeline definition as defined in the previous Pipeline definition section.
Example:
./orchestrator_multimno.py pipe_configs/pipelines/pipeline.json
!!! warning
The orchestrator_multimno.py
must be located in the same directory as the main_multimno.py
file.
If you want to only launch a component you can perform manually the spark-submit command from a terminal using the main_multimno.py
entrypoint:
The entrypoint of the application is a main.py which receives the following positional parameters:
- component_id: Id of the component that will be launched.
- general_config_path: Path to the general configuration file of the application.
- component_config_path: Path to the component configuration file.
spark-submit multimno/main_multimno.py <component_id> <general_config_path> <component_config_path>
Example:
spark-submit multimno/main_multimno.py InspireGridGeneration pipe_configs/configurations/general_config.ini pipe_configs/configurations/grid/grid_generation.ini