This repository contains code that processes MNO Data to generate population and mobility insights using the Spark framework.
- 📄 Description
- 🗃️ Repository Structure
- 📜 Code Documentation
🅿️ Pipeline- 🛠️ Mandatory Requirements
- 📦 Synthetic data
- 🏁 Quickstart
- 📓 User Manual
- 🤝 Contribute
- 🖥️ Developement Guidelines
This repository contains a python application that uses the PySpark library to process Big Data pipelines of MNO Data and generate multiple stadistical products related to mobility and sociodemographic analysis.
The code stored in this repository is aimed to be executed in a PySpark compatible cluster and to be deployed in cloud environments like AWS, GCP or Azure. Nevertheless, the code can be launched in local environments using a single node Spark configuration once all the required libraries have been correctly set.
For an easy deployment in local environments, configuration for creating a docker container with all the setup done is provided in this repository.
The repository contains the following directories:
Directory | Type | Description |
---|---|---|
.devcontainer | Directory with config files for setting up a dev-environment using Dev-Containers. | |
.vscode | Directory containing config files for developers using VsCode. | |
docs | Documentation source files that will be used for the documentation site. Mainly markdown files. | |
multimno | Main directory of the repository. It contains the Python source code of the application. | |
pipe_configs | Directory containing examples of configuration files for the execution of the pipeline. | |
sample_data | Directory containing Synthetic MNO-Data to be used to test the software. | |
resources | Directory containing requirements files and development related configuration and script files. | |
tests | Directory containing test code and test files for the testing execution. |
Please refer to the following website: https://eurostat.github.io/multimno/latest/
The pipeline of Big Data processing performed by the software can be found at the following document: MultiMNO Pipeline
Please verify that your system fullfils the System Requirements. in order to assert that your system can execute the code.
MNO synthetic data is given in the repository under the sample_data/lakehouse/bronze
directory. This data
has been generated synthetically and contains the following specs:
-
🌍 Spatial scope: All data has been generated in a bounding box that covers the metropolitan area of Madrid. The bounding box parameters are as follows:
- latitude_min = 40.352
- latitude_max = 40.486
- longitude_min = -3.751
- longitude_max = -3.579
-
📆 Temporal scope : Data has been generated for 9 days, from 2023-01-01 to 2023-01-09 both included.
-
🚶♂️Users: 100 different users.
-
📡Network: 500 different cells.
Use the following commands for a fast setup of an execution environment using docker.
Please check the Setup Guide for a more indepth detail of the system setup to execute the code.
Build docker image
docker build -t multimno:1.0-prod --target=multimno-prod .
Run an example pipeline within a container:
docker run --rm --name=multimno-container -v "${PWD}/sample_data:/opt/data" -v "${PWD}/pipe_configs:/opt/app/pipe_configs" multimno:1.0-prod pipe_configs/pipelines/pipeline.json
This command will:
- Create a docker container.
- mount the
sample_data
directory in/opt/data
within the container. - mount the
pipe_configs
directory in/opt/app/pipe_configs
within the container. - Execute a pipeline stored in
/opt/app/pipe_configs/pipelines/pipeline.json
within the container. This is the same file as the one in the repository. - Delete the container once the execution finishes.
NOTE: It is necessary to adjusts paths in the pipeline.json and in the general_configuration.ini file if the destination paths are altered.
As the multimno software is a python application designed to be executed in a Spark cluster, a lightweight Dockerfile called Dockerfile-lite
is given for execution of the software in existing Spark clusters.
Please refer to Lite section of the setup guide on details on how to use this deployment.
A user manual is provided composed of three sections:
- Configuration: Section containing the explanation of all the configuration files used by the software.
- Setup Guide: How to prepare the system for the software execution.
- Execution Guide: How to execute the software.
Please follow the contribute guide to see the rules and guidelines on how to contribute to the multimno repository.
Please follow the development guidelines to setup a dev-environment and see the recommended best practices for development, testing and documentation.