This project evaluates the driving distances of vehicles per day using the road network from OSM. The distances are stored in a DB and served by a REST service using Docker.
To configure, test and execute the project, just do:
git clone [email protected]:diegopso/driving-distances.git
cd driving-distances/
make
make test
make run-job
make run-service
make call-service
make clean # to stop and reset everything
You can also run integration tests:
make integration-test
This repository was tested on Ubuntu 18 and on Windows 10 using WSL2 with Ubuntu 20.
In order to use this project it is needed to install Docker and Docker Compose v2. There are many recipes to make easier to use this repo made with Makefile. Additionally, the target call-service
in the Makefile
uses curl. Finally, to create virtual environments and run tests, this project uses Poetry.
To install the solution, just clone this repository and run the build
target in the Makefile
. After that the solution is ready to run tests, the main job, and the webservice. The indicated order to execute this targets can be seen in the TLDR section.
Many recipes to make easier to use this repo are available in the Makefile
, among them:
build
builds the job and web servicetest
runs the testsrun-job
launches the job locallyrun-service
launches the service locallycall-service
calls the service with an example request
An end-point is available on the host machine after runing the target run-service
to consume the data generated in the job. This end-point is a GET
method available by default at http://localhost:8000/api/driven-distance
. The following query params can be used:
vehicle_id
: to be used when filtering data for a speciffic vehicle.start_date
: to be used when filtering data after a certain date.end_date
: to be used when filtering data before a certain date.
The end-point returns JSON
and all these request parameters can be used combined.
The solution implemented in this repository consist of a pipeline using Python and hosted in a container named job
, and a web service that provides a REST end-point to consumed the proccessed data. The complete structure of the solution can be seen in the figure below.
The pipeline has the following steps:
- Extract CSV: where the pandas library is used to read the CSV file and perform preliminary formating, such as, changing the timezone of the dates.
- Extract OSM data - once with the CSV data it is possible to identify the bounding box of the trips. This bounding box is used to export data from OpenStreetMaps using their API through the library OSMnx. OSMnx is a library to download and work with data from OpenStreetMaps developed on top of NetworX, a well-known python library to work with graphs.
- Coordinates to nodes: in the CSV file, the positions of the vehicles are stored as latitude and longitudes. These coordinates must be tranformed to graph nodes in the OSM graph using again OSMnx.
- Identify trip segments: since each observation is a point of presence of the vehicle, it is possible to combine sequential observation to identify trip segments of the vehicles using pandas.
- Estimate distances: The trip segments are used to evaluate driving distances using the OSM data. In some ocasions, the distance traversed between two sequential observations is small, resulting in a trip segment that starts and finishes in the same graph node. For these cases, a fallback is used to evaluate the straight-line distance between the origin and destination coordintates of the trip segment.
- Group and save: Finally, the trip segments are grouped according to the day they took place and saved to the DB that is used to provide the REST end-point.
A MySQL database is used to store the processed data. This database is consumed by a container where the REST end-point is implemented using Python and Flask. These technologies were used because they provide a simple and easy-to-use alternative to quickly implement a web service. The end-point is served using Nginx a well-known and generally well-performant web server.
The pipeline, database, Python/Flask end-point, and web server are managed using Docker and Docker Compose to ease the deployment of the solution.
├── .docker # stores Dockerfiles and other container-related configuration files for the containers.
| ├── db # the image files for the db (MySQL) container.
| ├── job # the image files for the job container.
| ├── nginx # the image files for the reverse proxy (Nginx) container
| └── service # the image files for the API container.
├── .k8s # stores k8s supporting files.
| ├── definitions # the yaml definitions.
| └── templates # templates used to generate definitions via Makefile.
├── input # stores `csv` files to be extracted. Columns: "vehicle_id","location_raw_lat","location_raw_lon","created_timestamp".
├── ddmc # store the python files responsible for extracting the `csv` files and storing in the database.
| ├── ddmc # the job package.
| └── tests # the job tests using Pytest and TestContainers.
├── output # used as working directory when extracting the data from the `csv` files. Also stores the DB files.
└── api # stores the python files for the web service.
├── api # the api package.
└── tests # the api tests using Pytest and TestContainers.
You can spin up everything using K8s.
make
kubectl apply -f .k8s/definitions/
sleep 30 # give it some time to spin up everything and run the job
make call-service
kubectl delete -f .k8s/definitions/