Skip to content

lcardno10/data-engineering-test-python-pyspark

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Test - Starter Project

Prerequisites

Python (3.8.* or later)

You can install python either from source or with pyenv.

Check you have python installed:

python --version

Preferably an IDE such as Visual Studio Code

https://code.visualstudio.com/

Dependencies and data

Creating a virtual environment

Ensure your pip (package manager) is up to date:

pip install --upgrade pip

To check your pip version run:

pip --version

Create the virtual environment in the root of the cloned project:

python -m venv .venv

Activating the newly created virtual environment

You always want your virtual environment to be active when working on this project.

source ./.venv/bin/activate

Installing Python requirements

This will install some of the packages you might find useful:

pip install -r requirements.txt

Running tests to ensure everything is working correctly

pytest

Generating the data

A data generator is included as part of the project in ./input_data_generator/main_data_generator.py This allows you to generate a configurable number of months of data.

To run the data generator use:

python ./input_data_generator/main_data_generator.py

This should produce customers, products and transaction data under ./input_data/starter

Getting started

The skeleton of a possible solution is provided in ./solution/solution_start.py You do not have to use this code if you want to approach the problem in a different way.

About

Standard tech test for Data Engineers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%