Here we address a simple use case of applying a single transform to a set of parquet files. We'll use the noop transform as an example, but in general, this process will work for any of the transforms contained in the repository. Additionally, what follows uses the python runtime (e.g., noop/python directory), but the examples below should also work for the ray (noop/ray directory) or spark (noop/spark directory) runtimes.
Each transform project contains a Makefile that will assist in building
the virtual environment in a directory named venv
.
To create the virtual environment for the noop
transform:
cd transforms/univeral/noop/python
make venv
Note, if needed, you can override the default python
command used
in make venv
above, with for example:
make PYTHON=python3.10 venv
To process data in the /home/me/input
directory and write it
to the /home/me/output
directory, activate the virtual environment
and then call the transform referencing these directories.
So for example, using the noop
transform
to read parquet files from /home/me/input
:
cd transforms/univeral/noop/python
source venv/bin/activate
python src/noop_transform_python.py \
--data_local_config "{ \
'input_folder' : '/home/me/input', \
'output_folder' : '/home/me/output' \
}"
deactivate
When processing data located in S3 buckets, one can use the same
approach and specify different --data_s3_*
configuration as follows:
cd transforms/univerals/noop/python
source venv/bin/activate
python src/noop_transform_python.py \
--data_s3_cred "{ \
'access_key' : '...', \
'secret_key' : '...', \
'url' : '...', \
}" \
--data_s3_config "{ \
'input_folder' : '...', \
'output_folder' : '...', \
}"