-
Notifications
You must be signed in to change notification settings - Fork 2
Other libraries
Petl, allows for the importing of data into a database. Supports standard transformations such as sorting, joining and aggregation. A very simple library. Does not use a graphical based method
Pandas, similar to petl, useful for a proof of concept and for tasks inside a pipeline.
Single machine pipeline execution using python multiprocessings. Has a script style interface, very OOP.
pipeline = Pipeline(
id='demo',
description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')
pipeline.add(Task(id='ping_localhost', description='Pings localhost',
commands=[RunBash('ping -c 3 localhost')]))
sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')
for host in ['google', 'amazon', 'facebook']:
sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
commands=[RunBash(f'ping -c 3 {host}.com')]))
sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
commands=[RunBash('ping foo')]), ['ping_amazon'])
pipeline.add(sub_pipeline, ['ping_localhost'])
pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
commands=[RunBash('sleep 2')]), ['sub_pipeline'])
Widely used ETL tool.
Create a DAG, add nodes to the graph, then specify links.
Tasks are called "Operators". There can be lots of different operators defined that have different actions. For example sql, gcs to s3, back and python.
Has >>
and <<
operators for forming chains.
Can chain dependencies a >> b >> c
. Or have multiple dependencies a >> [b, c]
.
@task
decorator
Can define sets of chains, to build a graph. Uses a similar style to airflow, with the >>
operator.
http://docs.bonobo-project.org/en/master/guide/graphs.html
Has an OOP implementation, define a class for each task. Then specify output location and any dependencies
Similar ideas to what I had when I first had the idea of this project. However looks like it only supports linear pipelines.
Abstract data sources and sinks - another idea i had :/