Skip to content

Other libraries

Riley Evans edited this page Feb 1, 2021 · 2 revisions

Other Libraries for ETLs

Petl

Petl, allows for the importing of data into a database. Supports standard transformations such as sorting, joining and aggregation. A very simple library. Does not use a graphical based method

Pandas

Pandas, similar to petl, useful for a proof of concept and for tasks inside a pipeline.

Mara

Single machine pipeline execution using python multiprocessings. Has a script style interface, very OOP.

pipeline = Pipeline(
    id='demo',
    description='A small pipeline that demonstrates the interplay between pipelines, tasks and commands')

pipeline.add(Task(id='ping_localhost', description='Pings localhost',
                  commands=[RunBash('ping -c 3 localhost')]))

sub_pipeline = Pipeline(id='sub_pipeline', description='Pings a number of hosts')

for host in ['google', 'amazon', 'facebook']:
    sub_pipeline.add(Task(id=f'ping_{host}', description=f'Pings {host}',
                          commands=[RunBash(f'ping -c 3 {host}.com')]))

sub_pipeline.add_dependency('ping_amazon', 'ping_facebook')
sub_pipeline.add(Task(id='ping_foo', description='Pings foo',
                      commands=[RunBash('ping foo')]), ['ping_amazon'])

pipeline.add(sub_pipeline, ['ping_localhost'])

pipeline.add(Task(id='sleep', description='Sleeps for 2 seconds',
                  commands=[RunBash('sleep 2')]), ['sub_pipeline'])

Apache Airflow

Widely used ETL tool.

Create a DAG, add nodes to the graph, then specify links.

Tasks are called "Operators". There can be lots of different operators defined that have different actions. For example sql, gcs to s3, back and python.

Has >> and << operators for forming chains.

Can chain dependencies a >> b >> c. Or have multiple dependencies a >> [b, c].

@task decorator

Bonobo

Can define sets of chains, to build a graph. Uses a similar style to airflow, with the >> operator. http://docs.bonobo-project.org/en/master/guide/graphs.html

Luigi

Has an OOP implementation, define a class for each task. Then specify output location and any dependencies

Porcupine

Similar ideas to what I had when I first had the idea of this project. However looks like it only supports linear pipelines.

Abstract data sources and sinks - another idea i had :/

https://www.tweag.io/blog/2019-10-30-porcupine/

Clone this wiki locally