Standardising hooks / connections and adding TTLs #12128

baolsen · 2020-11-06T06:56:24Z

baolsen
Nov 6, 2020

I'd like to discuss standardising connections / hooks within Airflow, and also possibly adding caching / TTL to them.

As a contributor who's been using Airflow for about 2 years now, it is still quite confusing how the hooks / connections work when perusing the source code. There are several methods involved and they do not seem to be consistently used. I think perhaps this is because of poor documentation or standardisation of the same. Understandably a project with many contributions is going to have some degree of sprawl, and implementation may require it. But it seems to me that more can be done to improve it.

Specifically, my use case is to consider how we could add caching with a TTL to the connection across all hooks.
So I started digging a bit to understand better how the hooks / conns all fit together and where this sort of caching could be implemented.

What I've found so far led me to create this discussion.

BaseHook :: get_conn , get_hook is there to assist with some sort of abstraction over the connection but it seems the subclasses don't consistently implement and use these. There are several methods with similar names and there does not seem to be clear documentation around the purpose of these methods. An abstract / base class should be clearer about the interface it is trying to create. Lack of this means that subclasses might have inconsistent implementation and as a result, unintended side effects.

As a more concrete example, EmrStepSensor.get_emr_response()
When the poke() method is called (Every 30 sec) this method creates a new EMR Hook. This means a new connection to AWS each time. It could be changed to use a self.get_hook() to prevent reconnecting every 30 sec. However the next problem is that connections have a limited lifespan, eg lets say 1 hour, before the token expires. This lifespan would be site-specific, so it would be good to have a way to configure it.

Maybe an optional attribute on the Airflow Connection model for TTL ?

We should also look at how do we propose to standardise the approach across the other hooks etc.

There is the get_hook() method and get_conn to consider. Looking at some examples:

EmrHook:
get_conn retrieves AWS boto client and stores it on self.conn.
No TTL currently. For long running jobs this could be a problem, but it isn't being used like this (see below).

EmrStepSensor, EmrJobFlowSensor:
Creates a new EmrHook each time the poke is run. Avoids the TTL problem but creates many connections.

SSHHook (see issue #10874 ):
get_conn() creates a new SSH connection / client
In SSHOperator, get_conn() is called in the execute() and runs 1 command over SSH.

If we want to run multiple commands over the connection, we need to either:

Create multiple SSHOperator tasks (slow to execute if there are a lot of commands)
Subclass SSHOperator and call super.execute() which means many connections / no re-use of connection
Add another operator for running multiple commands (which doesn't make a lot of sense if we want to do things in between commands)

If we look at AzureDataLakeHook for a different example -
IT uses self.get_conn to create the connection
Other methods refer to self.connection, which I can't find the definition of in that hook or in BaseHook. It must be there but I can't see where and how / why it is being used like that.
(There is possibly another separate issues because get_connection is called in the hook init)

To me it is quite confusing all the implementations in BaseHook around get_connection, get_connections, get_hook etc.
Some clearer / consistent documentation might help a lot here.

It's also hard for a new user or a developer to make reasonable assumptions about how each hook / conn works while there are different implementations / approaches. If it was clearer / more consistent then we can learn once how the hooks work and implement more easily. Having clear documentation on base_hook would help a lot as a single reference point while doing development. Then we don't need to inspect the source code of the specific hook we are using each time we want to use it (and sometimes there is multiple inheritance going on which can add to confusion as a new user)

Thoughts, suggestions, critical feedback all welcome :)

potiuk · 2020-11-11T18:16:08Z

potiuk
Nov 11, 2020
Collaborator

I think your observations are very correct Bjorn. I think the use of connection is very different in different cases. For example most of the Google operators have a very common way of using the connection and hook - but this is something that took us quite some time to standardize and build proper abstraction layers.
I honestly think it will be difficult to standardize the usage across the providers. One thing is that it would require backwards-incompatible changes in many places, and two - the Hooks/connection interfaces are rather low-level, and they often need some kind of "special" treatment for different providers, because they might require different optimization techniques, different authorization etc. I might be wrong, but there are some reasons why they are differently implemented for different providers.
However, detecting and "standardizing" using the connection in the scope of a provider is rather doable.
What I personally think might make sense instead of "standardizing across all providers" is to document (with reasoning and context) those different usage cases, clearly state when which usage method is better and when it is worse, what leads to using those connections in this way. And while doing it, we could pin point the cases that are clear mistakes, standardize "in-provider" usage. Maybe after we do that exercise we will come up with 3-4 usage patterns and that will be enough for the developers to be far less confused.

We already have an issue opened about documentation of "What is needed to add a new provider" https://github.com/apache/airflow/issues/10386 - I've added a point in the issue and linked to that discussion.

Maybe someone would like to pick that task :)

0 replies

mik-laj · 2020-11-18T03:59:30Z

mik-laj
Nov 18, 2020
Collaborator

However the next problem is that connections have a limited lifespan, eg lets say 1 hour, before the token expires.

This is most often handled by client libraries such as botocore/boto3 or google-auth.
For boto3, you should use DeferredRefreshableCredentials or RefreshableCredentials, to handle this case correctly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardising hooks / connections and adding TTLs #12128

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Standardising hooks / connections and adding TTLs #12128

baolsen Nov 6, 2020

Replies: 2 comments

potiuk Nov 11, 2020 Collaborator

mik-laj Nov 18, 2020 Collaborator

baolsen
Nov 6, 2020

potiuk
Nov 11, 2020
Collaborator

mik-laj
Nov 18, 2020
Collaborator