Replies: 2 comments
-
I think your observations are very correct Bjorn. I think the use of connection is very different in different cases. For example most of the Google operators have a very common way of using the connection and hook - but this is something that took us quite some time to standardize and build proper abstraction layers. We already have an issue opened about documentation of "What is needed to add a new provider" https://github.com/apache/airflow/issues/10386 - I've added a point in the issue and linked to that discussion. Maybe someone would like to pick that task :) |
Beta Was this translation helpful? Give feedback.
-
This is most often handled by client libraries such as botocore/boto3 or google-auth. |
Beta Was this translation helpful? Give feedback.
-
I'd like to discuss standardising connections / hooks within Airflow, and also possibly adding caching / TTL to them.
As a contributor who's been using Airflow for about 2 years now, it is still quite confusing how the hooks / connections work when perusing the source code. There are several methods involved and they do not seem to be consistently used. I think perhaps this is because of poor documentation or standardisation of the same. Understandably a project with many contributions is going to have some degree of sprawl, and implementation may require it. But it seems to me that more can be done to improve it.
Specifically, my use case is to consider how we could add caching with a TTL to the connection across all hooks.
So I started digging a bit to understand better how the hooks / conns all fit together and where this sort of caching could be implemented.
What I've found so far led me to create this discussion.
BaseHook :: get_conn , get_hook is there to assist with some sort of abstraction over the connection but it seems the subclasses don't consistently implement and use these. There are several methods with similar names and there does not seem to be clear documentation around the purpose of these methods. An abstract / base class should be clearer about the interface it is trying to create. Lack of this means that subclasses might have inconsistent implementation and as a result, unintended side effects.
As a more concrete example, EmrStepSensor.get_emr_response()
When the poke() method is called (Every 30 sec) this method creates a new EMR Hook. This means a new connection to AWS each time. It could be changed to use a self.get_hook() to prevent reconnecting every 30 sec. However the next problem is that connections have a limited lifespan, eg lets say 1 hour, before the token expires. This lifespan would be site-specific, so it would be good to have a way to configure it.
Maybe an optional attribute on the Airflow Connection model for TTL ?
We should also look at how do we propose to standardise the approach across the other hooks etc.
There is the get_hook() method and get_conn to consider. Looking at some examples:
EmrHook:
get_conn retrieves AWS boto client and stores it on self.conn.
No TTL currently. For long running jobs this could be a problem, but it isn't being used like this (see below).
EmrStepSensor, EmrJobFlowSensor:
Creates a new EmrHook each time the poke is run. Avoids the TTL problem but creates many connections.
SSHHook (see issue #10874 ):
get_conn() creates a new SSH connection / client
In SSHOperator, get_conn() is called in the execute() and runs 1 command over SSH.
If we want to run multiple commands over the connection, we need to either:
If we look at AzureDataLakeHook for a different example -
IT uses self.get_conn to create the connection
Other methods refer to self.connection, which I can't find the definition of in that hook or in BaseHook. It must be there but I can't see where and how / why it is being used like that.
(There is possibly another separate issues because get_connection is called in the hook init)
To me it is quite confusing all the implementations in BaseHook around get_connection, get_connections, get_hook etc.
Some clearer / consistent documentation might help a lot here.
It's also hard for a new user or a developer to make reasonable assumptions about how each hook / conn works while there are different implementations / approaches. If it was clearer / more consistent then we can learn once how the hooks work and implement more easily. Having clear documentation on base_hook would help a lot as a single reference point while doing development. Then we don't need to inspect the source code of the specific hook we are using each time we want to use it (and sometimes there is multiple inheritance going on which can add to confusion as a new user)
Thoughts, suggestions, critical feedback all welcome :)
Beta Was this translation helpful? Give feedback.
All reactions