-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve getting the query count in Airflow API endpoints #32630
Conversation
airflow/api_connexion/parameters.py
Outdated
@@ -125,3 +126,9 @@ def apply_sorting( | |||
else: | |||
order_by = f"{lstriped_orderby} asc" | |||
return query.order_by(text(order_by)) | |||
|
|||
|
|||
def get_query_count(query_stmt: sqlalchemy.sql.selectable.Select, session: sqlalchemy.orm.Session) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a good idea to put this in airflow.utils.db
instead? Importing this in airflow.www.views
feels wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the order_by
reset necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be a good idea to put this in airflow.utils.db instead? Importing this in airflow.www.views feels wrong.
I agree, I just moved it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the order_by reset necessary?
TBH, I am not sure how sqlalchemy processes this and generates the query, but according to the documentation, it just select from the subquery: <query> FROM (<subquery>) AS subquery
, so if we compare this format with and without order by:
airflow=# EXPLAIN SELECT COUNT(1) FROM (SELECT * FROM dag_run WHERE state='failed' ORDER BY data_interval_start) AS subquery;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=1.04..1.05 rows=1 width=8)
-> Sort (cost=1.02..1.03 rows=1 width=1459)
Sort Key: dag_run.data_interval_start
-> Seq Scan on dag_run (cost=0.00..1.01 rows=1 width=1459)
Filter: ((state)::text = 'failed'::text)
(5 rows)
airflow=# EXPLAIN SELECT COUNT(1) FROM (SELECT * FROM dag_run WHERE state='failed') AS subquery;
QUERY PLAN
-------------------------------------------------------------
Aggregate (cost=1.01..1.02 rows=1 width=8)
-> Seq Scan on dag_run (cost=0.00..1.01 rows=1 width=0)
Filter: ((state)::text = 'failed'::text)
(3 rows)
There is no many rows in my DB (fresh breeze DB), so this slight difference could be much bigger in the big DB.
Personally I always reset the order_by when I don't need it to be safe and to ensure a better performance. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So seems like a db optimisation issue. Since this is in a function, adding the reset seems to have no drawbacks. Perhaps a comment explaining the query difference would prevent someone coming in and mistakenly assuming the order_by is superfulous.
d5f615c
to
6dc4017
Compare
Co-authored-by: Tzu-ping Chung <[email protected]>
related: #28723
When I run the Airflow webserver, I'm getting:
This PR creates a new method with the new style to get the count of a query, and utilizes it instead of duplicating the count query:
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.