Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve getting the query count in Airflow API endpoints #32630

Merged
merged 4 commits into from
Jul 22, 2023

Conversation

hussein-awala
Copy link
Member

related: #28723

When I run the Airflow webserver, I'm getting:

SADeprecationWarning: Implicit coercion of SELECT and textual SELECT constructs into FROM clauses is deprecated; please call .subquery() on any Core select or ORM Query object in order to produce a subquery object.

This PR creates a new method with the new style to get the count of a query, and utilizes it instead of duplicating the count query:

def get_query_count(query_stmt: sqlalchemy.sql.selectable.Select, session: sqlalchemy.orm.Session) -> int:
    """Get count of query."""
    count_stmt = select(func.count()).select_from(query_stmt.order_by(None).subquery())
    return session.scalar(count_stmt)

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:webserver Webserver related Issues labels Jul 15, 2023
@hussein-awala hussein-awala added the type:misc/internal Changelog: Misc changes that should appear in change log label Jul 15, 2023
@@ -125,3 +126,9 @@ def apply_sorting(
else:
order_by = f"{lstriped_orderby} asc"
return query.order_by(text(order_by))


def get_query_count(query_stmt: sqlalchemy.sql.selectable.Select, session: sqlalchemy.orm.Session) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a good idea to put this in airflow.utils.db instead? Importing this in airflow.www.views feels wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the order_by reset necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a good idea to put this in airflow.utils.db instead? Importing this in airflow.www.views feels wrong.

I agree, I just moved it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the order_by reset necessary?

TBH, I am not sure how sqlalchemy processes this and generates the query, but according to the documentation, it just select from the subquery: <query> FROM (<subquery>) AS subquery, so if we compare this format with and without order by:

airflow=# EXPLAIN SELECT COUNT(1) FROM (SELECT * FROM dag_run WHERE state='failed' ORDER BY data_interval_start) AS subquery;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Aggregate  (cost=1.04..1.05 rows=1 width=8)
   ->  Sort  (cost=1.02..1.03 rows=1 width=1459)
         Sort Key: dag_run.data_interval_start
         ->  Seq Scan on dag_run  (cost=0.00..1.01 rows=1 width=1459)
               Filter: ((state)::text = 'failed'::text)
(5 rows)

airflow=# EXPLAIN SELECT COUNT(1) FROM (SELECT * FROM dag_run WHERE state='failed') AS subquery;
                         QUERY PLAN                          
-------------------------------------------------------------
 Aggregate  (cost=1.01..1.02 rows=1 width=8)
   ->  Seq Scan on dag_run  (cost=0.00..1.01 rows=1 width=0)
         Filter: ((state)::text = 'failed'::text)
(3 rows)

There is no many rows in my DB (fresh breeze DB), so this slight difference could be much bigger in the big DB.

Personally I always reset the order_by when I don't need it to be safe and to ensure a better performance. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So seems like a db optimisation issue. Since this is in a function, adding the reset seems to have no drawbacks. Perhaps a comment explaining the query difference would prevent someone coming in and mistakenly assuming the order_by is superfulous.

@hussein-awala hussein-awala requested a review from uranusjr July 17, 2023 20:15
airflow/utils/db.py Outdated Show resolved Hide resolved
Co-authored-by: Tzu-ping Chung <[email protected]>
@hussein-awala hussein-awala added this to the Airflow 2.6.4 milestone Jul 18, 2023
@hussein-awala hussein-awala merged commit 53c6305 into apache:main Jul 22, 2023
@eladkal eladkal modified the milestones: Airflow 2.6.4, Airflow 2.7.0 Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:webserver Webserver related Issues type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants