Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

Open
keen85 opened this issue Sep 20, 2024 · 0 comments

Comments

@keen85
Copy link

keen85 commented Sep 20, 2024

I have a folder structure in ADLSGen2 with hive-style partition folders:

orders
├── year=2021
│    ├── month=1
│    │   ├── file1.json
│    │   └── file2.json
│    └── month=2
│        └── file3.json
└── year=2022
     ├── month=11
     │   ├── file4.json
     │   └── file5.json
     └── month=12
         └── file6.json

According to documentation, when filtering on partition columns, files that are not necessary to answer a query are skipped.

If this is true I would expect the following two queries to be equal:

SELECT
    COUNT(*) AS number_of_orders, year
FROM
    read_json(
        'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/year=2021/**',
        , hive_partitioning = true
    )
GROUP BY ALL
SELECT
    COUNT(*) AS number_of_orders, year
FROM
    read_json(
        'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/**',
        , hive_partitioning = true
    )
WHERE year = 2021
GROUP BY ALL

However, I noticed that the execution time differs very much; the first query is much faster than the second one.
I have the impression that the second query does not make use of the partition filter entirely. I think that it does file listing for the full folder structure (even when only considering folder year=2021 whould suffice). This is an expensive operation and therefore the performance degrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant