Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

keen85 · 2024-09-20T12:25:57Z

I have a folder structure in ADLSGen2 with hive-style partition folders:

orders
├── year=2021
│    ├── month=1
│    │   ├── file1.json
│    │   └── file2.json
│    └── month=2
│        └── file3.json
└── year=2022
     ├── month=11
     │   ├── file4.json
     │   └── file5.json
     └── month=12
         └── file6.json

According to documentation, when filtering on partition columns, files that are not necessary to answer a query are skipped.

If this is true I would expect the following two queries to be equal:

SELECT
    COUNT(*) AS number_of_orders, year
FROM
    read_json(
        'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/year=2021/**',
        , hive_partitioning = true
    )
GROUP BY ALL

SELECT
    COUNT(*) AS number_of_orders, year
FROM
    read_json(
        'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/**',
        , hive_partitioning = true
    )
WHERE year = 2021
GROUP BY ALL

However, I noticed that the execution time differs very much; the first query is much faster than the second one.
I have the impression that the second query does not make use of the partition filter entirely. I think that it does file listing for the full folder structure (even when only considering folder year=2021 whould suffice). This is an expensive operation and therefore the performance degrades.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

keen85 commented Sep 20, 2024

Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

Partition Filter Pushdown on Azure Data Lake Storage Gen2 does not work properly (or can be massively improved) #81

Comments

keen85 commented Sep 20, 2024