You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to documentation, when filtering on partition columns, files that are not necessary to answer a query are skipped.
If this is true I would expect the following two queries to be equal:
SELECTCOUNT(*) AS number_of_orders, year
FROM
read_json(
'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/year=2021/**',
, hive_partitioning = true
)
GROUP BY ALL
SELECTCOUNT(*) AS number_of_orders, year
FROM
read_json(
'abfss://<storagacount>.dfs.core.windows.net/<container>/orders/**',
, hive_partitioning = true
)
WHERE year =2021GROUP BY ALL
However, I noticed that the execution time differs very much; the first query is much faster than the second one.
I have the impression that the second query does not make use of the partition filter entirely. I think that it does file listing for the full folder structure (even when only considering folder year=2021 whould suffice). This is an expensive operation and therefore the performance degrades.
The text was updated successfully, but these errors were encountered:
I have a folder structure in ADLSGen2 with hive-style partition folders:
According to documentation, when filtering on partition columns, files that are not necessary to answer a query are skipped.
If this is true I would expect the following two queries to be equal:
However, I noticed that the execution time differs very much; the first query is much faster than the second one.
I have the impression that the second query does not make use of the partition filter entirely. I think that it does file listing for the full folder structure (even when only considering folder
year=2021
whould suffice). This is an expensive operation and therefore the performance degrades.The text was updated successfully, but these errors were encountered: