Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter expressions not being applied #3032

Open
jpereiranexar opened this issue Nov 25, 2024 · 3 comments
Open

Filter expressions not being applied #3032

jpereiranexar opened this issue Nov 25, 2024 · 3 comments
Labels
bug Something isn't working mre-needed Whether an MRE needs to be provided

Comments

@jpereiranexar
Copy link

Environment

Delta-rs version: 0.22 (Tested multiple versions up to latest)
PyArrow version: 18.0.0

  • OS: Mac 14.4 (23E214)
  • Delta table on S3

Bug

What happened:

When applying a filter expression (lighting == "day") using pyarrow.dataset, no results are returned. However, if I do not apply the filter at this stage and instead filter the resulting pandas DataFrame (results[results["lighting"] == "day"]), I find that rows are filtered out, confirming that data matching the condition exists in the dataset.

What you expected to happen:

The filter method should correctly return rows where lighting == "day" when applied directly on the pyarrow.dataset.

How to reproduce it:

Given a delta table as such

CREATE TABLE hive_metastore.dwh.table_name (
  key STRING,
  ...
  lighting STRING,
  ...
  h3_id_res9 BIGINT)
USING delta
PARTITIONED BY (h3_id_res9)
LOCATION 'dbfs:s3_path'
TBLPROPERTIES (
  'delta.minReaderVersion' = '1',
  'delta.minWriterVersion' = '2')
# Python code

delta_table = get_delta_table(table_path, dynamo_table_name)
partitions = [("h3_id_res9", "in", str(608716487191953407))]
condition = pc.equal(ds.field("lighting"), "day")

# Apply filter directly on pyarrow dataset
results = (
    delta_table.to_pyarrow_dataset(partitions=partitions)
    .filter(expression=condition)
    .to_table()
    .to_pandas()
)

# Results are empty
assert results.empty, "Expected non-empty results, but got none."

# Remove filter and filter using pandas
results = (
    dt.to_pyarrow_dataset(partitions=partitions)
    .to_table()
    .to_pandas()
)
results_filtered = results[results["lighting"] == "day"].reset_index(drop=True)

# Results are non-empty and rows were filtered as expected
assert not results_filtered.empty, "Expected non-empty results, but got none after pandas filtering."

More Details:

  • In other tables, I am able to filter the data, so I don't think it's tied to data type
@jpereiranexar jpereiranexar added the bug Something isn't working label Nov 25, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 25, 2024

Can you provide an MRE that I can run as-is with deltalake only?

@ion-elgreco ion-elgreco added the mre-needed Whether an MRE needs to be provided label Nov 25, 2024
@jpereiranexar
Copy link
Author

jpereiranexar commented Nov 28, 2024

I haven't been able to replicate the error using mock data. However, I discovered that applying the filter with match_substring does return the expected values.

For example:
This condition fails:

condition = (ds.field("lighting") == np.str_("day"))

This returns the expected values:

condition = pc.match_substring(ds.field("lighting"), "day")

I've already verified that the data does not contain leading or trailing spaces or unexpected characters in the lighting column.
Any ideas on what else I can check?

@ion-elgreco
Copy link
Collaborator

I haven't been able to replicate the error using mock data. However, I discovered that applying the filter with match_substring does return the expected values.

For example:
This condition fails:

condition = (ds.field("lighting") == np.str_("day"))

This returns the expected values:

condition = pc.match_substring(ds.field("lighting"), "day")

I've already verified that the data does not contain leading or trailing spaces or unexpected characters in the lighting column.
Any ideas on what else I can check?

Can you check the statistics for that column for all your add actions? Maybe there is something happening there.

Without an MRE it's difficult to help here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mre-needed Whether an MRE needs to be provided
Projects
None yet
Development

No branches or pull requests

2 participants