Fix unionByName to properly handle missing columns from both DataFrames#243
Conversation
When allowMissingColumns=True, the method now correctly handles missing columns from both the left and right DataFrames by: - Adding missing columns from the right DataFrame to the left as NULL - Ensuring all columns from the left DataFrame are present in the right - Properly aligning column order to match Spark's behavior This ensures the union result contains all columns from both DataFrames, with NULL values where columns are missing, matching PySpark behavior.
There was a problem hiding this comment.
Pull request overview
This PR fixes the unionByName method to properly handle missing columns from both DataFrames when allowMissingColumns=True. Previously, the method only handled missing columns from the right DataFrame, but not from the left one.
Key Changes:
- Updated the logic to add NULL columns for missing columns from both DataFrames
- Column order now matches Spark's behavior by prioritizing the left DataFrame's schema
- Added a test case to verify the reversed scenario works correctly
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| duckdb/experimental/spark/sql/dataframe.py | Rewrote the unionByName implementation to handle missing columns bidirectionally and align columns properly before performing the union |
| tests/fast/spark/test_spark_union_by_name.py | Added test case test_union_by_name_allow_missing_cols_rev to verify the fix works when the DataFrame with fewer columns is on the left side |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Can you fix the linting and formatting errors please? See https://duckdb.org/docs/stable/dev/building/python#3-enable-pre-commit-hooks for guidance. |
|
Hey @evertlammerts, I belive the failing test is not related with this change due it’s is on pyarrow tests ans not on pyspark api. |
|
Thanks @mariotaddeucci ! |
When allowMissingColumns=True, the method now correctly handles missing columns from both the left and right DataFrames by:
This ensures the union result contains all columns from both DataFrames, with NULL values where columns are missing, matching PySpark behavior.