-
Notifications
You must be signed in to change notification settings - Fork 149
Open
Labels
Description
Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.
- There is an additional
_merge_rightcolumn which is not in the original dataframes, which could cause a bit of confusion for users. - We're displaying the column names as their aliases, which could also be a bit confusing. It would be best to translate them back to their original names.
Not a blocker for this, but we should open a follow-up issue to keep track of this.
import pandas as pd
import pyspark.pandas as ps
pdf1 = pd.DataFrame.from_dict({"id": [1,2,3,4,5], "a": [2,3,2,3, 2], "b": ["a", "b", "c", "d", ""]})
pdf2 = pd.DataFrame.from_dict({"id": [1,2,3,4,5, 6], "a": [2,3,2,3, 2, np.nan], "b": ["a", "b", "c", "d", "", pd.NA]})
df1 = ps.DataFrame(pdf1)
df2 = ps.DataFrame(pdf2)Spark
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id_df2 a_df2 b_df2 _merge_right
5 6 NaN None True
Pandas
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 3 5
1 df2 3 6
Column Summary
--------------
Number of columns in common: 3
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 1
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 5
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 3
Total number of values which compare unequal: 0
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 a int64 float64 0 0.0 0
Sample Rows with Unequal Values
-------------------------------
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
id a b
0 6 NaN <NA>
Originally posted by @jdawang in #275 (review)