-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
Description
Current Behaviour
AS issue suggest, I write code like
import pandas as pd
import numpy as np
import ydata_profiling
pdf = pd.DataFrame(
{
"col1": [np.random.randint(0, 10) for x in range(10)],
"col2": [np.random.randint(0, 100) for x in range(10)],
}
)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pdf)
df.show()
report = ydata_profiling.ProfileReport(df, minimal=True)
report.to_file("report.html")
report.dump("report")when submit with spark, dump throw an error
Generate report structure: 100%|██████████| 1/1 [00:00<00:00, 2.45it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00, 6.73it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 1079.34it/s]
Traceback (most recent call last):
File "/data/westfly/sparkjob/sparkjoin/x.py", line 21, in <module>
report.dump("report")
File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/serialize_report.py", line 128, in dump
output_file.write_bytes(self.dumps())
File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/serialize_report.py", line 39, in dumps
self.df_hash,
File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/profile_report.py", line 282, in df_hash
self._df_hash = hash_dataframe(self.df)
File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/utils/dataframe.py", line 201, in hash_dataframe
hash_values = "\n".join(hash_pandas_object(df).values.astype(str))
File "/data/westfly/.local/lib/python3.10/site-packages/pandas/core/util/hashing.py", line 178, in hash_pandas_object
raise TypeError(f"Unexpected type for hashing {type(obj)}")
TypeError: Unexpected type for hashing <class 'pyspark.sql.dataframe.DataFrame'>Expected Behaviour
ProfileReport.dump success
Data Description
none
Code that reproduces the bug
import pandas as pd
import numpy as np
import ydata_profiling
pdf = pd.DataFrame(
{
"col1": [np.random.randint(0, 10) for x in range(10)],
"col2": [np.random.randint(0, 100) for x in range(10)],
}
)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pdf)
df.show()
report = ydata_profiling.ProfileReport(df, minimal=True)
report.to_file("report.html")
report.dump("report")pandas-profiling version
v4.18.0
Dependencies
pandas==2.3.3
numpy==1.26.1
OS
No response
Checklist
- There is not yet another bug report for this issue in the issue tracker
- The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- The issue has not been resolved by the entries listed under Common Issues.