Skip to content

Bug Report Unexpected type for hashing <class 'pyspark.sql.dataframe.DataFrame'> at ProfileReport.dump #1801

@westfly

Description

@westfly

Current Behaviour

AS issue suggest, I write code like

import pandas as pd
import numpy as np
import ydata_profiling


pdf = pd.DataFrame(
    {
        "col1": [np.random.randint(0, 10) for x in range(10)],
        "col2": [np.random.randint(0, 100) for x in range(10)],
    }
)
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pdf)

df.show()

report = ydata_profiling.ProfileReport(df, minimal=True)
report.to_file("report.html")
report.dump("report")

when submit with spark, dump throw an error

Generate report structure: 100%|██████████| 1/1 [00:00<00:00,  2.45it/s]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  6.73it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 1079.34it/s]
Traceback (most recent call last):
  File "/data/westfly/sparkjob/sparkjoin/x.py", line 21, in <module>
    report.dump("report")
  File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/serialize_report.py", line 128, in dump
    output_file.write_bytes(self.dumps())
  File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/serialize_report.py", line 39, in dumps
    self.df_hash,
  File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/profile_report.py", line 282, in df_hash
    self._df_hash = hash_dataframe(self.df)
  File "/data/westfly/sparkjob/sparkjoin/ydata_profiling/utils/dataframe.py", line 201, in hash_dataframe
    hash_values = "\n".join(hash_pandas_object(df).values.astype(str))
  File "/data/westfly/.local/lib/python3.10/site-packages/pandas/core/util/hashing.py", line 178, in hash_pandas_object
    raise TypeError(f"Unexpected type for hashing {type(obj)}")
TypeError: Unexpected type for hashing <class 'pyspark.sql.dataframe.DataFrame'>

Expected Behaviour

ProfileReport.dump success

Data Description

none

Code that reproduces the bug

import pandas as pd
import numpy as np
import ydata_profiling


pdf = pd.DataFrame(
    {
        "col1": [np.random.randint(0, 10) for x in range(10)],
        "col2": [np.random.randint(0, 100) for x in range(10)],
    }
)
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pdf)

df.show()

report = ydata_profiling.ProfileReport(df, minimal=True)
report.to_file("report.html")
report.dump("report")

pandas-profiling version

v4.18.0

Dependencies

pandas==2.3.3
numpy==1.26.1

OS

No response

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions