Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: cov buggy when having NaT in column #53115

Open
3 tasks done
phofl opened this issue May 5, 2023 · 4 comments
Open
3 tasks done

BUG: cov buggy when having NaT in column #53115

phofl opened this issue May 5, 2023 · 4 comments
Labels
Bug cov/corr Error Reporting Incorrect or improved errors from pandas

Comments

@phofl
Copy link
Member

phofl commented May 5, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = DataFrame({
    "a": [1, 2, 3],
    # "b": [pd.Timestamp("1970-01-01 00:00:00.000000001"), pd.Timestamp("1970-01-01 00:00:00.000000002"), pd.NaT],
    "b": [1, 2, np.nan],
    "c": [1, 2, 3],
})
df.cov()

Issue Description

If I understand this correctly, both calculations should be equivalent. The np.nan case returns

     a    b    c
a  1.0  0.5  1.0
b  0.5  0.5  0.5
c  1.0  0.5  1.0

while the NaT case returns

              a             b             c
a  1.000000e+00 -4.611686e+18  1.000000e+00
b -4.611686e+18  2.835686e+37 -4.611686e+18
c  1.000000e+00 -4.611686e+18  1.000000e+00

Both are equivalent without any missing values

Expected Behavior

Same result for both

Installed Versions

main

@phofl phofl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2023
@jbrockmendel
Copy link
Member

  1. I think in .cov we do a self.to_numpy(dtype=np.float64).
    a) i think that should raise on a dt64 column
    b) the NaT gets converted to -9.22337204e+18 where it should be np.nan
  2. .cov definitely shouldn't work with a dt64 column (as df["b"].var() correctly raises); .corr should be OK

@phofl
Copy link
Member Author

phofl commented May 6, 2023

Raising sounds fine.

Timedelta has the same problem.

corr has the same problem for Timestamp and Timedelta, so this needs fixing.

@lithomas1 lithomas1 added Error Reporting Incorrect or improved errors from pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 8, 2023
@fbourgey
Copy link
Contributor

@phofl @jbrockmendel
I am thinking of working on this. To sum up, we need to

  • raise an error with df.cov() when there is a dt64 column? I don't get why df.corr() should be OK.

@jbrockmendel
Copy link
Member

I don't get why df.corr() should be OK

bc .corr() has units of td64 while .cov() has units of td64^2 which isn't a thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cov/corr Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

4 participants