Skip to content

GH-49168: [Python] map date32/date64 to datetime64[s] in to_pandas_dtype#49210

Open
PratyushD21 wants to merge 2 commits intoapache:mainfrom
PratyushD21:fix-date-to-pandas-dtype-seconds
Open

GH-49168: [Python] map date32/date64 to datetime64[s] in to_pandas_dtype#49210
PratyushD21 wants to merge 2 commits intoapache:mainfrom
PratyushD21:fix-date-to-pandas-dtype-seconds

Conversation

@PratyushD21
Copy link

@PratyushD21 PratyushD21 commented Feb 10, 2026

Rationale for this change

Fixes #49168

What changes are included in this PR?

As per the issue raised in [Python] Consider pa.date32/64.to_pandas_dtype() returning datetime64[s] instead of datetime64[ms], it was observed that no ms timestamp was supported in parquet. To rectify that, the pa.Date32/64.to_pandas_dtype will return datetime64[s].

python/pyarrow/types.pxi
python/pyarrow/tests/test_schema.py

Are these changes tested?

Yes, the schema test file python/pyarrow/tests/test_schema.py has been updated to include this change.

Are there any user-facing changes?

Yes, the return types pa.date64/32().to_pandas_dtype() should return datetime64[s].

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@alippai
Copy link
Contributor

alippai commented Feb 10, 2026

What about datetime64[D], wouldn’t that be more appropriate?

@PratyushD21
Copy link
Author

Hi @alippai, it seems that this approach of including time in the date32/64 was included in the #35656 discussion which has been merged already.

@alippai
Copy link
Contributor

alippai commented Feb 10, 2026

@PratyushD21 good point, thanks. @mroeschke will know it the best, I remember pandas also changed how it handles days since 2023.

@mroeschke
Copy link
Contributor

What about datetime64[D], wouldn’t that be more appropriate?

Yeah pandas does not natively support storing datetime64[D], [s] is the largest resolution supported.

The changes look good, but I don't have permissions to run the CI in Arrow. You may want to build Arrow/PyArrow from source and run the test suite to see if other tests passes, especially the parquet tests based on the history in #35656

@PratyushD21
Copy link
Author

Hi @mroeschke, I built from Source, and ran the test suite (especially the parquet tests). The tests have passed.
python -m pytest pyarrow and python -m pytest -q pyarrow/tests/parquet. Exit status was 0. Should I push an empty commit to trigger CI in Arrow?

@rok
Copy link
Member

rok commented Feb 10, 2026

I kicked off the CI, ping me if you need to start it again.

@PratyushD21
Copy link
Author

@rok would it be possible to re-trigger the CI? I have added additional check for pandas<=2. Now, the CIs should pass.

@AlenkaF AlenkaF changed the title PYARROW: map date32/date64 to datetime64[s] in to_pandas_dtype GH-49168: [Python] map date32/date64 to datetime64[s] in to_pandas_dtype Feb 11, 2026
@github-actions
Copy link

⚠️ GitHub issue #49168 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member

I don't think we should change this for to_pandas_dtype() if not changing it for to_pandas() itself? (i.e. both should ideally match?)
(or, if we want to change this, we should probably change to_pandas() as well? Or what is the exact rationale or use case for wanting to change to_pandas_dtype())

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry but I don't understand this. Can someone explain me why returning s is better than ms in this specific case? I don't use this specific API and I might be missing something obvious (again no expert on this area) but why this change is preferred and not just a matter of preference?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 11, 2026
@mroeschke
Copy link
Contributor

I don't think we should change this for to_pandas_dtype() if not changing it for to_pandas() itself? (i.e. both should ideally match?)

Ah sure, I think there should be consistency in both APIs.

For pyarrow.DataType.to_pandas_dtype() I would interpret "Return the equivalent NumPy / Pandas dtype" as mentioned in the docs as the "closest representable" type for types with no equivalence like date types; therefore; datetime64[s] makes more sense to me compared to datetime64[ms] because of second resolution's capacity to represent a larger amount of dates that a date32/64 might represent. This was at least our expectation in cuDF

Conversely it appears ms was chosen in #35656 due to parquet compatibility IIUC? And if so, does that dependence between pandas ops and parquet still exist today?

@jorisvandenbossche
Copy link
Member

datetime64[s] makes more sense to me compared to datetime64[ms] because of second resolution's capacity to represent a larger amount of dates that a date32/64 might represent.

For date32[day] that indeed makes sense, but for date64[ms], using datetime64[ms] ensures that the conversion is more likely to be zero-copy
(and I think that that point we maybe took date32 to use ms as well for consistency, since a conversion for that type is needed anyhow)

@mroeschke
Copy link
Contributor

ensures that the conversion is more likely to be zero-copy

OK that's reasonable. I understand if ultimately the change is not made; we just found it unexpected in cuDF

@PratyushD21
Copy link
Author

Hi @mroeschke @jorisvandenbossche, if I understand correctly, we would like to perform the same change for to_pandas() method as well, or are we planning to keep current datetime[ms] logic? Thank you for your inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Consider pa.date32/64.to_pandas_dtype() returning datetime64[s] instead of datetime64[ms]

6 participants