Relax nanosecond datetime restriction in CF time decoding #9618

kmuehlbauer · 2024-10-13T19:55:39Z

Closes Interoperability with Pandas 2.0 non-nanosecond datetime #7493
Closes DataArray constructor still coerces to np.datetime64[ns], not cftime in 0.11.0 #2587
Tests added/changed
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This is another attempt to resolve #7493. This goes a step further than #9580.

The idea of this PR is to automatically infer the needed resolutions for decoding/~~encoding~~ and only keep the constraints pandas imposes ("s" - lowest resolution, "ns" - highest resolution). There is still the idea of a default resolution, but this should only take precedence if it doesn't clash with the automatic inference. This can be discussed, though. Update: I've implemented time-unit-kwarg ~~a first try to have default resolution~~ on decode, which will override the current inferred resolution only to higher resolution (eg. 's' -> 'ns').

For sanity checking, and also for my own good, I've created a documentation page on time-coding in the internal dev section. Any suggestions (especially grammar) or ideas for enhancements are much appreciated.

There still might be room for consolidation of functions/methods (mostly in coding/times.py), but I have to leave it alone for some days. I went down that rabbit hole and need to relax, too 😬.

Looking forward to get your insights here, @spencerkclark, @ChrisBarker-NOAA, @pydata/xarray.

Todo:

floating point handling
Handling in Variable constructor
update decoding tests to iterate over time_units (where appropriate)
...

kmuehlbauer · 2024-10-14T12:13:07Z

Nice, mypy 1.12 is out and breaks our typing, 😭.

TomNicholas · 2024-10-14T15:16:04Z

Nice, mypy 1.12 is out and breaks our typing, 😭

Can we pin it in the CI temporarily?

kmuehlbauer · 2024-10-14T15:28:08Z

Can we pin it in the CI temporarily?

Yes, 1.11.2 was the last version.

kmuehlbauer · 2024-10-14T19:30:32Z

This is now ready for a first round of review. I think this is already in a quite usable state.

But no rush, this should be thoroughly tested.

spencerkclark · 2024-10-18T01:46:23Z

Sounds good @kmuehlbauer! I’ll try and take an initial look this weekend.

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

…ocessing, raise now early

…_ref_date

…o fix mypy

…t resolution, fix code and tests to allow this

…numpy datetime64

…times

ChrisBarker-NOAA · 2024-11-18T18:41:48Z

Not to throw too much of a wrench in the works here -- so feel free to disregard, but there's an issue I've faced with (single precision) float time encoding:

Folks (carelessly :-( ) sometimes encode times as "days since ..." using a single precision float. The problem here is not unnecessary precision, as you get with double, but too little -- if you go more than a few years out , you lose seconds precision. (the key problem is that float time -- its precision is a function of the magnitude -- not good for this use case)

The end result is that I get things like model timesteps that are supposed to be hourly, reporting as, e.g. 12:00:18, rather than 12:00:00

One way I've dealt with this is rounding to the minute, or even to hours (if I know the output is hourly), or perhaps to 10 minutes.

Could / should xarray provide a facility for doing this? maybe?

I guess what I'm proposing is that there be some way to tell xarray to store / save a time variable with e.g. second precision, but to round it to something more coarse when decoding.

maybe this could even be automatic / inferred:

if a time is in float days since -- it almost certainly is NOT millisecond precision, or even second precision -- and you could even look at the values (the first one?) and see what the minimum precision is for that timespan.

If I've done my math right, a float can only store second precision for a little over three years. So if the values are greater than three years, you don't have second precision.

Anyway, maybe way too much magic, but it would be nice for my use cases :-)

Example:

# 15 min timestep
In [57]: dates
Out[57]: 
[datetime.datetime(2024, 1, 1, 0, 0),
 datetime.datetime(2024, 1, 1, 0, 15),
 datetime.datetime(2024, 1, 1, 0, 30),
 datetime.datetime(2024, 1, 1, 0, 45)]

# common choice of units (though a bad one :-( )
In [58]: units
Out[58]: 'days since 1970-01-01T00:00:00'

# convert to numbers, used float64 by default
In [59]: nums_double = nc4.date2num(dates, units)

# truncate to float32
In [60]: nums_float = nums_double.astype(np.float32)

# convert back to datetimes:
In [61]: dates_float = nc4.num2date(nums_float, units)

In [62]: dates_float
Out[62]: 
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 14, 3, 750000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 56, 250000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
In [67]: [str(dt) for dt in dates_float]
Out[67]: 
['2024-01-01 00:00:00',
 '2024-01-01 00:14:03.750000',
 '2024-01-01 00:30:56.250000',
 '2024-01-01 00:45:00']

Ouch! so what were 15 minute timesteps is now off by about one minute -- and what's too bad is that rounding to the minute wouldn't be right either -- you'd need to round to maybe 5 minutes?

Anyway, maybe this simply isn't xarray's problem to solve -- data providers shouldn't make such mistakes :-(

spencerkclark · 2024-11-18T22:14:27Z

@ChrisBarker-NOAA yeah, I agree this kind of situation is annoying, but my feeling is that trying to fix this automatically would be too much magic. Xarray has convenient functionality for rounding times, which can be used to correct this explicitly—that would be my preference. E.g. for your example it would look like:

>>> decoded
<xarray.DataArray 'time' (time: 4)> Size: 32B
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 14, 3, 750000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 56, 250000, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
Dimensions without coordinates: time
>>> decoded.dt.round("5min")
<xarray.DataArray 'round' (time: 4)> Size: 32B
array([cftime.DatetimeGregorian(2024, 1, 1, 0, 0, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 15, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 30, 0, 0, has_year_zero=False),
       cftime.DatetimeGregorian(2024, 1, 1, 0, 45, 0, 0, has_year_zero=False)],
      dtype=object)
Dimensions without coordinates: time

ChrisBarker-NOAA · 2024-11-18T22:27:21Z

Xarray has convenient functionality for rounding times

Oh, nice: I had missed that! -- you're probably right, too much magic to do for people.

kmuehlbauer · 2024-11-18T22:38:09Z

@spencerkclark @ChrisBarker-NOAA I've implemented automated decoding of floating point data to the needed resolution, even when the wanted resolution does not apply.

Unfortunately the above outlined behaviour is too much involved to be put into the decoder. Nevertheless maybe we can distill some best practices from your vast experience with data @ChrisBarker-NOAA and create a nice example how to handle these difficulties?

ChrisBarker-NOAA · 2024-11-18T22:56:21Z

create a nice example how to handle these difficulties?

Sure -- where would be a good home for that?

kmuehlbauer · 2024-11-18T23:06:13Z

Not sure, but https://docs.xarray.dev/en/stable/user-guide/time-series.html could have a dedicated floating point date section.

kmuehlbauer · 2024-11-21T16:04:13Z

I've added a kwarg time_unit into the decode_cf and subsequent functionality.

But instead of adding that kwarg we could slightly overload the decode_times to take one of "s", "ms", "us", "ns" with "ns" as default.

This would have the positive effect, that we wouldn't need the additional kwarg and have to distribute it through the backends.

decode_times=None - directs to decode_times=True
decode_times=False - no decoding
decode_times=True - decode times with default value ("ns")
decode_times="s" - decode times to at least "s"
decode_times="ms" - decode times to at least "ms"
decode_times="us" - decode times to at least "us"
decode_times="ns" - decode times to "ns"

We could guard decode_times=None and decode_times=True with a DeprecationWarning and add our new defaults in the WarningMessage (eg. "us").

This methodology would be fully backwards compatible. It advertises the change via DeprecationWarning in normal operation and also if issues appear in the decoding steps.

If this is something which makes sense @shoyer, @dcherian, @spencerkclark, I'd add the needed changes to this PR.

dcherian · 2024-11-22T00:35:14Z

Alternatively, we could make small progress on #4490 and have

from xarray.coding import DatetimeCoder

ds = xr.open_mfdataset(..., decode_times=DatetimeCoder(units="ms"))

In the long term, it seems nice to have the default use the "natural" units i.e. "h" for units="hours since ..." and apparently even "M" for units=months since ... (!!)

https://numpy.org/doc/stable/reference/arrays.datetime.html#basic-datetimes
The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’),

TomNicholas mentioned this pull request Oct 14, 2024

Reimplement Datatree typed ops #9619

Merged

4 tasks

kmuehlbauer force-pushed the any-time-resolution-2 branch from ca5050d to f7396cf Compare October 14, 2024 16:09

kmuehlbauer marked this pull request as ready for review October 14, 2024 18:05

kmuehlbauer mentioned this pull request Oct 15, 2024

Implement default time resolution for CF time encoding/decoding #9580

Closed

4 tasks

kmuehlbauer added topic-CF conventions topic-cftime run-upstream Run upstream CI labels Oct 16, 2024

kmuehlbauer and others added 18 commits October 18, 2024 07:31

implement default_precision_timestamp, refactor coding/times.py and c…

7b5f323

…ore/variable.py to use any-precision datetime/timedelta with autmatic inferring of resolution

align tests with new time resolution behaviour

8784f33

timedelta decoding, fsspec handling

b45ab23

fixes in coding/times.py

39086ef

add docs on time coding

df49a40

attempt fixing doc tests

adb8ca3

fix issue where out-of-bounds floating point values slipped in the pr…

266b1ed

…ocessing, raise now early

convert to UTC first before stripping of tz in _unpack_time_units_and…

6d5f13b

…_ref_date

reorganize pandas compatibility code, remove unneeded code, attempt t…

5d68bfe

…o fix mypy

another attempt to finally fix mypy

07bba69

refactor out _check_date_is_after_shift

6e7f0bb

refactor out _maybe_strip_tz_from_timestamp

b4a49bb

more refactoring in coding.times.py

2e1ff4f

more refactoring in coding.times.py

d5a7da0

minor fix in time-coding.rst

821b68d

set default resolution to "s", which actually means, use pandas lowes…

d066edf

…t resolution, fix code and tests to allow this

Add section for default units, fix options

ed22da1

attempt to fix typing

8bf23f4

kmuehlbauer and others added 8 commits November 17, 2024 23:32

use pd.Timestamp(np.datetime64(cftime)) to convert from cftime to numpy

2be5739

use dt = np.datetime64(cftime.isoformat()) to convert from cftime to …

b9d0a8e

…numpy datetime64

fix time-coding.rst

08afc3b

use us in to_datetimeindex

edc55e1

revert back to us for datetimeindex tests

bffe919

estimate fitting resolution for floating point values, when decoding …

150b982

…times

add test

7113ceb

refactor floating point decoding

7f47f0b

Merge branch 'main' into any-time-resolution-2

512808d

simplify recursive function, update tests

63c83f4

kmuehlbauer and others added 10 commits November 19, 2024 13:15

more refactoring, update tests

0efbbeb

add fixture, apply fixture to more tests.

2910250

update time-coding.rst

57d8d72

fix typing

5333240

try to fix test, remove stale print

6f35c81

another attempt to fix test

d0c17a4

debug failing test

b2b6bb1

refactor cftime fallback in datetime decoding

5dbc8a7

Merge branch 'main' into any-time-resolution-2

be0d3e0

fix merge-collission

f95408a

kmuehlbauer changed the title ~~Relax nanosecond datetime restriction in CF time coding~~ Relax nanosecond datetime restriction in CF time decoding Nov 21, 2024

Merge branch 'main' into any-time-resolution-2

609e15c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax nanosecond datetime restriction in CF time decoding #9618

Relax nanosecond datetime restriction in CF time decoding #9618

kmuehlbauer commented Oct 13, 2024 •

edited

Loading

kmuehlbauer commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

spencerkclark commented Oct 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

spencerkclark commented Nov 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

kmuehlbauer commented Nov 21, 2024

dcherian commented Nov 22, 2024

Relax nanosecond datetime restriction in CF time decoding #9618

Are you sure you want to change the base?

Relax nanosecond datetime restriction in CF time decoding #9618

Conversation

kmuehlbauer commented Oct 13, 2024 • edited Loading

kmuehlbauer commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

kmuehlbauer commented Oct 14, 2024

spencerkclark commented Oct 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

spencerkclark commented Nov 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

ChrisBarker-NOAA commented Nov 18, 2024

kmuehlbauer commented Nov 18, 2024

kmuehlbauer commented Nov 21, 2024

dcherian commented Nov 22, 2024

kmuehlbauer commented Oct 13, 2024 •

edited

Loading