-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Update ASCII storage for dataframes #273
Conversation
Status note: the CI failure is puzzling me as unit tests are passing for me locally 👀 |
3edca86
to
ad10f3e
Compare
ad10f3e
to
25927cc
Compare
1b487b5
to
f3a0d62
Compare
Sorry for the delay. @Shelnutt2 @nguyenv this is ready for review. Failure of Windows |
a75f33e
to
4da886a
Compare
@nguyenv @ihnorton @Shelnutt2 ping 🙏 |
|
||
# TileDB string dims are ASCII not UTF-8. Decode them so they readback not like | ||
# `b"AKR1C3"` but rather like `"AKR1C3"`. Update as of | ||
# https://github.com/TileDB-Inc/TileDB-Py/pull/1304 these dims will read back OK. | ||
retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name].tolist() | ||
return [e.decode() for e in retval] | ||
|
||
retval = [e.decode() for e in retval] | ||
|
||
if len(retval) > 0 and isinstance(retval[0], bytes): | ||
return [e.decode() for e in retval] | ||
else: | ||
# list(...) is there to appease the linter which thinks we're returning `Any` | ||
return list(retval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could something like this work?
retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name]
return np.frombuffer(retval, dtype="U").tolist()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no sorry @nguyenv !
return np.frombuffer(retval, dtype="U").tolist()
ValueError: itemsize cannot be zero in type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But regardless all I'm doing is putting an if
around known-good code, invoking it when needed for older arrays which predate this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nguyenv what works remains to accept or reject this PR?
4da886a
to
eda0866
Compare
Thanks @nguyenv ! :) |
On collaboration with @nguyenv on TileDB-Inc/TileDB-Py#1304 -- new/upated feedback (thanks @nguyenv !) is that all along we can obtain reduce type-conversion at readback by a one-line change in this package.
Note this means that while before this PR we can store (encode/decode) user-supplied Unicode data in dataframes, as of this PR we no longer can. That ability will be restored in a future core release -- perhaps early 2023. Meanwhile unit tests on this PR that used to validate the ability to encode/decode user-supplied Unicode data in dataframes have been commented out, to accommodate the feature loss.
Summary:
"ascii"
per se, notbytes
and notstr
"ascii"
, conditionally retain the decode-on-readback logic which is essential since otherwiseb"foo" != "foo"