Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Update ASCII storage for dataframes #273

Merged
merged 7 commits into from
Sep 23, 2022
Merged

[python] Update ASCII storage for dataframes #273

merged 7 commits into from
Sep 23, 2022

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Aug 31, 2022

On collaboration with @nguyenv on TileDB-Inc/TileDB-Py#1304 -- new/upated feedback (thanks @nguyenv !) is that all along we can obtain reduce type-conversion at readback by a one-line change in this package.

Note this means that while before this PR we can store (encode/decode) user-supplied Unicode data in dataframes, as of this PR we no longer can. That ability will be restored in a future core release -- perhaps early 2023. Meanwhile unit tests on this PR that used to validate the ability to encode/decode user-supplied Unicode data in dataframes have been commented out, to accommodate the feature loss.

Summary:

  • Write string dims/attrs as "ascii" per se, not bytes and not str
  • Since old SOMAs written before this PR won't have all "ascii", conditionally retain the decode-on-readback logic which is essential since otherwise b"foo" != "foo"
  • Please see also the clear presentation at Queryability of dataframe attribute columns #99

@johnkerl
Copy link
Member Author

Status note: the CI failure is puzzling me as unit tests are passing for me locally 👀

@johnkerl johnkerl force-pushed the kerl/ascii branch 2 times, most recently from 3edca86 to ad10f3e Compare September 2, 2022 21:18
@johnkerl johnkerl changed the base branch from main to main-old September 16, 2022 00:36
@johnkerl johnkerl force-pushed the kerl/ascii branch 3 times, most recently from 1b487b5 to f3a0d62 Compare September 16, 2022 01:00
@johnkerl
Copy link
Member Author

johnkerl commented Sep 16, 2022

Sorry for the delay.

@Shelnutt2 @nguyenv this is ready for review.

Failure of Windows R-CMD-check.yaml is an unrelated KP.

@johnkerl johnkerl force-pushed the kerl/ascii branch 9 times, most recently from a75f33e to 4da886a Compare September 16, 2022 20:58
@johnkerl
Copy link
Member Author

@nguyenv @ihnorton @Shelnutt2 ping 🙏

Comment on lines +78 to +90

# TileDB string dims are ASCII not UTF-8. Decode them so they readback not like
# `b"AKR1C3"` but rather like `"AKR1C3"`. Update as of
# https://github.com/TileDB-Inc/TileDB-Py/pull/1304 these dims will read back OK.
retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name].tolist()
return [e.decode() for e in retval]

retval = [e.decode() for e in retval]

if len(retval) > 0 and isinstance(retval[0], bytes):
return [e.decode() for e in retval]
else:
# list(...) is there to appease the linter which thinks we're returning `Any`
return list(retval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could something like this work?

retval = A.query(attrs=[], dims=[self.dim_name])[:][self.dim_name]
return np.frombuffer(retval, dtype="U").tolist()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no sorry @nguyenv !

    return np.frombuffer(retval, dtype="U").tolist()
ValueError: itemsize cannot be zero in type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But regardless all I'm doing is putting an if around known-good code, invoking it when needed for older arrays which predate this PR!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyenv what works remains to accept or reject this PR?

@johnkerl
Copy link
Member Author

Thanks @nguyenv ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants