Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS support for FdReader used in (downstream) TFDS #31

Open
kentslaney opened this issue Jan 9, 2025 · 2 comments
Open

GCS support for FdReader used in (downstream) TFDS #31

kentslaney opened this issue Jan 9, 2025 · 2 comments

Comments

@kentslaney
Copy link

Following up on google/array_record/issues/120

The following code fails when it shouldn't

import tensorflow_datasets as tfds
ds = tfds.data_source("ref_coco", data_dir="gs://ref_coco", try_gcs=True)
next(iter(ds['train']))

Colab link

The path gets handed over to a Riegeli FdReader (source) and the array_record maintainers have pointed me upstream. I can't find GCS support mentioned in the docs, but given TFDS provides a try_gcs argument, it seems like GCS buckets should be supported somewhere along the line.

@QrczakMK
Copy link
Member

QrczakMK commented Jan 9, 2025

Internally at Google Riegeli supports GCS (via riegeli::GcsReader), but this is currently not open sourced.

I will see how feasible that would be.

@carlthome
Copy link

What's the reasoning for keeping it closed source? As a heavy user of Google Cloud Dataflow for large datasets processing, I'm surprised to get stuck on TensorFlow Datasets code that stopped working when writing TFRecord files to Google Cloud Storage (GCS), and was hoping that migrating to array_record would alleviate those pain points. Surprised to read that it's not clear that TFDS will get GCS support with array_record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants