Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BeamWriter hits "Record exceeds maximum record size" in Dataflow with autosharding #10995

Open
carlthome opened this issue Jan 29, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@carlthome
Copy link
Contributor

carlthome commented Jan 29, 2025

For TFDS 4.9.7 on Dataflow 2.60.0, I have a company-internal Dataflow job that fails. Given the input collection:

Elements added 332,090
Estimated size 1.74 TB

to train_write/GroupShards, where the output collection reports:

Elements added 2
Estimated size 1.8 GB

it then fails on the next element with

"E0123 207 recordwriter.cc:401] Record exceeds maximum record size (1096571470 > 1073741823)."

Workaround

By installing the TFDS prerelease after 3700745 and controlling --num_shards=4096 (auto-detection choose 2048), the DatasetBuilder runs to completion on Dataflow. I'm curious why the auto-detection didn't choose more file shards however, as all training examples should be roughly the same size in this DatasetBuilder.

Suggested fix

Maybe this

max_shard_size = 0.9 * cls.max_shard_size
is too little headroom for the training examples. The FeatureDict in this particular DatasetBuilder is large, and perhaps the key overhead is unusually large. Should that number be 0.8 instead? Or whether should be larger when the FeatureDict contains many keys?

Side remark

Surprisingly Dataflow limits mention

Maximum size for a single element (except where stricter conditions apply, for example Streaming Engine). 2 GB

which doesn't seem to be true in practice since the GroupBy fails on ~1 GB as per the logged error.

@carlthome carlthome added the bug Something isn't working label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant