Skip to content

SortedBucketIO.read() failed with memory issue for large dataset #5674

@vwhoho

Description

@vwhoho

Version 0.14.14

I have the following code to merge 2 sorted bucket datasets:

AvroSortedBucketIO.Read right = AvroSortedBucketIO.read(new TupleTag<>("right"), x.class).from("directoryX");
AvroSortedBucketIO.Read left = AvroSortedBucketIO.read(new TupleTag<>("left"), y.class).from("directoryY");
PCollection<KV<String, CoGbkResult>> merged = pipeline.apply("Merge and Transform", SortedBucketIO.read(String.class).of(left, right));
merged.apply(ParDo.of(new MergeTransform());

One of the dataset is approx 2TB

Dataflow failed out of memory

Please advise. Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions