You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Currently when writting a parquet dataset with mode overwrite / overwrite_partitions it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.
This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Athena: HIVE_CANNOT_OPEN_SPLIT errors
etc.
We would like a new option to ensure that in overwrite & overwrite_partitions modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:
Having deterministic output names (for example part-0.parquet, part-1.parquet).
Atomically replacing any existing files in the output path.
Finally doing the clean-up of any extra files that are not expected in the output path (if in this new upload there are less part files, for example).
This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.
Hi @pvieito updates to a single key in S3 are atomic and you are correct, if the object names are deterministic, we could overwrite the objects which would although leave the partition in inconsistent state during overwrite, would prevent most NoSuchKey errors (most... not all as some of the objects might still be deleted if the partition shrinks). Alternatively, you could consider to retry on NoSuchKey.
Hi! Currently when writting a parquet dataset with mode
overwrite
/overwrite_partitions
it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
HIVE_CANNOT_OPEN_SPLIT
errorsWe would like a new option to ensure that in
overwrite
&overwrite_partitions
modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:part-0.parquet
,part-1.parquet
).This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.
// cc. @jack-dell
The text was updated successfully, but these errors were encountered: