You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In pyarrow.dataset.write_dataset(), there is 3 options for the argument existing_data_behavior : ‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’
I'd like to have a new one:‘ignore‘
As from the description of ‘overwrite_or_ignore‘: ... This behavior [...] will allow for an append workflow.
I really like this concept, and I would find it perfect it if was possible to not download again files that have already been downloaded, and hence expanding this "append" philosophy in a broader way.
In case we want to keep a dataset up to date from another source, this would allow to download only the new data, instead of downloading every data point, and therefore wasting time/bandswith on already downloaded data
I don't know if this could work, and if it is the right place for such an option/use case.
I would imagine the new documentation to be like :
(same doc) ‘overwrite_or_ignore’ will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.
(added doc for ‘ignore‘) ‘ignore‘ allows the same append workflows, without overwriting files existing in destination. While this can significantly reduce bandswith usage, it may lead to different states between the source and destination, as files sharing the same name will no be checked for consistency.
Appendix :
As an illustration of what I mean, this is what I did to compare 2 datasets, and download only the not-already present data :
importosimportpyarrowimportpyarrow.datasetimportfsspecimporttqdmdefupdate_local_dataset(
remote_dataset: pyarrow.dataset.Dataset,
local_dataset: pyarrow.dataset.Dataset,
):
# It would also be great if we could reuse the dataset's filesystems# something like dataset.filesystems3_fs: fsspec.AbstractFileSystem=s3_tool.get_s3_fs_from_config()
local_fs: fsspec.AbstractFileSystem=fsspec.filesystem("local")
logger.info("Updating local clicklog dataset...")
remote_base_dir=os.path.dirname(next(remote_dataset.get_fragments()).path)
remote_filenames=set(map(os.path.basename, [fragmnnt.pathforfragmnntinremote_dataset.get_fragments()]))
local_base_dir=os.path.dirname(next(local_dataset.get_fragments()).path)
local_filenames=set(map(os.path.basename, [fragmnnt.pathforfragmnntinlocal_dataset.get_fragments()]))
to_dl_filenames=remote_filenames.difference(local_filenames)
forfilenameintqdm.tqdm(to_dl_filenames):
remote_filepath=os.path.join(remote_base_dir, filename)
local_filepath=os.path.join(local_base_dir, filename)
s3_fs.get_file(remote_filepath, local_filepath)
logger.info("Updated local dataset !")
Component(s)
Python
The text was updated successfully, but these errors were encountered:
Describe the enhancement requested
In pyarrow.dataset.write_dataset(), there is 3 options for the argument
existing_data_behavior
:‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’
I'd like to have a new one:
‘ignore‘
As from the description of
‘overwrite_or_ignore‘
:... This behavior [...] will allow for an append workflow.
I really like this concept, and I would find it perfect it if was possible to not download again files that have already been downloaded, and hence expanding this "append" philosophy in a broader way.
In case we want to keep a dataset up to date from another source, this would allow to download only the new data, instead of downloading every data point, and therefore wasting time/bandswith on already downloaded data
I don't know if this could work, and if it is the right place for such an option/use case.
I would imagine the new documentation to be like :
(same doc)
‘overwrite_or_ignore’
will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.(added doc for
‘ignore‘
)‘ignore‘
allows the same append workflows, without overwriting files existing in destination. While this can significantly reduce bandswith usage, it may lead to different states between the source and destination, as files sharing the same name will no be checked for consistency.Appendix :
As an illustration of what I mean, this is what I did to compare 2 datasets, and download only the not-already present data :
Component(s)
Python
The text was updated successfully, but these errors were encountered: