[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

Plenitude-ai · 2025-01-21T13:29:15Z

Describe the enhancement requested

In pyarrow.dataset.write_dataset(), there is 3 options for the argument existing_data_behavior :
‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’

I'd like to have a new one:‘ignore‘

As from the description of ‘overwrite_or_ignore‘:
... This behavior [...] will allow for an append workflow.
I really like this concept, and I would find it perfect it if was possible to not download again files that have already been downloaded, and hence expanding this "append" philosophy in a broader way.
In case we want to keep a dataset up to date from another source, this would allow to download only the new data, instead of downloading every data point, and therefore wasting time/bandswith on already downloaded data
I don't know if this could work, and if it is the right place for such an option/use case.

I would imagine the new documentation to be like :
(same doc)
‘overwrite_or_ignore’ will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.
(added doc for ‘ignore‘)
‘ignore‘ allows the same append workflows, without overwriting files existing in destination. While this can significantly reduce bandswith usage, it may lead to different states between the source and destination, as files sharing the same name will no be checked for consistency.

Appendix :

As an illustration of what I mean, this is what I did to compare 2 datasets, and download only the not-already present data :

import os
import pyarrow
import pyarrow.dataset
import fsspec
import tqdm

def update_local_dataset(
    remote_dataset: pyarrow.dataset.Dataset,
    local_dataset: pyarrow.dataset.Dataset,
):
    # It would also be great if we could reuse the dataset's filesystems
    # something like dataset.filesystem
    s3_fs: fsspec.AbstractFileSystem = s3_tool.get_s3_fs_from_config()
    local_fs: fsspec.AbstractFileSystem = fsspec.filesystem("local")
    logger.info("Updating local clicklog dataset...")

    remote_base_dir = os.path.dirname(next(remote_dataset.get_fragments()).path)
    remote_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in remote_dataset.get_fragments()]))

    local_base_dir = os.path.dirname(next(local_dataset.get_fragments()).path)
    local_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in local_dataset.get_fragments()]))

    to_dl_filenames = remote_filenames.difference(local_filenames)
    for filename in tqdm.tqdm(to_dl_filenames):
        remote_filepath = os.path.join(remote_base_dir, filename)
        local_filepath = os.path.join(local_base_dir, filename)
        s3_fs.get_file(remote_filepath, local_filepath)
    logger.info("Updated local dataset !")

Component(s)

Python

The text was updated successfully, but these errors were encountered:

Plenitude-ai added the Type: enhancement label Jan 21, 2025

github-actions bot added the Component: Python label Jan 21, 2025

Plenitude-ai changed the title ~~[python][dataset] Write dataset : 'ignore' already downloaded files~~ [Python][dataset] Write dataset : 'ignore' already downloaded files Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

Plenitude-ai commented Jan 21, 2025 •

edited

Loading

[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

Comments

Plenitude-ai commented Jan 21, 2025 • edited Loading

Describe the enhancement requested

Appendix :

Component(s)

Plenitude-ai commented Jan 21, 2025 •

edited

Loading