Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][dataset] Write dataset : 'ignore' already downloaded files #45321

Open
Plenitude-ai opened this issue Jan 21, 2025 · 0 comments
Open

Comments

@Plenitude-ai
Copy link

Plenitude-ai commented Jan 21, 2025

Describe the enhancement requested

In pyarrow.dataset.write_dataset(), there is 3 options for the argument existing_data_behavior :
‘error’ | ‘overwrite_or_ignore’ | ‘delete_matching’

I'd like to have a new one:‘ignore‘

As from the description of ‘overwrite_or_ignore‘:
... This behavior [...] will allow for an append workflow.
I really like this concept, and I would find it perfect it if was possible to not download again files that have already been downloaded, and hence expanding this "append" philosophy in a broader way.
In case we want to keep a dataset up to date from another source, this would allow to download only the new data, instead of downloading every data point, and therefore wasting time/bandswith on already downloaded data
I don't know if this could work, and if it is the right place for such an option/use case.

I would imagine the new documentation to be like :
(same doc)
‘overwrite_or_ignore’ will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.
(added doc for ‘ignore‘)
‘ignore‘ allows the same append workflows, without overwriting files existing in destination. While this can significantly reduce bandswith usage, it may lead to different states between the source and destination, as files sharing the same name will no be checked for consistency.

Appendix :

As an illustration of what I mean, this is what I did to compare 2 datasets, and download only the not-already present data :

import os
import pyarrow
import pyarrow.dataset
import fsspec
import tqdm

def update_local_dataset(
    remote_dataset: pyarrow.dataset.Dataset,
    local_dataset: pyarrow.dataset.Dataset,
):
    # It would also be great if we could reuse the dataset's filesystems
    # something like dataset.filesystem
    s3_fs: fsspec.AbstractFileSystem = s3_tool.get_s3_fs_from_config()
    local_fs: fsspec.AbstractFileSystem = fsspec.filesystem("local")
    logger.info("Updating local clicklog dataset...")

    remote_base_dir = os.path.dirname(next(remote_dataset.get_fragments()).path)
    remote_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in remote_dataset.get_fragments()]))

    local_base_dir = os.path.dirname(next(local_dataset.get_fragments()).path)
    local_filenames= set(map(os.path.basename, [fragmnnt.path for fragmnnt in local_dataset.get_fragments()]))

    to_dl_filenames = remote_filenames.difference(local_filenames)
    for filename in tqdm.tqdm(to_dl_filenames):
        remote_filepath = os.path.join(remote_base_dir, filename)
        local_filepath = os.path.join(local_base_dir, filename)
        s3_fs.get_file(remote_filepath, local_filepath)
    logger.info("Updated local dataset !")

Component(s)

Python

@Plenitude-ai Plenitude-ai changed the title [python][dataset] Write dataset : 'ignore' already downloaded files [Python][dataset] Write dataset : 'ignore' already downloaded files Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant