Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset() of images from a single directory where train.png image exists #7201

Open
SagiPolaczek opened this issue Oct 7, 2024 · 0 comments

Comments

@SagiPolaczek
Copy link

Describe the bug

Hey!
Firstly, thanks for maintaining such framework!

I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name train.png. Then, the loaded dataset had only this image.

I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.

Thanks,
Sagi

Steps to reproduce the bug

All of the steps I'm attaching are in a fresh env :)

(base) sagipolaczek@Sagis-MacBook-Pro ~ % conda activate hf_issue_env
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python --version
Python 3.10.15
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % pip list | grep datasets
datasets           3.0.1
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % ls -la Documents/hf_datasets_issue          
total 352
drwxr-xr-x   6 sagipolaczek  staff     192 Oct  7 11:59 .
drwx------@ 23 sagipolaczek  staff     736 Oct  7 11:46 ..
-rw-r--r--@  1 sagipolaczek  staff      72 Oct  7 11:59 metadata.csv
-rw-r--r--@  1 sagipolaczek  staff  160154 Oct  6 18:00 pika.png
-rw-r--r--@  1 sagipolaczek  staff    5495 Oct  6 12:02 pika_pika.png
-rw-r--r--@  1 sagipolaczek  staff    1753 Oct  6 11:50 train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % cat Documents/hf_datasets_issue/metadata.csv
file_name,text
train.png,A train
pika.png,Pika
pika_pika.png,Pika Pika!


(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python                                      
Python 3.10.15 (main, Oct  3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 1
    })
})
>>> dataset["train"][0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=354x84 at 0x10B50FD90>, 'text': 'A train'}

### DELETING `train.png` sample ###
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % vim Documents/hf_datasets_issue/metadata.csv
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % rm Documents/hf_datasets_issue/train.png 
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python                                      
Python 3.10.15 (main, Oct  3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
Generating train split: 2 examples [00:00, 65.99 examples/s]
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 2
    })
})
>>> dataset["train"]
Dataset({
    features: ['image', 'text'],
    num_rows: 2
})
>>> dataset["train"][0],dataset["train"][1]
({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=2356x1054 at 0x10DD11E70>, 'text': 'Pika'}, {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=343x154 at 0x10E258C70>, 'text': 'Pika Pika!'})

Expected behavior

My expected behavior would be to get a dataset with the sample train.png in it (along with the others data points).

Environment info

I've attached it in the example:

Python 3.10.15
datasets 3.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant