You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey!
Firstly, thanks for maintaining such framework!
I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name train.png. Then, the loaded dataset had only this image.
I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.
Thanks,
Sagi
Steps to reproduce the bug
All of the steps I'm attaching are in a fresh env :)
(base) sagipolaczek@Sagis-MacBook-Pro ~ % conda activate hf_issue_env
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python --version
Python 3.10.15
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % pip list | grep datasets
datasets 3.0.1
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % ls -la Documents/hf_datasets_issue
total 352
drwxr-xr-x 6 sagipolaczek staff 192 Oct 7 11:59 .
drwx------@ 23 sagipolaczek staff 736 Oct 7 11:46 ..
-rw-r--r--@ 1 sagipolaczek staff 72 Oct 7 11:59 metadata.csv
-rw-r--r--@ 1 sagipolaczek staff 160154 Oct 6 18:00 pika.png
-rw-r--r--@ 1 sagipolaczek staff 5495 Oct 6 12:02 pika_pika.png
-rw-r--r--@ 1 sagipolaczek staff 1753 Oct 6 11:50 train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % cat Documents/hf_datasets_issue/metadata.csv
file_name,text
train.png,A train
pika.png,Pika
pika_pika.png,Pika Pika!
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
>>> dataset
DatasetDict({
train: Dataset({
features: ['image', 'text'],
num_rows: 1
})
})
>>> dataset["train"][0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=354x84 at 0x10B50FD90>, 'text': 'A train'}
### DELETING `train.png` sample ###
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % vim Documents/hf_datasets_issue/metadata.csv
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % rm Documents/hf_datasets_issue/train.png
(hf_issue_env) sagipolaczek@Sagis-MacBook-Pro ~ % python
Python 3.10.15 (main, Oct 3 2024, 02:33:33) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="Documents/hf_datasets_issue/")
Generating train split: 2 examples [00:00, 65.99 examples/s]
>>> dataset
DatasetDict({
train: Dataset({
features: ['image', 'text'],
num_rows: 2
})
})
>>> dataset["train"]
Dataset({
features: ['image', 'text'],
num_rows: 2
})
>>> dataset["train"][0],dataset["train"][1]
({'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=2356x1054 at 0x10DD11E70>, 'text': 'Pika'}, {'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=343x154 at 0x10E258C70>, 'text': 'Pika Pika!'})
Expected behavior
My expected behavior would be to get a dataset with the sample train.png in it (along with the others data points).
Environment info
I've attached it in the example:
Python 3.10.15
datasets 3.0.1
The text was updated successfully, but these errors were encountered:
Describe the bug
Hey!
Firstly, thanks for maintaining such framework!
I had a small issue, where I wanted to load a custom dataset of image+text captioning. I had all of my images in a single directory, and one of the images had the name
train.png
. Then, the loaded dataset had only this image.I guess it's related to "train" as a split name, but it's definitely an unexpected behavior :)
Unfortunately I don't have time to submit a proper PR. I'm attaching a toy example to reproduce the issue.
Thanks,
Sagi
Steps to reproduce the bug
All of the steps I'm attaching are in a fresh env :)
Expected behavior
My expected behavior would be to get a dataset with the sample
train.png
in it (along with the others data points).Environment info
I've attached it in the example:
Python 3.10.15
datasets 3.0.1
The text was updated successfully, but these errors were encountered: