Replies: 4 comments
-
Hi Lukas, https://github.com/mlcommons/croissant/blob/main/datasets/1.0/pass-mini/metadata.json is a similar dataset: jpegs from two tar files are joined with a CSV containing other information. There is no split in that dataset, but that's just a semantic type, as defined in https://github.com/mlcommons/croissant/blob/main/datasets/1.0/recipes/simple-split.json. The order in which the data examples are yielded is implementation-dependent, as the croissant spec doesn't specify anything at the moment. It's probably either the order of the examples in the jpeg folder (or archive) or the order of the csv. Defining the order my be an feature of the format. Do you want to open a feature request and give a few examples of when that would be needed? |
Beta Was this translation helpful? Give feedback.
-
Hi Pierre, thanks a lot for the link to the pass-mini dataset. Is there some code which created the metadata.json file? Also the dataset seems to be a bit different from what I have as the files are stored in two tar files and not in folders with some hierarchical structure. Being honest, Croissant is now required for the NeurIPS 2024 Datasets and Benchmarks Track and I struggle to understand it :( Best, Lukas |
Beta Was this translation helpful? Give feedback.
-
Hi Lukas, You can use the Croissant editor to create your JSON file: https://huggingface.co/spaces/MLCommons/croissant-editor If you upload your files on the "Resources" tab, the editor will try to infer the corresponding Croissant definitions, but you may need to correct them. Please let us know how it goes. Best, |
Beta Was this translation helpful? Give feedback.
-
Hi Omar, thanks a lot for the reply, it was very helpful. After checking the editor, I realized that if I upload the dataset on Kaggle, I can download the metadata direectly from there :) When I went back to your Github readme, I realized that you hint on this in the "Integrations" section. I would suggest to stress this much more so that people not so skilled with technology like me can realize that the Croissant format is automatically generated by Kaggle and HuggingFace (and possibly others) and users do not need to handle it manually :) Thanks a lot once again and keep on with the good job :) Lukas |
Beta Was this translation helpful? Give feedback.
-
Hello, I wanted to ask whether there is a simple tutorial how to convert tabular metadata (csv format) into the Croissant format. Each row in the metadata corresponds to one image depicting exactly one animal. The information provided for each image is its path, identity of the depicted animal (class in the context of ML), split (train/test) and additional information such as date, species or animal orientation (left, top, ...). I would also need the order of Croissant metadata to stay the same as the original csv file.
Thanks a lot,
Lukas
Beta Was this translation helpful? Give feedback.
All reactions