-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added parameter trust_remote_code to hf dataset call. #2013
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2013 +/- ##
==========================================
- Coverage 84.40% 84.38% -0.03%
==========================================
Files 842 843 +1
Lines 105179 105204 +25
==========================================
- Hits 88781 88779 -2
- Misses 16398 16425 +27 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. I only have a few small change requests.
Set default trust_remote_code to false. Added an example that highlights the usecase.
I tried running your example, but I am getting this error even though I have the latest dataset installed.
|
I tried to run it on my machine: ********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 4.02MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 9.91M/9.91M [00:00<00:00, 31.3MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.9k/28.9k [00:00<00:00, 1.84MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.65M/1.65M [00:00<00:00, 17.8MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.54k/4.54k [00:00<?, ?B/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████| 60000/60000 [00:06<00:00, 8994.59 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9294.53 examples/s]
Dataset: DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 60000
})
test: Dataset({
features: ['image', 'label'],
num_rows: 10000
})
})
Saving dataset: mnist - train
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 117.66ba/s]
Saving dataset: mnist - test
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 140.53ba/s]
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\mnist.db))
Table: test
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER
Table: train
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER
Starting huggingface dataset download and export
Dataset Name: Anthropic/hh-rlhf
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db
Trust remote code: None
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5.77k/5.77k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 13.2M/13.2M [00:00<00:00, 26.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 16.2M/16.2M [00:00<00:00, 46.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 20.1M/20.1M [00:00<00:00, 52.6MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 25.7M/25.7M [00:00<00:00, 37.0MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 743k/743k [00:00<00:00, 3.73MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 875k/875k [00:00<00:00, 3.25MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.05M/1.05M [00:00<00:00, 6.13MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 7.67MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████| 160800/160800 [00:02<00:00, 58801.55 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████| 8552/8552 [00:00<00:00, 50027.32 examples/s]
Dataset: DatasetDict({
train: Dataset({
features: ['chosen', 'rejected'],
num_rows: 160800
})
test: Dataset({
features: ['chosen', 'rejected'],
num_rows: 8552
})
})
Saving dataset: Anthropic/hh-rlhf - train
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████| 161/161 [00:03<00:00, 46.63ba/s]
Saving dataset: Anthropic/hh-rlhf - test
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 57.28ba/s]
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db))
Table: test
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER
Table: train
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER This should be the correct behavior as stated in the docs and the latest release code, what I think is happening is that you had a previous cached Python version and subsequent calls to python install doesn't update the old dependencies. To understand the issue better, I intentionally downgraded the package to version (venv) PS C:\Users\josed\.cache\burn-dataset\venv\Scripts> pip freeze
[...]
datasets==2.14.7
[...] And got an exact behavior as yours ********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 3.66MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Traceback (most recent call last):
File "C:\Users\josed\.cache\burn-dataset\importer.py", line 201, in <module>
run()
File "C:\Users\josed\.cache\burn-dataset\importer.py", line 190, in run
download_and_export(
File "C:\Users\josed\.cache\burn-dataset\importer.py", line 36, in download_and_export
dataset_all = load_dataset(
^^^^^^^^^^^^^
File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 2129, in load_dataset
builder_instance = load_dataset_builder(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 1852, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
^^^^^^^^^^^^
File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 373, in __init__
self.config, self.config_id = self._create_builder_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 552, in _create_builder_config
builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BuilderConfig.__init__() got an unexpected keyword argument 'trust_remote_code' First I don't know if I have made the correct guess about what is going on or if I'm on the wrong track. If that's correct the issue is in the Python dependencies, thus to solve it I tried to add to command.args([
"-m",
"pip",
"--quiet",
"install",
+ "--upgrade",
"pyarrow",
"sqlalchemy",
"Pillow",
"soundfile",
"datasets",
]); This causes all dependencies to be upgraded but other problems arise. |
OK. It seems this was an issue. I forgot we install all dependencies into venv under burn-datasets dir. I guess we will have to tell others the workaround to remove the env: rm -rf ~/.cache/burn-dataset/venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
CC @laggui, @nathanielsimard , @louisfd , @syl20bnr so you're aware about potential upgrade issue. Basically hugging face datasets at much higher version but the packages under |
Pull Request Template
Checklist
run-checks all
script has been executed. Some tests failed on windows, but are unrelated to the fix, those tests aretanh_should_not_have_numerical_bugs_on_macos
andnn::rope_encoding::tests::test_rotary_encoding_forward
.Related Issues/PRs
Fixes #2012
Changes
The underlying Python that fetches HuggingFace datasets requires an additional parameter or if not provided an input.
Since this parameter was not provided, the input screen kicked in, but it was impossible to pass the command.
What I did was to add said parameter both in the
HuggingfaceDatasetLoader
and to the python script.Testing
A test module has been added.