Added parameter trust_remote_code to hf dataset call. #2013

Haislich · 2024-07-13T10:05:03Z

Pull Request Template

Checklist

Confirmed that run-checks all script has been executed. Some tests failed on windows, but are unrelated to the fix, those tests are tanh_should_not_have_numerical_bugs_on_macos and nn::rope_encoding::tests::test_rotary_encoding_forward.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Fixes #2012

Changes

The underlying Python that fetches HuggingFace datasets requires an additional parameter or if not provided an input.
Since this parameter was not provided, the input screen kicked in, but it was impossible to pass the command.
What I did was to add said parameter both in the HuggingfaceDatasetLoader and to the python script.

Testing

A test module has been added.

codecov · 2024-07-13T10:31:01Z

Codecov Report

Attention: Patch coverage is 0% with 25 lines in your changes missing coverage. Please review.

Project coverage is 84.38%. Comparing base (0a33aa3) to head (ccc5af3).
Report is 7 commits behind head on main.

Files	Patch %	Lines
crates/burn-dataset/examples/hf_dataset.rs	0.00%	14 Missing ⚠️
.../burn-dataset/src/source/huggingface/downloader.rs	0.00%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2013      +/-   ##
==========================================
- Coverage   84.40%   84.38%   -0.03%     
==========================================
  Files         842      843       +1     
  Lines      105179   105204      +25     
==========================================
- Hits        88781    88779       -2     
- Misses      16398    16425      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

antimora

Looks good overall. I only have a few small change requests.

crates/burn-dataset/src/source/huggingface/downloader.rs

Set default trust_remote_code to false. Added an example that highlights the usecase.

antimora · 2024-07-17T16:33:23Z

I tried running your example, but I am getting this error even though I have the latest dataset installed.

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: Anthropic/hh-rlhf
Subset Name: None
Sqlite database file: /Users/dilshod/.cache/burn-dataset/Anthropichh-rlhf.db
Trust remote code: None
Custom cache dir: None
********************************************************************************
/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py:2072: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 201, in <module>
    run()
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 190, in run
    download_and_export(
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 36, in download_and_export
    dataset_all = load_dataset(
                  ^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py", line 2112, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py", line 1835, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/builder.py", line 552, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: JsonConfig.__init__() got an unexpected keyword argument 'trust_remote_code'
thread 'main' panicked at crates/burn-dataset/examples/hf_dataset.rs:21:10:
called `Result::unwrap()` on an `Err` value: SqliteDataset(ConnectionPool(Error(Some("unable to open database file: /Users/dilshod/.cache/burn-dataset/Anthropichh-rlhf.db"))))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[examples]%

Haislich · 2024-07-17T18:19:55Z

I tried to run it on my machine:

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 4.02MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 9.91M/9.91M [00:00<00:00, 31.3MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.9k/28.9k [00:00<00:00, 1.84MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.65M/1.65M [00:00<00:00, 17.8MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.54k/4.54k [00:00<?, ?B/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████| 60000/60000 [00:06<00:00, 8994.59 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9294.53 examples/s] 
Dataset: DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 10000
    })
})
Saving dataset: mnist - train
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 117.66ba/s] 
Saving dataset: mnist - test
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 140.53ba/s] 
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\mnist.db))
Table: test
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER

Table: train
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER

Starting huggingface dataset download and export
Dataset Name: Anthropic/hh-rlhf
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db
Trust remote code: None
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5.77k/5.77k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 13.2M/13.2M [00:00<00:00, 26.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 16.2M/16.2M [00:00<00:00, 46.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 20.1M/20.1M [00:00<00:00, 52.6MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 25.7M/25.7M [00:00<00:00, 37.0MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 743k/743k [00:00<00:00, 3.73MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 875k/875k [00:00<00:00, 3.25MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.05M/1.05M [00:00<00:00, 6.13MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 7.67MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████| 160800/160800 [00:02<00:00, 58801.55 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████| 8552/8552 [00:00<00:00, 50027.32 examples/s] 
Dataset: DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 160800
    })
    test: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 8552
    })
})
Saving dataset: Anthropic/hh-rlhf - train
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████| 161/161 [00:03<00:00, 46.63ba/s]
Saving dataset: Anthropic/hh-rlhf - test
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 57.28ba/s] 
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db))
Table: test
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER

Table: train
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER

This should be the correct behavior as stated in the docs and the latest release code, what I think is happening is that you had a previous cached Python version and subsequent calls to python install doesn't update the old dependencies.

To understand the issue better, I intentionally downgraded the package to version 2.14.7. I know that versions like 2.19.2 should not contain the problem based on the release notes. However, if you look at the source code, those versions do include the fix. I chose version 2.14.7 because I know it does not have the fix, as the fix was pushed a month later.

(venv) PS C:\Users\josed\.cache\burn-dataset\venv\Scripts> pip freeze
[...]
datasets==2.14.7
[...]

And got an exact behavior as yours

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 3.66MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Traceback (most recent call last):
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 201, in <module>
    run()
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 190, in run
    download_and_export(
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 36, in download_and_export
    dataset_all = load_dataset(
                  ^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 552, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BuilderConfig.__init__() got an unexpected keyword argument 'trust_remote_code'

First I don't know if I have made the correct guess about what is going on or if I'm on the wrong track.

If that's correct the issue is in the Python dependencies, thus to solve it I tried to add to downloader.rs the following line:

command.args([
        "-m",
        "pip",
        "--quiet",
        "install",
         + "--upgrade",
        "pyarrow",
        "sqlalchemy",
        "Pillow",
        "soundfile",
        "datasets",
    ]);

This causes all dependencies to be upgraded but other problems arise.
Deleting the ~.cache/burn-dataset folder that contains the Python venv seems to work fine (i.e. a clean install on subsequent runs), but I honestly don't think it's a good solution.

antimora · 2024-07-17T21:34:17Z

OK. It seems this was an issue. I forgot we install all dependencies into venv under burn-datasets dir.

I guess we will have to tell others the workaround to remove the env:

rm -rf ~/.cache/burn-dataset/venv

antimora

LGTM

antimora · 2024-07-17T21:40:14Z

CC @laggui, @nathanielsimard , @louisfd , @syl20bnr so you're aware about potential upgrade issue. Basically hugging face datasets at much higher version but the packages under ~/.cache/burn-dataset/venv remain stale, especially it's a problem with the newer flag we are now passing if true.

Added parameter trust_remote_code to hf dataset call.

5ae5fb9

antimora requested changes Jul 14, 2024

View reviewed changes

crates/burn-dataset/src/source/huggingface/downloader.rs Outdated Show resolved Hide resolved

crates/burn-dataset/src/source/huggingface/downloader.rs Outdated Show resolved Hide resolved

Removed test modul as it may break causing false negatives.

ccc5af3

Set default trust_remote_code to false. Added an example that highlights the usecase.

Haislich requested a review from antimora July 15, 2024 10:56

antimora approved these changes Jul 17, 2024

View reviewed changes

antimora merged commit befe6c1 into tracel-ai:main Jul 17, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added parameter trust_remote_code to hf dataset call. #2013

Added parameter trust_remote_code to hf dataset call. #2013

Haislich commented Jul 13, 2024 •

edited by antimora

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading

antimora left a comment

antimora commented Jul 17, 2024

Haislich commented Jul 17, 2024

antimora commented Jul 17, 2024

antimora left a comment

antimora commented Jul 17, 2024

Added parameter trust_remote_code to hf dataset call. #2013

Added parameter trust_remote_code to hf dataset call. #2013

Conversation

Haislich commented Jul 13, 2024 • edited by antimora Loading

Pull Request Template

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Jul 13, 2024 • edited Loading

Codecov Report

antimora left a comment

Choose a reason for hiding this comment

antimora commented Jul 17, 2024

Haislich commented Jul 17, 2024

antimora commented Jul 17, 2024

antimora left a comment

Choose a reason for hiding this comment

antimora commented Jul 17, 2024

Haislich commented Jul 13, 2024 •

edited by antimora

Loading

codecov bot commented Jul 13, 2024 •

edited

Loading