Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parameter trust_remote_code to hf dataset call. #2013

Merged
merged 2 commits into from
Jul 17, 2024

Conversation

Haislich
Copy link
Contributor

@Haislich Haislich commented Jul 13, 2024

Pull Request Template

Checklist

  • Confirmed that run-checks all script has been executed. Some tests failed on windows, but are unrelated to the fix, those tests are tanh_should_not_have_numerical_bugs_on_macos and nn::rope_encoding::tests::test_rotary_encoding_forward.
  • Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Fixes #2012

Changes

The underlying Python that fetches HuggingFace datasets requires an additional parameter or if not provided an input.
Since this parameter was not provided, the input screen kicked in, but it was impossible to pass the command.
What I did was to add said parameter both in the HuggingfaceDatasetLoader and to the python script.

Testing

A test module has been added.

Copy link

codecov bot commented Jul 13, 2024

Codecov Report

Attention: Patch coverage is 0% with 25 lines in your changes missing coverage. Please review.

Project coverage is 84.38%. Comparing base (0a33aa3) to head (ccc5af3).
Report is 7 commits behind head on main.

Files Patch % Lines
crates/burn-dataset/examples/hf_dataset.rs 0.00% 14 Missing ⚠️
.../burn-dataset/src/source/huggingface/downloader.rs 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2013      +/-   ##
==========================================
- Coverage   84.40%   84.38%   -0.03%     
==========================================
  Files         842      843       +1     
  Lines      105179   105204      +25     
==========================================
- Hits        88781    88779       -2     
- Misses      16398    16425      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I only have a few small change requests.

crates/burn-dataset/src/source/huggingface/downloader.rs Outdated Show resolved Hide resolved
crates/burn-dataset/src/source/huggingface/downloader.rs Outdated Show resolved Hide resolved
Set default trust_remote_code to false.
Added an example that highlights the usecase.
@Haislich Haislich requested a review from antimora July 15, 2024 10:56
@antimora
Copy link
Collaborator

I tried running your example, but I am getting this error even though I have the latest dataset installed.

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: Anthropic/hh-rlhf
Subset Name: None
Sqlite database file: /Users/dilshod/.cache/burn-dataset/Anthropichh-rlhf.db
Trust remote code: None
Custom cache dir: None
********************************************************************************
/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py:2072: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 201, in <module>
    run()
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 190, in run
    download_and_export(
  File "/Users/dilshod/.cache/burn-dataset/importer.py", line 36, in download_and_export
    dataset_all = load_dataset(
                  ^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py", line 2112, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/load.py", line 1835, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dilshod/.cache/burn-dataset/venv/lib/python3.11/site-packages/datasets/builder.py", line 552, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: JsonConfig.__init__() got an unexpected keyword argument 'trust_remote_code'
thread 'main' panicked at crates/burn-dataset/examples/hf_dataset.rs:21:10:
called `Result::unwrap()` on an `Err` value: SqliteDataset(ConnectionPool(Error(Some("unable to open database file: /Users/dilshod/.cache/burn-dataset/Anthropichh-rlhf.db"))))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[examples]%

@Haislich
Copy link
Contributor Author

I tried to run it on my machine:

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 4.02MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 9.91M/9.91M [00:00<00:00, 31.3MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 28.9k/28.9k [00:00<00:00, 1.84MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.65M/1.65M [00:00<00:00, 17.8MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4.54k/4.54k [00:00<?, ?B/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████| 60000/60000 [00:06<00:00, 8994.59 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9294.53 examples/s] 
Dataset: DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 60000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 10000
    })
})
Saving dataset: mnist - train
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 117.66ba/s] 
Saving dataset: mnist - test
Dataset features: {'image_bytes': Value(dtype='binary', id=None), 'image_path': Value(dtype='string', id=None), 'label': ClassLabel(names=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], id=None)}
Creating SQL from Arrow format: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 140.53ba/s] 
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\mnist.db))
Table: test
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER

Table: train
Column: image_bytes - BLOB
Column: image_path - TEXT
Column: label - BIGINT
Column: row_id - INTEGER

Starting huggingface dataset download and export
Dataset Name: Anthropic/hh-rlhf
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db
Trust remote code: None
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2554: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 5.77k/5.77k [00:00<?, ?B/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 13.2M/13.2M [00:00<00:00, 26.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 16.2M/16.2M [00:00<00:00, 46.7MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 20.1M/20.1M [00:00<00:00, 52.6MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 25.7M/25.7M [00:00<00:00, 37.0MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 743k/743k [00:00<00:00, 3.73MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████| 875k/875k [00:00<00:00, 3.25MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.05M/1.05M [00:00<00:00, 6.13MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 7.67MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████| 160800/160800 [00:02<00:00, 58801.55 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████| 8552/8552 [00:00<00:00, 50027.32 examples/s] 
Dataset: DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 160800
    })
    test: Dataset({
        features: ['chosen', 'rejected'],
        num_rows: 8552
    })
})
Saving dataset: Anthropic/hh-rlhf - train
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████| 161/161 [00:03<00:00, 46.63ba/s]
Saving dataset: Anthropic/hh-rlhf - test
Dataset features: {'chosen': Value(dtype='string', id=None), 'rejected': Value(dtype='string', id=None)}
Creating SQL from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 57.28ba/s] 
Printing table schema for sqlite3 db (Engine(sqlite:///C:\Users\josed\.cache\burn-dataset\Anthropichh-rlhf.db))
Table: test
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER

Table: train
Column: chosen - TEXT
Column: rejected - TEXT
Column: row_id - INTEGER

This should be the correct behavior as stated in the docs and the latest release code, what I think is happening is that you had a previous cached Python version and subsequent calls to python install doesn't update the old dependencies.

To understand the issue better, I intentionally downgraded the package to version 2.14.7. I know that versions like 2.19.2 should not contain the problem based on the release notes. However, if you look at the source code, those versions do include the fix. I chose version 2.14.7 because I know it does not have the fix, as the fix was pushed a month later.

(venv) PS C:\Users\josed\.cache\burn-dataset\venv\Scripts> pip freeze
[...]
datasets==2.14.7
[...]

And got an exact behavior as yours

********************************************************************************
Starting huggingface dataset download and export
Dataset Name: mnist
Subset Name: None
Sqlite database file: C:\Users\josed\.cache\burn-dataset\mnist.db
Trust remote code: True
Custom cache dir: None
********************************************************************************
C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
  warnings.warn(
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████| 3.98k/3.98k [00:00<00:00, 3.66MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████| 6.83k/6.83k [00:00<?, ?B/s]
Traceback (most recent call last):
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 201, in <module>
    run()
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 190, in run
    download_and_export(
  File "C:\Users\josed\.cache\burn-dataset\importer.py", line 36, in download_and_export
    dataset_all = load_dataset(
                  ^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
                                       ^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\josed\.cache\burn-dataset\venv\Lib\site-packages\datasets\builder.py", line 552, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: BuilderConfig.__init__() got an unexpected keyword argument 'trust_remote_code'

First I don't know if I have made the correct guess about what is going on or if I'm on the wrong track.

If that's correct the issue is in the Python dependencies, thus to solve it I tried to add to downloader.rs the following line:

command.args([
        "-m",
        "pip",
        "--quiet",
        "install",
         + "--upgrade",
        "pyarrow",
        "sqlalchemy",
        "Pillow",
        "soundfile",
        "datasets",
    ]);

This causes all dependencies to be upgraded but other problems arise.
Deleting the ~.cache/burn-dataset folder that contains the Python venv seems to work fine (i.e. a clean install on subsequent runs), but I honestly don't think it's a good solution.

@antimora
Copy link
Collaborator

OK. It seems this was an issue. I forgot we install all dependencies into venv under burn-datasets dir.

I guess we will have to tell others the workaround to remove the env:

rm -rf ~/.cache/burn-dataset/venv

Copy link
Collaborator

@antimora antimora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antimora
Copy link
Collaborator

CC @laggui, @nathanielsimard , @louisfd , @syl20bnr so you're aware about potential upgrade issue. Basically hugging face datasets at much higher version but the packages under ~/.cache/burn-dataset/venv remain stale, especially it's a problem with the newer flag we are now passing if true.

@antimora antimora merged commit befe6c1 into tracel-ai:main Jul 17, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to open HuggingFace datasets
2 participants