Skip to content

cache_dir option in download_config in load_dataset is not respected #8029

@TsXor

Description

@TsXor

Describe the bug

Downloaded files still go to ~/.cache/huggingface/hub/ even if I specified cache_dir option in download_config in load_dataset.

Steps to reproduce the bug

Run my freshly written script and found that downloaded files did not go where I want.

'''
下载OpenWebText数据集,允许使用代理
'''

if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Download TikToken Files')
    parser.add_argument('--output-path', required=True, metavar='PATH', help='输出目录')
    parser.add_argument('--mirror', required=False, metavar='URL', help='HF镜像网址,例如:https://hf-mirror.com')
    parser.add_argument('--proxy', required=False, metavar='URL', help='代理网址')
    args = parser.parse_args()
else: args = None


import os
import shutil
from pathlib import Path
from typing import cast


if __name__ == '__main__':
    assert args is not None
    output_path = Path(args.output_path).resolve()
    proxy_url = None if args.proxy is None else str(args.proxy)
    mirror_url = None if args.mirror is None else str(args.mirror)

    output_path.mkdir(parents=True, exist_ok=True)
    download_cache_dir = output_path / 'download_cache'
    read_cache_dir = output_path / 'read_cache'
    save_dir = output_path / 'saved'
    complete_mark = output_path / 'completed'

    def clear_cache():
        shutil.rmtree(download_cache_dir)
        shutil.rmtree(read_cache_dir)

    def download_and_save():
        if mirror_url is not None:
            os.environ["HF_ENDPOINT"] = mirror_url

        from datasets import DownloadConfig, load_dataset

        if proxy_url is not None: proxy_dict = { "http": proxy_url, "https": proxy_url }
        else: proxy_dict = None

        dataset = load_dataset(
            'openwebtext',
            cache_dir=str(read_cache_dir),
            download_config=DownloadConfig(cache_dir=download_cache_dir, proxies=proxy_dict)
        )
        dataset.save_to_disk(save_dir)

    if complete_mark.is_file():
        print('OpenWebText is already downloaded')
        clear_cache()
    else:
        download_and_save()
        complete_mark.touch(exist_ok=True)
        clear_cache()

Expected behavior

Downloaded files goes to where I specified in download_config.

Environment info

> uv run datasets-cli env

Copy-and-paste the text below in your GitHub issue.

- `datasets` version: 4.6.0
- Platform: Windows-11-10.0.26200-SP0
- Python version: 3.14.3
- `huggingface_hub` version: 1.5.0
- PyArrow version: 23.0.1
- Pandas version: 3.0.1
- `fsspec` version: 2026.2.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions