-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
Downloaded files still go to ~/.cache/huggingface/hub/ even if I specified cache_dir option in download_config in load_dataset.
Steps to reproduce the bug
Run my freshly written script and found that downloaded files did not go where I want.
'''
下载OpenWebText数据集,允许使用代理
'''
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Download TikToken Files')
parser.add_argument('--output-path', required=True, metavar='PATH', help='输出目录')
parser.add_argument('--mirror', required=False, metavar='URL', help='HF镜像网址,例如:https://hf-mirror.com')
parser.add_argument('--proxy', required=False, metavar='URL', help='代理网址')
args = parser.parse_args()
else: args = None
import os
import shutil
from pathlib import Path
from typing import cast
if __name__ == '__main__':
assert args is not None
output_path = Path(args.output_path).resolve()
proxy_url = None if args.proxy is None else str(args.proxy)
mirror_url = None if args.mirror is None else str(args.mirror)
output_path.mkdir(parents=True, exist_ok=True)
download_cache_dir = output_path / 'download_cache'
read_cache_dir = output_path / 'read_cache'
save_dir = output_path / 'saved'
complete_mark = output_path / 'completed'
def clear_cache():
shutil.rmtree(download_cache_dir)
shutil.rmtree(read_cache_dir)
def download_and_save():
if mirror_url is not None:
os.environ["HF_ENDPOINT"] = mirror_url
from datasets import DownloadConfig, load_dataset
if proxy_url is not None: proxy_dict = { "http": proxy_url, "https": proxy_url }
else: proxy_dict = None
dataset = load_dataset(
'openwebtext',
cache_dir=str(read_cache_dir),
download_config=DownloadConfig(cache_dir=download_cache_dir, proxies=proxy_dict)
)
dataset.save_to_disk(save_dir)
if complete_mark.is_file():
print('OpenWebText is already downloaded')
clear_cache()
else:
download_and_save()
complete_mark.touch(exist_ok=True)
clear_cache()Expected behavior
Downloaded files goes to where I specified in download_config.
Environment info
> uv run datasets-cli env
Copy-and-paste the text below in your GitHub issue.
- `datasets` version: 4.6.0
- Platform: Windows-11-10.0.26200-SP0
- Python version: 3.14.3
- `huggingface_hub` version: 1.5.0
- PyArrow version: 23.0.1
- Pandas version: 3.0.1
- `fsspec` version: 2026.2.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels