Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url #2375

Open
cs-mshah opened this issue Jul 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@cs-mshah
Copy link

cs-mshah commented Jul 5, 2024

Describe the bug

I am trying to upload a large dataset to HF, but I frequently encounter timeouts with the following error:

    raise HfHubHTTPError(str(e), response=response) from e                                                                                                                                    
huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: <url>

Reproduction

I have the following script after a lot of searching on uploading a large dataset (~100GB) to huggingface datasets using the huggingface hub api:

import autoroot
import os
from huggingface_hub import HfApi, CommitOperationAdd, HfFileSystem, preupload_lfs_files
from pathlib import Path
from loguru import logger as log
import argparse
import multiprocessing

api = HfApi(token=os.environ["token"])
fs = HfFileSystem(token=os.environ["token"])

def get_all_files(root: Path, include_patterns=[], ignore_patterns=[]):
    def is_ignored(path):
        for pattern in ignore_patterns:
            if pattern in str(path):
                return True
        return False

    def is_included(path):
        for pattern in include_patterns:
            if pattern in str(path):
                return True
        if len(include_patterns) == 0:
            return True
        return False

    dirs = [root]
    while len(dirs) > 0:
        dir = dirs.pop()
        for candidate in dir.iterdir():
            if candidate.is_file() and not is_ignored(candidate) and is_included(candidate):
                yield candidate
            if candidate.is_dir():
                dirs.append(candidate)


def get_groups_of_n(n: int, iterator):
    assert n > 1
    buffer = []
    for elt in iterator:
        if len(buffer) == n:
            yield buffer
            buffer = []
        buffer.append(elt)
    if len(buffer) != 0:
        yield buffer


def main(args):
    if args.operation == "upload":
        remote_root = Path(os.path.join("datasets", args.repo_id))
        all_remote_files = fs.glob(os.path.join("datasets", args.repo_id, "**/*.hdf5"))
        all_remote_files = [
            str(Path(file).relative_to(remote_root)) for file in all_remote_files
        ]
        args.ignore_patterns.extend(all_remote_files)

        root = Path(args.root_directory)
        num_threads = args.num_threads
        if num_threads is None:
            num_threads = multiprocessing.cpu_count()
        for i, file_paths in enumerate(get_groups_of_n(args.group_size, get_all_files(root, args.include_patterns, args.ignore_patterns))):
            log.info(f"Committing {len(file_paths)} files...")
            # path_in_repo is path of file_path relative to root_directory
            operations = [] # List of all `CommitOperationAdd` objects that will be generated
            for file_path in file_paths:
                addition = CommitOperationAdd(
                    path_in_repo=str(file_path.relative_to(Path(args.relative_root))),
                    path_or_fileobj=str(file_path),
                )
                preupload_lfs_files(
                    args.repo_id,
                    [addition],
                    token=os.environ["token"],
                    num_threads=num_threads,
                    repo_type="dataset",
                )
                operations.append(addition)

            commit_info = api.create_commit(
                repo_id=args.repo_id,
                operations=operations,
                commit_message=f"Upload part {i}",
                repo_type="dataset",
                token=os.environ["token"],
                num_threads=num_threads
            )
            log.info(f"Commit {i} done: {commit_info.commit_message}")

    elif args.operation == "delete":
        api.delete_folder(args.path_in_repo, 
                          repo_id=args.repo_id, 
                          repo_type="dataset", 
                          commit_description="Delete old folder", 
                          token=os.environ["token"])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--operation", type=str, default="upload", choices=["upload", "delete"])
    parser.add_argument("--group_size", type=int, default=100)
    parser.add_argument("--repo_id", type=str)
    parser.add_argument(
        "--relative_root",
        type=str,
        help="Relative root",
    )
    parser.add_argument("--root_directory", type=str, help="Root directory to upload (or delete).")
    parser.add_argument("--path_in_repo", type=str, help="Path in the repo to delete")
    parser.add_argument("--ignore_patterns", help="Patterns to ignore", nargs="+", default=["spurious", "resources"])
    parser.add_argument("--include_patterns", help="Patterns to include", nargs="+", default=["hdf5", "csv"])
    parser.add_argument("--num_threads", type=int, default=None, help="Number of threads to use for uploading.")
    args = parser.parse_args()
    main(args)

Logs

No response

System info

- huggingface_hub version: 0.23.4
- Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /data/.cache/token
- Has saved token ?: True
- Who am I ?: cs-mshah
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.3.1
- Jinja2: 3.0.3
- Graphviz: N/A
- keras: 2.14.0
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: 0.1.6
- gradio: 3.50.0
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.7.4
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /data/.cache/hub
- HF_ASSETS_CACHE: /data/.cache/assets
- HF_TOKEN_PATH: /data/.cache/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

I have separately added this to the `.env` file:


HF_HUB_ENABLE_HF_TRANSFER=1
HF_HUB_ETAG_TIMEOUT=500
@cs-mshah cs-mshah added the bug Something isn't working label Jul 5, 2024
@Wauplin
Copy link
Contributor

Wauplin commented Jul 9, 2024

Hi @cs-mshah, for uploading very large folders to the Hub, you might want to have a look at #2254. It's not merged yet but starts to be mature. It's an upload method with advanced retry mechanisms that should help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants