Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_samples 参数指定多个数据集的数量报错 #6349

Closed
1 task done
Zbaoli opened this issue Dec 16, 2024 · 1 comment
Closed
1 task done

max_samples 参数指定多个数据集的数量报错 #6349

Zbaoli opened this issue Dec 16, 2024 · 1 comment
Labels
solved This problem has been already solved

Comments

@Zbaoli
Copy link

Zbaoli commented Dec 16, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

- `llamafactory` version: 0.9.2.dev0
- Platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.31
- Python version: 3.10.15
- PyTorch version: 2.5.1+cu124 (GPU)
- Transformers version: 4.46.1
- Datasets version: 3.1.0
- Accelerate version: 1.0.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 3090
- DeepSpeed version: 0.15.4
- vLLM version: 0.6.4.post1

Reproduction

配置文件:

dataset: evol_instruct_zh_gpt4,identity,belle_1k
max_samples: 9000,1000,1000

Expected behavior

参数文档上说max_samples参数用于指定每个数据集的最大样本数量,使用逗号分隔。
但我用上面的配置会报错:

[rank2]:     max_samples = min(data_args.max_samples, len(dataset))
[rank2]: TypeError: '<' not supported between instances of 'int' and 'str'

定位到代码这个地方:

def _load_single_dataset(...):
   ...
    if data_args.max_samples is not None:  # truncate dataset
        max_samples = min(data_args.max_samples, len(dataset))
        dataset = dataset.select(range(max_samples))

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 16, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 17, 2024

使用 num_samples 而非 max_samples
https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md

@hiyouga hiyouga closed this as completed Dec 17, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants