Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does preprocessing for pre-training ensures data shuffling? #6408

Closed
1 task done
coding2debug opened this issue Dec 20, 2024 · 2 comments
Closed
1 task done

Does preprocessing for pre-training ensures data shuffling? #6408

coding2debug opened this issue Dec 20, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@coding2debug
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

Python version: 3.11.10

Reproduction

Preprocessing file:

### model
model_name_or_path: Qwen2.5-3B

### method
stage: pt
do_train: true
finetuning_type: full

### dataset
dataset: llm_train
eval_dataset: llm_valid
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 30
preprocessing_batch_size: 1000
tokenized_path: tokenized_data_2048

### output
output_dir: qwen2_out
overwrite_output_dir: true

Training code

### model
model_name_or_path: Qwen2.5-3B
flash_attn : auto

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z0_config.json
enable_liger_kernel: true

### dataset
dataset: llm_train
eval_dataset: llm_valid
cutoff_len: 4096
overwrite_cache: false
preprocessing_num_workers: 16
preprocessing_batch_size: 1000
tokenized_path: tokenized_data_2048

### output
output_dir: qwen2_out
logging_steps: 1000
save_steps: 50000
save_total_limit: 5
plot_loss: true
overwrite_output_dir: false
report_to: wandb
run_name: official_qwen_pre_training

### train
per_device_train_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
disable_gradient_checkpointing: true

### eval
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 50000

Expected behavior

I want to know if this ensures a data shuffling during the pre-training and if possible to locate this in the code, as I am unable to find it.

Thanks

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 20, 2024
@HBin013
Copy link

HBin013 commented Dec 20, 2024

It seems to be "streaming". Check LLaMA-Factory/src/llamafactory/data/loader.py line 249-250.

if data_args.streaming:
    dataset = dataset.shuffle(buffer_size=data_args.buffer_size, seed=training_args.seed)

@hiyouga
Copy link
Owner

hiyouga commented Dec 21, 2024

The data will be shuffled in pre-training

@hiyouga hiyouga closed this as completed Dec 21, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants