Does preprocessing for pre-training ensures data shuffling? #6408

coding2debug · 2024-12-20T15:33:28Z

Reminder

I have read the README and searched the existing issues.

System Info

Python version: 3.11.10

Reproduction

Preprocessing file:

### model
model_name_or_path: Qwen2.5-3B

### method
stage: pt
do_train: true
finetuning_type: full

### dataset
dataset: llm_train
eval_dataset: llm_valid
cutoff_len: 4096
overwrite_cache: true
preprocessing_num_workers: 30
preprocessing_batch_size: 1000
tokenized_path: tokenized_data_2048

### output
output_dir: qwen2_out
overwrite_output_dir: true

Training code

### model
model_name_or_path: Qwen2.5-3B
flash_attn : auto

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z0_config.json
enable_liger_kernel: true

### dataset
dataset: llm_train
eval_dataset: llm_valid
cutoff_len: 4096
overwrite_cache: false
preprocessing_num_workers: 16
preprocessing_batch_size: 1000
tokenized_path: tokenized_data_2048

### output
output_dir: qwen2_out
logging_steps: 1000
save_steps: 50000
save_total_limit: 5
plot_loss: true
overwrite_output_dir: false
report_to: wandb
run_name: official_qwen_pre_training

### train
per_device_train_batch_size: 3
gradient_accumulation_steps: 4
learning_rate: 5.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
disable_gradient_checkpointing: true

### eval
per_device_eval_batch_size: 4
eval_strategy: steps
eval_steps: 50000

Expected behavior

I want to know if this ensures a data shuffling during the pre-training and if possible to locate this in the code, as I am unable to find it.

Thanks

Others

No response

The text was updated successfully, but these errors were encountered:

HBin013 · 2024-12-20T23:25:06Z

It seems to be "streaming". Check LLaMA-Factory/src/llamafactory/data/loader.py line 249-250.

if data_args.streaming:
    dataset = dataset.shuffle(buffer_size=data_args.buffer_size, seed=training_args.seed)

hiyouga · 2024-12-21T05:27:40Z

The data will be shuffled in pre-training

github-actions bot added the pending This problem is yet to be addressed label Dec 20, 2024

hiyouga closed this as completed Dec 21, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does preprocessing for pre-training ensures data shuffling? #6408

Does preprocessing for pre-training ensures data shuffling? #6408

coding2debug commented Dec 20, 2024

HBin013 commented Dec 20, 2024 •

edited

Loading

hiyouga commented Dec 21, 2024

Does preprocessing for pre-training ensures data shuffling? #6408

Does preprocessing for pre-training ensures data shuffling? #6408

Comments

coding2debug commented Dec 20, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

HBin013 commented Dec 20, 2024 • edited Loading

hiyouga commented Dec 21, 2024

HBin013 commented Dec 20, 2024 •

edited

Loading