You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Time passes and finetuning is complete.
Below is the final part of the log.
[2023-07-20 18:05:19,608] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-20 18:05:19,608] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-20 18:05:29,973] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters
ninja: no work to do.
Time to load cpu_adam op: 2.9780356884002686 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.93, 'learning_rate': 1.9131861575179e-05, 'epoch': 0.22}
{'loss': 0.8989, 'learning_rate': 1.7640214797136038e-05, 'epoch': 0.43}
[2023-07-21 10:45:40,546] [WARNING] [stage3.py:1850:step] 2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding get_accelerator().empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time
{'loss': 0.8909, 'learning_rate': 1.614856801909308e-05, 'epoch': 0.65}
{'loss': 0.8855, 'learning_rate': 1.4656921241050121e-05, 'epoch': 0.87}
{'loss': 0.8753, 'learning_rate': 1.316527446300716e-05, 'epoch': 1.09}
{'loss': 0.8607, 'learning_rate': 1.1673627684964201e-05, 'epoch': 1.3}
{'loss': 0.8557, 'learning_rate': 1.0181980906921243e-05, 'epoch': 1.52}
{'loss': 0.8521, 'learning_rate': 8.690334128878282e-06, 'epoch': 1.74}
{'loss': 0.8464, 'learning_rate': 7.198687350835323e-06, 'epoch': 1.95}
{'loss': 0.7611, 'learning_rate': 5.707040572792363e-06, 'epoch': 2.17}
{'loss': 0.7234, 'learning_rate': 4.2153937947494036e-06, 'epoch': 2.39}
{'loss': 0.7142, 'learning_rate': 2.723747016706444e-06, 'epoch': 2.6}
{'loss': 0.7086, 'learning_rate': 1.2321002386634846e-06, 'epoch': 2.82}
{'train_runtime': 403223.9352, 'train_samples_per_second': 2.194, 'train_steps_per_second': 0.017, 'train_loss': 0.8234077206364384, 'epoch': 3.0}
I created an output file name as result.bin and this file was created.
The model path is set to output_my_own_data/7B/checkpoint-6000 and the output_path is set to output_my_own_data/7B/result.bin.
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("./output_my_own_data/7B/tokenizer.model")
model = LlamaForCausalLM.from_pretrained("./output_my_own_data/7B/result.bin")
print()
The tokenizer is loaded, but the model has an error.
Exception has occurred: OSError
It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
File "/home/sulki/project/stanford_alpaca-main/inference copy.py", line 4, in <module>
model = LlamaForCausalLM.from_pretrained("/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin")
OSError: It looks like the config file at '/home/sulki/project/stanford_alpaca-main/output_tawos_34/7B/result.bin' is not a valid JSON file.
Which part is the problem?
The text was updated successfully, but these errors were encountered:
Hi
I finetuned the llama 7b model using alpaca.
Below is the command I ran.
Time passes and finetuning is complete.
Below is the final part of the log.
Below is the generated output_dir .
Here's how I've tried:
I created an output file name as result.bin and this file was created.
The model path is set to output_my_own_data/7B/checkpoint-6000 and the output_path is set to output_my_own_data/7B/result.bin.
I loaded the model as below.
The tokenizer is loaded, but the model has an error.
Which part is the problem?
The text was updated successfully, but these errors were encountered: