You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the repo. If I use torchrun as suggested by the webpage (see below) it fails due to an error while compiling cpuadam within the deepspeed library. The actual error is due to a compile command from nvcc (see below). The error is:
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
This leads to a follow up error: AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
My guess it there is some version issue either with gcc or cuda env or so. But since I installed alpaca into a new virtual environment (I tried both conda and venv) versioning issues should not really happen. So maybe sth. else...
Thanks for the repo. If I use torchrun as suggested by the webpage (see below) it fails due to an error while compiling cpuadam within the deepspeed library. The actual error is due to a compile command from nvcc (see below). The error is:
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
This leads to a follow up error: AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
I googled and tried a few things:
NVlabs/instant-ngp#119
microsoft/DeepSpeed#1846
but that did not help. Anyone any idea?
My guess it there is some version issue either with gcc or cuda env or so. But since I installed alpaca into a new virtual environment (I tried both conda and venv) versioning issues should not really happen. So maybe sth. else...
Commands causing errors:
torchrun --nproc_per_node=4 --master_port=23222 train.py --model_name_or_path /home/johannes/modelhf/llama-7b/ --data_path /home/johannes/alpaca/alpaca_data.json --bf16 True --output_dir /home/johannes/outalpaca --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --deepspeed "./configs/default_offload_opt_param.json" --tf32 True
then later it calls, which raises the above error in /usr/include/c++/11/bits/std_function.h:
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/johannes/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /home/johannes/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
The text was updated successfully, but these errors were encountered: