FIX VeRA failure on multiple GPUs #2163

BenjaminBossan · 2024-10-18T10:23:33Z

The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error.

Example of a failing run: https://github.com/huggingface/peft/actions/runs/11396317278/job/31709933958

Since these buffers are shared, I chose not to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory.

The failing tests were introduced in #2076 but the error was already there beforehand.

I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.

The shared buffers vera_A and vera_B could be on the wrong device when using multiple GPUs, resulting in an error. This PR moves the them to the correct device to fix the error. Since these buffers are shared, I chose *not* to move the whole buffer to the device. Instead, when we create the slices from those buffers during forward, I move the devices only there. This could be inefficient in terms of runtime, but IIUC, the alternative would be to create new copies of these buffers per device, using more memory. The failing tests were introduced in huggingface#2076 but the error was already there beforehand. I did not discover these failing tests earlier because we had a concurrent error caused by a transformers issue which looked very similar and I wrongly assumed that the VeRA error was caused by the same issue. But now that the issue has been fixed, the error still persists, prompting me to investigate.

HuggingFaceDocBuilderDev · 2024-10-18T10:27:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-10-18T11:09:51Z

@dkopi @vvvm23 It would be great if you could double check if this is the best solution or if there is a better way.

Merge branch 'main' into fix-vera-multi-gpu-device-error

5fdff83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX VeRA failure on multiple GPUs #2163

FIX VeRA failure on multiple GPUs #2163

BenjaminBossan commented Oct 18, 2024

HuggingFaceDocBuilderDev commented Oct 18, 2024

BenjaminBossan commented Oct 18, 2024

FIX VeRA failure on multiple GPUs #2163

Are you sure you want to change the base?

FIX VeRA failure on multiple GPUs #2163

Conversation

BenjaminBossan commented Oct 18, 2024

HuggingFaceDocBuilderDev commented Oct 18, 2024

BenjaminBossan commented Oct 18, 2024