POC: multiple model/configuration DeepSpeed support #3097

muellerzr · 2024-09-10T15:34:29Z

Multiple model/multiple configuration support for DeepSpeed

What does this add?

While we await for potentially rewriting much of the deepspeed (DS) implementation to get rid of the engine mechanic, this PR provides a way for users to use multiple deepspeed plugins (potentially with a single accelerator) to have different models prepared different ways/using different DS configurations.

Who is it for?

Solves #2496 (Finally)

Why is it needed?

There are many cases where users may want to use different DS configs/prepare multiple models when using DS. One such case is TRL where we may want an untrained model distributed in Zero3, while another in Zero2.

Note: for the case of multiple models being trained, a user should create a second Accelerator object (since we require the prepared engine to do backward()). Since the plugins are stored in the state, you don't need to re-pass in the plugins just create a secondary Accelerator.

What parts of the API does this impact?

User-facing:

Users may now pass in a list of DeepSpeedPlugin's to the Accelerator. The first plugin will be the "default" (so if you don't set an active plugin, this is what's used during all calls to .prepare()). When a user wants a different configuration, simply call plugin.enable() and this will in-turn setup that particular plugin to be the one accelerator.prepare will call (see examples below)

Internal structure:

Accelerator/AcceleratorState now stores a list of DeepSpeedPlugins
get_active_deepspeed_plugin will go through and find of the list of DeepSpeedPlugins which one is enabled.
Only a singular DS plugin should be "active"
We always try and setup the cached transformers plugin since we may constantly be swapping out its internals being used

Basic Usage Example(s):

from accelerate import Accelerator, DeepSpeedPlugin
plugin1 = DeepSpeedPlugin(config_file_a)
plugin2 = DeepSpeedPlugin(config_file_b)

plugins = {"student": plugin1, "teacher": plugin2}

accelerator = Accelerator(deepspeed_plugin=plugins)
# The ones we're training on
model, optimizer, scheduler = accelerator.prepare(model, optimizer, scheduler)
# Our inference/secondary config
accelerator.state.enable_deepspeed_plugin("teacher")
model2 = accelerator.prepare(model2)

Anticipated maintenance burden? (What will happen in say, 3 months if something changes)

Ideally this stays how it is, and we can remove that secondary/n-model Accelerator's later once we've worked with the DS team to make the DS api more pytorch-ic

Who can review?

@BenjaminBossan
@SunMarc
@stas00

HuggingFaceDocBuilderDev · 2024-09-10T15:38:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/accelerate/accelerator.py

src/accelerate/utils/deepspeed.py

Co-authored-by: Stas Bekman <[email protected]>

…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed

src/accelerate/accelerator.py

src/accelerate/utils/dataclasses.py

src/accelerate/utils/deepspeed.py

Co-authored-by: Stas Bekman <[email protected]>

…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed

stas00

Testing: I think you want an actual test where you do fwd/bwd with 2 models. This is insufficient to test that it works correctly IMHO.

tests/deepspeed/test_deepspeed_multiple_model.py

src/accelerate/utils/deepspeed.py

Co-authored-by: Benjamin Bossan <[email protected]>

SunMarc

Thanks for the huge work @muellerzr ! Mostly went through the examples and the tutorial. Just in case (even though dict are ordered for python >=3.7), I think we should advise the user to always enable the deepspeed plugin before preparing the model accelerator_0.state.enable_deepspeed_plugin("model1").

SunMarc · 2024-09-12T16:39:30Z

docs/source/usage_guides/deepspeed_multiple_model.md

+accelerator_0 = Accelerator(deepspeed_plugin=deepspeed_plugins)
+accelerator_1 = Accelerator()


What happens if you pass again a deepspeed_plugin in the second accelerator ?

Well, now we'll raise an error :)

SunMarc · 2024-09-12T16:41:41Z

src/accelerate/test_utils/scripts/external_deps/test_ds_multiple_model.py

+ # Run training loop
+ accelerator.state.enable_deepspeed_plugin("training")


If we forget to enable the training plugin during the training, what happens ?

On the accelerate side, not particularly much/anything at all, since it's mostly for deepspeed.init(). But for the transformers side eventually I believe this should run into some errors between zero2 and zero3, since with zero3 we need to gather params, and this sets a flag that tells transformers we're using zero3 right now.

SunMarc · 2024-09-12T16:46:20Z

src/accelerate/test_utils/scripts/external_deps/test_ds_multiple_model.py

+
+ zero3_accelerator = Accelerator()


let's add a quick comment that this accelerator will share the same AcceleratorState as zero2_accelerator, hence we don't need to pass deepspeed_plugins. However, it doesn't hare mixed_precision no ?

For rn we assume the same mixed precision as noted earlier, will look towards configuring multiple precision types later

SunMarc · 2024-09-12T17:03:06Z

src/accelerate/accelerator.py

+ deepspeed_plugin.enable()
+ else:
+ for plugin in deepspeed_plugin.values():
+ plugin.set_mixed_precision(mixed_precision)


can two deepspeed plugins have different mixed precision ? Not sure about the use case also, just asking

Good question. I suppose technically they could, e.g. one has the model in bf16 and another in float32.

A direct use case would be the zero3 model is in fp8 vs the trained model in bf16 (te fp8 specifically).

Though at first glance... quite a challenge. I'll leave this to a future PR/enhancement

that would require 2 accelerate configs.

Yep, which rn we don't/can't do

src/accelerate/accelerator.py

docs/source/usage_guides/deepspeed_multiple_model.md

stas00 · 2024-09-12T17:01:18Z

docs/source/usage_guides/deepspeed_multiple_model.md

+```python
+from accelerate import Accelerator
+
+accelerator = Accelerator(deepspeed_plugin=deepspeed_plugins)


can this be fixed to deepspeed_plugins as the key, otherwise this is a poor API.

Perhaps for BC it could be either?

This is a DS bug due to sanity checking never getting updated for inference. Please share a repro and issue. Thanks!

docs/source/usage_guides/deepspeed_multiple_model.md

stas00 · 2024-09-12T17:29:54Z

@muellerzr, where is zero3_init_flag: true (accelerate config) treated in this PR? As surely it's possible not all models will want the same treatment.

muellerzr · 2024-09-12T17:32:58Z

@stas00 it's done in that tweak to when we create the HF config file (notice how it's done always, rather than just under zero3). We then update the deepspeed config reference transformers looks at. See here: https://github.com/huggingface/accelerate/pull/3097/files#diff-bcbdd609996df7224cc9c21ecd97092122d45263d1323892474021d510ec1eefR1199-R1224

And then the first thing enable does is set the reference on that plugin

Co-authored-by: Stas Bekman <[email protected]>

src/accelerate/test_utils/scripts/external_deps/test_ds_multiple_model.py

stas00 · 2024-09-12T17:39:07Z

tests/deepspeed/test_deepspeed_multiple_model.py

+ "A wrapped DeepSpeed engine reference is currently tied for this `Accelerator()` instance.",
+ captured.output[0],
+ )
+


just flagging that this test is missing the crucial part of actually running training.

Correct, that's for the later test/the script

src/accelerate/utils/dataclasses.py

stevhliu

Very cool use case examples! Left some suggestions to reduce wordiness and be more direct, and made it clearer that the sections correspond to these use cases :)

docs/source/usage_guides/deepspeed_multiple_model.md

Co-authored-by: Steven Liu <[email protected]>

lewtun

Really cool feature @muellerzr - this is going to unlock a ton of cool use cases for post-training LLMs! I only commented on the docs from the perspective of a user - overall looks great 🔥

docs/source/usage_guides/deepspeed_multiple_model.md

lewtun · 2024-09-13T07:05:11Z

docs/source/usage_guides/deepspeed_multiple_model.md

+```python
+from accelerate.utils import DeepSpeedPlugin
+
+zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")


Would be good to document somewhere if/how accelerate launch works when two configs are needed. As a user, I rely heavily on accelerate launch --config_file=/path/to/deepspeed_zero{1,2,3}.yaml and I wonder if we can enable something like:

# comma separated? accelerate launch --config_file=deepspeed_zero2.yaml,deepspeed_zero3.yaml

And then the training code assumes the first plugin comes from the first config, etc

We can't do that currently, it is solely based on creating DeepSpeedPlugin's manually. (at least for the second one)

Most probable, setting up the following:

deepspeed_configs: config_1: deepspeed_config_file: ... name: ... config_2: deepspeed_config_file: ... name: ...

Indeed, having a "master" config file would be nice to have. That way I can have my Z2/Z3 configs fixed and just toggle their use based on the task at hand

docs/source/usage_guides/deepspeed_multiple_model.md

Co-authored-by: lewtun <[email protected]>

@stas00

Adding the new tests in huggingface/accelerate#3097 caused the nv-accelerate-v100 tests to fail. Due to other CI issues we didn't notice this at first. This just skips the problematic test for now. cc: @stas00 / @muellerzr

muellerzr added 6 commits September 9, 2024 11:01

Bookmark

b8b77e7

Migratory

602f6f6

Uncomment

0dd96a1

Rm name to model for now

8db38da

Rm container

2f8c5dd

Left: test

4b1bc15

muellerzr requested review from BenjaminBossan and SunMarc September 10, 2024 15:34

muellerzr added 3 commits September 10, 2024 11:59

Allow only wrapping one model

3488b11

Add warning but only ref once

3a42fcd

Refine

26b3c2b

stas00 reviewed Sep 10, 2024

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/utils/deepspeed.py Outdated Show resolved Hide resolved

muellerzr and others added 10 commits September 10, 2024 13:49

Update src/accelerate/accelerator.py

a059e36

Co-authored-by: Stas Bekman <[email protected]>

Finish stas nits

8a3e628

Merge branch 'muellerzr-multiple-model-deepspeed' of https://github.c…

bcfcf55

…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed

Merge branch 'main' into muellerzr-multiple-model-deepspeed

599986c

Clean

447091b

Fixup test + test writing

7049f6c

Fully working

9ff5296

Fin

db37ea2

Nit

5ebbae7

Quality

fb53edd

stas00 reviewed Sep 10, 2024

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/utils/dataclasses.py Outdated Show resolved Hide resolved

src/accelerate/utils/deepspeed.py Outdated Show resolved Hide resolved

muellerzr and others added 3 commits September 10, 2024 17:36

Update src/accelerate/accelerator.py

d1078b8

Co-authored-by: Stas Bekman <[email protected]>

Actionable error

d01aea4

Merge branch 'muellerzr-multiple-model-deepspeed' of https://github.c…

082f8ec

…om/huggingface/accelerate into muellerzr-multiple-model-deepspeed

stas00 reviewed Sep 10, 2024

View reviewed changes

tests/deepspeed/test_deepspeed_multiple_model.py Outdated Show resolved Hide resolved

tests/deepspeed/test_deepspeed_multiple_model.py Outdated Show resolved Hide resolved

Make note of when its enabled

4f8adfd

stas00 reviewed Sep 10, 2024

View reviewed changes

src/accelerate/utils/deepspeed.py Outdated Show resolved Hide resolved

Apply suggestions from code review

837272f

Co-authored-by: Benjamin Bossan <[email protected]>

muellerzr linked an issue Sep 12, 2024 that may be closed by this pull request

[RFC] Supporting multiple models with DeepSpeed #2496

Closed

4 tasks

SunMarc approved these changes Sep 12, 2024

View reviewed changes

stas00 reviewed Sep 12, 2024

View reviewed changes

docs/source/usage_guides/deepspeed_multiple_model.md Outdated Show resolved Hide resolved

Limit users to not pass in another ds plugin to another accelerator

a17e03e

stas00 reviewed Sep 12, 2024

View reviewed changes

not implemented err + Make a note about why no params

3e35dd3

Apply suggestions from code review from Stas

08c2c5f

Co-authored-by: Stas Bekman <[email protected]>

stas00 reviewed Sep 12, 2024

View reviewed changes

src/accelerate/utils/dataclasses.py Show resolved Hide resolved

muellerzr added 8 commits September 12, 2024 13:57

Add deepspeed_plugins arg + update doc

f889b88

Plugin -> plugins

91ab419

Change enable() -> select()

70bc728

Update ref properly + test

f193ece

Conflict

2375a0a

Be consistent, model1,model2...

c9baf0d

first_, second_

6d66b92

A few more auto values

254a3ef

stevhliu approved these changes Sep 12, 2024

View reviewed changes

Apply suggestions from code review

88cbbdf

Co-authored-by: Steven Liu <[email protected]>

muellerzr linked an issue Sep 12, 2024 that may be closed by this pull request

Multiple model DeepSpeed support (likely will need a full rewrite) #2974

Closed

lewtun reviewed Sep 13, 2024

View reviewed changes

Apply suggestions from code review

dd77813

Co-authored-by: lewtun <[email protected]>

muellerzr merged commit e9e5a73 into main Sep 13, 2024
28 checks passed

muellerzr deleted the muellerzr-multiple-model-deepspeed branch September 13, 2024 11:28

loadams mentioned this pull request Sep 25, 2024

Skip failing newly added tests in accelerate microsoft/DeepSpeed#6574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: multiple model/configuration DeepSpeed support #3097

POC: multiple model/configuration DeepSpeed support #3097

muellerzr commented Sep 10, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 10, 2024

stas00 left a comment

SunMarc left a comment

SunMarc Sep 12, 2024

muellerzr Sep 12, 2024

SunMarc Sep 12, 2024

muellerzr Sep 12, 2024

SunMarc Sep 12, 2024

muellerzr Sep 12, 2024

SunMarc Sep 12, 2024

muellerzr Sep 12, 2024

muellerzr Sep 12, 2024

stas00 Sep 12, 2024

muellerzr Sep 12, 2024

stas00 Sep 12, 2024

tjruwase Sep 12, 2024

stas00 commented Sep 12, 2024 •

edited

Loading

muellerzr commented Sep 12, 2024 •

edited

Loading

stas00 Sep 12, 2024

muellerzr Sep 12, 2024

stevhliu left a comment

lewtun left a comment

lewtun Sep 13, 2024

muellerzr Sep 13, 2024 •

edited

Loading

muellerzr Sep 13, 2024

lewtun Sep 13, 2024

		accelerator_0 = Accelerator(deepspeed_plugin=deepspeed_plugins)
		accelerator_1 = Accelerator()

		# Run training loop
		accelerator.state.enable_deepspeed_plugin("training")

POC: multiple model/configuration DeepSpeed support #3097

POC: multiple model/configuration DeepSpeed support #3097

Conversation

muellerzr commented Sep 10, 2024 • edited Loading

Multiple model/multiple configuration support for DeepSpeed

What does this add?

Who is it for?

Why is it needed?

What parts of the API does this impact?

User-facing:

Internal structure:

Basic Usage Example(s):

Anticipated maintenance burden? (What will happen in say, 3 months if something changes)

Who can review?

HuggingFaceDocBuilderDev commented Sep 10, 2024

stas00 left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Sep 12, 2024 • edited Loading

muellerzr commented Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr Sep 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr commented Sep 10, 2024 •

edited

Loading

stas00 commented Sep 12, 2024 •

edited

Loading

muellerzr commented Sep 12, 2024 •

edited

Loading

muellerzr Sep 13, 2024 •

edited

Loading