IBM Granite MoE Architecture #9438

gabe-l-hart · 2024-09-11T16:22:47Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Dependencies

This PR is dependent on merging the initial GraniteLM PR (IBM Granite Architecture #9412)

Description

This PR introduces the granitemoe model architecture from IBM. It emulates the transformers changes in this PR.

The granitemoe architecture follows a very similar pattern to the granite architecture and its changes relative to llama. For the MoE variant, the base architecture is mixtral (MoE branch of llama here in llama.cpp). The same four additional multipliers are added (embeddings_multiplier, attention_multiplier, residual_multiplier, and logits_scale).

Testing

This PR can be tested using ibm/PowerMoE-3b from huggingface following the same testing steps used for granite (here).

gguf-py/gguf/gguf_writer.py

This includes the addition of new tensor names for the new moe layers. These may not be correct at this point due to the need for the hack in gguf_writer.py to double-check the length of the shape for these layers. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

… and up After a lot of staring and squinting, it's clear that the standard mixtral expert implementation is equivalent to the vectorized parallel experts in granite. The difference is that in granite, the w1 and w3 are concatenated into a single tensor "input_linear." Rather than reimplementing all of the math on the llama.cpp side, the much simpler route is to just split this tensor during conversion and follow the standard mixtral route. Branch: GraniteMoE Co-Authored-By: [email protected] Signed-off-by: Gabe Goodhart <[email protected]>

GraniteMoE follows the mixtral architecture (once the input_linear layers are split into gate_exps/up_exps). The main delta is the addition of the same four multipliers used in Granite. Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from 5f37be3 to 3219f58 Compare September 11, 2024 16:29

github-actions bot added the python python script changes label Sep 11, 2024

gabe-l-hart mentioned this pull request Sep 11, 2024

IBM granite/granitemoe architecture support ollama/ollama#6760

Open

2 tasks

compilade reviewed Sep 14, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from cd41666 to 1b235d0 Compare September 16, 2024 15:28

gabe-l-hart added 4 commits September 17, 2024 06:42

feat(convert_hf_to_gguf): Add GraniteMoeModel

1788212

GraniteMoe has the same configuration deltas as Granite Branch: GraniteMoE Signed-off-by: Gabe Goodhart <[email protected]>

gabe-l-hart force-pushed the GraniteMoE branch from 1b235d0 to 2615459 Compare September 17, 2024 12:46

gabe-l-hart marked this pull request as ready for review September 17, 2024 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM Granite MoE Architecture #9438

IBM Granite MoE Architecture #9438

gabe-l-hart commented Sep 11, 2024 •

edited

Loading

IBM Granite MoE Architecture #9438

Are you sure you want to change the base?

IBM Granite MoE Architecture #9438

Conversation

gabe-l-hart commented Sep 11, 2024 • edited Loading

Dependencies

Description

Testing

gabe-l-hart commented Sep 11, 2024 •

edited

Loading