Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBM Granite MoE Architecture #9438

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented Sep 11, 2024

Dependencies

Description

This PR introduces the granitemoe model architecture from IBM. It emulates the transformers changes in this PR.

The granitemoe architecture follows a very similar pattern to the granite architecture and its changes relative to llama. For the MoE variant, the base architecture is mixtral (MoE branch of llama here in llama.cpp). The same four additional multipliers are added (embeddings_multiplier, attention_multiplier, residual_multiplier, and logits_scale).

Testing

This PR can be tested using ibm/PowerMoE-3b from huggingface following the same testing steps used for granite (here).

@gabe-l-hart gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from 5f37be3 to 3219f58 Compare September 11, 2024 16:29
@github-actions github-actions bot added the python python script changes label Sep 11, 2024
@gabe-l-hart gabe-l-hart force-pushed the GraniteMoE branch 2 times, most recently from cd41666 to 1b235d0 Compare September 16, 2024 15:28
This includes the addition of new tensor names for the new moe layers.
These may not be correct at this point due to the need for the hack in
gguf_writer.py to double-check the length of the shape for these layers.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>
GraniteMoe has the same configuration deltas as Granite

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>
… and up

After a lot of staring and squinting, it's clear that the standard mixtral
expert implementation is equivalent to the vectorized parallel experts in
granite. The difference is that in granite, the w1 and w3 are concatenated
into a single tensor "input_linear." Rather than reimplementing all of the
math on the llama.cpp side, the much simpler route is to just split this
tensor during conversion and follow the standard mixtral route.

Branch: GraniteMoE

Co-Authored-By: [email protected]

Signed-off-by: Gabe Goodhart <[email protected]>
GraniteMoE follows the mixtral architecture (once the input_linear layers
are split into gate_exps/up_exps). The main delta is the addition of the
same four multipliers used in Granite.

Branch: GraniteMoE

Signed-off-by: Gabe Goodhart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants