-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPT-NeoX and SAE #791
base: main
Are you sure you want to change the base?
Add GPT-NeoX and SAE #791
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @StellaAthena! Thanks for opening this PR, it would be really nice to have those two libraries officially supported on the Hub! I've left a couple comments to help with the integration. Let me know if you have any question :)
@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
filter: false, | |||
countDownloads: `path:"checkpoints/byt5_model.pt"`, | |||
}, | |||
"gpt-neox": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that gpt_neox has 3919 models while gpt-neox has 33 models. For consistency with other libraries, naming it gpt-neox
as you did makes more sense but it will mean most GPT-NeoX models will not be listed. A solution is to open a PR on all 3.9k models to update their metadata in the model card (we can provide a script for that) but it might be a bit too much. Any idea @osanseviero ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is actually a bit problematic. As we have 4k models with gpt-neox
tag (automatically determined due to model_type
in config.json
), this will lead to some issues. I wonder if we could explore a different name gpt-neox-original
or something like that for the tag (to be added to all repos that can run inference with the gpt-neox
library). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a side note, the above two comments are contradictory. To be clear, it looks to me that gpt-neox is the small one and gpt_neox is the big one.
I'm not deeply attached to either stylism. If we can rename 33 models and use a slightly different tag that's fine with me. I think the core question is if y'all like style consistency more than you dislike making a very large number of small edits. That said, if most of these models are getting automatically tagged there may need to be a CI that runs on new models to correct the tag each time. Would it be possible to hack the GPT-NeoX model class definition so that using gpt_neox
induces the metadata entry gpt-neox
? Or is that too wonky of a patch? I assume changing the config name is non-viable at this point in time.
repoName: "sae", | ||
repoUrl: "https://github.com/EleutherAI/sae", | ||
filter: true, | ||
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- is there SAE repos where weights are saved as
.bin
? I haven't find any but I might have missed them - if a user loads all layers, counting
.safetensors
files will count X downloads (for instance 30 layers == 30 downloads in sae-llama-3-8b-32x). It might be difficult to interpret the download number in the end but I don't have a proper solution for that at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's use the embed_tokens/cfg.json
file? https://huggingface.co/EleutherAI/sae-llama-3-8b-32x/tree/main/embed_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the other example, this was me guessing. To make sure I understand best practices, we're looking for a file path pattern that
- Appears in every model
- Only matches one file in each model
Is that correct? Perhaps the easiest solution is to just create a new metadata file which (for now, at least) primarily exists to support the integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise, we are looking for a file path pattern that:
- Appears in every model
- Only matches one file when loading the model
For example, on Meta-Llama-3.1-8B-Instruct-GGUF any download of a *.gguf
file is counted because we know that users won't download all GGUF files when instanciating the model (i.e. 1 usage == 1 GGUF file download).
@@ -358,6 +365,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
filter: false, | |||
countDownloads: `path:"tokenizer.model"`, | |||
}, | |||
"sae": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any models on the Hub tracked as sae
yet. It would be good to start adding a few ones (by adding library_name: sae
or tags: [sae, ...]
in the model card metadata of existing SAE models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be great to add library_name: sae
to https://huggingface.co/EleutherAI/sae-llama-3-8b-32x and others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin Yes, we have trained and uploaded a bunch of models using this library (and are actively working on more) but I figured we should add the connection to the back-end first and then add the tag to the models? Is that not the preferred order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually recommend to tag at least a few models before merging the back-end PR. There is no hard requirement for that but asking it ensures that the library definition will be used. We don't want to add libraries and then realize model authors forgot to update their library. Once PR is merged, we know we don't have to check anything else afterwards.
repoUrl: "https://github.com/EleutherAI/sae", | ||
filter: true, | ||
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you can also provide a snippet to let the users know how to instantiate the models. The snippets are usually generated using the id
, making it easy to copy-paste from the browser for end users. For instance here you could have something like:
from sae import Sae
# Load specific layer
sae = Sae.load_from_hub("EleutherAI/sae-llama-3-8b-32x", layer=10)
# Load all layers
saes = Sae.load_many_from_hub("EleutherAI/sae-llama-3-8b-32x")
saes["layer_10"]
It is not necessary to explain/document everything since the SAE github repo is better suited for that but having the "Use this model" button on your repos could help getting some traction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a great idea. Tagging @norabelrose for suggested example code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
filter: false, | |||
countDownloads: `path:"checkpoints/byt5_model.pt"`, | |||
}, | |||
"gpt-neox": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be easier to split the PR into the two different libraries
@@ -222,6 +222,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
filter: false, | |||
countDownloads: `path:"checkpoints/byt5_model.pt"`, | |||
}, | |||
"gpt-neox": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is actually a bit problematic. As we have 4k models with gpt-neox
tag (automatically determined due to model_type
in config.json
), this will lead to some issues. I wonder if we could explore a different name gpt-neox-original
or something like that for the tag (to be added to all repos that can run inference with the gpt-neox
library). WDYT?
@@ -358,6 +365,13 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
filter: false, | |||
countDownloads: `path:"tokenizer.model"`, | |||
}, | |||
"sae": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be great to add library_name: sae
to https://huggingface.co/EleutherAI/sae-llama-3-8b-32x and others
repoName: "sae", | ||
repoUrl: "https://github.com/EleutherAI/sae", | ||
filter: true, | ||
countDownloads: `path_extension:"safetensors" OR path_extension:"bin"`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's use the embed_tokens/cfg.json
file? https://huggingface.co/EleutherAI/sae-llama-3-8b-32x/tree/main/embed_tokens
Co-authored-by: Lucain <[email protected]>
Co-authored-by: Lucain <[email protected]>
Hi @StellaAthena, thanks for your comment and changes. I commented a few but I'm still unsure how to deal with |
Co-authored-by: Lucain <[email protected]>
Adds two libraries developed by EleutherAI, GPT-NeoX and SAE.