-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging models #1688
Comments
Could you share your full code? Without knowing what exactly is run it is difficult to say what is happening here. It might indeed be related to the minimum similarity since a value of .98 is quite high and I wonder whether that actually does something |
Thanks @MaartenGr for your quick reply! I essentially have multiple topic models that I am trying to merge, both use your llama methodology for representation. I was running into the issue of the merged model being the same as the first model even though the 2nd one has many different topics. So I was incrementally increasing the min similarity value, If I run a topic model on the whole combined text instead I do get topics from both the models. from torch import cuda
from torch import bfloat16
import transformers
from huggingface_hub import login
import subprocess as sp
import os
import torch
import re
from random import sample
import pandas as pd
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'; print(device)
cuda.empty_cache()
login(token = myToken)
model_id = 'meta-llama/Llama-2-13b-chat-hf'
cuda.empty_cache()
# Quantization to load an LLM with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type='nf4', # Normalized float 4
bnb_4bit_use_double_quant=True, # Second quantization after the first
bnb_4bit_compute_dtype=bfloat16 # Computation type
)
# Llama 2 Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id,token=myToken)
cuda.empty_cache()
# Llama 2 Model
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map='auto',
)
model.eval()
cuda.empty_cache()
# Our text generator
generator = transformers.pipeline(
model=model, tokenizer=tokenizer,
task='text-generation',
temperature=0.1,
max_new_tokens=500,
repetition_penalty=1.1
)
cuda.empty_cache()
prompt = system_prompt + example_prompt + main_prompt
## This prompt is the same as yours, copying here was causing some issues
cuda.empty_cache()
import pandas as pd
df = pd.read_csv('myText.csv')
docs = [i.lower() for i in df.text]
df2 = pd.read_csv('myText2.csv')
docs2 = [i.lower() for i in df2.text]
from sentence_transformers import SentenceTransformer
# Pre-calculate embeddings
embedding_model = SentenceTransformer("BAAI/bge-small-en")
embeddings = embedding_model.encode(docs, show_progress_bar=True)
cuda.empty_cache()
from umap import UMAP
from hdbscan import HDBSCAN
umap_model = UMAP(n_neighbors=100, n_components=5, min_dist=0.2, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=50, metric='euclidean', cluster_selection_method='eom',
prediction_data=True,min_samples=10)
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration
# KeyBERT
keybert = KeyBERTInspired()
# MMR
mmr = MaximalMarginalRelevance(diversity=0.3)
# Text generation with Llama 2
llama2 = TextGeneration(generator, prompt=prompt)
cuda.empty_cache()
# All representation models
representation_model = {
"KeyBERT": keybert,
"Llama2": llama2,
"MMR": mmr
}
from bertopic import BERTopic
topic_model = BERTopic(
# Sub-models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
# Hyperparameters
top_n_words=10,
verbose=True,
calculate_probabilities=False
)
cuda.empty_cache()
topic_model.fit(docs, embeddings)
cuda.empty_cache()
embeddings2 = embedding_model.encode(docs2, show_progress_bar=True)
cuda.empty_cache()
topic_model2 = BERTopic(
# Sub-models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
representation_model=representation_model,
# Hyperparameters
top_n_words=10,
verbose=True,
calculate_probabilities=False
)
topic_model2.fit(docs2, embeddings2)
cuda.empty_cache()
l1= len(topic_model.get_topic_info())
l2= len(topic_model.get_topic_info())
minS = 0.7
while l2==l1:
merged_model = BERTopic.merge_models([topic_model, topic_model2],min_similarity=minS)
l2= len(merged_model.get_topic_info())
print('minS: {minS} --> [{l1},{l2}]'.format(minS=minS,l1=l1,l2=l2))
minS = minS +0.01
merged_model.save('/mnt/ebs1/data/Share/GlobalFilingNLP/topicModels/mergedRisk2')
cuda.empty_cache() |
Based on your code, my guess would indeed be the relatively high |
I am facing similar issues with my models. I have a model with 64 topics (based 4000 text records) and another model with 115 topics (based on 8000ish records). When I try to merge them the merge_model either does not add any topics and when I increase the "min_similarity" value above a certain point, it fails with various errors such as KeyError: '40', KeyError: '41' etc. If this issue could be looked into, it will be of great help. [UPDATED BELOW with actual values from latest run] |
@Anirudh-Munnangi Thanks for sharing this. Could you also share your full code? Without it, it is hard to see what exactly is happening here. Also, could you share your full error log? |
Thank you @MaartenGr for your quick response. Here is the following:
The code above works for "min_similarity" till 0.81. No merging of topics happens till then. ERRORS0.82<=min_similarity<=0.87
0.88<=min_similarity<=0.90
0.91<=min_similarity<=0.93
0.94<=min_similarity<=0.95
min_similarity = 0.96
0.97<=min_similarity<=0.994
min_similarity = 0.995
min_similarity = 0.996
min_similarity > 0.996
I have ran the code at different values of "min_similarity" and found these errors. Thanks, |
@Anirudh-Munnangi Thanks for code. Can you share a full error message also? |
@MaartenGr Here is the full error log.
|
i'm having the same issue. I noticed it only happens when using a representation model. If I don't use a representation model, I don't get the error. Looking at the source code I believe the problem is here: (line 3149 _bertopic.py)
difficult to bebug from my end, but I wonder if topic_aspects_ is being used properly |
Also, something I would like clarity on is are we updating our Representative_Docs or at least retaining the information from the base model after merging models? What I am seeing is this field gets converted to null. Same thing for representation model results. This information shouldn't be lost or we should be able to choose the base version I see this explantion in the docs I don't agree with this assumption. I think this is overlooking some key functionality and desired control in the merge process. Why can't this behavior be optional? I think there is a lot of value to glean with the merge model method, but it needs some tweaks (tracking and retaining original information across merges. possibily updating representations after merge is complete) |
I had some time to dive more into debugging The issue is here:
the dictionary for Looks like I am able to get it what I wanted to do by changing the dictionary to this format |
I confirm the issue is here.
and my which has 2 keys ("KeyBERT" and "MMR"), each one with |
@aleianno90 @corsilt @ayushjainr @Anirudh-Munnangi I just created a PR that should fix this issue. You can install it as follows: pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1762/head Could you confirm this fix works? |
Still fails- File ~/test/lib/python3.10/site-packages/bertopic/_bertopic.py:3150, in BERTopic.merge_models(cls, models, min_similarity, embedding_model) Separately, after installing the fix I am not able to save the topic model { File ~/test/lib/python3.10/site-packages/bertopic/_bertopic.py:2987, in BERTopic.save(self, path, serialization, save_embedding_model, save_ctfidf) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:555, in dump(value, filename, compress, protocol, cache_size) File /usr/lib/python3.10/pickle.py:487, in _Pickler.dump(self, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:972, in _Pickler.save_dict(self, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items)
File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:972, in _Pickler.save_dict(self, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items)
File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj)
File /usr/lib/python3.10/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj)
File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:972, in _Pickler.save_dict(self, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:713, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:972, in _Pickler.save_dict(self, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:713, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:717, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:972, in _Pickler.save_dict(self, obj) File /usr/lib/python3.10/pickle.py:998, in _Pickler._batch_setitems(self, items) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:603, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:692, in _Pickler.save_reduce(self, func, args, state, listitems, dictitems, state_setter, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:887, in _Pickler.save_tuple(self, obj) File ~/test/lib/python3.10/site-packages/joblib/numpy_pickle.py:355, in NumpyPickler.save(self, obj) File /usr/lib/python3.10/pickle.py:560, in _Pickler.save(self, obj, save_persistent_id) File /usr/lib/python3.10/pickle.py:1071, in _Pickler.save_global(self, obj, name) PicklingError: Can't pickle <function add_hook_to_module..new_forward at 0x7f41a8421d80>: it's not found as accelerate.hooks.add_hook_to_module..new_forward" |
Based on your error message, it seems that you did not install the PR correctly. Could you check whether the PR was installed correctly? The lines of code do not match the PR in #1762
Quite sure that issue is not related to this since you are not using #1762, so opening up a new issue with either v0.16 or the commits from the main branch would be preferred. |
Hi @MaartenGr When I use "pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/1762/head" "from bertopic import BERTopic" fails. Not sure if others are facing similar problem. |
@Anirudh-Munnangi It is working for me without any issues. Can you create a completely new environment and try again? Also, you mention that it fails, but what exactly do you mean? Does it give any errors? Try to be as complete as possible. |
It says couldn't find the branch and so just reinstalls the master version WARNING: Did not find branch or tag 'refs/pull/1762/head', assuming revision or ref. |
I believe it installed fine for you but you can also run the following: pip install git+https://github.com/MaartenGr/BERTopic.git@fix_merging Let me know if it works! |
Also experiencing this issue When using: I get this error:
Even after updating to the fix_merging branch |
I'm sorry, this commit did seem to fix the issue! Thanks. |
Error merging topic models -
mergedModels = BERTopic.merge_models([model1,model2], min_similarity=0.9)
KeyError Traceback (most recent call last)
Cell In[20], line 1
----> 1 mergedModels = BERTopic.merge_models([m1[2],m1[0]], min_similarity=0.98)
File ~/test/lib/python3.10/site-packages/bertopic/_bertopic.py:3150, in BERTopic.merge_models(cls, models, min_similarity, embedding_model)
3147 merged_topics["topic_labels"][str(new_topic_val)] = selected_topics["topic_labels"][str(new_topic)]
3149 if selected_topics["topic_aspects"]:
-> 3150 merged_topics["topic_aspects"][str(new_topic_val)] = selected_topics["topic_aspects"][str(new_topic)]
3152 # Add new embeddings
3153 new_tensors = tensors[new_topic - selected_topics["_outliers"]]
KeyError: '12'
One thing to note is that there's no error when I reduce the min_similarity value but I see no topics getting added
The text was updated successfully, but these errors were encountered: