-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036
Comments
Hmmm, although I think I understand the issue, it is not clear to me why this issue suddenly appears whereas it has been working fine for a while now (aside from the underlying issues with Either way, I have seen the solution of tiling the embeddings before but was hesitant to implement it since that would increase the size of the seeded topic embeddings quite a bit. If I'm not mistaken, your embeddings would now be twice as big. |
Thanks for your prompt reply. I am using numpy version 1.26.4. Your concern totally make sense to me. It seems like this issue regarding difference in array shapes only occurs in # Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
indices = [index for index, topic in enumerate(y) if topic == seed_topic]
embeddings[indices] = embeddings[indices] * 0.75 + seed_topic_embeddings[seed_topic] * 0.25 |
@RTChou Would this also be possible even though the shapes of the embedding matrices differ? |
@MaartenGr Yes, and it is essentially doing broadcasting under the hood in C, so I believe it will be more efficient comparing to the previous solution that uses Here is a toy example showing the calculation of the weighted average of arrays/matrices with different shapes:
import numpy as np
array1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array2 = np.array([2, 4, 6])
avg = array1 * 0.75 + array2 * 0.25
avg
array2_broadcasted = np.tile(array2, (array1.shape[0], 1))
avg_broadcasted = array1 * 0.75 + array2_broadcasted * 0.25
avg_broadcasted
|
@MaartenGr Sounds good. I will make a PR after the merging then. Thanks for letting me know! |
wanted to ask whether @RTChou already created the PR? |
@dentro-innovation Thanks for the reminder. I just created a PR, which is currently waiting for approval. |
Hi, I am trying to run the example code given in https://maartengr.github.io/BERTopic/getting_started/guided/guided.html#example and got an error.
Example code:
Error:
The issue happened when calculating the (weighted) averages between a set of documents (
embeddings[indices]
) and their seed topic embeddings (seed_topic_embeddings[seed_topic]
), wherenp.average
cannot calculate the averages between a 2D array and a 1D array.This issue can be solved by broadcasting the 1D array to match the shape of the 2D array, and calculating the averages along axis 0.
Original code (https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py#L3766):
Modified code:
The text was updated successfully, but these errors were encountered: