-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierachical topics calculated with topic embeddings #1894
Hierachical topics calculated with topic embeddings #1894
Conversation
…en clusters for hierrachical topics. The topic embeddings should provide better representation that the term frequencies calculated based on c-TF-IDF.
Thank you for this PR! I might be mistaken but I think you might have missed the CONTRIBUTING file since I'm not seeing an attached issue. Having said that, we can start this feature request here for now. I definitely agree that having the option to switch between c-TF-IDF and topic embeddings would be great to have, not only in this specific example but throughout BERTopic. Keeping that in mind, I actually think it would be best that whenever such a parameter ( So I think this PR might be a bit bigger than the single function. I believe that if this parameter is in this single function, users will expect it to be used in others as well. |
Hi @MaartenGr, my apologies for not reading the contributing file. I am happy to make changes in the other methods as well. The list of the methods above is complete or they are only examples? |
Thanks, that would be great! A few things to keep in mind:
Hopefully, it is not too much. But if it is, please let me know! I'll see if I can help out. |
Thanks for the guidelines! I will look into it within the next two weeks. |
6ba19c7
to
51e0b3f
Compare
Included a unit test for selecting the representation.
51e0b3f
to
dc8d33b
Compare
@MaartenGr, please have a look. I included the embedding selection to all mentioned methods apart from I left the defaults for attribute While the function
|
@azikoss Thanks for the work on this! Hopefully, I will have some time this weekend or beginning of next week to look at this a bit more in-depth.
I agree. Unification would definitely be best here. It will likely change the output so it might be nice to additionally test this. However, since there is the option to easily switch between type of topic representation embedding I'm not too worried about this.
I just started the tests, so let's see what happens with end-to-end tests using the defaults. In an ideal world, I would like to see more tests that also cover this end-to-end, but they are already quite large and slow.
I would have to check this but I generally want to prevent casting sparse matrices into numpy arrays as that can increase the memory needed to hold that matrix significantly. |
@MaartenGr thanks for your comments. I just ran the test on Python 3.9. They all run successfully. If I understand correctly, the support is also for 3.8? If so, I will change the types. Have a look at the conversion to np.array pls and let me know. It is about 50/50 where the c_tfidf matrix is converted to np.array and where it is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this! I left some comments that we can discuss as they relate to, for instance casting to numpy. Also, the tests fail on python 3.8 most likely as a result of these kinds of type hints not yet being supported.
- removed unnecessary type checks - made it python 3.8 compatible
…eddings can be also a sparse matrix
@MaartenGr pls let me know if there is anything else to address. |
@azikoss Thanks for all the work so far! I was on holiday for the last two weeks and just getting back to a bunch of issues and PRs. Yours is high priority and I hope to get to it this weekend. Thanks for being patient with me. |
I just did a first pass through the code and wanted to see whether all checks pass correctly but there seems to be a problem with the pipeline. To get this pipeline working for you, I would advise implementing the same fix as #2008. That way, we can run the pipeline and see if everything works as intended. |
@MaartenGr I am sorry for the delay. I just merged the latest changes. Is there anything else I should do with respect to #2008 ? Locally the tests are running successfully. |
@azikoss I just merged the PR, congrats and thank you for all the hard work! Especially considering this PR started out small and grew into something bigger, I'm very grateful. Also, really cool feature to have as it will be helpful for instances where c-TF-IDF is not available (like after using |
Great! My pleasure contributing to this great toolkit! And thanks for your guidelines:) |
Added an option to use topic embeddings to calculate distances between clusters for hierarchical topics. The topic embeddings should provide a better representation of the term frequencies calculated based on c-TF-IDF.