-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix max_seq_length issue in BERT models. #124
base: master
Are you sure you want to change the base?
Conversation
@MaartenGr |
Thank you for this PR. Although splitting up in sentences is definitely helpful, there are a few things that make it a bit more complex than you propose in the PR. For example, joining and splitting text with empty spaces will not work for all languages. There are several non-western languages where tokens do not have spaces (such as Chinese languages). Moreover, the |
Thanks for the reply & apologies for the late response. |
In the coming weeks, I hope to have released a new version of BERTopic, after which I can look into this a bit deeper. It is important though that there are no assumptions made with respect to the separator of tokens. Similarly, it would also open the need to have different tokenizers enabled based on the input model. I am not saying this potential feature will not be implemented but it seems to me that the focus should be more on the CountVectorizer to do the splitting. Having said that, with differing n-gram levels and other parameters that do not take single token splitting into account, this might prove to be difficult. Similarly, some users have suggested providing input embeddings as a parameter as a way to speed up experimentation. Using that feature, users could truncate or combine embeddings however they want which may resolve this issue. |
@MaartenGr I think optionally giving a param spacy_max_tokens and then using multilingual SpaCy to do sentence and word tokenization would be a nice solution. From ChatGPT: To count the number of words (tokens) in each sentence using spaCy, you can get the tokenization information directly from the import spacy
# Load the multilingual model
nlp = spacy.load("xx_ent_wiki_sm")
# Process the input text
text = "This is an English sentence. 这是一个中文句子。Ceci est une phrase française."
doc = nlp(text)
# Extract the sentences and their word counts
for sent in doc.sents:
token_count = len(sent)
print(f"Sentence: {sent}\nWord count: {token_count}\n") This will output the number of words (tokens) in each sentence according to spaCy's tokenization for various languages:
Please note that the tokenization behavior and, consequently, the word count may vary depending on the language and specific linguistic characteristics. In general, spaCy's models provide reasonable tokenization for most languages, but edge cases could exist. |
Multi-lingual SpaCy tokenization is and will often be quite different from tokenization in one of the backends as they all have different strategies for tokenization. The way tokenization is handled in transformer-based models often differs, especially with respect to sub-tokens, compared to a vocab-based technique such as SpaCy. As a result, the actual vocab and therefore token count can differ significantly between techniques. Instead, I think it would be most appropriate to let the backends handle the count of tokenization and adjust where necessary. One thing to note here though that it can slow down computation quite a bit and I think should be regarded as an additional parameter. So something like |
Is there any news on this topic? I'm new to this great library and assumed that it broke up a body of text that it's supposed to process into chunks if the entire body would exceed the max_seq_length of a particular embedding model. BTW, @beshr-eldebuch I believe that you're off a little as far as the length...here's a list regarding the models I use in another program I've created: Regardless, however, if I want to get the key words from a legal case, for example, oftentimes the case will exceed the sequence length (aka context in "chat model terminology), which is typically 512. Basically, can you please explain how keybert splits a large body of text or if it does at all? And how hard would it be to implement the splitting? I briefly read that it does some math regarding words relevance in relation to the body of text...so it seems like it might be difficult if that calculation now has to take into consideration multiple chunks (i.e. multiple bodies of text with words within them and their relation to that specific body...). In other words, should I simply assume that anything above 512 tokens will be truncated and only feed text up to that limit? Also, any qualms with implementing the Alibaba models, which have a context limit of 8192? Should we start a new discussion on this? Thanks for the great work! Luckily I found this repo. |
@BBC-Esq There are no updates on this other than either creating the document embeddings yourself or splitting the documents beforehand and passing those. The former is tricky to implement since there are dozens of ways you can create (weighted) document embeddings to perform keyword extraction on. This is why the latter is often advised since it allows you to separate topics. You otherwise run into the problem of having an embedding with dozens of different "topics" and keyword extraction might be inaccurate. That said, you could do the former manually with the Also note that there are many ways you can split/chunk a given sequence (char/whitespace/model's tokenizer/etc.) and that some are quite a bit slower than others. Don't get me wrong, having such a feature would be nice, it's just a bit more complex than simply splitting by whitespace and counting tokens.
KeyBERT does no splitting at all. What happens is that the underlying embedding model might truncate the input sequence and then create the embedding in order to prevent token limits. The documents themselves, however (so the candidate keywords), are not truncated at all.
The embeddings yes, the text itself that produces the candidate words no.
You can already use them, since sentence-transformers support them. |
Thanks, I'm not understanding the "candidate words" aspect of keyBERT...Also not understanding what you're referring as far as embeddings versus non-embeddings. For example, when I first tried this program I used a script that didn't specify any sentence-transformers, basically your example on the readme...I couldn't see which LLM was being used, but I searched online and it said that used "bert-base-nli-mean-tokens" under the hood. I'm not sure if this is accurate, but when I ran the script it did in-fact seem to download something, which seemed to be a traditional TQDM progress bar I see when downloading from huggingface... I guess I'm not understanding the distinction between (1) core KeyBERT functionalities and (2) what using sentence-transformer models specifically do when interacting with KeyBERT... If you'll skim my source code at my repo I posted, you'll notice it uses sentence-transformers models as well as the "default" behavior. You seem like a normal dude and I'm sure we could nerd out on a few things, but if you want let's do a 5-10 minute discord call to go through the features of KeyBERT. I'm already very familiar with embeddings in general, my main repository that got me into python programming as a hobby is here: |
If I understand correctly...If I pass a 1000 page book to KeyBert, the CountVectorizer from Scikit-Learn will create the array of tokens and their frequency throughout all 1000 pages. These are defined as "candidate" keywords or keyphrases? Then, vector representations of each of these keywords/keyphrases are created? Then, a vector representation of the 1000 pages is created...but it's truncated if it exceeds the context limit of the embedding model? So we end up with an array of keywords based on all 1000 pages of the text body...but when cosine similarity is done it's only in-fact using the 1000 pages up to the point it was truncated because of the embedding model's context limit? If I'm understanding your correctly? Regarding n-grams:How does that interplay? For example, does the CountVectorizer extract all words and permutations of sequences of words, albeit up to the n-gram limit? For example, if we have a sentence: "I like cats" with an n=gram setting of two... keywords/keyphrases: And then it proceeds to create vectors...perform cosine regarding the larger body of text from whence they came, like above? |
That is correct. The tokenization step is independent of the embedding step since the tokenization is done in one swoop on the corpus level. For instance, if you were to have a million documents, then it would only generate the word embeddings for all unique words instead of every word it will encounter in each document.
Yes, do note though that it would have to be a value of (1, 2) to indicate that the minimum value is one and can have at most 2-grams. |
First, thanks for your great effort in this repo. I highly appreciate it.
Regarding this PR:
Problem:
BERT models have restrictions on their input, the number of words that each model can accept is known as
max_seq_length
. Usually, the sentences will be truncated if the sentence is larger thanmax_seq_length
.Context
I'm trying to embed a couple of sentences with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model, this model has a
max_seq_length
of 128. So trying to embed sentences with more than 128 words will be an issue here. This issue becomes clearly noticeable for larger sentence sizes (more than 500 words).Solution
max_seq_length
.Example
Let's say our model has a
max_seq_length
of 16.And we have the following sentence that has a count of 25:
"KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document."
We first take our first 16 words and embed them :
"KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords"
Then we take the rest and embed them:
"and keyphrases that are most similar to a document"
The final embedding will be (embedding_1 + embedding_2) / 2
Reference:
UKPLab/sentence-transformers#364