Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix max_seq_length issue in BERT models. #124

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

beshr-eldebuch
Copy link

First, thanks for your great effort in this repo. I highly appreciate it.

Regarding this PR:

Problem:
BERT models have restrictions on their input, the number of words that each model can accept is known as max_seq_length. Usually, the sentences will be truncated if the sentence is larger than max_seq_length.

Context
I'm trying to embed a couple of sentences with sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model, this model has a max_seq_length of 128. So trying to embed sentences with more than 128 words will be an issue here. This issue becomes clearly noticeable for larger sentence sizes (more than 500 words).

Solution

  • Divide the large sentence into sub-sentences, each of size max_seq_length.
  • Embed each sub-sentence in sub-sentences
  • The final embedding is the mean of the previous embeddings.

Example

  • Let's say our model has a max_seq_length of 16.

  • And we have the following sentence that has a count of 25:

    "KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document."

  • We first take our first 16 words and embed them :
    "KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords"

  • Then we take the rest and embed them:
    "and keyphrases that are most similar to a document"

  • The final embedding will be (embedding_1 + embedding_2) / 2

Reference:
UKPLab/sentence-transformers#364

@beshr-eldebuch
Copy link
Author

@MaartenGr
If this looks convenient for you, I'll update the docs as well.

@MaartenGr
Copy link
Owner

Thank you for this PR. Although splitting up in sentences is definitely helpful, there are a few things that make it a bit more complex than you propose in the PR. For example, joining and splitting text with empty spaces will not work for all languages. There are several non-western languages where tokens do not have spaces (such as Chinese languages). Moreover, the max_seq_length that you propose and find within SentenceTransformers is not the exact same as individual words. They are word pieces and do not have a one-on-one relationship with words. For example, some words might be tokenized into separate word pieces. In these cases, it might be more accurate to let the SentenceTransformer models do the tokenization. However, since other embedding models have different tokenization procedures, accounting for all of them, even the custom ones, might not be possible.

@beshr-eldebuch
Copy link
Author

Thanks for the reply & apologies for the late response.
The point that you mentioned regarding different tokenizers working differently is totally right. However, to make the problem as general as possible, the assumption above was made.
Splitting the text into chunks of max_seq_length (or even lower to avoid issues with the tokenizers) is way better than truncating the remaining text, especially for those who are new to BERT and want to use it as is.
Interesting that Chinees doesn't have spaces, I generally work with Latin languages.
I understand your concerns regarding this PR, but you may at least want to mention this issue in the docs so people be aware of this issue (I'm not asking for credits)

@MaartenGr
Copy link
Owner

In the coming weeks, I hope to have released a new version of BERTopic, after which I can look into this a bit deeper. It is important though that there are no assumptions made with respect to the separator of tokens. Similarly, it would also open the need to have different tokenizers enabled based on the input model. I am not saying this potential feature will not be implemented but it seems to me that the focus should be more on the CountVectorizer to do the splitting. Having said that, with differing n-gram levels and other parameters that do not take single token splitting into account, this might prove to be difficult.

Similarly, some users have suggested providing input embeddings as a parameter as a way to speed up experimentation. Using that feature, users could truncate or combine embeddings however they want which may resolve this issue.

@turian
Copy link

turian commented May 5, 2023

@MaartenGr I think optionally giving a param spacy_max_tokens and then using multilingual SpaCy to do sentence and word tokenization would be a nice solution.

From ChatGPT:

To count the number of words (tokens) in each sentence using spaCy, you can get the tokenization information directly from the Doc and Span objects. Here's how you can do it:

import spacy

# Load the multilingual model
nlp = spacy.load("xx_ent_wiki_sm")

# Process the input text
text = "This is an English sentence. 这是一个中文句子。Ceci est une phrase française."
doc = nlp(text)

# Extract the sentences and their word counts
for sent in doc.sents:
    token_count = len(sent)
    print(f"Sentence: {sent}\nWord count: {token_count}\n")

This will output the number of words (tokens) in each sentence according to spaCy's tokenization for various languages:

Sentence: This is an English sentence.
Word count: 5

Sentence: 这是一个中文句子。
Word count: 6

Sentence: Ceci est une phrase française.
Word count: 6

Please note that the tokenization behavior and, consequently, the word count may vary depending on the language and specific linguistic characteristics. In general, spaCy's models provide reasonable tokenization for most languages, but edge cases could exist.

@MaartenGr
Copy link
Owner

Multi-lingual SpaCy tokenization is and will often be quite different from tokenization in one of the backends as they all have different strategies for tokenization. The way tokenization is handled in transformer-based models often differs, especially with respect to sub-tokens, compared to a vocab-based technique such as SpaCy. As a result, the actual vocab and therefore token count can differ significantly between techniques.

Instead, I think it would be most appropriate to let the backends handle the count of tokenization and adjust where necessary. One thing to note here though that it can slow down computation quite a bit and I think should be regarded as an additional parameter. So something like truncate_documents: bool = False should suffice whilst letting the backends handle the embedding/truncation steps. That would require updating all backends to include this option but should be the most accurate representation of token limits.

@BBC-Esq
Copy link

BBC-Esq commented Oct 10, 2024

Is there any news on this topic? I'm new to this great library and assumed that it broke up a body of text that it's supposed to process into chunks if the entire body would exceed the max_seq_length of a particular embedding model. BTW, @beshr-eldebuch I believe that you're off a little as far as the length...here's a list regarding the models I use in another program I've created:
image

Regardless, however, if I want to get the key words from a legal case, for example, oftentimes the case will exceed the sequence length (aka context in "chat model terminology), which is typically 512.

Basically, can you please explain how keybert splits a large body of text or if it does at all? And how hard would it be to implement the splitting? I briefly read that it does some math regarding words relevance in relation to the body of text...so it seems like it might be difficult if that calculation now has to take into consideration multiple chunks (i.e. multiple bodies of text with words within them and their relation to that specific body...).

In other words, should I simply assume that anything above 512 tokens will be truncated and only feed text up to that limit?

Also, any qualms with implementing the Alibaba models, which have a context limit of 8192? Should we start a new discussion on this? Thanks for the great work! Luckily I found this repo.

@MaartenGr
Copy link
Owner

@BBC-Esq There are no updates on this other than either creating the document embeddings yourself or splitting the documents beforehand and passing those. The former is tricky to implement since there are dozens of ways you can create (weighted) document embeddings to perform keyword extraction on. This is why the latter is often advised since it allows you to separate topics. You otherwise run into the problem of having an embedding with dozens of different "topics" and keyword extraction might be inaccurate. That said, you could do the former manually with the document_embeddings parameter.

Also note that there are many ways you can split/chunk a given sequence (char/whitespace/model's tokenizer/etc.) and that some are quite a bit slower than others.

Don't get me wrong, having such a feature would be nice, it's just a bit more complex than simply splitting by whitespace and counting tokens.

Basically, can you please explain how keybert splits a large body of text or if it does at all? And how hard would it be to implement the splitting? I briefly read that it does some math regarding words relevance in relation to the body of text...so it seems like it might be difficult if that calculation now has to take into consideration multiple chunks (i.e. multiple bodies of text with words within them and their relation to that specific body...).

KeyBERT does no splitting at all. What happens is that the underlying embedding model might truncate the input sequence and then create the embedding in order to prevent token limits. The documents themselves, however (so the candidate keywords), are not truncated at all.

In other words, should I simply assume that anything above 512 tokens will be truncated and only feed text up to that limit?

The embeddings yes, the text itself that produces the candidate words no.

Also, any qualms with implementing the Alibaba models, which have a context limit of 8192? Should we start a new discussion on this? Thanks for the great work! Luckily I found this repo.

You can already use them, since sentence-transformers support them.

@BBC-Esq
Copy link

BBC-Esq commented Oct 11, 2024

Thanks, I'm not understanding the "candidate words" aspect of keyBERT...Also not understanding what you're referring as far as embeddings versus non-embeddings.

For example, when I first tried this program I used a script that didn't specify any sentence-transformers, basically your example on the readme...I couldn't see which LLM was being used, but I searched online and it said that used "bert-base-nli-mean-tokens" under the hood. I'm not sure if this is accurate, but when I ran the script it did in-fact seem to download something, which seemed to be a traditional TQDM progress bar I see when downloading from huggingface...

I guess I'm not understanding the distinction between (1) core KeyBERT functionalities and (2) what using sentence-transformer models specifically do when interacting with KeyBERT...

If you'll skim my source code at my repo I posted, you'll notice it uses sentence-transformers models as well as the "default" behavior.

You seem like a normal dude and I'm sure we could nerd out on a few things, but if you want let's do a 5-10 minute discord call to go through the features of KeyBERT. I'm already very familiar with embeddings in general, my main repository that got me into python programming as a hobby is here:

https://github.com/BBC-Esq/VectorDB-Plugin-for-LM-Studio

@BBC-Esq
Copy link

BBC-Esq commented Oct 11, 2024

Think I found the answer on your good docs:

image

@BBC-Esq
Copy link

BBC-Esq commented Oct 11, 2024

If I understand correctly...If I pass a 1000 page book to KeyBert, the CountVectorizer from Scikit-Learn will create the array of tokens and their frequency throughout all 1000 pages. These are defined as "candidate" keywords or keyphrases?

Then, vector representations of each of these keywords/keyphrases are created?

Then, a vector representation of the 1000 pages is created...but it's truncated if it exceeds the context limit of the embedding model?

So we end up with an array of keywords based on all 1000 pages of the text body...but when cosine similarity is done it's only in-fact using the 1000 pages up to the point it was truncated because of the embedding model's context limit?

If I'm understanding your correctly?

Regarding n-grams:

How does that interplay? For example, does the CountVectorizer extract all words and permutations of sequences of words, albeit up to the n-gram limit?

For example, if we have a sentence: "I like cats" with an n=gram setting of two...

keywords/keyphrases:
I
like
cats
I like
like cats

And then it proceeds to create vectors...perform cosine regarding the larger body of text from whence they came, like above?

@MaartenGr
Copy link
Owner

That is correct. The tokenization step is independent of the embedding step since the tokenization is done in one swoop on the corpus level. For instance, if you were to have a million documents, then it would only generate the word embeddings for all unique words instead of every word it will encounter in each document.

For example, if we have a sentence: "I like cats" with an n=gram setting of two...

Yes, do note though that it would have to be a value of (1, 2) to indicate that the minimum value is one and can have at most 2-grams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants