KeyLLM to extract keywords from text with LLMs #180
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match
KeyBERT
withKeyLLM
. You could also choose to useKeyLLM
withoutKeyBERT
.We start with an example of some data:
This data was chosen to show the different use cases and techniques. As you might have noticed documents 1 and 2 are quite similar whereas document 3 is about an entirely different subject. This similarity will be taken into account when using
KeyBERT
together withKeyLLM
Let's start with
KeyLLM
only.Use Cases
If you want the full performance and easiest method, you can skip the use cases below and go straight to number 5 where you will combine
KeyBERT
withKeyLLM
.1. Create Keywords with
KeyLLM
We start by creating keywords for each document. This creation process is simply asking the LLM to come up with a bunch of keywords for each document. The focus here is on creating keywords which refers to the idea that the keywords do not necessarily need to appear in the input documents:
This creates the following keywords:
2. Extract Keywords with
KeyLLM
Instead of creating keywords out of thin air, we ask the LLM to check whether they actually appear in the text and limit the keywords to those that are found in the documents. We do this by using a custom prompt together with
check_vocab=True
:This creates the following keywords:
3. Fine-tune Candidate Keywords
If you already have a list of keywords, you could fine-tune them by asking the LLM to come up with nicer tags or names that we could use. We can use the
[CANDIDATES]
tag in the prompt to assign where they should go.This creates the following keywords:
4. Efficient
KeyLLM
If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.
This creates the following keywords:
5. Efficient
KeyLLM
+KeyBERT
This is the best of both worlds. We use
KeyBERT
to generate a first pass of keywords and embeddings and give those toKeyLLM
for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior withthreshold
. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.This creates the following keywords:
Large Language Models
Currently, the following are supported:
keybert.llm.OpenAI
keybert.llm.Cohere
keybert.llm.TextGeneration
keybert.llm.LangChain
keybert.llm.LiteLLM
To do:
...