diff --git a/docs/api/cohere.md b/docs/api/cohere.md new file mode 100644 index 00000000..d9eb8f2a --- /dev/null +++ b/docs/api/cohere.md @@ -0,0 +1,3 @@ +# `Cohere` + +::: keybert.llm._cohere.Cohere diff --git a/docs/api/keyllm.md b/docs/api/keyllm.md new file mode 100644 index 00000000..24f8ce5e --- /dev/null +++ b/docs/api/keyllm.md @@ -0,0 +1,3 @@ +# `KeyLLM` + +::: keybert._llm.KeyLLM diff --git a/docs/api/langchain.md b/docs/api/langchain.md new file mode 100644 index 00000000..d0087f05 --- /dev/null +++ b/docs/api/langchain.md @@ -0,0 +1,3 @@ +# `LangChain` + +::: keybert.llm._langchain.LangChain diff --git a/docs/api/litellm.md b/docs/api/litellm.md new file mode 100644 index 00000000..e3608f78 --- /dev/null +++ b/docs/api/litellm.md @@ -0,0 +1,3 @@ +# `LiteLLM` + +::: keybert.llm._litellm.LiteLLM diff --git a/docs/api/openai.md b/docs/api/openai.md new file mode 100644 index 00000000..9b6f36d0 --- /dev/null +++ b/docs/api/openai.md @@ -0,0 +1,3 @@ +# `OpenAI` + +::: keybert.llm._openai.OpenAI diff --git a/docs/api/textgeneration.md b/docs/api/textgeneration.md new file mode 100644 index 00000000..a18639f3 --- /dev/null +++ b/docs/api/textgeneration.md @@ -0,0 +1,3 @@ +# `TextGeneration` + +::: keybert.llm._textgeneration.TextGeneration diff --git a/docs/guides/llms.md b/docs/guides/llms.md index fc1728de..949e3a8a 100644 --- a/docs/guides/llms.md +++ b/docs/guides/llms.md @@ -1,149 +1,146 @@ -# Embedding Models -In this tutorial we will be going through the embedding models that can be used in KeyBERT. -Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case. +# Large Language Models (LLM) +In this tutorial we will be going through the Large Language Models (LLM) that can be used in KeyLLM. +Having the option to choose the LLM allow you to leverage the model that suit your use-case. -### **Sentence Transformers** -You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html) -and pass it through KeyBERT with `model`: +### **OpenAI** +To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model: ```python -from keybert import KeyBERT -kw_model = KeyBERT(model="all-MiniLM-L6-v2") -``` +import openai +from keybert.llm import OpenAI +from keybert import KeyLLM -Or select a SentenceTransformer model with your own parameters: +# Create your OpenAI LLM +openai.api_key = "sk-..." +llm = OpenAI() -```python -from sentence_transformers import SentenceTransformer +# Load it in KeyLLM +kw_model = KeyLLM(llm) -sentence_model = SentenceTransformer("all-MiniLM-L6-v2") -kw_model = KeyBERT(model=sentence_model) +# Extract keywords +keywords = kw_model.extract_keywords(MY_DOCUMENTS) ``` -### 🤗 **Hugging Face Transformers** -To use a Hugging Face transformers model, load in a pipeline and point -to any model found on their model hub (https://huggingface.co/models): +If you want to use a chat-based model, please run the following instead: ```python -from transformers.pipelines import pipeline - -hf_model = pipeline("feature-extraction", model="distilbert-base-cased") -kw_model = KeyBERT(model=hf_model) -``` +import openai +from keybert.llm import OpenAI +from keybert import KeyLLM -!!! tip "Tip!" - These transformers also work quite well using `sentence-transformers` which has a number of - optimizations tricks that make using it a bit faster. +# Create your LLM +openai.api_key = "sk-..." +llm = OpenAI(model="gpt-3.5-turbo", chat=True) -### **Flair** -[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that -is publicly available. Flair can be used as follows: - -```python -from flair.embeddings import TransformerDocumentEmbeddings - -roberta = TransformerDocumentEmbeddings('roberta-base') -kw_model = KeyBERT(model=roberta) +# Load it in KeyLLM +kw_model = KeyLLM(llm) ``` -You can select any 🤗 transformers model [here](https://huggingface.co/models). - -Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings. -Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily -pass it to KeyBERT in order to use those word embeddings as document embeddings: +### **Cohere** +To use Cohere's external API, we need to define our key and use the `keybert.llm.OpenAI` model: ```python -from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings +import cohere +from keybert.llm import Cohere +from keybert import KeyLLM + +# Create your OpenAI LLM +co = cohere.Client(my_api_key) +llm = Cohere(co) -glove_embedding = WordEmbeddings('crawl') -document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding]) +# Load it in KeyLLM +kw_model = KeyLLM(llm) -kw_model = KeyBERT(model=document_glove_embeddings) +# Extract keywords +keywords = kw_model.extract_keywords(MY_DOCUMENTS) ``` -### **Spacy** -[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are -many models available across many languages for modeling text. +### **LiteLLM** +[LiteLLM](https://github.com/BerriAI/litellm) allows you to use any closed-source LLM with KeyLLM -To use Spacy's non-transformer models in KeyBERT: +Let's use OpenAI as an example: ```python -import spacy +import os +from keybert.llm import LiteLLM +from keybert import KeyLLM -nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) +# Select LLM +os.environ["OPENAI_API_KEY"] = "sk-..." +llm = LiteLLM("gpt-3.5-turbo") -kw_model = KeyBERT(model=nlp) +# Load it in KeyLLM +kw_model = KeyLLM(llm) ``` -Using spacy-transformer models: +### 🤗 **Hugging Face Transformers** +To use a Hugging Face transformers model, load in a pipeline and point +to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example: ```python -import spacy - -spacy.prefer_gpu() -nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) - -kw_model = KeyBERT(model=nlp) +from torch import cuda, bfloat16 +import transformers + +model_id = 'meta-llama/Llama-2-7b-chat-hf' + +# 4-bit Quantization to load Llama 2 with less GPU memory +bnb_config = transformers.BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type='nf4', + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=bfloat16 +) + +# Llama 2 Model & Tokenizer +tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) +model = transformers.AutoModelForCausalLM.from_pretrained( + model_id, + trust_remote_code=True, + quantization_config=bnb_config, + device_map='auto', +) +model.eval() + +# Our text generator +generator = transformers.pipeline( + model=model, tokenizer=tokenizer, + task='text-generation', + temperature=0.1, + max_new_tokens=500, + repetition_penalty=1.1 +) ``` -If you run into memory issues with spacy-transformer models, try: +Then, we load the `generator` in `KeyLLM`: ```python -import spacy -from thinc.api import set_gpu_allocator, require_gpu +from keybert.llm import TextGeneration +from keybert import KeyLLM -nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']) -set_gpu_allocator("pytorch") -require_gpu(0) - -kw_model = KeyBERT(model=nlp) +# Load it in KeyLLM +llm = TextGeneration(generator) +kw_model = KeyLLM(llm) ``` -### **Universal Sentence Encoder (USE)** -The Universal Sentence Encoder encodes text into high dimensional vectors that are used here -for embedding the documents. The model is trained and optimized for greater-than-word length text, -such as sentences, phrases or short paragraphs. +### **LangChain** -Using USE in KeyBERT is rather straightforward: +To use LangChain, we can simply load in any LLM and pass that as a QA-chain to KeyLLM: ```python -import tensorflow_hub -embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") -kw_model = KeyBERT(model=embedding_model) +from langchain.chains.question_answering import load_qa_chain +from langchain.llms import OpenAI +chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff") ``` -### **Gensim** -For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model -to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically -best for short documents since the word embeddings are pooled. +Finally, you can pass the chain to KeyBERT as follows: ```python -import gensim.downloader as api -ft = api.load('fasttext-wiki-news-subwords-300') -kw_model = KeyBERT(model=ft) -``` +from keybert.llm import LangChain +from keybert import KeyLLM -### **Custom Backend** -If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to -create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT: +# Create your LLM +llm = LangChain(chain) -```python -from keybert.backend import BaseEmbedder -from sentence_transformers import SentenceTransformer - -class CustomEmbedder(BaseEmbedder): - def __init__(self, embedding_model): - super().__init__() - self.embedding_model = embedding_model - - def embed(self, documents, verbose=False): - embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose) - return embeddings - -# Create custom backend -distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2") -custom_embedder = CustomEmbedder(embedding_model=distilbert) - -# Pass custom backend to keybert -kw_model = KeyBERT(model=custom_embedder) -``` +# Load it in KeyLLM +kw_model = KeyLLM(llm) +``` \ No newline at end of file diff --git a/keybert/_llm.py b/keybert/_llm.py index 03b70e31..ca8ff22e 100644 --- a/keybert/_llm.py +++ b/keybert/_llm.py @@ -29,7 +29,7 @@ def extract_keywords( check_vocab: bool = False, candidate_keywords: List[List[str]] = None, threshold: float = None, - embeddings = None + embeddings=None ) -> Union[List[str], List[List[str]]]: """Extract keywords and/or keyphrases @@ -85,7 +85,7 @@ def extract_keywords( out_cluster = set(list(range(len(docs)))).difference(in_cluster) # Extract keywords for all documents not in a cluster - if out_cluster: + if out_cluster: selected_docs = [docs[index] for index in out_cluster] print(out_cluster, selected_docs) if candidate_keywords is not None: diff --git a/keybert/llm/_cohere.py b/keybert/llm/_cohere.py index c2f9dd6e..fdf0d3d6 100644 --- a/keybert/llm/_cohere.py +++ b/keybert/llm/_cohere.py @@ -2,6 +2,7 @@ from tqdm import tqdm from typing import List from keybert.llm._base import BaseLLM +from keybert.llm._utils import process_candidate_keywords DEFAULT_PROMPT = """ @@ -93,19 +94,26 @@ def __init__(self, self.delay_in_seconds = delay_in_seconds self.verbose = verbose - def extract_keywords(self, documents: List[str]): + def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None): """ Extract topics Arguments: documents: The documents to extract keywords from + candidate_keywords: A list of candidate keywords that the LLM will fine-tune + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or + shorten them depending on the input prompt. Returns: all_keywords: All keywords for each document """ all_keywords = [] + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) - for document in tqdm(documents, disable=not self.verbose): + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): prompt = self.prompt.replace("[DOCUMENT]", document) + if candidates is not None: + prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates)) # Delay if self.delay_in_seconds: diff --git a/keybert/llm/_langchain.py b/keybert/llm/_langchain.py index c6e4cf7c..4741c8ce 100644 --- a/keybert/llm/_langchain.py +++ b/keybert/llm/_langchain.py @@ -2,6 +2,7 @@ from typing import List from langchain.docstore.document import Document from keybert.llm._base import BaseLLM +from keybert.llm._utils import process_candidate_keywords DEFAULT_PROMPT = "What is this document about? Please provide keywords separated by commas." @@ -75,18 +76,26 @@ def __init__(self, self.default_prompt_ = DEFAULT_PROMPT self.verbose = verbose - def extract_keywords(self, documents: List[str]): + def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None): """ Extract topics Arguments: documents: The documents to extract keywords from + candidate_keywords: A list of candidate keywords that the LLM will fine-tune + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or + shorten them depending on the input prompt. Returns: all_keywords: All keywords for each document """ all_keywords = [] + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) - for document in tqdm(documents, disable=not self.verbose): + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): + prompt = self.prompt.replace("[DOCUMENT]", document) + if candidates is not None: + prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates)) input_document = Document(page_content=document) keywords = self.chain.run(input_documents=input_document, question=self.prompt).strip() keywords = [keyword.strip() for keyword in keywords.split(",")] diff --git a/keybert/llm/_litellm.py b/keybert/llm/_litellm.py index 0702abc2..f0e55469 100644 --- a/keybert/llm/_litellm.py +++ b/keybert/llm/_litellm.py @@ -3,6 +3,7 @@ from litellm import completion from typing import Mapping, Any, List from keybert.llm._base import BaseLLM +from keybert.llm._utils import process_candidate_keywords DEFAULT_PROMPT = """ @@ -88,19 +89,26 @@ def __init__(self, if self.generator_kwargs.get("prompt"): del self.generator_kwargs["prompt"] - def extract_keywords(self, documents: List[str]): + def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None): """ Extract topics Arguments: documents: The documents to extract keywords from + candidate_keywords: A list of candidate keywords that the LLM will fine-tune + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or + shorten them depending on the input prompt. Returns: all_keywords: All keywords for each document """ all_keywords = [] + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) - for document in tqdm(documents, disable=not self.verbose): + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): prompt = self.prompt.replace("[DOCUMENT]", document) + if candidates is not None: + prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates)) # Delay if self.delay_in_seconds: diff --git a/keybert/llm/_openai.py b/keybert/llm/_openai.py index 9d155667..178abee6 100644 --- a/keybert/llm/_openai.py +++ b/keybert/llm/_openai.py @@ -3,7 +3,7 @@ from tqdm import tqdm from typing import Mapping, Any, List from keybert.llm._base import BaseLLM -from keybert.llm._utils import retry_with_exponential_backoff +from keybert.llm._utils import retry_with_exponential_backoff, process_candidate_keywords DEFAULT_PROMPT = """ @@ -145,21 +145,16 @@ def extract_keywords(self, documents: List[str], candidate_keywords: List[List[s Arguments: documents: The documents to extract keywords from candidate_keywords: A list of candidate keywords that the LLM will fine-tune - For example, it will create a nicer representation of - the candidate keywords, remove redundant keywords, or + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or shorten them depending on the input prompt. Returns: all_keywords: All keywords for each document """ all_keywords = [] - if candidate_keywords is None: - candidate_keywords = [None for _ in documents] - elif isinstance(candidate_keywords[0][0], str) and not isinstance(candidate_keywords[0], list): - candidate_keywords = [[keyword for keyword, _ in candidate_keywords]] - elif isinstance(candidate_keywords[0][0], tuple): - candidate_keywords = [[keyword for keyword, _ in keywords] for keywords in candidate_keywords] - + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): prompt = self.prompt.replace("[DOCUMENT]", document) if candidates is not None: diff --git a/keybert/llm/_textgeneration.py b/keybert/llm/_textgeneration.py index 54530841..f505caf4 100644 --- a/keybert/llm/_textgeneration.py +++ b/keybert/llm/_textgeneration.py @@ -3,6 +3,7 @@ from transformers.pipelines.base import Pipeline from typing import Mapping, List, Any, Union from keybert.llm._base import BaseLLM +from keybert.llm._utils import process_candidate_keywords DEFAULT_PROMPT = """ @@ -90,19 +91,26 @@ def __init__(self, self.pipeline_kwargs = pipeline_kwargs self.verbose = verbose - def extract_keywords(self, documents: List[str]): + def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None): """ Extract topics Arguments: documents: The documents to extract keywords from + candidate_keywords: A list of candidate keywords that the LLM will fine-tune + For example, it will create a nicer representation of + the candidate keywords, remove redundant keywords, or + shorten them depending on the input prompt. Returns: all_keywords: All keywords for each document """ all_keywords = [] + candidate_keywords = process_candidate_keywords(documents, candidate_keywords) - for document in tqdm(documents, disable=not self.verbose): + for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose): prompt = self.prompt.replace("[DOCUMENT]", document) + if candidates is not None: + prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates)) # Extract result from generator and use that as label keywords = self.model(prompt, **self.pipeline_kwargs)[0]["generated_text"].replace(prompt, "") diff --git a/keybert/llm/_utils.py b/keybert/llm/_utils.py index 665046d1..6ca8bdd3 100644 --- a/keybert/llm/_utils.py +++ b/keybert/llm/_utils.py @@ -2,6 +2,17 @@ import time +def process_candidate_keywords(documents, candidate_keywords): + """Create a common format for candidate keywords.""" + if candidate_keywords is None: + candidate_keywords = [None for _ in documents] + elif isinstance(candidate_keywords[0][0], str) and not isinstance(candidate_keywords[0], list): + candidate_keywords = [[keyword for keyword, _ in candidate_keywords]] + elif isinstance(candidate_keywords[0][0], tuple): + candidate_keywords = [[keyword for keyword, _ in keywords] for keywords in candidate_keywords] + return candidate_keywords + + def retry_with_exponential_backoff( func, initial_delay: float = 1, diff --git a/mkdocs.yml b/mkdocs.yml index a34c4e2a..029418ec 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,10 +15,18 @@ nav: - Embedding Models: guides/embeddings.md - CountVectorizer: guides/countvectorizer.md - KeyLLM: guides/keyllm.md + - LLMs: guides/llms.md - API: - KeyBERT: api/keybert.md - MMR: api/mmr.md - MaxSum: api/maxsum.md + - KeyLLM: api/keyllm.md + - LLM: + - OpenAI: api/openai.md + - Cohere: api/cohere.md + - LangChain: api/langchain.md + - TextGeneration: api/textgeneration.md + - LiteLLM: api/litellm.md - FAQ: faq.md - Changelog: changelog.md