More documentation

MaartenGr · Sep 26, 2023 · b270773 · b270773
1 parent 21040b6
commit b270773
Show file tree

Hide file tree

Showing 15 changed files with 186 additions and 124 deletions.
diff --git a/docs/api/cohere.md b/docs/api/cohere.md
@@ -0,0 +1,3 @@
+# `Cohere`
+
+::: keybert.llm._cohere.Cohere
diff --git a/docs/api/keyllm.md b/docs/api/keyllm.md
@@ -0,0 +1,3 @@
+# `KeyLLM`
+
+::: keybert._llm.KeyLLM
diff --git a/docs/api/langchain.md b/docs/api/langchain.md
@@ -0,0 +1,3 @@
+# `LangChain`
+
+::: keybert.llm._langchain.LangChain
diff --git a/docs/api/litellm.md b/docs/api/litellm.md
@@ -0,0 +1,3 @@
+# `LiteLLM`
+
+::: keybert.llm._litellm.LiteLLM
diff --git a/docs/api/openai.md b/docs/api/openai.md
@@ -0,0 +1,3 @@
+# `OpenAI`
+
+::: keybert.llm._openai.OpenAI
diff --git a/docs/api/textgeneration.md b/docs/api/textgeneration.md
@@ -0,0 +1,3 @@
+# `TextGeneration`
+
+::: keybert.llm._textgeneration.TextGeneration
diff --git a/docs/guides/llms.md b/docs/guides/llms.md
@@ -1,149 +1,146 @@
-# Embedding Models
-In this tutorial we will be going through the embedding models that can be used in KeyBERT.
-Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case.
+# Large Language Models (LLM)
+In this tutorial we will be going through the Large Language Models (LLM) that can be used in KeyLLM.
+Having the option to choose the LLM allow you to leverage the model that suit your use-case.
 
-### **Sentence Transformers**
-You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
-and pass it through KeyBERT with `model`:
+### **OpenAI**
+To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model:
 
 ```python
-from keybert import KeyBERT
-kw_model = KeyBERT(model="all-MiniLM-L6-v2")
-```
+import openai
+from keybert.llm import OpenAI
+from keybert import KeyLLM
 
-Or select a SentenceTransformer model with your own parameters:
+# Create your OpenAI LLM
+openai.api_key = "sk-..."
+llm = OpenAI()
 
-```python
-from sentence_transformers import SentenceTransformer
+# Load it in KeyLLM
+kw_model = KeyLLM(llm)
 
-sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
-kw_model = KeyBERT(model=sentence_model)
+# Extract keywords
+keywords = kw_model.extract_keywords(MY_DOCUMENTS)
 ```
 
-### 🤗 **Hugging Face Transformers**
-To use a Hugging Face transformers model, load in a pipeline and point 
-to any model found on their model hub (https://huggingface.co/models):
+If you want to use a chat-based model, please run the following instead:
 
 ```python
-from transformers.pipelines import pipeline
-
-hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
-kw_model = KeyBERT(model=hf_model)
-```
+import openai
+from keybert.llm import OpenAI
+from keybert import KeyLLM
 
-!!! tip "Tip!"
-    These transformers also work quite well using `sentence-transformers` which has a number of 
-    optimizations tricks that make using it a bit faster. 
+# Create your LLM
+openai.api_key = "sk-..."
+llm = OpenAI(model="gpt-3.5-turbo", chat=True)
 
-### **Flair**
-[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
-is publicly available. Flair can be used as follows:
-
-```python
-from flair.embeddings import TransformerDocumentEmbeddings
-
-roberta = TransformerDocumentEmbeddings('roberta-base')
-kw_model = KeyBERT(model=roberta)
+# Load it in KeyLLM
+kw_model = KeyLLM(llm)
 ```
 
-You can select any 🤗 transformers model [here](https://huggingface.co/models).
-
-Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
-Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
-pass it to KeyBERT in order to use those word embeddings as document embeddings:
+### **Cohere**
+To use Cohere's external API, we need to define our key and use the `keybert.llm.OpenAI` model:
 
 ```python
-from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
+import cohere
+from keybert.llm import Cohere
+from keybert import KeyLLM
+
+# Create your OpenAI LLM
+co = cohere.Client(my_api_key)
+llm = Cohere(co)
 
-glove_embedding = WordEmbeddings('crawl')
-document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
+# Load it in KeyLLM
+kw_model = KeyLLM(llm)
 
-kw_model = KeyBERT(model=document_glove_embeddings)
+# Extract keywords
+keywords = kw_model.extract_keywords(MY_DOCUMENTS)
 ```
 
-### **Spacy**
-[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
-many models available across many languages for modeling text.
+### **LiteLLM**
+[LiteLLM](https://github.com/BerriAI/litellm) allows you to use any closed-source LLM with KeyLLM
 
-To use Spacy's non-transformer models in KeyBERT:
+Let's use OpenAI as an example:
 
 ```python
-import spacy
+import os
+from keybert.llm import LiteLLM
+from keybert import KeyLLM
 
-nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
+# Select LLM
+os.environ["OPENAI_API_KEY"] = "sk-..."
+llm = LiteLLM("gpt-3.5-turbo")
 
-kw_model = KeyBERT(model=nlp)
+# Load it in KeyLLM
+kw_model = KeyLLM(llm)
 ```
 
-Using spacy-transformer models:
+### 🤗 **Hugging Face Transformers**
+To use a Hugging Face transformers model, load in a pipeline and point 
+to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example:
 
 ```python
-import spacy
-
-spacy.prefer_gpu()
-nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
-
-kw_model = KeyBERT(model=nlp)
+from torch import cuda, bfloat16
+import transformers
+
+model_id = 'meta-llama/Llama-2-7b-chat-hf'
+
+# 4-bit Quantization to load Llama 2 with less GPU memory
+bnb_config = transformers.BitsAndBytesConfig(
+    load_in_4bit=True,  
+    bnb_4bit_quant_type='nf4',  
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_compute_dtype=bfloat16
+)
+
+# Llama 2 Model & Tokenizer
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
+model = transformers.AutoModelForCausalLM.from_pretrained(
+    model_id,
+    trust_remote_code=True,
+    quantization_config=bnb_config,
+    device_map='auto',
+)
+model.eval()
+
+# Our text generator
+generator = transformers.pipeline(
+    model=model, tokenizer=tokenizer,
+    task='text-generation',
+    temperature=0.1,
+    max_new_tokens=500,
+    repetition_penalty=1.1
+)
 ```
 
-If you run into memory issues with spacy-transformer models, try:
+Then, we load the `generator` in `KeyLLM`:
 
 ```python
-import spacy
-from thinc.api import set_gpu_allocator, require_gpu
+from keybert.llm import TextGeneration
+from keybert import KeyLLM
 
-nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
-set_gpu_allocator("pytorch")
-require_gpu(0)
-
-kw_model = KeyBERT(model=nlp)
+# Load it in KeyLLM
+llm = TextGeneration(generator)
+kw_model = KeyLLM(llm)
 ```
 
-### **Universal Sentence Encoder (USE)**
-The Universal Sentence Encoder encodes text into high dimensional vectors that are used here
-for embedding the documents. The model is trained and optimized for greater-than-word length text,
-such as sentences, phrases or short paragraphs.
+### **LangChain**
 
-Using USE in KeyBERT is rather straightforward:
+To use LangChain, we can simply load in any LLM and pass that as a QA-chain to KeyLLM:
 
 ```python
-import tensorflow_hub
-embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
-kw_model = KeyBERT(model=embedding_model)
+from langchain.chains.question_answering import load_qa_chain
+from langchain.llms import OpenAI
+chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
 ```
 
-### **Gensim**
-For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model
-to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically
-best for short documents since the word embeddings are pooled.
+Finally, you can pass the chain to KeyBERT as follows:
 
 ```python
-import gensim.downloader as api
-ft = api.load('fasttext-wiki-news-subwords-300')
-kw_model = KeyBERT(model=ft)
-```
+from keybert.llm import LangChain
+from keybert import KeyLLM
 
-### **Custom Backend**
-If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to
-create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:
+# Create your LLM
+llm = LangChain(chain)
 
-```python
-from keybert.backend import BaseEmbedder
-from sentence_transformers import SentenceTransformer
-
-class CustomEmbedder(BaseEmbedder):
-    def __init__(self, embedding_model):
-        super().__init__()
-        self.embedding_model = embedding_model
-
-    def embed(self, documents, verbose=False):
-        embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
-        return embeddings
-
-# Create custom backend
-distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")
-custom_embedder = CustomEmbedder(embedding_model=distilbert)
-
-# Pass custom backend to keybert
-kw_model = KeyBERT(model=custom_embedder)
-```
+# Load it in KeyLLM
+kw_model = KeyLLM(llm)
+```
diff --git a/keybert/_llm.py b/keybert/_llm.py
@@ -29,7 +29,7 @@ def extract_keywords(
         check_vocab: bool = False,
         candidate_keywords: List[List[str]] = None,
         threshold: float = None,
-        embeddings = None
+        embeddings=None
     ) -> Union[List[str], List[List[str]]]:
         """Extract keywords and/or keyphrases
 
@@ -85,7 +85,7 @@ def extract_keywords(
             out_cluster = set(list(range(len(docs)))).difference(in_cluster)
 
             # Extract keywords for all documents not in a cluster
-            if out_cluster:    
+            if out_cluster:
                 selected_docs = [docs[index] for index in out_cluster]
                 print(out_cluster, selected_docs)
                 if candidate_keywords is not None:

diff --git a/keybert/llm/_cohere.py b/keybert/llm/_cohere.py
@@ -2,6 +2,7 @@
 from tqdm import tqdm
 from typing import List
 from keybert.llm._base import BaseLLM
+from keybert.llm._utils import process_candidate_keywords
 
 
 DEFAULT_PROMPT = """
@@ -93,19 +94,26 @@ def __init__(self,
         self.delay_in_seconds = delay_in_seconds
         self.verbose = verbose
 
-    def extract_keywords(self, documents: List[str]):
+    def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None):
         """ Extract topics
 
         Arguments:
             documents: The documents to extract keywords from
+            candidate_keywords: A list of candidate keywords that the LLM will fine-tune
+                        For example, it will create a nicer representation of
+                        the candidate keywords, remove redundant keywords, or
+                        shorten them depending on the input prompt.
 
         Returns:
             all_keywords: All keywords for each document
         """
         all_keywords = []
+        candidate_keywords = process_candidate_keywords(documents, candidate_keywords)
 
-        for document in tqdm(documents, disable=not self.verbose):
+        for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose):
             prompt = self.prompt.replace("[DOCUMENT]", document)
+            if candidates is not None:
+                prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates))
 
             # Delay
             if self.delay_in_seconds:

diff --git a/keybert/llm/_langchain.py b/keybert/llm/_langchain.py
@@ -2,6 +2,7 @@
 from typing import List
 from langchain.docstore.document import Document
 from keybert.llm._base import BaseLLM
+from keybert.llm._utils import process_candidate_keywords
 
 
 DEFAULT_PROMPT = "What is this document about? Please provide keywords separated by commas."
@@ -75,18 +76,26 @@ def __init__(self,
         self.default_prompt_ = DEFAULT_PROMPT
         self.verbose = verbose
 
-    def extract_keywords(self, documents: List[str]):
+    def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None):
         """ Extract topics
 
         Arguments:
             documents: The documents to extract keywords from
+            candidate_keywords: A list of candidate keywords that the LLM will fine-tune
+                        For example, it will create a nicer representation of
+                        the candidate keywords, remove redundant keywords, or
+                        shorten them depending on the input prompt.
 
         Returns:
             all_keywords: All keywords for each document
         """
         all_keywords = []
+        candidate_keywords = process_candidate_keywords(documents, candidate_keywords)
 
-        for document in tqdm(documents, disable=not self.verbose):
+        for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose):
+            prompt = self.prompt.replace("[DOCUMENT]", document)
+            if candidates is not None:
+                prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates))
             input_document = Document(page_content=document)
             keywords = self.chain.run(input_documents=input_document, question=self.prompt).strip()
             keywords = [keyword.strip() for keyword in keywords.split(",")]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `LangChain`

		::: keybert.llm._langchain.LangChain
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# `TextGeneration`

		::: keybert.llm._textgeneration.TextGeneration