Skip to content

Commit

Permalink
More documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
MaartenGr committed Sep 26, 2023
1 parent 21040b6 commit b270773
Show file tree
Hide file tree
Showing 15 changed files with 186 additions and 124 deletions.
3 changes: 3 additions & 0 deletions docs/api/cohere.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `Cohere`

::: keybert.llm._cohere.Cohere
3 changes: 3 additions & 0 deletions docs/api/keyllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `KeyLLM`

::: keybert._llm.KeyLLM
3 changes: 3 additions & 0 deletions docs/api/langchain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `LangChain`

::: keybert.llm._langchain.LangChain
3 changes: 3 additions & 0 deletions docs/api/litellm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `LiteLLM`

::: keybert.llm._litellm.LiteLLM
3 changes: 3 additions & 0 deletions docs/api/openai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `OpenAI`

::: keybert.llm._openai.OpenAI
3 changes: 3 additions & 0 deletions docs/api/textgeneration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `TextGeneration`

::: keybert.llm._textgeneration.TextGeneration
205 changes: 101 additions & 104 deletions docs/guides/llms.md
Original file line number Diff line number Diff line change
@@ -1,149 +1,146 @@
# Embedding Models
In this tutorial we will be going through the embedding models that can be used in KeyBERT.
Having the option to choose embedding models allow you to leverage pre-trained embeddings that suit your use-case.
# Large Language Models (LLM)
In this tutorial we will be going through the Large Language Models (LLM) that can be used in KeyLLM.
Having the option to choose the LLM allow you to leverage the model that suit your use-case.

### **Sentence Transformers**
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyBERT with `model`:
### **OpenAI**
To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model:

```python
from keybert import KeyBERT
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
```
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM

Or select a SentenceTransformer model with your own parameters:
# Create your OpenAI LLM
openai.api_key = "sk-..."
llm = OpenAI()

```python
from sentence_transformers import SentenceTransformer
# Load it in KeyLLM
kw_model = KeyLLM(llm)

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=sentence_model)
# Extract keywords
keywords = kw_model.extract_keywords(MY_DOCUMENTS)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):
If you want to use a chat-based model, please run the following instead:

```python
from transformers.pipelines import pipeline

hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)
```
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM

!!! tip "Tip!"
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.
# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI(model="gpt-3.5-turbo", chat=True)

### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:

```python
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
kw_model = KeyBERT(model=roberta)
# Load it in KeyLLM
kw_model = KeyLLM(llm)
```

You can select any 🤗 transformers model [here](https://huggingface.co/models).

Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
pass it to KeyBERT in order to use those word embeddings as document embeddings:
### **Cohere**
To use Cohere's external API, we need to define our key and use the `keybert.llm.OpenAI` model:

```python
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
import cohere
from keybert.llm import Cohere
from keybert import KeyLLM

# Create your OpenAI LLM
co = cohere.Client(my_api_key)
llm = Cohere(co)

glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
# Load it in KeyLLM
kw_model = KeyLLM(llm)

kw_model = KeyBERT(model=document_glove_embeddings)
# Extract keywords
keywords = kw_model.extract_keywords(MY_DOCUMENTS)
```

### **Spacy**
[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
many models available across many languages for modeling text.
### **LiteLLM**
[LiteLLM](https://github.com/BerriAI/litellm) allows you to use any closed-source LLM with KeyLLM

To use Spacy's non-transformer models in KeyBERT:
Let's use OpenAI as an example:

```python
import spacy
import os
from keybert.llm import LiteLLM
from keybert import KeyLLM

nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
# Select LLM
os.environ["OPENAI_API_KEY"] = "sk-..."
llm = LiteLLM("gpt-3.5-turbo")

kw_model = KeyBERT(model=nlp)
# Load it in KeyLLM
kw_model = KeyLLM(llm)
```

Using spacy-transformer models:
### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example:

```python
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

kw_model = KeyBERT(model=nlp)
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

# 4-bit Quantization to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

# Llama 2 Model & Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
quantization_config=bnb_config,
device_map='auto',
)
model.eval()

# Our text generator
generator = transformers.pipeline(
model=model, tokenizer=tokenizer,
task='text-generation',
temperature=0.1,
max_new_tokens=500,
repetition_penalty=1.1
)
```

If you run into memory issues with spacy-transformer models, try:
Then, we load the `generator` in `KeyLLM`:

```python
import spacy
from thinc.api import set_gpu_allocator, require_gpu
from keybert.llm import TextGeneration
from keybert import KeyLLM

nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)

kw_model = KeyBERT(model=nlp)
# Load it in KeyLLM
llm = TextGeneration(generator)
kw_model = KeyLLM(llm)
```

### **Universal Sentence Encoder (USE)**
The Universal Sentence Encoder encodes text into high dimensional vectors that are used here
for embedding the documents. The model is trained and optimized for greater-than-word length text,
such as sentences, phrases or short paragraphs.
### **LangChain**

Using USE in KeyBERT is rather straightforward:
To use LangChain, we can simply load in any LLM and pass that as a QA-chain to KeyLLM:

```python
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
kw_model = KeyBERT(model=embedding_model)
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")
```

### **Gensim**
For Gensim, KeyBERT supports its `gensim.downloader` module. Here, we can download any model word embedding model
to be used in KeyBERT. Note that Gensim is primarily used for Word Embedding models. This works typically
best for short documents since the word embeddings are pooled.
Finally, you can pass the chain to KeyBERT as follows:

```python
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
kw_model = KeyBERT(model=ft)
```
from keybert.llm import LangChain
from keybert import KeyLLM

### **Custom Backend**
If your backend or model cannot be found in the ones currently available, you can use the `keybert.backend.BaseEmbedder` class to
create your own backend. Below, you will find an example of creating a SentenceTransformer backend for KeyBERT:
# Create your LLM
llm = LangChain(chain)

```python
from keybert.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer

class CustomEmbedder(BaseEmbedder):
def __init__(self, embedding_model):
super().__init__()
self.embedding_model = embedding_model

def embed(self, documents, verbose=False):
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
return embeddings

# Create custom backend
distilbert = SentenceTransformer("paraphrase-MiniLM-L6-v2")
custom_embedder = CustomEmbedder(embedding_model=distilbert)

# Pass custom backend to keybert
kw_model = KeyBERT(model=custom_embedder)
```
# Load it in KeyLLM
kw_model = KeyLLM(llm)
```
4 changes: 2 additions & 2 deletions keybert/_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def extract_keywords(
check_vocab: bool = False,
candidate_keywords: List[List[str]] = None,
threshold: float = None,
embeddings = None
embeddings=None
) -> Union[List[str], List[List[str]]]:
"""Extract keywords and/or keyphrases
Expand Down Expand Up @@ -85,7 +85,7 @@ def extract_keywords(
out_cluster = set(list(range(len(docs)))).difference(in_cluster)

# Extract keywords for all documents not in a cluster
if out_cluster:
if out_cluster:
selected_docs = [docs[index] for index in out_cluster]
print(out_cluster, selected_docs)
if candidate_keywords is not None:
Expand Down
12 changes: 10 additions & 2 deletions keybert/llm/_cohere.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from tqdm import tqdm
from typing import List
from keybert.llm._base import BaseLLM
from keybert.llm._utils import process_candidate_keywords


DEFAULT_PROMPT = """
Expand Down Expand Up @@ -93,19 +94,26 @@ def __init__(self,
self.delay_in_seconds = delay_in_seconds
self.verbose = verbose

def extract_keywords(self, documents: List[str]):
def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None):
""" Extract topics
Arguments:
documents: The documents to extract keywords from
candidate_keywords: A list of candidate keywords that the LLM will fine-tune
For example, it will create a nicer representation of
the candidate keywords, remove redundant keywords, or
shorten them depending on the input prompt.
Returns:
all_keywords: All keywords for each document
"""
all_keywords = []
candidate_keywords = process_candidate_keywords(documents, candidate_keywords)

for document in tqdm(documents, disable=not self.verbose):
for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose):
prompt = self.prompt.replace("[DOCUMENT]", document)
if candidates is not None:
prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates))

# Delay
if self.delay_in_seconds:
Expand Down
13 changes: 11 additions & 2 deletions keybert/llm/_langchain.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from typing import List
from langchain.docstore.document import Document
from keybert.llm._base import BaseLLM
from keybert.llm._utils import process_candidate_keywords


DEFAULT_PROMPT = "What is this document about? Please provide keywords separated by commas."
Expand Down Expand Up @@ -75,18 +76,26 @@ def __init__(self,
self.default_prompt_ = DEFAULT_PROMPT
self.verbose = verbose

def extract_keywords(self, documents: List[str]):
def extract_keywords(self, documents: List[str], candidate_keywords: List[List[str]] = None):
""" Extract topics
Arguments:
documents: The documents to extract keywords from
candidate_keywords: A list of candidate keywords that the LLM will fine-tune
For example, it will create a nicer representation of
the candidate keywords, remove redundant keywords, or
shorten them depending on the input prompt.
Returns:
all_keywords: All keywords for each document
"""
all_keywords = []
candidate_keywords = process_candidate_keywords(documents, candidate_keywords)

for document in tqdm(documents, disable=not self.verbose):
for document, candidates in tqdm(zip(documents, candidate_keywords), disable=not self.verbose):
prompt = self.prompt.replace("[DOCUMENT]", document)
if candidates is not None:
prompt = prompt.replace("[CANDIDATES]", ", ".join(candidates))
input_document = Document(page_content=document)
keywords = self.chain.run(input_documents=input_document, question=self.prompt).strip()
keywords = [keyword.strip() for keyword in keywords.split(",")]
Expand Down
Loading

0 comments on commit b270773

Please sign in to comment.