style: apply pre-commit hooks

MaartenGr · Jun 26, 2024 · 25f9931 · 25f9931
1 parent 929e389
commit 25f9931
Show file tree

Hide file tree

Showing 37 changed files with 207 additions and 320 deletions.
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -59,7 +59,7 @@ kw_model = KeyLLM(llm)
 
 * Use `KeyLLM` to leverage LLMs for extracting keywords
   * Use it either with or without candidate keywords generated through `KeyBERT`
-  * Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM  
+  * Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM
 
 ```python
 import openai
@@ -101,7 +101,7 @@ doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
 keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
 ```
 
-Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`. 
+Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.
 
 **Fixes**:
 
@@ -137,7 +137,7 @@ kw_model = KeyBERT(model=hf_model)
 
 **NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!
 
-**Fixes**: 
+**Fixes**:
 
 * Fix typo in ReadMe by [@priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
 * Add missing optional dependencies (gensim, use, and spacy) by [@yusuke1997](https://github.com/yusuke1997)

diff --git a/docs/faq.md b/docs/faq.md
@@ -21,11 +21,11 @@ topic modeling to HTML-code to extract topics of code, then it becomes important
 
 
 ## **How can I speed up the model?**
-Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package. 
+Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
 Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.
 
-A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words 
-need to only be embedded a single time, which can result in a major speed up. 
+A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
+need to only be embedded a single time, which can result in a major speed up.
 
 This is **faster**:
 

diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md
@@ -21,7 +21,7 @@ kw_model = KeyBERT(model=sentence_model)
 ```
 
 ### 🤗 **Hugging Face Transformers**
-To use a Hugging Face transformers model, load in a pipeline and point 
+To use a Hugging Face transformers model, load in a pipeline and point
 to any model found on their model hub (https://huggingface.co/models):
 
 ```python
@@ -32,8 +32,8 @@ kw_model = KeyBERT(model=hf_model)
 ```
 
 !!! tip "Tip!"
-    These transformers also work quite well using `sentence-transformers` which has a number of 
-    optimizations tricks that make using it a bit faster. 
+    These transformers also work quite well using `sentence-transformers` which has a number of
+    optimizations tricks that make using it a bit faster.
 
 ### **Flair**
 [Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that

diff --git a/docs/guides/keyllm.md b/docs/guides/keyllm.md
@@ -16,7 +16,7 @@ documents = [
 
 This data was chosen to show the different use cases and techniques. As you might have noticed documents 1 and 2 are quite similar whereas document 3 is about an entirely different subject. This similarity will be taken into account when using `KeyBERT` together with `KeyLLM`
 
-Let's start with `KeyLLM` only. 
+Let's start with `KeyLLM` only.
 
 
 # Use Cases
@@ -180,7 +180,7 @@ If you have embeddings of your documents, you could use those to find documents
 </div>
 
 !!! Tip
-    Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch. 
+    Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
     There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
     `pip uninstall sentence-transformers`
     `pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`
@@ -231,7 +231,7 @@ This is the best of both worlds. We use `KeyBERT` to generate a first pass of ke
 </div>
 
 !!! Tip
-    Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch. 
+    Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
     There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
     `pip uninstall sentence-transformers`
     `pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`

diff --git a/docs/guides/llms.md b/docs/guides/llms.md
@@ -3,7 +3,7 @@ In this tutorial we will be going through the Large Language Models (LLM) that c
 Having the option to choose the LLM allow you to leverage the model that suit your use-case.
 
 ### **OpenAI**
-To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model. 
+To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model.
 
 We install the package first:
 
@@ -98,7 +98,7 @@ kw_model = KeyLLM(llm)
 ```
 
 ### 🤗 **Hugging Face Transformers**
-To use a Hugging Face transformers model, load in a pipeline and point 
+To use a Hugging Face transformers model, load in a pipeline and point
 to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example:
 
 ```python
@@ -109,8 +109,8 @@ model_id = 'meta-llama/Llama-2-7b-chat-hf'
 
 # 4-bit Quantization to load Llama 2 with less GPU memory
 bnb_config = transformers.BitsAndBytesConfig(
-    load_in_4bit=True,  
-    bnb_4bit_quant_type='nf4',  
+    load_in_4bit=True,
+    bnb_4bit_quant_type='nf4',
     bnb_4bit_use_double_quant=True,
     bnb_4bit_compute_dtype=bfloat16
 )
@@ -152,15 +152,15 @@ I have the following document:
 - The website mentions that it only takes a couple of days to deliver but I still have not received mine.
 
 Please give me the keywords that are present in this document and separate them with commas.
-Make sure you to only return the keywords and say nothing else. For example, don't say: 
+Make sure you to only return the keywords and say nothing else. For example, don't say:
 "Here are the keywords present in the document"
 [/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken [INST]
 
 I have the following document:
 - [DOCUMENT]
 
 Please give me the keywords that are present in this document and separate them with commas.
-Make sure you to only return the keywords and say nothing else. For example, don't say: 
+Make sure you to only return the keywords and say nothing else. For example, don't say:
 "Here are the keywords present in the document"
 [/INST]
 """
@@ -200,4 +200,4 @@ llm = LangChain(chain)
 
 # Load it in KeyLLM
 kw_model = KeyLLM(llm)
-```
+```
diff --git a/docs/guides/quickstart.md b/docs/guides/quickstart.md
@@ -78,9 +78,9 @@ keywords = kw_model.extract_keywords(doc, highlight=True)
 
 ## **Fine-tuning**
 
-As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead 
-to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two 
-approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**. 
+As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
+to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
+approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.
 
 ###  **Max Sum Distance**
 
@@ -165,8 +165,8 @@ keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
 
 ## **Prepare embeddings**
 
-When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and 
-word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that 
+When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
+word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
 we only have to calculate it once:
 
 
@@ -183,15 +183,15 @@ You can then use these embeddings and pass them to `.extract_keywords` to speed
 keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
 ```
 
-There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:   
+There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:
 
 * `candidates`
 * `keyphrase_ngram_range`
-* `stop_words` 
+* `stop_words`
 * `min_df`
 * `vectorizer`
 
-The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`. 
+The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.
 
 In other words, the following will work as they use the same parameter subset:
 
@@ -200,8 +200,8 @@ from keybert import KeyBERT
 
 kw_model = KeyBERT()
 doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
-keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", 
-                                     doc_embeddings=doc_embeddings, 
+keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
+                                     doc_embeddings=doc_embeddings,
                                      word_embeddings=word_embeddings)
 ```
 
@@ -212,7 +212,7 @@ from keybert import KeyBERT
 
 kw_model = KeyBERT()
 doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
-keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english", 
-                                     doc_embeddings=doc_embeddings, 
+keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
+                                     doc_embeddings=doc_embeddings,
                                      word_embeddings=word_embeddings)
 ```
diff --git a/docs/images/guided.svg b/docs/images/guided.svg
diff --git a/docs/images/pipeline.svg b/docs/images/pipeline.svg
diff --git a/docs/index.md b/docs/index.md
@@ -99,4 +99,4 @@ of words you would like in the resulting keyphrases:
 ```
 
 !!! note "NOTE"
-    You can also pass multiple documents at once if you are looking for a major speed-up!
+    You can also pass multiple documents at once if you are looking for a major speed-up!
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -6,7 +6,7 @@
 --md-typeset-a-color: #0277BD;
 }
 
-body[data-md-color-primary="black"] .excalidraw  svg  {  
+body[data-md-color-primary="black"] .excalidraw  svg  {
   filter: invert(100%) hue-rotate(180deg);
 }
 

diff --git a/keybert/__init__.py b/keybert/__init__.py
@@ -4,3 +4,8 @@
 from keybert._model import KeyBERT
 
 __version__ = version("keybert")
+
+__all__ = [
+    "KeyBERT",
+    "KeyLLM",
+]
diff --git a/keybert/_highlight.py b/keybert/_highlight.py
@@ -11,10 +11,8 @@ class NullHighlighter(RegexHighlighter):
     highlights = [r""]
 
 
-def highlight_document(
-    doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer
-):
-    """Highlight keywords in a document
+def highlight_document(doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer):
+    """Highlight keywords in a document.
 
     Arguments:
         doc: The document for which to extract keywords/keyphrases.
@@ -38,10 +36,8 @@ def highlight_document(
     console.print(highlighted_text)
 
 
-def _highlight_one_gram(
-    doc: str, keywords: List[str], vectorizer: CountVectorizer
-) -> str:
-    """Highlight 1-gram keywords in a document
+def _highlight_one_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
+    """Highlight 1-gram keywords in a document.
 
     Arguments:
         doc: The document for which to extract keywords/keyphrases.
@@ -57,18 +53,13 @@ def _highlight_one_gram(
     separator = "" if "zh" in str(tokenizer) else " "
 
     highlighted_text = separator.join(
-        [
-            f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}"
-            for token in tokens
-        ]
+        [f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}" for token in tokens]
     ).strip()
     return highlighted_text
 
 
-def _highlight_n_gram(
-    doc: str, keywords: List[str], vectorizer: CountVectorizer
-) -> str:
-    """Highlight n-gram keywords in a document
+def _highlight_n_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
+    """Highlight n-gram keywords in a document.
 
     Arguments:
         doc: The document for which to extract keywords/keyphrases.
@@ -85,8 +76,7 @@ def _highlight_n_gram(
     separator = "" if "zh" in str(tokenizer) else " "
 
     n_gram_tokens = [
-        [separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)]
-        for i, _ in enumerate(tokens)
+        [separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)
     ]
     highlighted_text = []
     skip = False
@@ -96,11 +86,8 @@ def _highlight_n_gram(
 
         if not skip:
             for index, n_gram in enumerate(n_grams):
-
                 if n_gram.lower() in keywords:
-                    candidate = (
-                        f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
-                    )
+                    candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
                     skip = index + 1
 
             if not candidate:

diff --git a/keybert/_llm.py b/keybert/_llm.py
@@ -2,21 +2,21 @@
 
 try:
     from sentence_transformers import util
+
     HAS_SBERT = True
 except ModuleNotFoundError:
     HAS_SBERT = False
 
 
 class KeyLLM:
-    """
-    A minimal method for keyword extraction with Large Language Models (LLM)
+    """A minimal method for keyword extraction with Large Language Models (LLM).
 
     The keyword extraction is done by simply asking the LLM to extract a
     number of keywords from a single piece of text.
     """
 
     def __init__(self, llm):
-        """KeyBERT initialization
+        """KeyBERT initialization.
 
         Arguments:
             llm: The Large Language Model to use
@@ -29,9 +29,9 @@ def extract_keywords(
         check_vocab: bool = False,
         candidate_keywords: List[List[str]] = None,
         threshold: float = None,
-        embeddings=None
+        embeddings=None,
     ) -> Union[List[str], List[List[str]]]:
-        """Extract keywords and/or keyphrases
+        """Extract keywords and/or keyphrases.
 
         To get the biggest speed-up, make sure to pass multiple documents
         at once instead of iterating over a single document.
@@ -78,7 +78,6 @@ def extract_keywords(
                 return []
 
         if HAS_SBERT and threshold is not None and embeddings is not None:
-
             # Find similar documents
             clusters = util.community_detection(embeddings, min_community_size=2, threshold=threshold)
             in_cluster = set([cluster for cluster_set in clusters for cluster in cluster_set])
@@ -97,21 +96,16 @@ def extract_keywords(
                 )
                 out_cluster_keywords = {index: words for words, index in zip(out_cluster_keywords, out_cluster)}
 
-            # Extract keywords for only the first document in a cluster        
+            # Extract keywords for only the first document in a cluster
             if in_cluster:
                 selected_docs = [docs[cluster[0]] for cluster in clusters]
                 if candidate_keywords is not None:
                     selected_keywords = [candidate_keywords[cluster[0]] for cluster in clusters]
                 else:
                     selected_keywords = None
-                in_cluster_keywords = self.llm.extract_keywords(
-                    selected_docs,
-                    selected_keywords
-                )
+                in_cluster_keywords = self.llm.extract_keywords(selected_docs, selected_keywords)
                 in_cluster_keywords = {
-                    doc_id: in_cluster_keywords[index] 
-                    for index, cluster in enumerate(clusters)
-                    for doc_id in cluster
+                    doc_id: in_cluster_keywords[index] for index, cluster in enumerate(clusters) for doc_id in cluster
                 }
 
             # Update out cluster keywords with in cluster keywords