Skip to content

Commit

Permalink
ci: set up lint job using pre-commit/action (#238)
Browse files Browse the repository at this point in the history
  • Loading branch information
afuetterer authored Jul 16, 2024
1 parent 5740fd5 commit d8c2487
Show file tree
Hide file tree
Showing 40 changed files with 260 additions and 341 deletions.
8 changes: 8 additions & 0 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ on:
- dev

jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
# Ref: https://github.com/pre-commit/action
- uses: pre-commit/[email protected]

build:
runs-on: ubuntu-latest
strategy:
Expand Down
16 changes: 5 additions & 11 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,9 @@ repos:
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/PyCQA/flake8
rev: 7.1.0
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.1
hooks:
- id: flake8
- repo: https://github.com/psf/black
rev: 24.4.2
hooks:
- id: black
exclude: |
(?x)^(
README.md
)$
- id: ruff
args: [--fix, --show-fixes, --exit-non-zero-on-fix]
- id: ruff-format
6 changes: 3 additions & 3 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ kw_model = KeyLLM(llm)

* Use `KeyLLM` to leverage LLMs for extracting keywords
* Use it either with or without candidate keywords generated through `KeyBERT`
* Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM
* Multiple LLMs are integrated: OpenAI, Cohere, LangChain, HF, and LiteLLM

```python
import openai
Expand Down Expand Up @@ -101,7 +101,7 @@ doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.
Do note that the parameters passed to `.extract_embeddings` for creating the vectorizer should be exactly the same as those in `.extract_keywords`.

**Fixes**:

Expand Down Expand Up @@ -137,7 +137,7 @@ kw_model = KeyBERT(model=hf_model)

**NOTE**: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!

**Fixes**:
**Fixes**:

* Fix typo in ReadMe by [@priyanshul-govil](https://github.com/priyanshul-govil) in [#117](https://github.com/MaartenGr/KeyBERT/pull/117)
* Add missing optional dependencies (gensim, use, and spacy) by [@yusuke1997](https://github.com/yusuke1997)
Expand Down
6 changes: 3 additions & 3 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ topic modeling to HTML-code to extract topics of code, then it becomes important


## **How can I speed up the model?**
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
Since KeyBERT uses large language models as its backend, a GPU is typically prefered when using this package.
Although it is possible to use it without a dedicated GPU, the inference speed will be significantly slower.

A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
need to only be embedded a single time, which can result in a major speed up.
A second method for speeding up KeyBERT is by passing it multiple documents at once. By doing this, words
need to only be embedded a single time, which can result in a major speed up.

This is **faster**:

Expand Down
6 changes: 3 additions & 3 deletions docs/guides/embeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ kw_model = KeyBERT(model=sentence_model)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):

```python
Expand All @@ -32,8 +32,8 @@ kw_model = KeyBERT(model=hf_model)
```

!!! tip "Tip!"
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.
These transformers also work quite well using `sentence-transformers` which has a number of
optimizations tricks that make using it a bit faster.

### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
Expand Down
6 changes: 3 additions & 3 deletions docs/guides/keyllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ documents = [

This data was chosen to show the different use cases and techniques. As you might have noticed documents 1 and 2 are quite similar whereas document 3 is about an entirely different subject. This similarity will be taken into account when using `KeyBERT` together with `KeyLLM`

Let's start with `KeyLLM` only.
Let's start with `KeyLLM` only.


# Use Cases
Expand Down Expand Up @@ -180,7 +180,7 @@ If you have embeddings of your documents, you could use those to find documents
</div>

!!! Tip
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
`pip uninstall sentence-transformers`
`pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`
Expand Down Expand Up @@ -231,7 +231,7 @@ This is the best of both worlds. We use `KeyBERT` to generate a first pass of ke
</div>

!!! Tip
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
Before you get started, it might be worthwhile to uninstall sentence-transformers and re-install it from the main branch.
There is an issue with community detection (cluster) that might make the model run without finishing. It is as straightforward as:
`pip uninstall sentence-transformers`
`pip install --upgrade git+https://github.com/UKPLab/sentence-transformers`
Expand Down
14 changes: 7 additions & 7 deletions docs/guides/llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ In this tutorial we will be going through the Large Language Models (LLM) that c
Having the option to choose the LLM allow you to leverage the model that suit your use-case.

### **OpenAI**
To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model.
To use OpenAI's external API, we need to define our key and use the `keybert.llm.OpenAI` model.

We install the package first:

Expand Down Expand Up @@ -98,7 +98,7 @@ kw_model = KeyLLM(llm)
```

### 🤗 **Hugging Face Transformers**
To use a Hugging Face transformers model, load in a pipeline and point
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models). Let's use Llama 2 as an example:

```python
Expand All @@ -109,8 +109,8 @@ model_id = 'meta-llama/Llama-2-7b-chat-hf'

# 4-bit Quantization to load Llama 2 with less GPU memory
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
Expand Down Expand Up @@ -152,15 +152,15 @@ I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken [INST]
I have the following document:
- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""
Expand Down Expand Up @@ -200,4 +200,4 @@ llm = LangChain(chain)

# Load it in KeyLLM
kw_model = KeyLLM(llm)
```
```
24 changes: 12 additions & 12 deletions docs/guides/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,9 @@ keywords = kw_model.extract_keywords(doc, highlight=True)

## **Fine-tuning**

As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.
As a default, KeyBERT simply compares the documents and candidate keywords/keyphrases based on their cosine similarity. However, this might lead
to very similar words ending up in the list of most accurate keywords/keyphrases. To make sure they are a bit more diversified, there are two
approaches that we can take in order to fine-tune our output, **Max Sum Distance** and **Maximal Marginal Relevance**.

### **Max Sum Distance**

Expand Down Expand Up @@ -165,8 +165,8 @@ keywords = kw_model.extract_keywords(doc, seed_keywords=seed_keywords)

## **Prepare embeddings**

When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
When you have a large dataset and you want to fine-tune parameters such as `diversity` it can take quite a while to re-calculate the document and
word embeddings each time you change a parameter. Instead, we can pre-calculate these embeddings and pass them to `.extract_keywords` such that
we only have to calculate it once:


Expand All @@ -183,15 +183,15 @@ You can then use these embeddings and pass them to `.extract_keywords` to speed
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
```

There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:
There are several parameters in `.extract_embeddings` that define how the list of candidate keywords/keyphrases is generated:

* `candidates`
* `keyphrase_ngram_range`
* `stop_words`
* `stop_words`
* `min_df`
* `vectorizer`

The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.
The values of these parameters need to be exactly the same in `.extract_embeddings` as they are in `. extract_keywords`.

In other words, the following will work as they use the same parameter subset:

Expand All @@ -200,8 +200,8 @@ from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=1, stop_words="english")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```

Expand All @@ -212,7 +212,7 @@ from keybert import KeyBERT

kw_model = KeyBERT()
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs, min_df=3, stop_words="dutch")
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
keywords = kw_model.extract_keywords(docs, min_df=1, stop_words="english",
doc_embeddings=doc_embeddings,
word_embeddings=word_embeddings)
```
4 changes: 2 additions & 2 deletions docs/images/guided.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/images/pipeline.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,4 @@ of words you would like in the resulting keyphrases:
```

!!! note "NOTE"
You can also pass multiple documents at once if you are looking for a major speed-up!
You can also pass multiple documents at once if you are looking for a major speed-up!
2 changes: 1 addition & 1 deletion docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
--md-typeset-a-color: #0277BD;
}

body[data-md-color-primary="black"] .excalidraw svg {
body[data-md-color-primary="black"] .excalidraw svg {
filter: invert(100%) hue-rotate(180deg);
}

Expand Down
5 changes: 5 additions & 0 deletions keybert/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@
from keybert._model import KeyBERT

__version__ = version("keybert")

__all__ = [
"KeyBERT",
"KeyLLM",
]
31 changes: 9 additions & 22 deletions keybert/_highlight.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,8 @@ class NullHighlighter(RegexHighlighter):
highlights = [r""]


def highlight_document(
doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer
):
"""Highlight keywords in a document
def highlight_document(doc: str, keywords: List[Tuple[str, float]], vectorizer: CountVectorizer):
"""Highlight keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -38,10 +36,8 @@ def highlight_document(
console.print(highlighted_text)


def _highlight_one_gram(
doc: str, keywords: List[str], vectorizer: CountVectorizer
) -> str:
"""Highlight 1-gram keywords in a document
def _highlight_one_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
"""Highlight 1-gram keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -57,18 +53,13 @@ def _highlight_one_gram(
separator = "" if "zh" in str(tokenizer) else " "

highlighted_text = separator.join(
[
f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}"
for token in tokens
]
[f"[black on #FFFF00]{token}[/]" if token.lower() in keywords else f"{token}" for token in tokens]
).strip()
return highlighted_text


def _highlight_n_gram(
doc: str, keywords: List[str], vectorizer: CountVectorizer
) -> str:
"""Highlight n-gram keywords in a document
def _highlight_n_gram(doc: str, keywords: List[str], vectorizer: CountVectorizer) -> str:
"""Highlight n-gram keywords in a document.
Arguments:
doc: The document for which to extract keywords/keyphrases.
Expand All @@ -85,8 +76,7 @@ def _highlight_n_gram(
separator = "" if "zh" in str(tokenizer) else " "

n_gram_tokens = [
[separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)]
for i, _ in enumerate(tokens)
[separator.join(tokens[i : i + max_len][0 : j + 1]) for j in range(max_len)] for i, _ in enumerate(tokens)
]
highlighted_text = []
skip = False
Expand All @@ -96,11 +86,8 @@ def _highlight_n_gram(

if not skip:
for index, n_gram in enumerate(n_grams):

if n_gram.lower() in keywords:
candidate = (
f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
)
candidate = f"[black on #FFFF00]{n_gram}[/]" + n_grams[-1].split(n_gram)[-1]
skip = index + 1

if not candidate:
Expand Down
Loading

0 comments on commit d8c2487

Please sign in to comment.