Skip to content

Commit

Permalink
Add DataMapPlot documentation (#1854)
Browse files Browse the repository at this point in the history
  • Loading branch information
dkapitan authored Mar 9, 2024
1 parent b59aab8 commit 8985f26
Show file tree
Hide file tree
Showing 6 changed files with 55 additions and 2 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,6 @@ venv.bak/
.idea/
.vscode
.DS_Store

# mkdocs
site/
2 changes: 1 addition & 1 deletion docs/api/plotting/document_datamap.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# `Document Data Map`
# `Documents with DataMapPlot`

::: bertopic.plotting._datamap.visualize_document_datamap
20 changes: 20 additions & 0 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,3 +311,23 @@ are important in understanding the general topic of the document. Although this
have data that contains a lot of noise, for example, HTML-tags, then it would be best to remove them. HTML-tags
typically do not contribute to the meaning of a document and should therefore be removed. However, if you apply
topic modeling to HTML-code to extract topics of code, then it becomes important.

## **I run into issues running on Apple Silicon. What should I do?**
Apple Silicon chips (M1 & M2) are based on the ARM64 (aka [AArch64](https://apple.stackexchange.com/questions/451238/is-m1-chip-aarch64-or-amd64), not to be confused with AMD64). There are known issues with upstream dependencies for this architecture, for example [numba](https://github.com/numba/numba/issues/5520). You may not always run into this issue, depending on the extras that you need.

One possible solution to this is to use [VS Code Dev Containers](https://code.visualstudio.com/docs/devcontainers/containers), which allows you to setup a Linux-based environment. To run BERTopic effectively you need to be aware of two things:

- Make sure to use a Docker image specifically compiled for ARM64
- Make sure to use `volume` instead of `mount-bind`, since the latter significantly reduces I/O speeds to disk

Using the pre-configured [Data Science Devcontainers](https://github.com/b-data/data-science-devcontainers) makes sure these setting are optimized. To start using them, do the following:

- Install and run Docker
- Install `python-base` or `python-scipy` [devcontainer](https://github.com/b-data/data-science-devcontainers)
- ℹ️ Change PYTHON_VERSION to 3.11 in the `devcontainer.json` to work with the latest version of Python 3.11 (currently 3.11.8)
- Open VS Code, build the container and start working
- Note that data is persisted in the container
- When using an unmodified devcontainer.json: work in `/home/vscode` which is the `home` directory of user `vscode`
- Python packages are installed to the home directory by default. This is due to env variable `PIP_USER=1`
- Note that the directory `/workspaces` is also persisted

27 changes: 27 additions & 0 deletions docs/getting_started/visualization/visualize_documents.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## **Visualize documents with Plotly**

Using the `.visualize_topics`, we can visualize the topics and get insight into their relationships. However,
you might want a more fine-grained approach where we can visualize the documents inside the topics to see
if they were assigned correctly or whether they make sense. To do so, we can use the `topic_model.visualize_documents()`
Expand Down Expand Up @@ -43,6 +45,30 @@ When you visualize the documents, you might not always want to see the complete
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)
```

## **Visualize documents with DataMapPlot**

`.visualize_document_datamap` provides an alternative way to visualize the documents inside the topics as a static [DataMapPlot](https://datamapplot.readthedocs.io/en/latest/intro_splash.html). Using the same pipeline as above, you can generate a DataMapPlot by running:

```python

# with the original embeddings
topic_model.visualize_document_datamap(docs, embeddings=embeddings)

# with the reduced embeddings
topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
```

<br><br>
<img src="./datamapplot.png">
<br><br>

Or if you want to save the resulting figure:

```python
fig = topic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings)
fig.savefig("path/to/file.png", bbox_inches="tight")
```

## **Visualize Probablities or Distribution**

We can generate the topic-document probability matrix by simply setting `calculate_probabilities=True` if a HDBSCAN model is used:
Expand Down Expand Up @@ -100,3 +126,4 @@ df
the distribution of the frequencies of topics across a document. It merely shows
how confident BERTopic is that certain topics can be found in a document.


4 changes: 3 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,7 @@ to tweak the model to your liking.
|-----------------------|---|
| Visualize Topics | `.visualize_topics()` |
| Visualize Documents | `.visualize_documents()` |
| Visualize Document with DataMapPlot | `.visualize_document_datamap()` |
| Visualize Document Hierarchy | `.visualize_hierarchical_documents()` |
| Visualize Topic Hierarchy | `.visualize_hierarchy()` |
| Visualize Topic Tree | `.get_topic_tree(hierarchical_topics)` |
Expand All @@ -254,7 +255,8 @@ to tweak the model to your liking.
| Visualize Term Score Decline | `.visualize_term_rank()` |
| Visualize Topic Probability Distribution | `.visualize_distribution(probs[0])` |
| Visualize Topics over Time | `.visualize_topics_over_time(topics_over_time)` |
| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` |
| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` |



## **Citation**
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ nav:
- Plotting:
- Barchart: api/plotting/barchart.md
- Documents: api/plotting/documents.md
- Documents with DataMapPlot: api/plotting/document_datamap.md
- DTM: api/plotting/dtm.md
- Hierarchical documents: api/plotting/hierarchical_documents.md
- Hierarchical topics: api/plotting/hierarchy.md
Expand Down

0 comments on commit 8985f26

Please sign in to comment.