Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: knowledge bases #1019

Open
wants to merge 4 commits into
base: feat/knowledge-base
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/api/embedding/embedding_gallery.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Embedding Gallery

This section contains the existing [`Embeddings`][distilabel.embeddings] subclasses implemented in `distilabel`.
This section contains the existing [`Embeddings`][distilabel.embeddings.Embeddings] subclasses implemented in `distilabel`.

::: distilabel.embeddings
options:
Expand Down
2 changes: 1 addition & 1 deletion docs/api/embedding/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This section contains the API reference for the `distilabel` embeddings.

For more information on how the [`Embeddings`][distilabel.steps.tasks.Task] works and see some examples.
For more information on how the [`Embeddings`][distilabel.embeddings.Embeddings] works and see some examples.

::: distilabel.embeddings.base
7 changes: 7 additions & 0 deletions docs/api/knowledge_bases/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Knowledge Bases

This section contains the API reference for the `distilabel` knowledge bases.

For more information on how the [`KnowledgeBase`][distilabel.knowledge_bases.base.KnowledgeBase] works and see some examples.

::: distilabel.knowledge_bases.base.KnowledgeBase
8 changes: 8 additions & 0 deletions docs/api/knowledge_bases/knowledge_base_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Knowledge Base Gallery

This section contains the existing [`KnowledgeBase`][distilabel.knowledge_bases.base.KnowledgeBase] subclasses implemented in `distilabel`.

::: distilabel.knowledge_bases
options:
filters:
- "!^KnowledgeBase$"
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,9 @@ nav:
- Embedding:
- "api/embedding/index.md"
- Embedding Gallery: "api/embedding/embedding_gallery.md"
- KnowledgeBase:
- "api/knowledge_bases/index.md"
- KnowledgeBase Gallery: "api/knowledge_bases/knowledge_base_gallery.md"
- Pipeline:
- "api/pipeline/index.md"
- Routing Batch Function: "api/pipeline/routing_batch_function.md"
Expand Down
9 changes: 9 additions & 0 deletions src/distilabel/knowledge_bases/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from distilabel.knowledge_bases.argilla import ArgillaKnowledgeBase
from distilabel.knowledge_bases.base import KnowledgeBase
from distilabel.knowledge_bases.lancedb import LanceDBKnowledgeBase

__all__ = [
"KnowledgeBase",
"ArgillaKnowledgeBase",
"LanceDBKnowledgeBase",
]
28 changes: 28 additions & 0 deletions src/distilabel/knowledge_bases/argilla.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,34 @@


class ArgillaKnowledgeBase(KnowledgeBase, ArgillaBase):
"""`argilla` library implementation for knowledge base.

Attributes:
dataset_name: the name of the dataset to use.
dataset_workspace: the workspace of the dataset to use.
api_url: the url of the argilla api to use. Defaults to `ARGILLA_API_URL` environment variable.
api_key: the key of the argilla api to use. Defaults to `ARGILLA_API_KEY` environment variable.
vector_field: the name of the field containing the vector.

References:
- [Argilla Knowledge Base](https://docs.argilla.io/)

Examples:
Connecting to a Argilla Knowledge Base.

```python
from distilabel.knowledge_bases import ArgillaKnowledgeBase

knowledge_base = ArgillaKnowledgeBase(
dataset_name="my_dataset",
dataset_workspace="my_workspace",
vector_field="my_vector_field",
)
knowledge_base.load()
```

"""

vector_field: RuntimeParameter[str] = Field(
None, description="The name of the field containing the vector."
)
Expand Down
28 changes: 28 additions & 0 deletions src/distilabel/knowledge_bases/lancedb.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,34 @@


class LanceDBKnowledgeBase(KnowledgeBase):
"""
`lancedb` library implementation for knowledge base.

Attributes:
uri: the uri of the lancedb database.
table_name: the name of the table to use.
api_key: the api key to use to connect to the lancedb database.
region: the region of the lancedb database.
read_consistency_interval: the read consistency interval of the lancedb database.
request_thread_pool_size: the request thread pool size of the lancedb database.
index_cache_size: the index cache size of the lancedb database.

References:
- [LanceDB](https://lancedb.github.io/lancedb/)

Examples:
Connecting to a LanceDB Knowledge Base.

```python
from distilabel.knowledge_bases import LanceDBKnowledgeBase

knowledge_base = LanceDBKnowledgeBase(uri="my_uri", table_name="my_table")

knowledge_base.load()
```

"""

uri: str = Field(..., description="The URI of the LanceDB database.")
table_name: str = Field(..., description="The name of the table to use.")
api_key: Optional[str] = Field(
Expand Down
2 changes: 2 additions & 0 deletions src/distilabel/steps/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
)
from distilabel.steps.generators.utils import make_generator_step
from distilabel.steps.globals.huggingface import PushToHub
from distilabel.steps.knowledge_bases.vector_search import VectorSearch
from distilabel.steps.reward_model import RewardModelScore
from distilabel.steps.truncate import TruncateTextColumn
from distilabel.steps.typing import GeneratorStepOutput, StepOutput
Expand Down Expand Up @@ -96,4 +97,5 @@
"TruncateTextColumn",
"GeneratorStepOutput",
"StepOutput",
"VectorSearch",
]
32 changes: 16 additions & 16 deletions src/distilabel/steps/knowledge_bases/vector_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,26 +24,26 @@


class VectorSearch(Step):
"""Execute Vector Search using a `KnowledgeBase` and a potential `Embeddings`.

`VectorSearch` is a `Step` that using an `KnowledgeBase` generates sentence
embeddings for the provided input texts.
"""`VectorSearch` is a `Step` that uses a `KnowledgeBase` to perform vector search
for the provided input texts or embeddings.

Attributes:
knowledge_base: the `KnowledgeBase` used to generate the sentence embeddings.
knowledge_base: The `KnowledgeBase` used to perform the vector search.
embeddings: Optional `Embeddings` model to generate embeddings if not provided in the input.
n_retrieved_documents: The number of documents to retrieve from the knowledge base.

Input columns:
- text (Optional[`str`]): The text for which the sentence embedding has to be generated.
- embedding (`Optional[List[Union[float, int]]]`): The sentence embedding generated for the input text.
- text (`str`): The text for which to perform the vector search (if embeddings are not provided).
- embedding (`List[Union[float, int]]`): The embedding to use for vector search (if provided).

Output columns:
- dynamic: The dynamic columns of the `KnowledgeBase`.
- dynamic (`Any`): The columns returned by the `KnowledgeBase` for the retrieved documents.

Categories:
- knowledge_base

Examples:
Do vector search using a `KnowledgeBase` and a potential `Embeddings`.
Perform vector search using a `KnowledgeBase` and an `Embeddings` model.

```python
from distilabel.embeddings import SentenceTransformerEmbeddings
Expand All @@ -59,10 +59,10 @@ class VectorSearch(Step):
table_name="my_table",
)

vector_search = VectorSearch(
knowledge_base=knowledge_base,
embeddings=embedding,
n_retrieved_documents=5
vector_search = VectorSearch(
knowledge_base=knowledge_base,
embeddings=embedding,
n_retrieved_documents=5
)

vector_search.load()
Expand All @@ -75,10 +75,9 @@ class VectorSearch(Step):
# }]
```

Do vector search using a `KnowledgeBase`.
Perform vector search using only a `KnowledgeBase` with pre-computed embeddings.

```python
from distilabel.embeddings import SentenceTransformerEmbeddings
from distilabel.knowledge_bases.lancedb import LanceDBKnowledgeBase
from distilabel.steps.knowledge_bases.vector_search import VectorSearch

Expand All @@ -93,9 +92,10 @@ class VectorSearch(Step):
)

vector_search.load()
result = next(embedding_generation.process([{'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]))
result = next(vector_search.process([{'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]))
# [{'embedding': [0.06209656596183777, -0.015797119587659836, ...], "knowledge_base_col_1": [10.0], "knowledge_base_col_2": ["foo"]}]
```

"""

knowledge_base: KnowledgeBase
Expand Down
22 changes: 22 additions & 0 deletions src/distilabel/utils/export_components_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from typing import Generator, List, Type, TypedDict, TypeVar

from distilabel.embeddings.base import Embeddings
from distilabel.knowledge_bases.base import KnowledgeBase
from distilabel.llms.base import LLM
from distilabel.steps.base import _Step
from distilabel.steps.tasks.base import _Task
Expand All @@ -31,6 +32,7 @@ class ComponentsInfo(TypedDict):
steps: List
tasks: List
embeddings: List
knowledge_bases: List


def export_components_info() -> ComponentsInfo:
Expand Down Expand Up @@ -62,6 +64,13 @@ def export_components_info() -> ComponentsInfo:
}
for embeddings_type in _get_embeddings()
],
"knowledge_bases": [
{
"name": knowledge_base_type.__name__,
"docstring": parse_google_docstring(knowledge_base_type),
}
for knowledge_base_type in _get_knowledge_bases()
],
}


Expand Down Expand Up @@ -126,6 +135,19 @@ def _get_embeddings() -> List[Type["Embeddings"]]:
]


def _get_knowledge_bases() -> List[Type["KnowledgeBase"]]:
"""Get all `KnowledgeBase` subclasses, that are not abstract classes.

Returns:
A list of `KnowledgeBase` subclasses
"""
return [
knowledge_base_type
for knowledge_base_type in _recursive_subclasses(KnowledgeBase)
if not inspect.isabstract(knowledge_base_type)
]


# Reference: https://adamj.eu/tech/2024/05/10/python-all-subclasses/
def _recursive_subclasses(klass: Type[T]) -> Generator[Type[T], None, None]:
"""Recursively get all subclasses of a class.
Expand Down
62 changes: 59 additions & 3 deletions src/distilabel/utils/mkdocs/components_gallery.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
"scorer": ":octicons-number-16:",
"preference": ":material-poll:",
"embedding": ":material-vector-line:",
"knowledge_base": ":material-database:",
"clustering": ":material-scatter-plot:",
"columns": ":material-table-column:",
"filtering": ":material-filter:",
Expand All @@ -104,6 +105,7 @@
"scorer": "Scorer steps are used to evaluate and score the data with a numerical value.",
"preference": "Preference steps are used to collect preferences on the data with numerical values or ranks.",
"embedding": "Embedding steps are used to generate embeddings for the data.",
"knowledge_base": "Knowledge bases are used to store and retrieve data.",
"clustering": "Clustering steps are used to group similar data points together.",
"columns": "Columns steps are used to manipulate columns in the data.",
"filtering": "Filtering steps are used to filter the data based on some criteria.",
Expand All @@ -113,9 +115,18 @@
"save": "Save steps are used to save the data.",
}

assert list(_STEP_CATEGORY_TO_DESCRIPTION.keys()) == list(
_STEPS_CATEGORY_TO_ICON.keys()
)
if list(_STEP_CATEGORY_TO_DESCRIPTION.keys()) != list(_STEPS_CATEGORY_TO_ICON.keys()):
missing_from_icon = set(_STEP_CATEGORY_TO_DESCRIPTION.keys()) - set(
_STEPS_CATEGORY_TO_ICON.keys()
)
missing_from_description = set(_STEPS_CATEGORY_TO_ICON.keys()) - set(
_STEP_CATEGORY_TO_DESCRIPTION.keys()
)
raise ValueError(
f"The following keys are in _STEPS_CATEGORY_TO_ICON but not in _STEP_CATEGORY_TO_DESCRIPTION: {missing_from_description}.\n"
f"The following keys are in _STEP_CATEGORY_TO_DESCRIPTION but not in _STEPS_CATEGORY_TO_ICON: {missing_from_icon}.\n"
"The keys in _STEP_CATEGORY_TO_DESCRIPTION and _STEPS_CATEGORY_TO_ICON must match."
)

_STEP_CATEGORIES = list(_STEP_CATEGORY_TO_DESCRIPTION.keys())
_STEP_CATEGORY_TABLE = pd.DataFrame(
Expand Down Expand Up @@ -199,6 +210,9 @@ def on_files(
self.file_paths["embeddings"] = self._generate_embeddings_pages(
src_dir=src_dir, embeddings=components_info["embeddings"]
)
self.file_paths["knowledge_bases"] = self._generate_knowledge_bases_pages(
src_dir=src_dir, knowledge_bases=components_info["knowledge_bases"]
)

# Add the new files to the files collections
for relative_file_path in [
Expand All @@ -207,6 +221,7 @@ def on_files(
*self.file_paths["tasks"],
*self.file_paths["llms"],
*self.file_paths["embeddings"],
*self.file_paths["knowledge_bases"],
]:
file = File(
path=relative_file_path,
Expand Down Expand Up @@ -468,6 +483,47 @@ def _generate_embeddings_pages(self, src_dir: Path, embeddings: list) -> List[st

return paths

def _generate_knowledge_bases_pages(
self, src_dir: Path, knowledge_bases: list
) -> List[str]:
"""Generates the files for the `Knowledge Bases` subsection of the components gallery.

Args:
src_dir: The path to the source directory.
knowledge_bases: The list of `Knowledge Base` components.

Returns:
The relative paths to the generated files.
"""

paths = ["components-gallery/knowledge_bases/index.md"]
steps_gallery_page_path = src_dir / paths[0]
steps_gallery_page_path.parent.mkdir(parents=True, exist_ok=True)

# Create detail page for each `Knowledge Base`
for knowledge_base in knowledge_bases:
content = _LLM_DETAIL_TEMPLATE.render(llm=knowledge_base)
knowledge_base_path = f"components-gallery/knowledge_bases/{knowledge_base['name'].lower()}.md"
path = src_dir / knowledge_base_path
with open(path, "w") as f:
f.write(content)

paths.append(knowledge_base_path)

# Create the `components-gallery/knowledge_bases/index.md` file
content = _COMPONENTS_LIST_TEMPLATE.render(
title="KnowledgeBases Gallery",
description="",
components=knowledge_bases,
component_group="knowledge_bases",
default_icon=":material-database:",
)

with open(steps_gallery_page_path, "w") as f:
f.write(content)

return paths

def on_nav(
self, nav: "Navigation", *, config: "MkDocsConfig", files: "Files"
) -> Union["Navigation", None]:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,4 +39,12 @@ hide:

[:octicons-arrow-right-24: Embeddings](embeddings/index.md){ .bottom }

- :material-database:{ .lg .middle } __Knowledge Bases__

---

Explore all the available `KnowledgeBase` models integrated with `distilabel`.

[:octicons-arrow-right-24: Knowledge Bases](knowledge_bases/index.md){ .bottom }

</div>
Loading