Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated files to include the notebooks and updated .yml file #2

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
be09de9
This Jupyter notebook is intended to replace DEMO-semantic-search-pod…
tdubon Oct 4, 2023
3897004
Edits to the text and added nearText example.
tdubon Oct 4, 2023
d57349b
Deleted text2vec-cohere
tdubon Oct 4, 2023
3dd4ba5
Draft update to README.md
tdubon Oct 5, 2023
e57224c
Changed the key parameter name
tdubon Oct 5, 2023
b551da9
Updated README.md
tdubon Oct 6, 2023
2784616
Update Embedded_Weaviate.ipynb
tdubon Oct 6, 2023
147908b
Add files via upload
tdubon Oct 10, 2023
562aaea
Updated docker-compose.yml for new module
tdubon Oct 10, 2023
f4c0897
Edited notebook to include an example connecting to the server with d…
tdubon Oct 10, 2023
850fac7
Deleted .py files that are no longer needed
tdubon Oct 10, 2023
129260b
Deleted pychace and .DS_Store files
tdubon Oct 11, 2023
98e6b98
Delete .DS_Store file
tdubon Oct 11, 2023
1fecf2b
Second attempt to delete .DS_Store
tdubon Oct 11, 2023
b02f19e
Adding Embedded_Weaviate.ipynb file back to repository
tdubon Oct 11, 2023
6bea8a2
Removed my key
tdubon Oct 11, 2023
5943e46
Update README.md
tdubon Oct 11, 2023
de50725
Update README.md
tdubon Oct 11, 2023
6085913
added helper.py and import.py file back to repo
tdubon Oct 12, 2023
a071916
Updated files based on reviewer feedback
tdubon Oct 14, 2023
db7fe4e
added requirements file
tdubon Oct 14, 2023
02b67b6
Updated files based on reviewer feedback
tdubon Oct 17, 2023
c4904ba
Updated files based on reviewer feedback
tdubon Oct 17, 2023
235a6f4
removed keys from .yml file
tdubon Oct 17, 2023
0dde0a9
removed keys
tdubon Oct 17, 2023
62bc444
Revised files to address test failures
tdubon Oct 18, 2023
6f28974
deleted __pycache__
tdubon Oct 18, 2023
e645981
deleted .DS_Store file
tdubon Oct 18, 2023
1202d73
Changed the files to use text2vec-transformers
tdubon Oct 26, 2023
abe034d
added helper function
tdubon Oct 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions Docker_Weaviate.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These demos do not intend to have Jupyter Notebooks. Instead, we are aiming to have standalone demo application.
Since you did a great job at describing each step, maybe it would be nice to add your explanations as comments in the import.py and helper.py files?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to the comments from @iamleonie - I see that certain files (like helper.py and import.py have been removed).

The notebook as it is will throw an error at import helper because helper.py is missing. That will be remedied by restoring those files - but, just as a reminder, it's good to check that the notebook runs from start to finish.

Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Semantic Search using Weaviate and Docker "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial we create a vector store that can be queried using semantic search on a sample dataset composed of transcribed podcasts. The steps will include uploading your data from a local store, and creating a schema as well as an object store.\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great as description in the README.md

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In your terminal: \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great as setup instructions in the README.md

"1. Run your virtual environment: conda activate /Users/your_path/environment_name OR source path_to_your_VR/bin/activate\n",
Copy link

@databyjp databyjp Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language here needs to be improved.

The canonical conda syntax is conda activate myenv where myenv can be the name or the path (source: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment).

Also, this line is confusing (path_to_your_VR) - what is VR?

Instructions 2 and 3 are confusing as they look like parts of the same instruction. If they cloned the repo, they would not need to separately download this file.

I would suggest something like:

1. Create and activate a virtual environment, for example using conda or venv
2. Install the required libraries with `pip install -r requirements.txt`
3. Run Weaviate using Docker, for example with `docker-compose up -d`

"2. Download and run the yml image doc in this repo\n",
"3. Run docker-compose up -d\n",
"4. Run pip install -r requirements.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you create a requirements.txt? It would be nice if you could commit it as well.

]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"{\"action\":\"hnsw_vector_cache_prefill\",\"count\":1000,\"index_id\":\"podcast_7gZn71E8okke\",\"level\":\"info\",\"limit\":1000000000000,\"msg\":\"prefilled vector cache\",\"time\":\"2023-10-09T08:38:20-07:00\",\"took\":88134}\n"
]
}
],
"source": [
"import weaviate\n",
"from weaviate.util import generate_uuid5\n",
"import json\n",
"import helper\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Local docker container setup with text2vec-openai vectorizer module specified in yml file\n",
"More on modules: https://weaviate.io/developers/weaviate/modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Instatiate the client with rest API\n",
"client = weaviate.Client(\"http://localhost:8080\")\n",
"\n",
"meta_info = client.get_meta()\n",
"print(json.dumps(meta_info, indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Print the client information to confirm the modules are loaded.\n",
"meta_info = client.get_meta()\n",
"print(json.dumps(meta_info, indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell below we setup the schema, an outline requiring the data type, vectorizer and the list of classes. Note that it is essential to have your data cleaned and the categories clearly identified for this step. If using your own vectorizer, \"none\" should be specified for \"vectorizer\". "
tdubon marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client.schema.delete_all()\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using client.schema.delete_class("Podcast")

"schema = {\n",
" \"classes\": [\n",
" {\n",
" \"class\": \"Podcast\",\n",
" \"vectorizer\": \"text2vec-openai\",\n",
" \"properties\": [\n",
" {\n",
" \"name\": \"title\",\n",
" \"dataType\": [\"text\"]\n",
" },\n",
" {\n",
" \"name\": \"transcript\",\n",
" \"dataType\": [\"text\"]\n",
" }\n",
" ]\n",
" }\n",
" ]\n",
"}\n",
"client.schema.create(schema)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great to move to import.py as a descriptive comment.

"\n",
"The name of the object represents the highest level classification for your data, indicated below as podcast_object (in dictionary type). Target class represents the next level in the classification of your data. Here we indicate it below as the string \"Podcast\", but note that multiple classes could have been specified, for example, if we had different categories of podcasts, such as English, Spanish, etc.\n",
"\n",
"The function definition below is implementing batch_size=1. Note that with larger amounts of data you will want to adjust this setting. Per the documentation: \"batch imports are used to maximize import speed and minimize network latency. Batch import processes multiple objects per request, and clients can parallelize the process.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(\"/Users/tdubon/DEMO-semantic-search-podcast/data/podcast_ds.json\", 'r') as f:\n",
" datastore = json.load(f)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(json.dumps(datastore, indent=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the cell below we define the batch and the uuid.\n",
"\n",
"Batch definition is helpful because it's \"a way of importing/creating objects and references in bulk using a single API request to the Weaviate server.\" "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great to move to import.py as a descriptive comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI a good starting batch is ~50-100 or so. A 'batch' sends data objects in groups to speed up import, so a batch size of 1 removes the benefit os using batches.

Typically only time you might use a batch size of 1 is to troubleshoot.

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def add_podcasts(batch_size = 1):\n",
" client.batch.configure(batch_size=1)\n",
" with client.batch as batch:\n",
" for i, d in enumerate(datastore):\n",
" print(f\"importing podcast: {i+1}\")\n",
" properties = {\n",
" \"title\": d[\"title\"],\n",
" \"transcript\": d[\"transcript\"]\n",
" }\n",
" podcast_uuid = generate_uuid5('podcast', d[\"title\"] + d[\"transcript\"])\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podcast_uuid here does not get used. Recommend using it like so:

            batch.add_data_object(
                data_object=properties, 
                class_name= "Podcast",
                uuid=podcast_uuid
            )

" \n",
" batch.add_data_object(\n",
" data_object=properties, \n",
" class_name= \"Podcast\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"add_podcasts(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next you implement the pipeline and query your data, such as semantic search, generative search, question/answering. In this example we use nearText with the module text2vec-openai which implments text-embedding-ada-002. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could create a file called "query.py" and add this part there.

]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Question answering - search \n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the query here is a semantic search. Question answering is a separate feature. So I would recommend updating the comment here.

"\n",
"response = (\n",
" client.query\n",
" .get(\"Podcast\", [\"transcript\"])\n",
" .with_near_text({\"concepts\": [\"semantic search\"]})\n",
" .with_limit(3)\n",
" .with_additional([\"distance\"])\n",
" .do()\n",
")\n",
"\n",
"print(json.dumps(response, indent=2))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "SemanticSearch",
"language": "python",
"name": "semanticsearch"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading