-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated files to include the notebooks and updated .yml file #2
base: main
Are you sure you want to change the base?
Changes from 18 commits
be09de9
3897004
d57349b
3dd4ba5
e57224c
b551da9
2784616
147908b
562aaea
f4c0897
850fac7
129260b
98e6b98
1fecf2b
b02f19e
6bea8a2
5943e46
de50725
6085913
a071916
db7fe4e
02b67b6
c4904ba
235a6f4
0dde0a9
62bc444
6f28974
e645981
1202d73
abe034d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,236 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Semantic Search using Weaviate and Docker " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In this tutorial we create a vector store that can be queried using semantic search on a sample dataset composed of transcribed podcasts. The steps will include uploading your data from a local store, and creating a schema as well as an object store.\n" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be great as description in the README.md |
||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In your terminal: \n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be great as setup instructions in the README.md |
||
"1. Run your virtual environment: conda activate /Users/your_path/environment_name OR source path_to_your_VR/bin/activate\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the language here needs to be improved. The canonical conda syntax is Also, this line is confusing ( Instructions 2 and 3 are confusing as they look like parts of the same instruction. If they cloned the repo, they would not need to separately download this file. I would suggest something like:
|
||
"2. Download and run the yml image doc in this repo\n", | ||
"3. Run docker-compose up -d\n", | ||
"4. Run pip install -r requirements.txt" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you create a requirements.txt? It would be nice if you could commit it as well. |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"{\"action\":\"hnsw_vector_cache_prefill\",\"count\":1000,\"index_id\":\"podcast_7gZn71E8okke\",\"level\":\"info\",\"limit\":1000000000000,\"msg\":\"prefilled vector cache\",\"time\":\"2023-10-09T08:38:20-07:00\",\"took\":88134}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import weaviate\n", | ||
"from weaviate.util import generate_uuid5\n", | ||
"import json\n", | ||
"import helper\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Local docker container setup with text2vec-openai vectorizer module specified in yml file\n", | ||
"More on modules: https://weaviate.io/developers/weaviate/modules" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Instatiate the client with rest API\n", | ||
"client = weaviate.Client(\"http://localhost:8080\")\n", | ||
"\n", | ||
"meta_info = client.get_meta()\n", | ||
"print(json.dumps(meta_info, indent=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Print the client information to confirm the modules are loaded.\n", | ||
"meta_info = client.get_meta()\n", | ||
"print(json.dumps(meta_info, indent=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the cell below we setup the schema, an outline requiring the data type, vectorizer and the list of classes. Note that it is essential to have your data cleaned and the categories clearly identified for this step. If using your own vectorizer, \"none\" should be specified for \"vectorizer\". " | ||
tdubon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client.schema.delete_all()\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggest using |
||
"schema = {\n", | ||
" \"classes\": [\n", | ||
" {\n", | ||
" \"class\": \"Podcast\",\n", | ||
" \"vectorizer\": \"text2vec-openai\",\n", | ||
" \"properties\": [\n", | ||
" {\n", | ||
" \"name\": \"title\",\n", | ||
" \"dataType\": [\"text\"]\n", | ||
" },\n", | ||
" {\n", | ||
" \"name\": \"transcript\",\n", | ||
" \"dataType\": [\"text\"]\n", | ||
" }\n", | ||
" ]\n", | ||
" }\n", | ||
" ]\n", | ||
"}\n", | ||
"client.schema.create(schema)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. \n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be great to move to import.py as a descriptive comment. |
||
"\n", | ||
"The name of the object represents the highest level classification for your data, indicated below as podcast_object (in dictionary type). Target class represents the next level in the classification of your data. Here we indicate it below as the string \"Podcast\", but note that multiple classes could have been specified, for example, if we had different categories of podcasts, such as English, Spanish, etc.\n", | ||
"\n", | ||
"The function definition below is implementing batch_size=1. Note that with larger amounts of data you will want to adjust this setting. Per the documentation: \"batch imports are used to maximize import speed and minimize network latency. Batch import processes multiple objects per request, and clients can parallelize the process.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with open(\"/Users/tdubon/DEMO-semantic-search-podcast/data/podcast_ds.json\", 'r') as f:\n", | ||
" datastore = json.load(f)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(json.dumps(datastore, indent=2))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In the cell below we define the batch and the uuid.\n", | ||
"\n", | ||
"Batch definition is helpful because it's \"a way of importing/creating objects and references in bulk using a single API request to the Weaviate server.\" " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be great to move to import.py as a descriptive comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI a good starting batch is ~50-100 or so. A 'batch' sends data objects in groups to speed up import, so a batch size of 1 removes the benefit os using batches. Typically only time you might use a batch size of 1 is to troubleshoot. |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def add_podcasts(batch_size = 1):\n", | ||
" client.batch.configure(batch_size=1)\n", | ||
" with client.batch as batch:\n", | ||
" for i, d in enumerate(datastore):\n", | ||
" print(f\"importing podcast: {i+1}\")\n", | ||
" properties = {\n", | ||
" \"title\": d[\"title\"],\n", | ||
" \"transcript\": d[\"transcript\"]\n", | ||
" }\n", | ||
" podcast_uuid = generate_uuid5('podcast', d[\"title\"] + d[\"transcript\"])\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
batch.add_data_object(
data_object=properties,
class_name= "Podcast",
uuid=podcast_uuid
) |
||
" \n", | ||
" batch.add_data_object(\n", | ||
" data_object=properties, \n", | ||
" class_name= \"Podcast\")\n", | ||
" " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"add_podcasts(1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next you implement the pipeline and query your data, such as semantic search, generative search, question/answering. In this example we use nearText with the module text2vec-openai which implments text-embedding-ada-002. " | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be great if you could create a file called "query.py" and add this part there. |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#Question answering - search \n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI the query here is a semantic search. Question answering is a separate feature. So I would recommend updating the comment here. |
||
"\n", | ||
"response = (\n", | ||
" client.query\n", | ||
" .get(\"Podcast\", [\"transcript\"])\n", | ||
" .with_near_text({\"concepts\": [\"semantic search\"]})\n", | ||
" .with_limit(3)\n", | ||
" .with_additional([\"distance\"])\n", | ||
" .do()\n", | ||
")\n", | ||
"\n", | ||
"print(json.dumps(response, indent=2))" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "SemanticSearch", | ||
"language": "python", | ||
"name": "semanticsearch" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.5" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These demos do not intend to have Jupyter Notebooks. Instead, we are aiming to have standalone demo application.
Since you did a great job at describing each step, maybe it would be nice to add your explanations as comments in the import.py and helper.py files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to the comments from @iamleonie - I see that certain files (like
helper.py
andimport.py
have been removed).The notebook as it is will throw an error at
import helper
becausehelper.py
is missing. That will be remedied by restoring those files - but, just as a reminder, it's good to check that the notebook runs from start to finish.