Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated files to include the notebooks and updated .yml file #2

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

tdubon
Copy link

@tdubon tdubon commented Oct 10, 2023

Hi, I updated the files but I need help testing them as I don't have any OpenAI credits to get the final output.

Refers to issue: #1

@tdubon
Copy link
Author

tdubon commented Oct 11, 2023

Hi @iamleonie, please review the pull request and let me know if there are any changes that I need to make.

@tdubon tdubon closed this Oct 11, 2023
@tdubon tdubon reopened this Oct 11, 2023
@iamleonie
Copy link
Collaborator

Hi @tdubon, could you please add the following to the PR description:

  • a short description of the changes you made
  • link to the issue this PR is referring to

Could you please remove the binary files from the PR (.DS_store and pycache)

I will review the PR in detail tomorrow.

@iamleonie iamleonie self-requested a review October 11, 2023 16:11
@tdubon
Copy link
Author

tdubon commented Oct 11, 2023

Hi @iamleonie, sure. I deleted the named files and here is the additional information you requested:

The original demo import.py was integrated into two Jupyter notebooks. The differences are as follows:

  • method of connecting to the server - one notebook uses the embedded Weaviate client and the other uses docker.
  • text2vec_openai's vectorization module instead of the transformer, which requires GPU
  • Narrative explanations of each step.

The original .yml file was updated to include the information needed for the new module.

Let me know if you have any questions or need any other information.

Copy link
Collaborator

@iamleonie iamleonie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tdubon,
Thank you for your contribution. As these demo projects are not intended to be in Jupyter Notebook format but in runnable application format, it would be great if we could convert your Jupyter Notebooks into python files so other contributors, e.g. who are working on the frontend, can build on your work.

So, we would like to keep the helper.py and import.py files, move the contents of your Jupyter Notebooks there and remove the Jupyter Notebooks, that would be great.

I apologize for the additional effort. Thank you for your understanding.

Embedded_Weaviate.ipynb Outdated Show resolved Hide resolved
helper.py Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't delete these files as other contributors are building their solution on these files.

docker-compose.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These demos do not intend to have Jupyter Notebooks. Instead, we are aiming to have standalone demo application.
Since you did a great job at describing each step, maybe it would be nice to add your explanations as comments in the import.py and helper.py files?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to the comments from @iamleonie - I see that certain files (like helper.py and import.py have been removed).

The notebook as it is will throw an error at import helper because helper.py is missing. That will be remedied by restoring those files - but, just as a reminder, it's good to check that the notebook runs from start to finish.

"1. Run your virtual environment: conda activate /Users/your_path/environment_name OR source path_to_your_VR/bin/activate\n",
"2. Download and run the yml image doc in this repo\n",
"3. Run docker-compose up -d\n",
"4. Run pip install -r requirements.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you create a requirements.txt? It would be nice if you could commit it as well.

Docker_Weaviate.ipynb Outdated Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following cells we load the locally stored data (in json format) and create a function definition for an add_podcast object. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great to move to import.py as a descriptive comment.

"source": [
"In the cell below we define the batch and the uuid.\n",
"\n",
"Batch definition is helpful because it's \"a way of importing/creating objects and references in bulk using a single API request to the Weaviate server.\" "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be great to move to import.py as a descriptive comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI a good starting batch is ~50-100 or so. A 'batch' sends data objects in groups to speed up import, so a batch size of 1 removes the benefit os using batches.

Typically only time you might use a batch size of 1 is to troubleshoot.

"cell_type": "markdown",
"metadata": {},
"source": [
"Next you implement the pipeline and query your data, such as semantic search, generative search, question/answering. In this example we use nearText with the module text2vec-openai which implments text-embedding-ada-002. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could create a file called "query.py" and add this part there.

"metadata": {},
"outputs": [],
"source": [
"client.schema.delete_all()\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using client.schema.delete_class("Podcast")

" \"title\": d[\"title\"],\n",
" \"transcript\": d[\"transcript\"]\n",
" }\n",
" podcast_uuid = generate_uuid5('podcast', d[\"title\"] + d[\"transcript\"])\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podcast_uuid here does not get used. Recommend using it like so:

            batch.add_data_object(
                data_object=properties, 
                class_name= "Podcast",
                uuid=podcast_uuid
            )

"metadata": {},
"outputs": [],
"source": [
"#Question answering - search \n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the query here is a semantic search. Question answering is a separate feature. So I would recommend updating the comment here.

@@ -0,0 +1,221 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comments on the Docker based file as I think they apply here also.

"metadata": {},
"source": [
"In your terminal: \n",
"1. Run your virtual environment: conda activate /Users/your_path/environment_name OR source path_to_your_VR/bin/activate\n",
Copy link

@databyjp databyjp Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language here needs to be improved.

The canonical conda syntax is conda activate myenv where myenv can be the name or the path (source: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#activating-an-environment).

Also, this line is confusing (path_to_your_VR) - what is VR?

Instructions 2 and 3 are confusing as they look like parts of the same instruction. If they cloned the repo, they would not need to separately download this file.

I would suggest something like:

1. Create and activate a virtual environment, for example using conda or venv
2. Install the required libraries with `pip install -r requirements.txt`
3. Run Weaviate using Docker, for example with `docker-compose up -d`

@tdubon
Copy link
Author

tdubon commented Oct 17, 2023

Hi @iamleonie and @databyjp, I've reflected the changes that have been communicated in the data_import.py, query.py, and README files. Please note that the helper.py and import.py files should be deleted for the reasons mentioned before.

Finally, please note that this is the last that I am able to contribute to the project. Any further changes will have to be delegated to someone else, if you so choose.

Thanks again

@iamleonie
Copy link
Collaborator

Thank you for incorporating the feedback. I will review the PR shortly. Please note that this may take a few days.

@iamleonie
Copy link
Collaborator

Hi @tdubon, thank you for implementing most of the requested changes.

I did an in-depth review of your changes and from my point of view, I would still require the following changes to merge this PR:

  • Revert the usage of text2vec_openai's vectorization to transformer. From what I can see in the original docker-compose.yml, GPU usage is disabled and thus the usage of the transformer doesn't require GPU.
  • Remove new files podcast_ds2.json and accompanying data_import.py: From what I understand data_import.py is very similar to import.py with the difference that the reduce podcast_ds2.json file is used instead of the original podcast_ds.json file. Having a reduced file for local testing can be helpful but I would not commit it to the repository.
  • I will leave some smaller comments directly in the code.

As you mentioned that you won't be able to make any further modifications please let us know how you'd like to proceed.

README.md Show resolved Hide resolved
(TO DO)

## Setup instructions
1. Set-up Weaviate: `docker-compose up -d`*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actally prefer to keep this

README.md Show resolved Hide resolved

![Screenshot 2022-03-29 191123](https://user-images.githubusercontent.com/72981484/160694464-38a49b47-cd8f-4492-ae25-1cffaa7d85c2.jpg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have to be removed?

README.md Outdated Show resolved Hide resolved
import.py Show resolved Hide resolved
import.py Outdated Show resolved Hide resolved

message = str(item["title"]) + ' imported'
helper.log(message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove the logging functionality. This is a very helpful output in the console.

query.py Outdated Show resolved Hide resolved
urllib3==2.0.6
validators==0.22.0
wcwidth==0.2.8
weaviate-client==3.24.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all the above packages? I am assuming weaviate-client is sufficient and the rest could be removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants