Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial implementation for vectordb #3879

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

lspinheiro
Copy link
Collaborator

Why are these changes needed?

Related issue number

Checks

@lspinheiro
Copy link
Collaborator Author

lspinheiro commented Oct 22, 2024

@jackgerrits @ekzhu , wondering if storage should be an abstraction in the core package and only the implementation should go into autogen-ext. Any thoughts? Still an early draft, just looking for some feedback on the design. I think libraries like langchain and llamaindex have similar abstractions which are useful for rag but could have other uses for for agents that do few-shot prompting

@lspinheiro lspinheiro requested a review from thinkall October 22, 2024 22:02
Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lspinheiro for the PR! I've just minor suggestions, others look good to me.

python/packages/autogen-ext/pyproject.toml Outdated Show resolved Hide resolved
python/uv.lock Outdated Show resolved Hide resolved
Copy link
Collaborator

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rebase the PR branch to main? right now there are too many merge conflicts

@lspinheiro lspinheiro force-pushed the lpinheiro/feat/add-vectordb-chroma-store branch from 22cb336 to 7da80e9 Compare October 23, 2024 21:19
@lspinheiro lspinheiro force-pushed the lpinheiro/feat/add-vectordb-chroma-store branch 2 times, most recently from 22cb336 to 4b886b0 Compare October 24, 2024 04:32
@lspinheiro lspinheiro marked this pull request as ready for review October 24, 2024 06:48
@lspinheiro lspinheiro requested review from thinkall and ekzhu October 24, 2024 07:20
@lspinheiro
Copy link
Collaborator Author

@ekzhu , small nudge. This should be ready for review now.

@lspinheiro lspinheiro force-pushed the lpinheiro/feat/add-vectordb-chroma-store branch from 8bb13f0 to b3f0672 Compare October 27, 2024 01:49
Copy link
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lspinheiro , LGTM!

"""Define Document according to autogen 0.4 specifications."""

id: ItemID
content: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why assuming string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same representation used in 0.2 and I believe it assumes the content has already been preprocessed to a string format. More complex data types can lead to more complex operations in general. Would you suggest something different?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vector store can have a payload object, which is usually a json. Unless you know the schema mapping returning string is a problem, maybe we can use dictionary or some other model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is meant to capture the full payload but only the main text content of the associated document and the transformation between autogen doc and vectordb doc is handled in each implementation. The pydantic model is defined with arbitrary_types_allowed which should allow implementations to unpack additional attributes as needed at the document level instead of the content. Would that work for the scenario you have in mind?

type: str = ""
embedding_function: Optional[Callable[..., Any]] = None # embeddings = embedding_function(sentences)

async def create_collection(self, collection_name: str, overwrite: bool = False, get_or_create: bool = True) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some vector store requires schema to create collection and some other parameters, may using variable args here can help generalise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great callout. Checked the old code and those additional arguments were being passed in through the class constructor. Not the best design. Pushing a change as soon as the checks complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be updated now

"""
...

def update_docs(self, docs: List[Document], collection_name: Optional[str] = None, **kwargs: Any) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most used method is upsert

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the protocol, my understanding is that the idea is to keep it closer in design the DB theory and enforce CRUD like operations. Implementations can add upsert but they probably shouldn't be forced to

@MohMaz MohMaz added the awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-op-response Issue or pr has been triaged or responded to and is now awaiting a reply from the original poster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants