Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use non-vectorstore retrievers in LangChainKnowledgeBase #1096

Closed
Joelius300 opened this issue Aug 15, 2024 · 1 comment · Fixed by #1097
Closed

Cannot use non-vectorstore retrievers in LangChainKnowledgeBase #1096

Joelius300 opened this issue Aug 15, 2024 · 1 comment · Fixed by #1097

Comments

@Joelius300
Copy link
Contributor

Joelius300 commented Aug 15, 2024

LangChainKnowledgeBase works great when you already have a LangChain-based retrieval setup that uses vectorstores, but when you want something more basic, like a TFIDFRetriever, you cannot use it because of an instance check.

if not isinstance(self.retriever, VectorStoreRetriever):
raise ValueError(f"Retriever is not of type VectorStoreRetriever: {self.retriever}")

Looking at the source of VectorStoreRetriever, it doesn't override the invoke method, which is the only one the LangChainKnowledgeBase uses (here specifically).

After some manual testing for confirmation, I believe it would be best to relax the check to BaseRetriever instead, which defines/implements1 the required invoke method. What do you think? :)

Footnotes

  1. technically, it's already defined by Runnable, but I would argue it makes much more sense semantically to do an instance check on BaseRetriever.

@Joelius300
Copy link
Contributor Author

Motivation / Example

The following snippet does not work and throws the above mentioned ValueError. This is despite that fact that it would work perfectly fine and traditional retrieval methods that aren't based on vector stores are just as valid in that scenario.

Taken from Joelius300/cudeschin-rag@73ae0992b9

loader = DirectoryLoader(cudeschin_path / "content/de", glob="**/*.md",
                         loader_cls=TextLoader,
                         loader_kwargs=dict(encoding="utf-8"))
splitter = MarkdownTextSplitter(is_separator_regex=True)
docs = splitter.split_documents(loader.load())

retriever = TFIDFRetriever.from_documents(docs, k=n_documents)
return LangChainKnowledgeBase(retriever=retriever, num_documents=n_documents)

To work around this issue, I created a custom knowledge base as follows and replaced LangChainKnowledgeBase above with it:

class LessPickyLangChainKnowledgeBase(LangChainKnowledgeBase):
    def search(self, query: str, num_documents: Optional[int] = None) -> List[LangChainDocument]:
        if self.retriever is None:
            raise ValueError("must provide retriever")
        if not isinstance(self.retriever, BaseRetriever):  # no reason for VectorStoreRetriever
            raise ValueError(f"Retriever is not of type BaseRetriever: {self.retriever}")
        _num_documents = num_documents or self.num_documents
        # logger.debug(f"Getting {_num_documents} relevant documents for query: {query}")
        lc_documents: List[LangChainDocument] = self.retriever.invoke(input=query)
        documents = []
        for lc_doc in lc_documents:
            documents.append(
                Document(
                    content=lc_doc.page_content,
                    meta_data=lc_doc.metadata,
                )
            )
        return documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant