Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retrieval langchain with turkish dataset #36

Open
4entertainment opened this issue Nov 13, 2023 · 2 comments
Open

retrieval langchain with turkish dataset #36

4entertainment opened this issue Nov 13, 2023 · 2 comments

Comments

@4entertainment
Copy link

i have the following code:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from transformers import AutoTokenizer, AutoModel

from silly import no_ssl_verification
from langchain.embeddings.huggingface import HuggingFaceEmbeddings


with no_ssl_verification():
    # load the document and split it into chunks
    loader = TextLoader("paul_graham/paul_graham_essay_tr.txt")
    documents = loader.load()

    # split it into chunks
    text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # create the Turkish embedding function
    # tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")
    # model = AutoModel.from_pretrained("dbmdz/bert-base-turkish-cased")
    embedding_function = SentenceTransformerEmbeddings(model_name="dbmdz/bert-base-turkish-cased")

    # load it into Chroma
    db = Chroma.from_documents(docs, embedding_function)

    # query it
    query = "Yazarın üniversiteden önce üzerinde çalıştığı iki ana şey neydi?"
    docs = db.similarity_search(query)

    # print results
    print(docs[0].page_content)

how can i fix my code to do qa retrieval with langchain with using turkish-bert embeddings? please help me.

@kapusuzoglu
Copy link

what is the issue? Have you tried another model?

@4entertainment
Copy link
Author

Thank your for your reply @kapusuzoglu.

My main task is write a RAG in my own language. I have LLM and embeddings. Do you have any code script or documentation about this task?

And how can i compare vector databases? (like chroma, faiss etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants