Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5th of July Updates #6

Open
alexandersimoes opened this issue Jul 5, 2024 · 2 comments
Open

5th of July Updates #6

alexandersimoes opened this issue Jul 5, 2024 · 2 comments

Comments

@alexandersimoes
Copy link

No description provided.

@alebjanes
Copy link
Contributor

Updates on my side:

  1. Evaluation set: the evaluation set we'll use with all approaches with 100 questions is ready, along with the correct answers and correct values that should go in the answers.
  2. New content to the corpus: In order for the RAG to work, I added the content that is needed to answer these questions to the corpus (for all years available). This includes content with broad product categories (like dairy, salmon, chicken, etc.) that are composed by multiple hs codes.
  3. RAG evaluation: Now I'm running the RAG evaluation which will take each of this 100 questions, fetch the top k results using similarity search and then pass this as context to an LLM. I'll try a few combinations changing the top k results (5 or 10), the embedding model, and the final LLM model (to evaluate the costs of using gpt-4 or gpt-3.5 here). As an initial result here, the first evaluation got 81/100 questions correct.

@pippo-sci
Copy link
Contributor

pippo-sci commented Jul 8, 2024

Fine-tuning results with both random sample and Ale's test set (in RAG only questions):

Model Accuracy
TinyLlama 1epoch 0%
TinyLlama 10 epoch 0%
TinyLlama 50 epoch 0%
Llama2 1epoch 0%

The main issue is the model learns the text around the numbers but it gets the numbers wrong. Actually, It changes the number every time is queried. Side effect, the tinyllama models lost their capabilities to answer other inputs.

Next steps:

  • Apply another metric to discriminate how far off the values are, to check if there is any difference between models
  • Test fine tuning of api URLs
  • Test with stripped version of the multilayer approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants