Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prompt testing using a LLM testing suite #174

Open
vmesel opened this issue Apr 7, 2024 · 5 comments
Open

Add Prompt testing using a LLM testing suite #174

vmesel opened this issue Apr 7, 2024 · 5 comments

Comments

@vmesel
Copy link
Member

vmesel commented Apr 7, 2024

We need to enable our users to prevent having regressions on their prompt without noticing it clearly. In order to achieve this, we must implement a way for our users to run a test suite on their own test cases with a user set similarity score. This must be simply set and must be extensible to use beyond toml files.

@avelino
Copy link
Member

avelino commented Apr 7, 2024

https://github.com/confident-ai/deepeval?tab=readme-ov-file#writing-your-first-test-case

this week @lgabs shared this project with me, apparently we were able to use it to test our prompts

@vmesel
Copy link
Member Author

vmesel commented Apr 7, 2024 via email

@lgabs
Copy link
Collaborator

lgabs commented Apr 7, 2024

Yeah, I couldn't study too much of llm evals, but I did check that the way the community seams to evaluate llms applications is using several standard metrics such that another llm evaluates the llm apppication outputs against expected results (this evaluator could even be a free local llm, for a task is much simpler).

I saved this Andre Ng short course to see soon, maybe it'll help 🤘

@lgabs
Copy link
Collaborator

lgabs commented Apr 7, 2024

Also, I think it would be a good idea for us to use the same common dataset for local dev, and also for tests that depend on dataset (generate embeddings or even these llm evals), one idea is to download some one from hugging face like this wiki_qa. What do you think? This would a new issue, of course.

These llm evals change a lot depending on the domain, so i think we could just make good documentation about how to add these tests without trying to write generic cases for all datasets.

@vmesel
Copy link
Member Author

vmesel commented Apr 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants