Trying to reproduce MATH benchmark + inconsistency between llama docs #315

fzyzcjy · 2024-08-13T07:15:55Z

Hi thanks for the great open source model! I am trying to reproduce the MATH benchmark, but currently I only achieve 50.9% (average over 10 retries) instead of the 51.9% reported by official llama. Thus I wonder whether it is normal to have such difference, or I am doing something wrong here. Especially, it would be great if I could know the correct prompts and templates to evaluate MATH.

I also seem to find a bit of inconsistency between llama docs. https://github.com/meta-llama/llama3/blob/main/eval_details.md says "4-shot", while https://ai.meta.com/blog/meta-llama-3-1/ (table) says "0-shot CoT". Thus I wonder whether the numbers are 4-shot or 0-shot?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to reproduce MATH benchmark + inconsistency between llama docs #315

Trying to reproduce MATH benchmark + inconsistency between llama docs #315

fzyzcjy commented Aug 13, 2024

Trying to reproduce MATH benchmark + inconsistency between llama docs #315

Trying to reproduce MATH benchmark + inconsistency between llama docs #315

Comments

fzyzcjy commented Aug 13, 2024