Skip to content

Commit

Permalink
revised scoring results without truncations
Browse files Browse the repository at this point in the history
  • Loading branch information
yuchenlin committed Jun 26, 2024
1 parent 4ffcbde commit 88704b7
Show file tree
Hide file tree
Showing 54 changed files with 1,023,115 additions and 8 deletions.
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
!evaluation/results_v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=gpt-4-turbo-2024-04-09/*.json
!eval_results/v2.0625/score.v2
local_scripts/
# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down Expand Up @@ -203,5 +204,5 @@ evaluation/eval_template.no_checklist.md
*.768-1024.json

result_dirs/
*.instant.json
*.instant.json
eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/_merged_0625_truncation.json
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ To analyze the correlation between WildBench (v2) and human evaluation, we consi

- [ ] LLM360/K2-Chat
- [x] DeepSeek-V2-Code
- [ ] Yi-large-preview
- [x] Yi-large-preview
- [x] THUDM/glm-4-9b-chat
- [x] chujiezheng/neo_7b_instruct_v0.1-ExPO
- [x] ZhangShenao/SELM-Llama-3-8B-Instruct-iter-3
Expand Down
20,464 changes: 20,464 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Hermes-2-Theta-Llama-3-8B.json

Large diffs are not rendered by default.

20,482 changes: 20,482 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Llama-2-70b-chat-hf.json

Large diffs are not rendered by default.

20,444 changes: 20,444 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Llama-2-7b-chat-hf.json

Large diffs are not rendered by default.

Large diffs are not rendered by default.

20,463 changes: 20,463 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Llama-3-Instruct-8B-SimPO.json

Large diffs are not rendered by default.

20,463 changes: 20,463 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Meta-Llama-3-70B-Instruct.json

Large diffs are not rendered by default.

20,463 changes: 20,463 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Meta-Llama-3-8B-Instruct.json

Large diffs are not rendered by default.

20,462 changes: 20,462 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Mistral-7B-Instruct-v0.2.json

Large diffs are not rendered by default.

20,443 changes: 20,443 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Mixtral-8x7B-Instruct-v0.1.json

Large diffs are not rendered by default.

Large diffs are not rendered by default.

20,463 changes: 20,463 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Phi-3-medium-128k-instruct.json

Large diffs are not rendered by default.

20,442 changes: 20,442 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Phi-3-mini-128k-instruct.json

Large diffs are not rendered by default.

20,423 changes: 20,423 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Qwen1.5-72B-Chat-greedy.json

Large diffs are not rendered by default.

20,462 changes: 20,462 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/[email protected]

Large diffs are not rendered by default.

20,484 changes: 20,484 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Qwen2-72B-Instruct.json

Large diffs are not rendered by default.

Large diffs are not rendered by default.

20,482 changes: 20,482 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/SELM-Zephyr-7B-iter-3.json

Large diffs are not rendered by default.

20,464 changes: 20,464 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Starling-LM-7B-beta-ExPO.json

Large diffs are not rendered by default.

20,462 changes: 20,462 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Starling-LM-7B-beta.json

Large diffs are not rendered by default.

20,462 changes: 20,462 additions & 0 deletions eval_results/v2.0625/score.v2/eval=gpt-4o-2024-05-13/Yi-1.5-34B-Chat.json

Large diffs are not rendered by default.

Loading

0 comments on commit 88704b7

Please sign in to comment.