TREC AutoJudge (Meta-) Evaluation & Leaderboard

This repository contains the code used for evaluation and approaches for the TREC AutoJudge shared tasks.

trec25 the AutoJudge Pilot at TREC 2025 (attention: this is work in progress)
trec26 the upcoming iteration at TREC 2026 (attention: this is work in progress)

We currently develop (attention: this is work in progress) a step-by-step guide on how to submit at documentation/README.md.

What is TREC AutoJudge?

TREC Auto-Judge offers the first rigorous, cross-task benchmark for Large-Language-Model judges.

While Large-Language-Model judges have emerged as a pragmatic solution when manual relevance assessment is costly or infeasible. However, recent studies reveal wide variation in accuracy across tasks, prompts, and model sizes.

Currently, shared task organizers choose an LLM judge per track ad hoc, risking inconsistent baselines and hidden biases.

Auto-Judge provides a test bed for comparing different LLM judge ideas across several tasks and correlate results against manually created relevance judgments. AutoJudge proved a testbed to study emerging evaluation approaches, as well as vulnerabilities of LLM judges, and the efficacy of safeguards for those vulnerabilities.

This Auto-Judge evaluation script standardizes data handling and evaluation across multiple shared tasks/TREC tracks that rely on LLM judging and provided a centralized, comparative evaluation of LLM judges under realistic conditions.

What is this code for?

This project provides a means to evaluate AutoJudge approaches and provide a system ranking / leaderboard.

It will be used by TREC AutoJudge coordinators to score submissions. We encourage prospective participants to run this locally for method development.

This code will handle obtaining data sets (akin to ir_datasets), input/output and format conversions, and evaluation measures.

Code Setup

Purpose

Initial code is in the trec_auto_judge directory (attention: this is in the very early brain storming phase).

Installation

You can install the early prototype via:

pip3 install git+https://github.com/trec-auto-judge/auto-judge-code.git

Rationale

After installing the code base, you will have to customize

Command Line Usage

After installation above, you can run trec-auto-judge --help which provides an overview of commands.

Developer Section

Unit Tests

Run unit tests via:

PYTHONPATH=. pytest tests

Create Badge (TODO: add this to CI):

PYTHONPATH=. python3 -m pytest --cov-report term --cov=trec_auto_judge tests
coverage report --data-file=.coverage > test-coverage
coverage-badge -o tests/coverage.svg

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
documentation		documentation
integration-tests		integration-tests
tests		tests
trec25		trec25
trec26		trec26
trec_auto_judge		trec_auto_judge
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_key_example.sh		api_key_example.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TREC AutoJudge (Meta-) Evaluation & Leaderboard

What is TREC AutoJudge?

What is this code for?

Code Setup

Purpose

Installation

Rationale

Command Line Usage

Developer Section

Unit Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

trec-auto-judge/auto-judge-code

Folders and files

Latest commit

History

Repository files navigation

TREC AutoJudge (Meta-) Evaluation & Leaderboard

What is TREC AutoJudge?

What is this code for?

Code Setup

Purpose

Installation

Rationale

Command Line Usage

Developer Section

Unit Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages