`Athena`: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis #415

DominikRemo · 2026-02-11T12:39:41Z

Motivation and Context

The current evaluation workflow treats expert feedback and LLM-based feedback evaluation separately, making direct comparison difficult. We needed a robust way to:

Consolidate split expert evaluation configurations into a single source of truth.
De-pseudonymize expert results automatically (mapping blinded labels back to "LLM", "Cofee", "Tutor", etc.).
Run "LLM-as-a-Judge" evaluations using the exact same metrics as human experts.
Export all results (Human vs. LLM) into a single dataset for analysis.

Description

This PR significantly enhances the evaluation pipeline across steps 4, 5, and 6:

Expert Evaluation (4_expert_evaluation.ipynb):
- Added create_common_config: Merges multiple split evaluation configs into a single common_evaluation_config.json.
- Added De-pseudonymization: Automatically resolves randomized feedback types (e.g., "A", "B") and metric IDs in progress files back to their original names, ensuring consistent data for analysis.
- Outputs cleaned data to data/4_expert_evaluation/output_depseudonymized.
LLM as a Judge (5_llm_as_a_judge.ipynb):
- Metric Synchronization: Now loads metrics directly from common_evaluation_config.json instead of defining them potentially differently. This ensures the LLM judges based on the exact same criteria as humans.
- Generates evaluation results in the standardized "progress file" format, treating the LLM effectively as another expert evaluator.
Analysis Data Export (6_analysis_data_export.ipynb):
- Integrated the new depseudonymized expert data and LLM evaluation results.
- Explicitly registers the LLM as a "participant" in the dataset.

Steps for Testing (Only locally)

Expert Eval: Run 4_expert_evaluation.ipynb. Ensure common_evaluation_config.json is created and files in output_depseudonymized have resolved names (not IDs).
LLM Eval: Run 5_llm_as_a_judge.ipynb. Select "Reuse Metrics from Expert Evaluation" and verify that it loads the metrics from step 4. Run the evaluation to generate the output JSON.
Export: Run 6_analysis_data_export.ipynb. Verify it loads both the expert files and the LLM file, merges them, and exports an evaluation_data.csv where expert_id includes llm.

Summary by CodeRabbit

New Features
- LLM-as-a-judge workflow with batched request generation, structured evaluation processing, and JSON output.
Improvements
- Unified expert-evaluation pipeline with de‑pseudonymization, shared metrics config, and an analysis data export that merges progress and participant info.
Bug Fixes / Removals
- Removed legacy single-call evaluation path and simplified prompt/output formatting.
Chores
- Kernel metadata and dependency updates; repository ignore rules refined.

coderabbitai · 2026-02-11T12:40:01Z

📝 Walkthrough

Walkthrough

Adds utilities to discover and de‑pseudonymize expert evaluation files, an LLM-as-judge request/processing pipeline, new analysis export notebook, model and prompt adjustments, removal of the legacy llm_service, minor notebook kernel metadata updates, dependency additions, and .gitignore changes.

Changes

Cohort / File(s)	Summary
Notebook Kernel Updates `athena/evaluation/0_database_setup.ipynb`, `athena/evaluation/1_sampling_exercises.ipynb`, `athena/evaluation/2_sampling_submissions.ipynb`, `athena/evaluation/3_feedback_generation.ipynb`, `athena/evaluation/5_llm_as_a_judge.ipynb`	Notebook metadata updated: kernelspec `display_name` changed to `.venv` (from `evaluation-py3.12`) and related language metadata edits.
Expert Evaluation Notebook `athena/evaluation/4_expert_evaluation.ipynb`	Major rework: switched input to `data/4_expert_evaluation/input_pseudonymized`; removed prior read/save flow; added discovery, common-config creation, and depseudonymization steps using new service functions; outputs to `data/4_expert_evaluation/output_depseudonymized`.
New Export Notebook `athena/evaluation/6_analysis_data_export.ipynb`	New notebook that loads evaluation progress and common config, merges with participant info (optionally includes LLM participant), selects columns, and writes `data/6_analysis/evaluation_data.csv`.
Evaluation Model Additions `athena/evaluation/model/evaluation_model.py`	Added `MetricEvaluationRequest` Pydantic model (`prompt: list[BaseMessage]`, `exercise_id`, `submission_id`, `feedback_type`, `metrics`) and adjusted imports to use `BaseMessage`.
Domain Model Updates `athena/evaluation/model/model.py`	`Submission` and `Exercise` constructors accept `meta: Optional[dict] = None` and initialize `self.meta` defensively.
Prompt Simplification `athena/evaluation/prompts/llm_evaluation_prompt.py`	Removed `PydanticOutputParser` and all `format_instructions`/output-parser usage; prompt formatting no longer injects parser instructions.
JSON & Utility Service `athena/evaluation/service/json_service.py`	Large additions: secure zip extraction and file discovery (`get_evaluation_files`), `create_common_config`, `resolve_feedback_types_and_metric_titles`, `load_evaluation_progress`, `load_common_evaluation_config`; updated `group_exercise_data` signature to accept `Optional[str]`.
New LLM-as-Judge Service `athena/evaluation/service/llm_as_a_judge_service.py`	New module with `generate_evaluation_requests(...)` (builds `MetricEvaluationRequest` list) and `process_feedback_evaluations(...)` (invokes model, validates metric titles, accumulates results and cost, writes JSON output).
Removed Legacy Service `athena/evaluation/service/llm_service.py`	Deleted legacy file and `evaluate_feedback_with_model`; logic replaced by new llm_as_a_judge_service.
Dependencies `athena/evaluation/pyproject.toml`	Added dependencies: `ipykernel = "7.2.0"` and `tqdm = "4.67.3"`.
Gitignore Update `athena/evaluation/.gitignore`	Replaced flat `data/` rule with recursive `data//` and re-included directories and `.gitkeep` via negations.
Public API Surface Changes `athena/evaluation/prompts/...`, `athena/evaluation/model/...`, `athena/evaluation/service/...`	New public functions `generate_evaluation_requests`, `process_feedback_evaluations`, added `MetricEvaluationRequest` model, updated AzureChatOpenAI constructor parameter usage, removed `evaluate_feedback_with_model`.

Sequence Diagram(s)

sequenceDiagram
    participant Notebooks as Notebooks/Orchestration
    participant JSONSvc as json_service
    participant JudgeSvc as llm_as_a_judge_service
    participant Azure as AzureOpenAI

    rect rgba(200,200,255,0.5)
    Notebooks->>JSONSvc: get_evaluation_files / load_common_evaluation_config / load_evaluation_progress
    JSONSvc-->>Notebooks: evaluation files, common config, progress dataframe
    end

    rect rgba(200,255,200,0.5)
    Notebooks->>JudgeSvc: generate_evaluation_requests(data, metrics, filter?)
    JudgeSvc-->>Notebooks: List[MetricEvaluationRequest]
    Notebooks->>JudgeSvc: process_feedback_evaluations(requests, output_path, model, metrics)
    end

    rect rgba(255,200,200,0.5)
    JudgeSvc->>Azure: invoke model per request (prompts)
    Azure-->>JudgeSvc: responses (evaluations)
    JudgeSvc->>JudgeSvc: validate metrics, accumulate results, track cost
    JudgeSvc-->>Notebooks: write evaluation_progress_llm-as-a-judge.json
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through configs and prompts in the night,
I stitched files and metrics till they looked just right,
Old service slept; new requests took flight,
JSON and costs counted under moonlight,
Hooray — evaluations ready, carrot-bright! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the primary changes: unifying expert and LLM evaluation pipelines with integrated data export for analysis, which aligns with the major notebook updates and new service functions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch athena/evaluation/notebooks/llm-as-a-judge

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

athena/evaluation/service/llm_as_a_judge_service.py (2)
75-88: Hoist model.with_structured_output(MetricEvaluations) outside the loop.

model.with_structured_output(MetricEvaluations) creates a new runnable wrapper on every iteration. Since the schema doesn't change, build it once before the loop.
♻️ Proposed fix
+    structured_model = model.with_structured_output(MetricEvaluations)
+
     progress_bar = tqdm(requests, desc="Processing")
 
     for request in progress_bar:
         evaluation_progress["has_started_evaluating"] = True
         try:
             with get_openai_callback() as cb:
-                metric_evaluations = model.with_structured_output(MetricEvaluations).invoke(
+                metric_evaluations = structured_model.invoke(
                     request.prompt, max_tokens=100, temperature=0
                 )
55-60: Consider using requests[0].metrics instead of a separate metrics parameter.

Each MetricEvaluationRequest already carries its own metrics list. The separate metrics parameter on process_feedback_evaluations is redundant and could drift out of sync with what's in the requests. If all requests share the same metrics, you could derive expected_metric_titles from request.metrics inside the loop instead.
athena/evaluation/4_expert_evaluation.ipynb (1)
115-131: Unnecessary f-string prefix on Line 127.

f"Created Common Evaluation Config With:" has no placeholders. Use a plain string.
♻️ Proposed fix
-    "print(f\"Created Common Evaluation Config With:\")\n",
+    "print(\"Created Common Evaluation Config With:\")\n",
athena/evaluation/5_llm_as_a_judge.ipynb (1)

107-127: Both "Define New Metrics" and "Reuse Metrics" cells assign to metrics — add a usage note.

Both cells (lines 87-97 and 113-127) write to the same metrics variable. If a user runs both cells sequentially, the second silently overwrites the first. This is a common notebook pattern, but consider adding a brief markdown note (e.g., "Run one of the two cells below") in the "Define Metrics" section (line 69-70) to make the mutual exclusivity explicit.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.4)

athena/evaluation/1_sampling_exercises.ipynb

athena/evaluation/0_database_setup.ipynb

athena/evaluation/2_sampling_submissions.ipynb

9 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

athena/evaluation/prompts/llm_evaluation_prompt.py (1)
134-149: ⚠️ Potential issue | 🟡 Minor

grading_instruction_feedback can become None, rendering as literal "None" in the description string.

When feedback.structured_grading_instruction_id is truthy but the ID isn't found in the exercise's grading criteria, grading_instructions.get(...) returns None, causing the ternary on line 145 to assign None to grading_instruction_feedback — overwriting the "" default from line 134. The f-string on line 149 then produces "None\n{feedback.description}".
Proposed fix
             grading_instruction_feedback = (
-                grading_instruction.feedback + ": " if grading_instruction else None
+                grading_instruction.feedback + ": " if grading_instruction else ""
             )
athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59: ⚠️ Potential issue | 🟡 Minor

model_name can be None if env var is unset, causing AttributeError on .replace().

os.getenv("LLM_EVALUATION_MODEL") returns None when the variable isn't set. Calling .replace() on None at line 54 will crash. Consider adding a guard or using a default value.
Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
+"if not model_name:\n",
+"    raise ValueError(\"LLM_EVALUATION_MODEL environment variable is not set.\")\n",

🤖 Fix all issues with AI agents

In `@athena/evaluation/4_expert_evaluation.ipynb`:
- Around line 128-130: The default for the metrics field is wrong: replace
occurrences of common_evaluation_config.get("metrics", {}) with
common_evaluation_config.get("metrics", []) so the default is a list not a dict;
update both places referenced in the diff (the print statement using len(...)
and the later use at line ~152) and ensure this matches how create_common_config
constructs metrics.

In `@athena/evaluation/5_llm_as_a_judge.ipynb`:
- Around line 193-196: The call random.sample(requests, 10) can raise ValueError
when requests has fewer than 10 items; change the sampling to use
random.sample(requests, k) with k = min(10, len(requests)) so it gracefully
handles small lists (locate the sampling line using the symbol random.sample and
the requests variable).

In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 79-96: The notebook kernel metadata lists "version": "3.11.9"
which is inconsistent with the project target of Python 3.12; update the
notebook's metadata block (the "kernelspec"/"language_info" fields) to reflect
Python 3.12 by changing "version": "3.11.9" to "3.12" and optionally update
"kernelspec.display_name" to "Python 3.12" and "kernelspec.name" to a matching
identifier (e.g., "python3.12") so the notebook metadata aligns with
pyproject.toml and other notebooks.
- Around line 24-34: The code defines llm_evaluation_progress_path but never
uses it, so load_evaluation_progress(evaluation_progress_path, None) omits LLM
data; update the call to pass the llm_evaluation_progress_path variable as the
second argument to load_evaluation_progress so that both expert
(evaluation_progress_path) and LLM (llm_evaluation_progress_path) progress files
are loaded and merged by that function.

In `@athena/evaluation/model/model.py`:
- Line 62: The constructor parameter using a mutable default (meta: dict = {})
in Submission.__init__ should be changed to use None as the default and
initialize a new dict inside the method to avoid shared state; update the
signature to use meta: Optional[dict] = None and then set self.meta = {} if meta
is None else dict(meta) (or similar) within Submission.__init__; apply the same
change to Exercise.__init__ (the similar meta/default pattern noted at line
~140) to prevent shared mutable defaults across instances.

In `@athena/evaluation/pyproject.toml`:
- Around line 15-20: The project declares several Python deps but omits tqdm
which is imported by athena.evaluation.service.llm_as_a_judge_service (import
tqdm on top of that module), causing ImportError for consumers; fix by adding a
stable tqdm entry to pyproject.toml dependencies (e.g., a compatible caret
version like "^4.65.0" or similar) so the package manager installs tqdm
alongside the other libs.

In `@athena/evaluation/service/json_service.py`:
- Around line 616-621: The zip extraction loop using zipfile.ZipFile(... ) and
zip_ref.extractall(data_dir) is vulnerable to ZipSlip; update the extraction to
iterate over zip_ref.infolist() (or .namelist()), validate each member path by
joining with data_dir and ensuring the normalized absolute path starts with the
intended data_dir prefix, and only then extract that member (skip or raise on
invalid paths) rather than calling extractall; apply this change where files is
iterated and file.endswith(".zip") is checked to securely extract archives.
- Around line 660-670: The while-loop resolving transitive mappings can
infinite-loop if mappings become self-referential (e.g., A→A); add a guard by
tracking iterations or detecting self-mappings: introduce a max_iterations
(e.g., len(mappings) or 1000) and increment each loop, breaking when exceeded,
or after updating mappings check for any key where mappings[key] == key and
remove that key from the intersection (or set mappings[key] = None/remove the
entry) so it no longer appears in pseudonyms and feedback_types; apply this fix
around the variables mappings, pseudonyms, feedback_types, intersection and the
while-loop to ensure termination.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 92-101: The current loop breaks on the first validation failure
which abandons the remaining requests; instead, when evaluated_metric_titles !=
expected_metric_titles or the LLM returns an unexpected format, log the error
(including evaluated_metric_titles, expected_metric_titles and request) and use
continue to skip just that request so processing continues for the rest, and
ensure is_finished_evaluating is set to False for any skipped request so the
output indicates incomplete evaluation; update the logic around
evaluated_metric_titles/expected_metric_titles and the else branch to perform
logging + continue (or at minimum set is_finished_evaluating = False before any
early exit) rather than break.
- Around line 138-142: Remove unnecessary f-string prefixes used where there are
no placeholders: change the filename construction that uses
f"evaluation_progress_llm-as-a-judge.json" when building file_path (the
os.path.join call that assigns file_path) to a plain string, and change the
print call print(f"\nSummary of Evaluation:") to print("\nSummary of
Evaluation:"). Keep the rest of the logic (writing evaluation_progress with
json.dump and variable names file_path and evaluation_progress) unchanged.
- Around line 67-140: In process_feedback_evaluations, evaluation_progress is
only set inside the for-loop which causes a NameError when requests is empty;
initialize evaluation_progress before the loop (e.g., set
current_submission_index/current_exercise_index to None, selected_values to {}
and has_started_evaluating/is_finished_evaluating appropriately) so the variable
always exists, then update it inside the loop as currently done; ensure the
final json dump writes that preinitialized evaluation_progress when no
iterations occur.

In `@athena/evaluation/service/plot_service.py`:
- Around line 30-31: Several functions in
athena/evaluation/service/plot_service.py call plt.show() before plt.savefig(),
which causes saved files to be blank; locate the three occurrences of the
plt.show() / plt.savefig() pair (the end of each plotting function in this
module) and swap the calls so plt.savefig(os.path.join(plot_path, filename)) is
executed before plt.show(); apply the same swap in the other two functions that
have the identical pair so all saved plots are written before being displayed.
- Around line 8-31: The plot_boxplot function currently draws on the global
matplotlib state so successive calls overlay plots; fix it by explicitly
creating a new figure/axes at the start (e.g., call plt.figure() or use fig, ax
= plt.subplots()) before calling sns.boxplot and by closing the figure at the
end (e.g., plt.close(fig) or plt.close()) after saving; update plot_boxplot (and
mirror the same pattern in the other plotting functions) so all plotting,
layout, grid, savefig, show, and close operations are performed on the newly
created figure/axes to avoid bleed between calls.

🧹 Nitpick comments (4)

athena/evaluation/service/plot_service.py (1)

34-83: plot_feedback_type_metric and plot_metric_feedback_type are near-duplicates.

They differ only in which column is x vs hue and the legend/filename labels. Consider delegating to plot_boxplot (which already accepts all these parameters) to eliminate the duplication.

athena/evaluation/service/json_service.py (2)

617-631: Inconsistent path construction: f-strings vs os.path.join.

Lines 620, 629, and 631 use f-string interpolation (f"{data_dir}/{file}") while most of the rest of this file uses os.path.join. Prefer os.path.join throughout for cross-platform safety.

735-735: Function signature uses str | None alongside Optional[str] in the same file.

Line 735 uses str | None for llm_evaluation_progress_path while other functions (e.g., group_exercise_data at Line 56) use Optional[str]. Pick one style for consistency.

athena/evaluation/service/llm_as_a_judge_service.py (1)

129-135: evaluation_progress is redundantly rebuilt on every iteration.

This dict is reassigned in every loop iteration but only the final value matters (after the loop). Move it outside the loop (as part of the NameError fix above) and update has_started_evaluating / is_finished_evaluating once after the loop completes.

athena/evaluation/4_expert_evaluation.ipynb

athena/evaluation/5_llm_as_a_judge.ipynb

athena/evaluation/6_analysis_data_export.ipynb

athena/evaluation/model/model.py

athena/evaluation/service/llm_as_a_judge_service.py

athena/evaluation/service/plot_service.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

athena/evaluation/1_sampling_exercises.ipynb (1)
102-103: ⚠️ Potential issue | 🟡 Minor

Missing pandas import for the commented-out read_csv hint.

The commented line references pd.read_csv, but pandas is never imported (as pd or otherwise) in this notebook. If a user uncomments it, they'll get a NameError.
Proposed fix
-    "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")"
+    "# import pandas as pd\n",
+    "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")"
athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59: ⚠️ Potential issue | 🟡 Minor

model_name.replace(...) will raise AttributeError if the env var is unset.

If LLM_EVALUATION_MODEL is not configured in .env, os.getenv returns None and the .replace() call on Line 54 crashes. Consider adding a guard or using a default.
Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
 "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
 "api_version = os.getenv(\"OPENAI_API_VERSION\")\n",
 "\n",
+"if not all([model_name, api_key, api_base, api_version]):\n",
+"    raise ValueError(\"Missing required environment variables. Check your .env file.\")\n",
 "\n",
 "model = AzureChatOpenAI(\n",

🤖 Fix all issues with AI agents

In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 71-74: The columns exercise, submission, and feedback in the
DataFrame (populated by load_common_evaluation_config) contain nested dicts that
will be written as Python reprs by df.to_csv and are hard to parse; before
calling df.to_csv (the line using analysis_output_path and
"evaluation_data.csv"), either flatten the dicts into primitive columns (e.g.,
exercise_id, exercise_title, submission_timestamp, feedback_text) or
JSON-serialize those three columns (use json.dumps on each cell) so the CSV
contains valid JSON strings, then write the resulting DataFrame with df.to_csv.

In `@athena/evaluation/service/json_service.py`:
- Around line 654-658: The current aggressive sanitization in the metrics
collection loop (the metric["title"] assignment) removes useful punctuation like
hyphens and parentheses; change the filter to only remove control/invisible
characters instead of stripping most punctuation—import unicodedata and replace
the comprehension with one that keeps characters whose Unicode category does not
start with 'C' (control) or, alternatively, allow a small whitelist of
punctuation (e.g., "-.()/_:") while trimming whitespace; update the
metric["title"] assignment inside the loop that iterates over
config_data.get("metrics", []) and ensure the new logic preserves meaningful
characters while still trimming surrounding spaces.

🧹 Nitpick comments (6)

athena/evaluation/.gitignore (1)
1-3: Consider adding a comment to document the recursive ignore pattern.

The recursive pattern correctly handles nested directories and .gitkeep files, preserving the directory structure (e.g., data/4_expert_evaluation/input_pseudonymized/.gitkeep, data/6_analysis/.gitkeep). However, the pattern is not immediately obvious and adding documentation would improve maintainability for future contributors.
Suggested documentation
+# Ignore all files recursively under data/, but keep directory structure via .gitkeep
 data/**/*
 !data/**/
 !data/**/.gitkeep
athena/evaluation/service/json_service.py (3)
610-637: Inconsistent path construction: mix of os.path.join and f-string with /.

Lines 620 and 623 correctly use os.path.join, but Lines 633 and 635 use f"{data_dir}/{file}". This is inconsistent and can break on Windows.
Proposed fix
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))
715-743: Misleading variable names obscure the data structure.

In the nested loops, feedback_id is actually a feedback-type pseudonym and ratings is a dict of {metric_id: score}. Similarly, the outer submission variable holds per-submission data keyed by submission IDs. Consider renaming for clarity:
Suggested renames
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
-                    resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
-                        if feedback_id not in pseudonym_to_feedback_type:
-                            raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
+            for _exercise_id, submissions in progress_data.get("selected_values", {}).items():
+                for submission_id, feedbacks_by_type in submissions.items():
+                    resolved_feedback = {}
+                    for pseudonym, ratings in feedbacks_by_type.items():
+                        if pseudonym not in pseudonym_to_feedback_type:
+                            raise ValueError(f"Pseudonym {pseudonym} not found in mappings.")
794-819: Store nested dicts as proper JSON strings or flatten the structure before DataFrame creation.

The exercise, submission, and feedback columns are exported to CSV by the downstream analysis notebook (6_analysis_data_export.ipynb). Pandas will serialize these nested dicts as string representations (e.g., "{'id': ...}") rather than valid JSON, making them fragile and difficult to parse. Either flatten these columns into separate DataFrame columns, or serialize them with json.dumps() before storing in the DataFrame.
athena/evaluation/6_analysis_data_export.ipynb (1)
39-53: LLM participant row is always appended even when LLM data is not loaded.

If llm_evaluation_progress_path remains None, no LLM scores exist in df, but the LLM participant row is unconditionally added to participant_info_df. After the left merge on expert_id, this won't cause incorrect data (the LLM row just won't match anything), but it's misleading. Consider guarding the append:
Proposed fix
 "# Load participant info and add LLM as a participant\n",
 "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n",
+"if llm_evaluation_progress_path:\n",
 "participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
(with appropriate indentation)
athena/evaluation/4_expert_evaluation.ipynb (1)
148-162: count variable is incremented but never used.

The count variable on Lines 155 and 157 is dead code. Consider either printing it after the loop or removing it.

Also, os.makedirs(output_dir, exist_ok=True) on Line 159 is redundant inside the loop — output_dir was already created in the previous cell (Line 123).
Proposed fix
-"count = 0\n",
 "for progress_file, progress_data in resolve_feedback_types_and_metric_titles(evaluation_progress_files, metric_ids_to_titles, pseudonym_to_feedback_type).items():\n",
-"    count += 1\n",
 "    output_progress_path = f\"{output_dir}/{os.path.basename(progress_file)}\"\n",
-"    os.makedirs(output_dir, exist_ok=True)\n",
 "    with open(output_progress_path, \"w\") as output_file:\n",
-"        json.dump(progress_data, output_file, indent=4)"
+"        json.dump(progress_data, output_file, indent=4)\n",
+"\n",
+"print(f\"Resolved {len(evaluation_progress_files)} progress files.\")"

athena/evaluation/6_analysis_data_export.ipynb

athena/evaluation/service/json_service.py

Updated expert evaluation notebook to support multiple config/progress files, added steps for depseudonymization, and improved error handling. Expanded LLM-as-a-judge notebook with options to reuse expert metrics, improved output, and clarified evaluation steps. Added new analysis notebooks and introduced plot service for visualization. Updated dependencies and refactored model and service code to support new workflow.

github-actions · 2026-02-11T16:33:07Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T16:33:08Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59: ⚠️ Potential issue | 🟡 Minor

model_name will be None if the env var is unset, causing AttributeError on .replace().

If LLM_EVALUATION_MODEL is not set in the .env file, os.getenv(...) returns None, and model_name.replace("azure_openai_", "") on line 54 will raise AttributeError: 'NoneType' object has no attribute 'replace'. Consider adding a guard or a clear error message.
Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
+"if not model_name:\n",
+"    raise ValueError(\"Environment variable LLM_EVALUATION_MODEL is not set.\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",

🤖 Fix all issues with AI agents

In `@athena/evaluation/service/json_service.py`:
- Around line 788-791: The current return uses
pd.DataFrame.from_records(records).astype({'exercise_id': 'int64',
'submission_id': 'int64'}) which will KeyError on an empty DataFrame with no
columns; modify the code to first build the DataFrame (df =
pd.DataFrame.from_records(records)) and if df.empty (or if not records) return
an explicitly typed empty DataFrame (e.g. pd.DataFrame({'exercise_id':
pd.Series(dtype='int64'), 'submission_id': pd.Series(dtype='int64')}), otherwise
perform df.astype({'exercise_id': 'int64', 'submission_id': 'int64'}) and return
the result to avoid the KeyError.
- Around line 620-625: The ZipSlip check around extracting zip entries is
vulnerable because target_path.startswith(os.path.abspath(data_dir)) can be
fooled by sibling paths; update the extraction logic in the block using
zipfile.ZipFile (where target_path and data_dir are computed and zip_ref.extract
is called) to compute a base path with a trailing separator (e.g., base_path =
os.path.abspath(data_dir) + os.sep) and then verify
target_path.startswith(base_path) before calling zip_ref.extract; this ensures
entries like "/app/data_evil/..." do not pass the check.

🧹 Nitpick comments (7)

athena/evaluation/pyproject.toml (1)

18-19: Nit: Inconsistent version pinning for tqdm.

All other dependencies in this file use exact version pins (e.g., "7.1.0"), but tqdm uses a caret range ("^4.65.0"). Consider pinning it exactly for consistency and reproducibility, e.g., "4.67.1".
athena/evaluation/service/json_service.py (4)
627-636: Inconsistent path construction: f-string interpolation vs os.path.join.

Lines 633 and 635 use f"{data_dir}/{file}" while the zip extraction above correctly uses os.path.join. This can break on Windows where the separator is \.
Proposed fix
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))
724-742: Confusing variable names obscure the nested structure.

The loop variable submission on line 728 actually represents per-exercise data (a dict of submissions), and feedback on line 729 represents per-submission data (a dict of feedback types). This makes the code hard to follow.
Suggested rename
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
+            for _exercise_id, submissions in progress_data.get("selected_values", {}).items():
+                for submission_id, feedback_types in submissions.items():
                     resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
+                    for feedback_id, ratings in feedback_types.items():
                         if feedback_id not in pseudonym_to_feedback_type:
                             raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
                         resolved_ratings = {}
                         for metric_id, score in ratings.items():
                             if metric_id not in metric_ids_to_titles:
                                 raise ValueError(f"Metric ID {metric_id} not found in metrics.")
                             title = metric_ids_to_titles[metric_id]
                             resolved_ratings[title] = score
                         resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings
-                    submission[submission_id] = resolved_feedback
+                    submissions[submission_id] = resolved_feedback
757-761: Expert ID extraction is fragile if the filename contains extra underscores or dots.

file.replace("evaluation_progress_", "").replace(".json", "") assumes a single occurrence of these substrings. A filename like evaluation_progress_expert_1.json.bak or evaluation_progress_A_B.json could produce unexpected IDs. Using re or splitting on a known pattern would be more robust, though the current approach works for the expected naming convention.

804-811: Direct dict key access (exercise['id'], submission['id']) will raise KeyError if the key is missing.

Other parts of this file use .get() with defaults for defensive access. Consider using .get("id") here too, or at least documenting that id is a required field so failures are intentional.
athena/evaluation/4_expert_evaluation.ipynb (1)

148-162: count variable is incremented but never used.

The count variable on lines 155/157 is accumulated but never printed or returned. Either remove it or add a summary print after the loop (e.g., print(f"Resolved {count} progress files.")).

Also, os.makedirs(output_dir, exist_ok=True) on line 159 is called inside the loop on every iteration. It's harmless due to exist_ok=True, but moving it before the loop would be cleaner — it's already called on line 123 in the previous cell anyway.

athena/evaluation/5_llm_as_a_judge.ipynb (1)

217-223: Confirm that the output path data/5_llm_evaluation/ produces a file that notebook 6 can consume.

Notebook 6 currently sets llm_evaluation_progress_path = None (line 25 of 6_analysis_data_export.ipynb). For the end-to-end pipeline to work, the user would need to update that path to point to the LLM output from this cell. Consider adding a comment here noting the expected output filename so the user knows what to reference in step 6.

athena/evaluation/service/json_service.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 27-29: Fix the typo in the notebook comment: change "evalution
progress" to "evaluation progress" in the explanatory string that mentions the
5_llm_as_a_judge notebook and the stored path (the comment line referencing
"data/5_llm_evaluation/evaluation_progress_llm-as-a-judge.json"); leave the
variable llm_evaluation_progress_path as-is.

🧹 Nitpick comments (6)

athena/evaluation/6_analysis_data_export.ipynb (1)
18-57: LLM participant is always appended even when no LLM data is loaded.

Lines 44–55 unconditionally add an 'llm' row to participant_info_df, but llm_evaluation_progress_path defaults to None (line 29), so load_evaluation_progress won't produce any expert_id == 'llm' rows. After the left merge on expert_id (line 56), this phantom participant row won't appear in the final data, so there's no data corruption — but it's misleading. Consider guarding the concat:
Suggested guard
 "# Load participant info and add LLM as a participant\n",
 "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n",
-"participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
+"if llm_evaluation_progress_path:\n",
+"    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
athena/evaluation/4_expert_evaluation.ipynb (1)

142-160: os.makedirs on line 157 is called inside the loop — consider hoisting it.

os.makedirs(output_dir, exist_ok=True) is already called once at line 123. The repeated call inside the loop (line 157) is redundant. It's harmless due to exist_ok=True, but removing it keeps the loop body focused on file I/O.
athena/evaluation/service/json_service.py (4)
610-638: Mix of os.path.join and f-string path concatenation.

Lines 620, 624 correctly use os.path.join / os.path.abspath, but lines 634 and 636 revert to f-string concatenation (f"{data_dir}/{file}"). Use os.path.join consistently for cross-platform safety.
Suggested fix
         if file.startswith("evaluation_config_"):
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))
716-744: Line 716: very long function signature — consider wrapping for readability.

The signature spans well beyond typical line-length limits. Also, the docstring is missing a Returns section.
Suggested formatting
-def resolve_feedback_types_and_metric_titles(evaluation_progress_files: list[str], metric_ids_to_titles: dict[str, str], pseudonym_to_feedback_type: dict[str, str]) -> dict[str, dict]:
+def resolve_feedback_types_and_metric_titles(
+    evaluation_progress_files: list[str],
+    metric_ids_to_titles: dict[str, str],
+    pseudonym_to_feedback_type: dict[str, str],
+) -> dict[str, dict]:
747-798: Type annotation style inconsistency and empty-DataFrame guard looks good.

The empty-DataFrame guard (lines 790–794) properly addresses the prior review concern.

Minor style nit: line 747 uses str | None (PEP 604 union syntax) while line 56 uses Optional[str]. Both are valid in Python 3.12+, but mixing styles within the same file reduces consistency.

2-8: import os.path on line 6 is redundant.

os is already imported on line 3, and os.path is automatically available as a submodule. The explicit import os.path can be removed.

athena/evaluation/6_analysis_data_export.ipynb

github-actions · 2026-02-11T16:55:43Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T16:55:44Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

github-actions · 2026-02-11T17:07:03Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T17:07:04Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@athena/evaluation/service/json_service.py`:
- Around line 798-807: The empty-DataFrame branch currently returns only
exercise_id and submission_id causing KeyError downstream; change the
empty-return to construct and return a fully-typed DataFrame with the same
schema as the non-empty case (columns: expert_id, exercise_id, submission_id,
feedback_type, metric, score) using pd.DataFrame({ 'expert_id':
pd.Series(dtype='int64'), 'exercise_id': pd.Series(dtype='int64'),
'submission_id': pd.Series(dtype='int64'), 'feedback_type':
pd.Series(dtype='object'), 'metric': pd.Series(dtype='object'), 'score':
pd.Series(dtype='float64') }) so code that accesses df['expert_id'] etc. works;
keep the existing pd.DataFrame.from_records(records) and df.astype({...}) path
for non-empty results.

🧹 Nitpick comments (3)

athena/evaluation/service/json_service.py (3)

627-637: No file-extension filter on discovered evaluation files.

Lines 632 and 634 match any file whose name starts with evaluation_config_ or evaluation_progress_, regardless of extension. A stray file like evaluation_config_backup.txt or a .zip would be included and would later fail when json.load is called. Consider adding an .endswith(".json") guard.

Proposed fix

     for file in files:
-        if file.startswith("evaluation_config_"):
+        if file.startswith("evaluation_config_") and file.endswith(".json"):
             evaluation_config_files.append(os.path.join(data_dir, file))
-        elif file.startswith("evaluation_progress_"):
+        elif file.startswith("evaluation_progress_") and file.endswith(".json"):
             evaluation_progress_files.append(os.path.join(data_dir, file))

730-749: Misleading variable names obscure the nested data structure.

The variable names don't match the data they hold:

Line 734: submission is actually the exercise-level dict mapping submission_id → feedback_data, not a single submission.
Line 735: feedback is actually a dict of feedback_pseudonym → ratings, not a single feedback item.
Line 737: feedback_id is actually a pseudonym key.

Also, Line 747 mutates submission (the dict being iterated on Line 735) by reassigning submission[submission_id]. While updating an existing key during iteration is safe in CPython, it's a readability concern. Building a new dict and replacing after iteration would be clearer.

Proposed refactor for clarity

     for progress_file in evaluation_progress_files:
         with open(progress_file, "r") as file:
             progress_data = json.load(file)
             # De-pseudonymize feedbacks in progress file
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
-                    resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
-                        if feedback_id not in pseudonym_to_feedback_type:
-                            raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
-                        resolved_ratings = {}
-                        for metric_id, score in ratings.items():
-                            if metric_id not in metric_ids_to_titles:
-                                raise ValueError(f"Metric ID {metric_id} not found in metrics.")
-                            title = metric_ids_to_titles[metric_id]
-                            resolved_ratings[title] = score
-                        resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings
-                    submission[submission_id] = resolved_feedback
+            for exercise_id, submissions_data in progress_data.get("selected_values", {}).items():
+                resolved_submissions = {}
+                for submission_id, feedback_types in submissions_data.items():
+                    resolved_feedback = {}
+                    for pseudonym, ratings in feedback_types.items():
+                        if pseudonym not in pseudonym_to_feedback_type:
+                            raise ValueError(f"Pseudonym {pseudonym} not found in mappings.")
+                        resolved_ratings = {}
+                        for metric_id, score in ratings.items():
+                            if metric_id not in metric_ids_to_titles:
+                                raise ValueError(f"Metric ID {metric_id} not found in metrics.")
+                            resolved_ratings[metric_ids_to_titles[metric_id]] = score
+                        resolved_feedback[pseudonym_to_feedback_type[pseudonym]] = resolved_ratings
+                    resolved_submissions[submission_id] = resolved_feedback
+                progress_data["selected_values"][exercise_id] = resolved_submissions
         progress[progress_file] = progress_data

664-681: Silent exit on unresolvable cyclic mappings.

If the while loop exhausts max_iterations without resolving all transitive mappings, execution continues silently with partially-resolved mappings. This will surface later as opaque ValueError("Pseudonym ... not found in mappings") exceptions in resolve_feedback_types_and_metric_titles or create_common_config's exercise-collection pass, making debugging difficult.

Consider raising an explicit error when the loop terminates without convergence:

Proposed fix

         iterations += 1
 
+    if len(intersection) > 0:
+        raise ValueError(
+            f"Could not resolve all transitive mappings after {max_iterations} iterations. "
+            f"Unresolved pseudonyms: {intersection}"
+        )
+
     for config_file in evaluation_config_files:

athena/evaluation/service/json_service.py

coderabbitai · 2026-02-11T17:43:00Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

DominikRemo · 2026-02-11T17:53:53Z

@coderabbitai resolve

coderabbitai · 2026-02-11T17:54:12Z

✅ Actions performed

Comments resolved.

DominikRemo · 2026-02-11T17:56:21Z

@coderabbitai full review

coderabbitai · 2026-02-11T17:56:34Z

✅ Actions performed

Full review triggered.

DominikRemo · 2026-02-11T18:14:39Z

@coderabbitai full review

coderabbitai · 2026-02-11T18:14:53Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59: ⚠️ Potential issue | 🟡 Minor

model_name will be None if the env var is missing, causing AttributeError on .replace().

If LLM_EVALUATION_MODEL is not set, os.getenv returns None and Line 54 crashes with AttributeError: 'NoneType' object has no attribute 'replace'. The same applies to the other env vars passed to AzureChatOpenAI. Consider adding a guard or using a default.
Suggested fix
 model_name = os.getenv("LLM_EVALUATION_MODEL")
 api_key = os.getenv("AZURE_OPENAI_API_KEY")
 api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
 api_version = os.getenv("OPENAI_API_VERSION")
+
+if not all([model_name, api_key, api_base, api_version]):
+    raise ValueError("Missing required environment variables. Check your .env file.")

🤖 Fix all issues with AI agents

In `@athena/evaluation/pyproject.toml`:
- Around line 18-19: Update the pinned package versions in pyproject.toml for
the dependencies ipykernel and tqdm: change ipykernel from 7.1.0 to 7.2.0 and
tqdm from 4.65.0 to 4.67.3, then regenerate the lockfile (e.g., run your
dependency manager's update/lock command such as poetry lock or pip-compile) and
run tests to ensure nothing breaks.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 89-94: The hardcoded max_tokens=100 passed to
model.with_structured_output(MetricEvaluations).invoke(...) is too small for
multi-metric structured JSON and can cause truncation; update this call to use a
larger default (e.g., 500–1000) or make it configurable via a constant/parameter
(e.g., MAX_EVAL_TOKENS or an argument to the LLM judge service) so
get_openai_callback(), model.with_structured_output(MetricEvaluations), and
invoke(...) use that value; ensure any configuration name is documented in the
function/class signature and replace the inline 100 with the new variable.

🧹 Nitpick comments (5)

athena/evaluation/model/evaluation_model.py (1)
24-31: Inconsistent generic syntax: List[BaseMessage] vs list[Metric].

Line 25 uses List[BaseMessage] (from typing) while line 31 uses the built-in list[Metric]. Since the project targets Python 3.12+, prefer the built-in list consistently (matching the style already used by MetricEvaluations on line 19).
♻️ Suggested fix
-    prompt: List[BaseMessage] = Field(
+    prompt: list[BaseMessage] = Field(
This would also allow removing the List import from typing on line 1.
athena/evaluation/6_analysis_data_export.ipynb (1)
44-56: pd.concat without ignore_index=True produces a duplicate index.

The appended LLM row will carry index 0, duplicating the existing first row's index. While this doesn't break the subsequent merge, it can cause subtle issues if the DataFrame is later indexed by position (e.g., df.loc[0] returns two rows).
♻️ Suggested fix
-    "    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
+    "    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
Add ignore_index=True to the concat call:
participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{
    ...
}])], ignore_index=True)
athena/evaluation/service/json_service.py (2)
689-704: Exercise collection re-reads all config files from disk a second time.

The config files are already loaded in the first loop (lines 651–662) for metrics/mappings. Consider collecting exercises in the same pass to avoid redundant I/O, or at minimum document why the two-pass approach is needed (e.g., mappings must be fully resolved before exercise de-pseudonymization).

821-846: load_common_evaluation_config doesn't close the file handle on exception and nests all logic under with.

Minor: the entire function body is inside the with block, meaning the DataFrame construction happens while the file is still open. This works but is slightly unusual. More importantly, there's no error handling if evaluation_config_path doesn't exist.

Also, the function docstring says it returns pd.DataFrame but the return type annotation is missing.
Suggested improvement
-def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame:
+def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame:
     """
     Loads the common evaluation configuration from a JSON file.
     Args:
         evaluation_config_path (str): The path to the common evaluation configuration file.
+    Returns:
+        pd.DataFrame: A DataFrame containing the evaluation configuration records.
     """
     with open(evaluation_config_path, "r") as config_file:
         common_evaluation_config = json.load(config_file)
-        
-        records = []
-        ...
-        return pd.DataFrame.from_records(records)
+
+    records = []
+    for raw_exercise in common_evaluation_config.get("exercises", []):
+        exercise = {k: v for k, v in raw_exercise.items() if k != "submissions"}
+        for raw_submission in raw_exercise.get("submissions", []):
+            submission = {k: v for k, v in raw_submission.items() if k != "feedbacks"}
+            for feedback_type, raw_feedback in raw_submission.get("feedbacks", {}).items():
+                record = {
+                    'exercise_id': exercise['id'],
+                    'submission_id': submission['id'],
+                    'feedback_type': feedback_type,
+                    'exercise': exercise,
+                    'submission': submission,
+                    'feedback': raw_feedback,
+                }
+                records.append(record)
+
+    return pd.DataFrame.from_records(records)
athena/evaluation/5_llm_as_a_judge.ipynb (1)

85-125: Two mutually exclusive metric-loading cells — consider adding a clearer guard.

Both "Define New Metrics" (Lines 85–95) and "Reuse Metrics from Expert Evaluation" (Lines 111–125) assign to the same metrics variable. Running both cells sequentially means the second silently overwrites the first. This is a common notebook pattern but could confuse users. A brief inline comment like # NOTE: Only run ONE of the two metrics cells below in each cell header would help.

athena/evaluation/pyproject.toml

athena/evaluation/service/llm_as_a_judge_service.py

github-actions · 2026-02-11T18:29:22Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T18:29:23Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

DominikRemo · 2026-02-11T19:08:59Z

@CodeRabbit review

coderabbitai · 2026-02-11T19:09:12Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

DominikRemo · 2026-02-11T19:27:45Z

@CodeRabbit full review

coderabbitai · 2026-02-11T19:27:56Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

athena/evaluation/5_llm_as_a_judge.ipynb (1)
40-59: ⚠️ Potential issue | 🟡 Minor

model_name can be None if the env var is unset, causing AttributeError on Line 54.

os.getenv("LLM_EVALUATION_MODEL") returns None when the variable isn't configured. The .replace(...) call on Line 54 will then fail. Consider adding a guard or using a clearer error message:
Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
 "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
 "api_version = os.getenv(\"OPENAI_API_VERSION\")\n",
+"\n",
+"if not all([model_name, api_key, api_base, api_version]):\n",
+"    raise ValueError(\"Missing required environment variables. Check your .env file.\")\n",

🤖 Fix all issues with AI agents

In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 18-36: The notebook sets analysis_output_path but never ensures
the directory exists before writing output; before exporting the CSV (where
df.to_csv / writing occurs) add a directory creation step using
os.makedirs(analysis_output_path, exist_ok=True) so the path is created if
missing; locate and update the cell around the export step that references
analysis_output_path to call os.makedirs(...) just prior to the write.

In `@athena/evaluation/service/json_service.py`:
- Around line 821-846: The function load_common_evaluation_config currently
returns pd.DataFrame.from_records(records) which yields a column-less DataFrame
when records is empty; change load_common_evaluation_config to detect the
empty-records case and return an empty DataFrame with the expected columns (at
minimum "exercise_id", "submission_id", "feedback_type", "exercise",
"submission", "feedback") so downstream merges on
["exercise_id","submission_id","feedback_type"] don't KeyError; update the
return path in load_common_evaluation_config to construct
pd.DataFrame.from_records(records, columns=[...]) or explicitly create
pd.DataFrame(columns=[...]) when records is empty.
- Around line 670-681: The loop mutates mappings while iterating which can
produce non-deterministic results; change the inner iteration to work on a
stable snapshot (e.g., iterate over list(mappings.items()) or create a new dict)
so updates do not read values updated earlier in the same pass, then apply
deletions as before; specifically modify the block that iterates "for key, value
in mappings.items()" to instead iterate a snapshot (or build new_mappings and
replace mappings) and preserve the subsequent self-reference removal and
recomputation of pseudonyms/feedback_types/intersection.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 87-93: The LLM call using
model.with_structured_output(MetricEvaluations).invoke(...) inside the
progress_bar loop has no error handling so any transient API failure will abort
the loop and lose partial evaluation_progress and total_cost; wrap the
invocation and callback usage (get_openai_callback, metric_evaluations,
cb.total_cost) in a try/except that catches API/network exceptions, log the
exception, mark the current item as errored in evaluation_progress (same pattern
as the existing validation-failure handling), ensure you still accumulate
cb.total_cost if available, persist/write evaluation_progress to disk after the
exception, and continue the loop so remaining requests are processed.

🧹 Nitpick comments (5)

athena/evaluation/model/evaluation_model.py (1)
24-31: New MetricEvaluationRequest model looks good overall.

Minor style nit: lines 25 and 31 mix List[BaseMessage] (from typing) with list[Metric] (builtin). Since the project targets Python 3.12, you could use list[BaseMessage] consistently and drop the List import.
Suggested consistency fix
-from typing import List
-
-from langchain_core.messages import BaseMessage
+from langchain_core.messages import BaseMessage
 from pydantic import BaseModel, Field
...
 class MetricEvaluationRequest(BaseModel):
-    prompt: List[BaseMessage] = Field(
+    prompt: list[BaseMessage] = Field(
         ..., description="The prompt to evaluate the metrics."
     )
athena/evaluation/6_analysis_data_export.ipynb (1)

42-57: Left merges may silently drop data — consider validating merge results.

The left merge on Line 40 and Line 57 could leave NaN in config/participant columns for unmatched records, and no warning is emitted. For a data-export notebook, this could lead to silently incomplete CSVs. Consider adding a brief check (e.g., logging the count of rows with missing config/participant data) after each merge so users are aware of data gaps.

athena/evaluation/4_expert_evaluation.ipynb (1)

142-159: Cross-cell dependency on json import — consider adding it to this cell.

This cell uses json.dump (Line 158) but json is imported in the previous cell (Line 116). If a user runs this cell in isolation or restarts and runs from here, they'll get a NameError. Since os is re-imported on Line 150, it would be consistent to also import json here.
athena/evaluation/service/json_service.py (1)
689-704: Config files are read twice — once for metrics/mappings and again for exercises.

Lines 651-662 open and parse each config file, and Lines 689-704 re-open and re-parse the same files. This is a minor inefficiency.
Suggested optimization

Cache the parsed JSON on the first pass:
+    parsed_configs = []
     for config_file in evaluation_config_files:
         with open(config_file, "r") as file:
             config_data = json.load(file)
+            parsed_configs.append(config_data)
             # Collect metrics
             ...
 
-    for config_file in evaluation_config_files:
-        with open(config_file, "r") as file:
-            config_data = json.load(file)
+    for config_data in parsed_configs:
             # Collect exercises
             ...
athena/evaluation/service/llm_as_a_judge_service.py (1)
146-148: Summary counts use nested list comprehensions that are hard to read.

These can be simplified with sum(...) for clarity:
Suggested refactor
-    print(f"Total evaluated submissions: {len([sub for ex in selected_values.values() for sub in ex.values()])}")
-    print(f"Total evaluated feedbacks: {len([fb for ex in selected_values.values() for sub in ex.values() for fb in sub.values()])}")
+    print(f"Total evaluated submissions: {sum(len(ex) for ex in selected_values.values())}")
+    print(f"Total evaluated feedbacks: {sum(len(sub) for ex in selected_values.values() for sub in ex.values())}")

athena/evaluation/6_analysis_data_export.ipynb

athena/evaluation/service/json_service.py

athena/evaluation/service/llm_as_a_judge_service.py

github-actions · 2026-02-11T20:12:17Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T20:12:18Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@athena/evaluation/service/json_service.py`:
- Around line 806-822: The df.empty branch is unreachable because the code
raises ValueError when records is empty; remove the dead empty-DataFrame branch
(the block that constructs and returns the typed empty DataFrame) and keep the
initial raise, so the function simply builds df =
pd.DataFrame.from_records(records) and returns
df.astype({'exercise_id':'int64','submission_id':'int64'}); reference symbols:
records, pd.DataFrame.from_records, df, df.empty, ValueError, and df.astype.

🧹 Nitpick comments (3)

athena/evaluation/service/llm_as_a_judge_service.py (2)

25-48: Minor code duplication between None and list branches.

The None and list cases differ only in the feedbacks argument passed to get_formatted_prompt ([] vs submission.feedbacks). Consider merging them to reduce repetition.

♻️ Proposed consolidation

-            if submission.feedbacks is None:
-                prompt = get_formatted_prompt(exercise, submission, [], metrics)
-                requests.append(
-                    MetricEvaluationRequest(
-                        prompt=prompt,
-                        exercise_id=exercise.id,
-                        submission_id=submission.id,
-                        feedback_type="default",
-                        metrics=metrics,
-                    )
-                )
-            elif isinstance(submission.feedbacks, list):
-                prompt = get_formatted_prompt(
-                    exercise, submission, submission.feedbacks, metrics
-                )
-                requests.append(
-                    MetricEvaluationRequest(
-                        prompt=prompt,
-                        exercise_id=exercise.id,
-                        submission_id=submission.id,
-                        feedback_type="default",
-                        metrics=metrics,
-                    )
-                )
-            elif isinstance(submission.feedbacks, dict):
+            if submission.feedbacks is None or isinstance(submission.feedbacks, list):
+                feedbacks = submission.feedbacks or []
+                prompt = get_formatted_prompt(exercise, submission, feedbacks, metrics)
+                requests.append(
+                    MetricEvaluationRequest(
+                        prompt=prompt,
+                        exercise_id=exercise.id,
+                        submission_id=submission.id,
+                        feedback_type="default",
+                        metrics=metrics,
+                    )
+                )
+            elif isinstance(submission.feedbacks, dict):

87-97: No error handling around the LLM API call — consider narrowing the exception type.

The try/except was added to address a prior review concern (good), but catching bare Exception is flagged by Ruff (BLE001). Since LLM API calls typically raise specific exceptions (e.g., openai.APIError, requests.exceptions.RequestException), narrowing the catch would improve signal. That said, for a notebook-driven evaluation pipeline, the current broad catch is pragmatic and acceptable.

athena/evaluation/service/json_service.py (1)

825-853: load_common_evaluation_config accesses exercise['id'] and submission['id'] without guarding for missing keys.

If a malformed config entry is missing "id", this will raise a KeyError with no context about which entry caused the problem. Consider using .get() with validation or wrapping in a try/except for a more informative error message. This is a minor robustness concern given this is notebook-driven code.

athena/evaluation/service/json_service.py

github-actions · 2026-02-11T20:31:02Z

Athena Test Results Summary

	Tests	Passed ✅	Skipped	Failed
Athena Test Report	10 ran	10 passed	0 skipped	0 failed

Failing Tests Summary

Test	Result
No test annotations available

github-actions · 2026-02-11T20:31:03Z

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package	Line Rate	Branch Rate	Status
athena	37.8%	3.3%	❌
athena.helpers	100.0%	100.0%	✅
athena.helpers.programming	33.0%	0.0%	❌
athena.helpers.text	0.0%	100.0%	❌
athena.models	0.0%	0.0%	❌
athena.schemas	76.5%	8.3%	❌
athena.storage	21.1%	0.0%	❌
llm_core	100.0%	100.0%	✅
llm_core.core	26.0%	6.2%	❌
llm_core.loaders	79.3%	37.5%	❌
llm_core.loaders.model_loaders	68.5%	37.5%	❌
llm_core.models	66.7%	35.7%	❌
llm_core.models.providers	77.2%	56.2%	❌
llm_core.utils	52.8%	18.5%	❌
modeling.module_modeling_llm.module_modeling_llm	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer	71.4%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser	79.2%	60.2%	❌
modeling.module_modeling_llm.module_modeling_llm.core	88.9%	50.0%	❌
modeling.module_modeling_llm.module_modeling_llm.models	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.prompts	100.0%	100.0%	✅
modeling.module_modeling_llm.module_modeling_llm.utils	100.0%	50.0%	❌
programming.module_programming_llm.module_programming_llm	100.0%	100.0%	✅
programming.module_programming_llm.module_programming_llm.helpers	27.6%	0.0%	❌
programming.module_programming_llm.module_programming_llm.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm	72.7%	12.5%	❌
text.module_text_llm.module_text_llm.default_approach	66.4%	36.1%	❌
text.module_text_llm.module_text_llm.default_approach.prompts	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.default_approach.schemas	100.0%	100.0%	✅
text.module_text_llm.module_text_llm.divide_and_conquer	34.0%	0.0%	❌
text.module_text_llm.module_text_llm.helpers	55.4%	26.7%	❌
text.module_text_llm.module_text_llm.self_consistency	46.2%	0.0%	❌

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@athena/evaluation/service/json_service.py`:
- Around line 760-813: The issue is a dtype mismatch: load_evaluation_progress
casts exercise_id and submission_id to int64 but load_common_evaluation_config
leaves exercise['id'] and submission['id'] as-is, causing merges to fail; update
load_common_evaluation_config to coerce the DataFrame columns for exercise_id
and submission_id to int64 (same as load_evaluation_progress) after building its
records/DataFrame so both DataFrames share explicit int64 dtypes for those
columns before any merge.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 134-143: The "Total evaluated feedbacks" print counts
feedback_type keys including the per-submission "meta" key, inflating the
number; update the calculation that uses selected_values (the expression summing
len(sub) for ex in selected_values.values() for sub in ex.values()) to exclude
the "meta" key when counting each submission's feedbacks (e.g., sum only keys !=
"meta" or subtract 1 when "meta" in sub.keys()) so the printed total reflects
actual feedback items.

🧹 Nitpick comments (2)

athena/evaluation/service/json_service.py (2)

692-705: exercise["submissions"].sort(key=lambda x: x["id"]) mutates data from parsed_configs in-place.

Since parsed_configs holds references to the originally parsed JSON dicts, this sort mutates the source data. This is likely fine for the current single-use notebook workflow, but worth noting if parsed_configs is ever reused.

More importantly: if a submission lacks an "id" key, this will raise a KeyError. A .get("id", 0) fallback or validation step before this point would be safer.

816-844: load_common_evaluation_config docstring is missing the Returns section.

The function returns a pd.DataFrame, but the docstring only documents Args. Minor documentation gap.

athena/evaluation/service/json_service.py

athena/evaluation/service/llm_as_a_judge_service.py

DominikRemo · 2026-02-11T21:10:41Z

@coderabbitai resolve

coderabbitai · 2026-02-11T21:10:56Z

✅ Actions performed

Comments resolved.

DominikRemo · 2026-02-11T21:13:04Z

@coderabbitai full review

coderabbitai · 2026-02-11T21:13:11Z

✅ Actions performed

Full review triggered.

laadvo

Looked over the code and tested the notebooks locally. Working as expected, nice improvement :)

DominikRemo self-assigned this Feb 11, 2026

DominikRemo requested a review from a team as a code owner February 11, 2026 12:39

github-actions bot added the athena label Feb 11, 2026

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/6_analysis_data_export.ipynb Show resolved Hide resolved

athena/evaluation/service/json_service.py Show resolved Hide resolved

DominikRemo added 7 commits February 11, 2026 17:29

Refactor llm-as-a-judge workflow and update notebooks

c09ad89

Fix tqdm import & add ipykernel dependency

268946e

Simplify analysis

c747d14

Add short documentation to analysis export notebook

f81659d

Delete unused plot_service

ca430e4

Address Code Rabbit Suggestions

0e0520b

DominikRemo force-pushed the athena/evaluation/notebooks/llm-as-a-judge branch from 863c57f to 0e0520b Compare February 11, 2026 16:30

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/service/json_service.py Show resolved Hide resolved

athena/evaluation/service/json_service.py Outdated Show resolved Hide resolved

DominikRemo added 3 commits February 11, 2026 17:38

Secure zip extraction and handle empty progress DF

f900b69

Pin tqdm version

8e08006

Address Nitpicks

b0c2511

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/6_analysis_data_export.ipynb Show resolved Hide resolved

Fix typo and nitpicks

f249443

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/service/json_service.py Show resolved Hide resolved

Address @coderabbitai review

af1b0e2

ls1intum deleted a comment from coderabbitai bot Feb 11, 2026

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/pyproject.toml Outdated Show resolved Hide resolved

athena/evaluation/service/llm_as_a_judge_service.py Outdated Show resolved Hide resolved

Update pinned dependency versions to latest stable releases.

e2c8de3

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

Address @coderabbitai feedback

ae09933

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/service/json_service.py Show resolved Hide resolved

Remove dead code and reduce duplication

ca54812

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

athena/evaluation/service/json_service.py Show resolved Hide resolved

athena/evaluation/service/llm_as_a_judge_service.py Show resolved Hide resolved

laadvo self-requested a review February 11, 2026 21:55

laadvo approved these changes Feb 12, 2026

View reviewed changes

Athena: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis #415

Are you sure you want to change the base?

Athena: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis #415

Conversation

DominikRemo commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

Steps for Testing (Only locally)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2026

Athena Test Results Summary

Failing Tests Summary

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Detailed Coverage Table

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2026

Athena Test Results Summary

Failing Tests Summary

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Detailed Coverage Table

Uh oh!

github-actions bot commented Feb 11, 2026

Athena Test Results Summary

Failing Tests Summary

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Detailed Coverage Table

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot commented Feb 11, 2026

Uh oh!

DominikRemo commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026

Uh oh!

DominikRemo commented Feb 11, 2026

Uh oh!

coderabbitai bot commented Feb 11, 2026

`Athena`: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis #415

`Athena`: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis #415

DominikRemo commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading