Skip to content

Conversation

@DominikRemo
Copy link
Contributor

@DominikRemo DominikRemo commented Feb 11, 2026

Motivation and Context

The current evaluation workflow treats expert feedback and LLM-based feedback evaluation separately, making direct comparison difficult. We needed a robust way to:

  • Consolidate split expert evaluation configurations into a single source of truth.
  • De-pseudonymize expert results automatically (mapping blinded labels back to "LLM", "Cofee", "Tutor", etc.).
  • Run "LLM-as-a-Judge" evaluations using the exact same metrics as human experts.
  • Export all results (Human vs. LLM) into a single dataset for analysis.

Description

This PR significantly enhances the evaluation pipeline across steps 4, 5, and 6:

  • Expert Evaluation (4_expert_evaluation.ipynb):
    • Added create_common_config: Merges multiple split evaluation configs into a single common_evaluation_config.json.
    • Added De-pseudonymization: Automatically resolves randomized feedback types (e.g., "A", "B") and metric IDs in progress files back to their original names, ensuring consistent data for analysis.
    • Outputs cleaned data to data/4_expert_evaluation/output_depseudonymized.
  • LLM as a Judge (5_llm_as_a_judge.ipynb):
    • Metric Synchronization: Now loads metrics directly from common_evaluation_config.json instead of defining them potentially differently. This ensures the LLM judges based on the exact same criteria as humans.
    • Generates evaluation results in the standardized "progress file" format, treating the LLM effectively as another expert evaluator.
  • Analysis Data Export (6_analysis_data_export.ipynb):
    • Integrated the new depseudonymized expert data and LLM evaluation results.
    • Explicitly registers the LLM as a "participant" in the dataset.

Steps for Testing (Only locally)

  1. Expert Eval: Run 4_expert_evaluation.ipynb. Ensure common_evaluation_config.json is created and files in output_depseudonymized have resolved names (not IDs).
  2. LLM Eval: Run 5_llm_as_a_judge.ipynb. Select "Reuse Metrics from Expert Evaluation" and verify that it loads the metrics from step 4. Run the evaluation to generate the output JSON.
  3. Export: Run 6_analysis_data_export.ipynb. Verify it loads both the expert files and the LLM file, merges them, and exports an evaluation_data.csv where expert_id includes llm.

Summary by CodeRabbit

  • New Features

    • LLM-as-a-judge workflow with batched request generation, structured evaluation processing, and JSON output.
  • Improvements

    • Unified expert-evaluation pipeline with de‑pseudonymization, shared metrics config, and an analysis data export that merges progress and participant info.
  • Bug Fixes / Removals

    • Removed legacy single-call evaluation path and simplified prompt/output formatting.
  • Chores

    • Kernel metadata and dependency updates; repository ignore rules refined.

@DominikRemo DominikRemo self-assigned this Feb 11, 2026
@DominikRemo DominikRemo requested a review from a team as a code owner February 11, 2026 12:39
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Adds utilities to discover and de‑pseudonymize expert evaluation files, an LLM-as-judge request/processing pipeline, new analysis export notebook, model and prompt adjustments, removal of the legacy llm_service, minor notebook kernel metadata updates, dependency additions, and .gitignore changes.

Changes

Cohort / File(s) Summary
Notebook Kernel Updates
athena/evaluation/0_database_setup.ipynb, athena/evaluation/1_sampling_exercises.ipynb, athena/evaluation/2_sampling_submissions.ipynb, athena/evaluation/3_feedback_generation.ipynb, athena/evaluation/5_llm_as_a_judge.ipynb
Notebook metadata updated: kernelspec display_name changed to .venv (from evaluation-py3.12) and related language metadata edits.
Expert Evaluation Notebook
athena/evaluation/4_expert_evaluation.ipynb
Major rework: switched input to data/4_expert_evaluation/input_pseudonymized; removed prior read/save flow; added discovery, common-config creation, and depseudonymization steps using new service functions; outputs to data/4_expert_evaluation/output_depseudonymized.
New Export Notebook
athena/evaluation/6_analysis_data_export.ipynb
New notebook that loads evaluation progress and common config, merges with participant info (optionally includes LLM participant), selects columns, and writes data/6_analysis/evaluation_data.csv.
Evaluation Model Additions
athena/evaluation/model/evaluation_model.py
Added MetricEvaluationRequest Pydantic model (prompt: list[BaseMessage], exercise_id, submission_id, feedback_type, metrics) and adjusted imports to use BaseMessage.
Domain Model Updates
athena/evaluation/model/model.py
Submission and Exercise constructors accept meta: Optional[dict] = None and initialize self.meta defensively.
Prompt Simplification
athena/evaluation/prompts/llm_evaluation_prompt.py
Removed PydanticOutputParser and all format_instructions/output-parser usage; prompt formatting no longer injects parser instructions.
JSON & Utility Service
athena/evaluation/service/json_service.py
Large additions: secure zip extraction and file discovery (get_evaluation_files), create_common_config, resolve_feedback_types_and_metric_titles, load_evaluation_progress, load_common_evaluation_config; updated group_exercise_data signature to accept Optional[str].
New LLM-as-Judge Service
athena/evaluation/service/llm_as_a_judge_service.py
New module with generate_evaluation_requests(...) (builds MetricEvaluationRequest list) and process_feedback_evaluations(...) (invokes model, validates metric titles, accumulates results and cost, writes JSON output).
Removed Legacy Service
athena/evaluation/service/llm_service.py
Deleted legacy file and evaluate_feedback_with_model; logic replaced by new llm_as_a_judge_service.
Dependencies
athena/evaluation/pyproject.toml
Added dependencies: ipykernel = "7.2.0" and tqdm = "4.67.3".
Gitignore Update
athena/evaluation/.gitignore
Replaced flat data/* rule with recursive data/**/* and re-included directories and .gitkeep via negations.
Public API Surface Changes
athena/evaluation/prompts/..., athena/evaluation/model/..., athena/evaluation/service/...
New public functions generate_evaluation_requests, process_feedback_evaluations, added MetricEvaluationRequest model, updated AzureChatOpenAI constructor parameter usage, removed evaluate_feedback_with_model.

Sequence Diagram(s)

sequenceDiagram
    participant Notebooks as Notebooks/Orchestration
    participant JSONSvc as json_service
    participant JudgeSvc as llm_as_a_judge_service
    participant Azure as AzureOpenAI

    rect rgba(200,200,255,0.5)
    Notebooks->>JSONSvc: get_evaluation_files / load_common_evaluation_config / load_evaluation_progress
    JSONSvc-->>Notebooks: evaluation files, common config, progress dataframe
    end

    rect rgba(200,255,200,0.5)
    Notebooks->>JudgeSvc: generate_evaluation_requests(data, metrics, filter?)
    JudgeSvc-->>Notebooks: List[MetricEvaluationRequest]
    Notebooks->>JudgeSvc: process_feedback_evaluations(requests, output_path, model, metrics)
    end

    rect rgba(255,200,200,0.5)
    JudgeSvc->>Azure: invoke model per request (prompts)
    Azure-->>JudgeSvc: responses (evaluations)
    JudgeSvc->>JudgeSvc: validate metrics, accumulate results, track cost
    JudgeSvc-->>Notebooks: write evaluation_progress_llm-as-a-judge.json
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through configs and prompts in the night,
I stitched files and metrics till they looked just right,
Old service slept; new requests took flight,
JSON and costs counted under moonlight,
Hooray — evaluations ready, carrot-bright! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the primary changes: unifying expert and LLM evaluation pipelines with integrated data export for analysis, which aligns with the major notebook updates and new service functions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch athena/evaluation/notebooks/llm-as-a-judge

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
athena/evaluation/service/llm_as_a_judge_service.py (2)

75-88: Hoist model.with_structured_output(MetricEvaluations) outside the loop.

model.with_structured_output(MetricEvaluations) creates a new runnable wrapper on every iteration. Since the schema doesn't change, build it once before the loop.

♻️ Proposed fix
+    structured_model = model.with_structured_output(MetricEvaluations)
+
     progress_bar = tqdm(requests, desc="Processing")
 
     for request in progress_bar:
         evaluation_progress["has_started_evaluating"] = True
         try:
             with get_openai_callback() as cb:
-                metric_evaluations = model.with_structured_output(MetricEvaluations).invoke(
+                metric_evaluations = structured_model.invoke(
                     request.prompt, max_tokens=100, temperature=0
                 )

55-60: Consider using requests[0].metrics instead of a separate metrics parameter.

Each MetricEvaluationRequest already carries its own metrics list. The separate metrics parameter on process_feedback_evaluations is redundant and could drift out of sync with what's in the requests. If all requests share the same metrics, you could derive expected_metric_titles from request.metrics inside the loop instead.

athena/evaluation/4_expert_evaluation.ipynb (1)

115-131: Unnecessary f-string prefix on Line 127.

f"Created Common Evaluation Config With:" has no placeholders. Use a plain string.

♻️ Proposed fix
-    "print(f\"Created Common Evaluation Config With:\")\n",
+    "print(\"Created Common Evaluation Config With:\")\n",
athena/evaluation/5_llm_as_a_judge.ipynb (1)

107-127: Both "Define New Metrics" and "Reuse Metrics" cells assign to metrics — add a usage note.

Both cells (lines 87-97 and 113-127) write to the same metrics variable. If a user runs both cells sequentially, the second silently overwrites the first. This is a common notebook pattern, but consider adding a brief markdown note (e.g., "Run one of the two cells below") in the "Define Metrics" section (line 69-70) to make the mutual exclusivity explicit.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.4)
athena/evaluation/1_sampling_exercises.ipynb
athena/evaluation/0_database_setup.ipynb
athena/evaluation/2_sampling_submissions.ipynb
  • 9 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
athena/evaluation/prompts/llm_evaluation_prompt.py (1)

134-149: ⚠️ Potential issue | 🟡 Minor

grading_instruction_feedback can become None, rendering as literal "None" in the description string.

When feedback.structured_grading_instruction_id is truthy but the ID isn't found in the exercise's grading criteria, grading_instructions.get(...) returns None, causing the ternary on line 145 to assign None to grading_instruction_feedback — overwriting the "" default from line 134. The f-string on line 149 then produces "None\n{feedback.description}".

Proposed fix
             grading_instruction_feedback = (
-                grading_instruction.feedback + ": " if grading_instruction else None
+                grading_instruction.feedback + ": " if grading_instruction else ""
             )
athena/evaluation/5_llm_as_a_judge.ipynb (1)

47-59: ⚠️ Potential issue | 🟡 Minor

model_name can be None if env var is unset, causing AttributeError on .replace().

os.getenv("LLM_EVALUATION_MODEL") returns None when the variable isn't set. Calling .replace() on None at line 54 will crash. Consider adding a guard or using a default value.

Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
+"if not model_name:\n",
+"    raise ValueError(\"LLM_EVALUATION_MODEL environment variable is not set.\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/4_expert_evaluation.ipynb`:
- Around line 128-130: The default for the metrics field is wrong: replace
occurrences of common_evaluation_config.get("metrics", {}) with
common_evaluation_config.get("metrics", []) so the default is a list not a dict;
update both places referenced in the diff (the print statement using len(...)
and the later use at line ~152) and ensure this matches how create_common_config
constructs metrics.

In `@athena/evaluation/5_llm_as_a_judge.ipynb`:
- Around line 193-196: The call random.sample(requests, 10) can raise ValueError
when requests has fewer than 10 items; change the sampling to use
random.sample(requests, k) with k = min(10, len(requests)) so it gracefully
handles small lists (locate the sampling line using the symbol random.sample and
the requests variable).

In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 79-96: The notebook kernel metadata lists "version": "3.11.9"
which is inconsistent with the project target of Python 3.12; update the
notebook's metadata block (the "kernelspec"/"language_info" fields) to reflect
Python 3.12 by changing "version": "3.11.9" to "3.12" and optionally update
"kernelspec.display_name" to "Python 3.12" and "kernelspec.name" to a matching
identifier (e.g., "python3.12") so the notebook metadata aligns with
pyproject.toml and other notebooks.
- Around line 24-34: The code defines llm_evaluation_progress_path but never
uses it, so load_evaluation_progress(evaluation_progress_path, None) omits LLM
data; update the call to pass the llm_evaluation_progress_path variable as the
second argument to load_evaluation_progress so that both expert
(evaluation_progress_path) and LLM (llm_evaluation_progress_path) progress files
are loaded and merged by that function.

In `@athena/evaluation/model/model.py`:
- Line 62: The constructor parameter using a mutable default (meta: dict = {})
in Submission.__init__ should be changed to use None as the default and
initialize a new dict inside the method to avoid shared state; update the
signature to use meta: Optional[dict] = None and then set self.meta = {} if meta
is None else dict(meta) (or similar) within Submission.__init__; apply the same
change to Exercise.__init__ (the similar meta/default pattern noted at line
~140) to prevent shared mutable defaults across instances.

In `@athena/evaluation/pyproject.toml`:
- Around line 15-20: The project declares several Python deps but omits tqdm
which is imported by athena.evaluation.service.llm_as_a_judge_service (import
tqdm on top of that module), causing ImportError for consumers; fix by adding a
stable tqdm entry to pyproject.toml dependencies (e.g., a compatible caret
version like "^4.65.0" or similar) so the package manager installs tqdm
alongside the other libs.

In `@athena/evaluation/service/json_service.py`:
- Around line 616-621: The zip extraction loop using zipfile.ZipFile(... ) and
zip_ref.extractall(data_dir) is vulnerable to ZipSlip; update the extraction to
iterate over zip_ref.infolist() (or .namelist()), validate each member path by
joining with data_dir and ensuring the normalized absolute path starts with the
intended data_dir prefix, and only then extract that member (skip or raise on
invalid paths) rather than calling extractall; apply this change where files is
iterated and file.endswith(".zip") is checked to securely extract archives.
- Around line 660-670: The while-loop resolving transitive mappings can
infinite-loop if mappings become self-referential (e.g., A→A); add a guard by
tracking iterations or detecting self-mappings: introduce a max_iterations
(e.g., len(mappings) or 1000) and increment each loop, breaking when exceeded,
or after updating mappings check for any key where mappings[key] == key and
remove that key from the intersection (or set mappings[key] = None/remove the
entry) so it no longer appears in pseudonyms and feedback_types; apply this fix
around the variables mappings, pseudonyms, feedback_types, intersection and the
while-loop to ensure termination.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 92-101: The current loop breaks on the first validation failure
which abandons the remaining requests; instead, when evaluated_metric_titles !=
expected_metric_titles or the LLM returns an unexpected format, log the error
(including evaluated_metric_titles, expected_metric_titles and request) and use
continue to skip just that request so processing continues for the rest, and
ensure is_finished_evaluating is set to False for any skipped request so the
output indicates incomplete evaluation; update the logic around
evaluated_metric_titles/expected_metric_titles and the else branch to perform
logging + continue (or at minimum set is_finished_evaluating = False before any
early exit) rather than break.
- Around line 138-142: Remove unnecessary f-string prefixes used where there are
no placeholders: change the filename construction that uses
f"evaluation_progress_llm-as-a-judge.json" when building file_path (the
os.path.join call that assigns file_path) to a plain string, and change the
print call print(f"\nSummary of Evaluation:") to print("\nSummary of
Evaluation:"). Keep the rest of the logic (writing evaluation_progress with
json.dump and variable names file_path and evaluation_progress) unchanged.
- Around line 67-140: In process_feedback_evaluations, evaluation_progress is
only set inside the for-loop which causes a NameError when requests is empty;
initialize evaluation_progress before the loop (e.g., set
current_submission_index/current_exercise_index to None, selected_values to {}
and has_started_evaluating/is_finished_evaluating appropriately) so the variable
always exists, then update it inside the loop as currently done; ensure the
final json dump writes that preinitialized evaluation_progress when no
iterations occur.

In `@athena/evaluation/service/plot_service.py`:
- Around line 30-31: Several functions in
athena/evaluation/service/plot_service.py call plt.show() before plt.savefig(),
which causes saved files to be blank; locate the three occurrences of the
plt.show() / plt.savefig() pair (the end of each plotting function in this
module) and swap the calls so plt.savefig(os.path.join(plot_path, filename)) is
executed before plt.show(); apply the same swap in the other two functions that
have the identical pair so all saved plots are written before being displayed.
- Around line 8-31: The plot_boxplot function currently draws on the global
matplotlib state so successive calls overlay plots; fix it by explicitly
creating a new figure/axes at the start (e.g., call plt.figure() or use fig, ax
= plt.subplots()) before calling sns.boxplot and by closing the figure at the
end (e.g., plt.close(fig) or plt.close()) after saving; update plot_boxplot (and
mirror the same pattern in the other plotting functions) so all plotting,
layout, grid, savefig, show, and close operations are performed on the newly
created figure/axes to avoid bleed between calls.
🧹 Nitpick comments (4)
athena/evaluation/service/plot_service.py (1)

34-83: plot_feedback_type_metric and plot_metric_feedback_type are near-duplicates.

They differ only in which column is x vs hue and the legend/filename labels. Consider delegating to plot_boxplot (which already accepts all these parameters) to eliminate the duplication.

athena/evaluation/service/json_service.py (2)

617-631: Inconsistent path construction: f-strings vs os.path.join.

Lines 620, 629, and 631 use f-string interpolation (f"{data_dir}/{file}") while most of the rest of this file uses os.path.join. Prefer os.path.join throughout for cross-platform safety.


735-735: Function signature uses str | None alongside Optional[str] in the same file.

Line 735 uses str | None for llm_evaluation_progress_path while other functions (e.g., group_exercise_data at Line 56) use Optional[str]. Pick one style for consistency.

athena/evaluation/service/llm_as_a_judge_service.py (1)

129-135: evaluation_progress is redundantly rebuilt on every iteration.

This dict is reassigned in every loop iteration but only the final value matters (after the loop). Move it outside the loop (as part of the NameError fix above) and update has_started_evaluating / is_finished_evaluating once after the loop completes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
athena/evaluation/1_sampling_exercises.ipynb (1)

102-103: ⚠️ Potential issue | 🟡 Minor

Missing pandas import for the commented-out read_csv hint.

The commented line references pd.read_csv, but pandas is never imported (as pd or otherwise) in this notebook. If a user uncomments it, they'll get a NameError.

Proposed fix
-    "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")"
+    "# import pandas as pd\n",
+    "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")"
athena/evaluation/5_llm_as_a_judge.ipynb (1)

47-59: ⚠️ Potential issue | 🟡 Minor

model_name.replace(...) will raise AttributeError if the env var is unset.

If LLM_EVALUATION_MODEL is not configured in .env, os.getenv returns None and the .replace() call on Line 54 crashes. Consider adding a guard or using a default.

Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
 "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
 "api_version = os.getenv(\"OPENAI_API_VERSION\")\n",
 "\n",
+"if not all([model_name, api_key, api_base, api_version]):\n",
+"    raise ValueError(\"Missing required environment variables. Check your .env file.\")\n",
 "\n",
 "model = AzureChatOpenAI(\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 71-74: The columns exercise, submission, and feedback in the
DataFrame (populated by load_common_evaluation_config) contain nested dicts that
will be written as Python reprs by df.to_csv and are hard to parse; before
calling df.to_csv (the line using analysis_output_path and
"evaluation_data.csv"), either flatten the dicts into primitive columns (e.g.,
exercise_id, exercise_title, submission_timestamp, feedback_text) or
JSON-serialize those three columns (use json.dumps on each cell) so the CSV
contains valid JSON strings, then write the resulting DataFrame with df.to_csv.

In `@athena/evaluation/service/json_service.py`:
- Around line 654-658: The current aggressive sanitization in the metrics
collection loop (the metric["title"] assignment) removes useful punctuation like
hyphens and parentheses; change the filter to only remove control/invisible
characters instead of stripping most punctuation—import unicodedata and replace
the comprehension with one that keeps characters whose Unicode category does not
start with 'C' (control) or, alternatively, allow a small whitelist of
punctuation (e.g., "-.()/_:") while trimming whitespace; update the
metric["title"] assignment inside the loop that iterates over
config_data.get("metrics", []) and ensure the new logic preserves meaningful
characters while still trimming surrounding spaces.
🧹 Nitpick comments (6)
athena/evaluation/.gitignore (1)

1-3: Consider adding a comment to document the recursive ignore pattern.

The recursive pattern correctly handles nested directories and .gitkeep files, preserving the directory structure (e.g., data/4_expert_evaluation/input_pseudonymized/.gitkeep, data/6_analysis/.gitkeep). However, the pattern is not immediately obvious and adding documentation would improve maintainability for future contributors.

Suggested documentation
+# Ignore all files recursively under data/, but keep directory structure via .gitkeep
 data/**/*
 !data/**/
 !data/**/.gitkeep
athena/evaluation/service/json_service.py (3)

610-637: Inconsistent path construction: mix of os.path.join and f-string with /.

Lines 620 and 623 correctly use os.path.join, but Lines 633 and 635 use f"{data_dir}/{file}". This is inconsistent and can break on Windows.

Proposed fix
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))

715-743: Misleading variable names obscure the data structure.

In the nested loops, feedback_id is actually a feedback-type pseudonym and ratings is a dict of {metric_id: score}. Similarly, the outer submission variable holds per-submission data keyed by submission IDs. Consider renaming for clarity:

Suggested renames
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
-                    resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
-                        if feedback_id not in pseudonym_to_feedback_type:
-                            raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
+            for _exercise_id, submissions in progress_data.get("selected_values", {}).items():
+                for submission_id, feedbacks_by_type in submissions.items():
+                    resolved_feedback = {}
+                    for pseudonym, ratings in feedbacks_by_type.items():
+                        if pseudonym not in pseudonym_to_feedback_type:
+                            raise ValueError(f"Pseudonym {pseudonym} not found in mappings.")

794-819: Store nested dicts as proper JSON strings or flatten the structure before DataFrame creation.

The exercise, submission, and feedback columns are exported to CSV by the downstream analysis notebook (6_analysis_data_export.ipynb). Pandas will serialize these nested dicts as string representations (e.g., "{'id': ...}") rather than valid JSON, making them fragile and difficult to parse. Either flatten these columns into separate DataFrame columns, or serialize them with json.dumps() before storing in the DataFrame.

athena/evaluation/6_analysis_data_export.ipynb (1)

39-53: LLM participant row is always appended even when LLM data is not loaded.

If llm_evaluation_progress_path remains None, no LLM scores exist in df, but the LLM participant row is unconditionally added to participant_info_df. After the left merge on expert_id, this won't cause incorrect data (the LLM row just won't match anything), but it's misleading. Consider guarding the append:

Proposed fix
 "# Load participant info and add LLM as a participant\n",
 "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n",
+"if llm_evaluation_progress_path:\n",
 "participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",

(with appropriate indentation)

athena/evaluation/4_expert_evaluation.ipynb (1)

148-162: count variable is incremented but never used.

The count variable on Lines 155 and 157 is dead code. Consider either printing it after the loop or removing it.

Also, os.makedirs(output_dir, exist_ok=True) on Line 159 is redundant inside the loop — output_dir was already created in the previous cell (Line 123).

Proposed fix
-"count = 0\n",
 "for progress_file, progress_data in resolve_feedback_types_and_metric_titles(evaluation_progress_files, metric_ids_to_titles, pseudonym_to_feedback_type).items():\n",
-"    count += 1\n",
 "    output_progress_path = f\"{output_dir}/{os.path.basename(progress_file)}\"\n",
-"    os.makedirs(output_dir, exist_ok=True)\n",
 "    with open(output_progress_path, \"w\") as output_file:\n",
-"        json.dump(progress_data, output_file, indent=4)"
+"        json.dump(progress_data, output_file, indent=4)\n",
+"\n",
+"print(f\"Resolved {len(evaluation_progress_files)} progress files.\")"

Updated expert evaluation notebook to support multiple config/progress files, added steps for depseudonymization, and improved error handling. Expanded LLM-as-a-judge notebook with options to reuse expert metrics, improved output, and clarified evaluation steps. Added new analysis notebooks and introduced plot service for visualization. Updated dependencies and refactored model and service code to support new workflow.
@DominikRemo DominikRemo force-pushed the athena/evaluation/notebooks/llm-as-a-judge branch from 863c57f to 0e0520b Compare February 11, 2026 16:30
@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)

47-59: ⚠️ Potential issue | 🟡 Minor

model_name will be None if the env var is unset, causing AttributeError on .replace().

If LLM_EVALUATION_MODEL is not set in the .env file, os.getenv(...) returns None, and model_name.replace("azure_openai_", "") on line 54 will raise AttributeError: 'NoneType' object has no attribute 'replace'. Consider adding a guard or a clear error message.

Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
+"if not model_name:\n",
+"    raise ValueError(\"Environment variable LLM_EVALUATION_MODEL is not set.\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 788-791: The current return uses
pd.DataFrame.from_records(records).astype({'exercise_id': 'int64',
'submission_id': 'int64'}) which will KeyError on an empty DataFrame with no
columns; modify the code to first build the DataFrame (df =
pd.DataFrame.from_records(records)) and if df.empty (or if not records) return
an explicitly typed empty DataFrame (e.g. pd.DataFrame({'exercise_id':
pd.Series(dtype='int64'), 'submission_id': pd.Series(dtype='int64')}), otherwise
perform df.astype({'exercise_id': 'int64', 'submission_id': 'int64'}) and return
the result to avoid the KeyError.
- Around line 620-625: The ZipSlip check around extracting zip entries is
vulnerable because target_path.startswith(os.path.abspath(data_dir)) can be
fooled by sibling paths; update the extraction logic in the block using
zipfile.ZipFile (where target_path and data_dir are computed and zip_ref.extract
is called) to compute a base path with a trailing separator (e.g., base_path =
os.path.abspath(data_dir) + os.sep) and then verify
target_path.startswith(base_path) before calling zip_ref.extract; this ensures
entries like "/app/data_evil/..." do not pass the check.
🧹 Nitpick comments (7)
athena/evaluation/pyproject.toml (1)

18-19: Nit: Inconsistent version pinning for tqdm.

All other dependencies in this file use exact version pins (e.g., "7.1.0"), but tqdm uses a caret range ("^4.65.0"). Consider pinning it exactly for consistency and reproducibility, e.g., "4.67.1".

athena/evaluation/service/json_service.py (4)

627-636: Inconsistent path construction: f-string interpolation vs os.path.join.

Lines 633 and 635 use f"{data_dir}/{file}" while the zip extraction above correctly uses os.path.join. This can break on Windows where the separator is \.

Proposed fix
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))

724-742: Confusing variable names obscure the nested structure.

The loop variable submission on line 728 actually represents per-exercise data (a dict of submissions), and feedback on line 729 represents per-submission data (a dict of feedback types). This makes the code hard to follow.

Suggested rename
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
+            for _exercise_id, submissions in progress_data.get("selected_values", {}).items():
+                for submission_id, feedback_types in submissions.items():
                     resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
+                    for feedback_id, ratings in feedback_types.items():
                         if feedback_id not in pseudonym_to_feedback_type:
                             raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
                         resolved_ratings = {}
                         for metric_id, score in ratings.items():
                             if metric_id not in metric_ids_to_titles:
                                 raise ValueError(f"Metric ID {metric_id} not found in metrics.")
                             title = metric_ids_to_titles[metric_id]
                             resolved_ratings[title] = score
                         resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings
-                    submission[submission_id] = resolved_feedback
+                    submissions[submission_id] = resolved_feedback

757-761: Expert ID extraction is fragile if the filename contains extra underscores or dots.

file.replace("evaluation_progress_", "").replace(".json", "") assumes a single occurrence of these substrings. A filename like evaluation_progress_expert_1.json.bak or evaluation_progress_A_B.json could produce unexpected IDs. Using re or splitting on a known pattern would be more robust, though the current approach works for the expected naming convention.


804-811: Direct dict key access (exercise['id'], submission['id']) will raise KeyError if the key is missing.

Other parts of this file use .get() with defaults for defensive access. Consider using .get("id") here too, or at least documenting that id is a required field so failures are intentional.

athena/evaluation/4_expert_evaluation.ipynb (1)

148-162: count variable is incremented but never used.

The count variable on lines 155/157 is accumulated but never printed or returned. Either remove it or add a summary print after the loop (e.g., print(f"Resolved {count} progress files.")).

Also, os.makedirs(output_dir, exist_ok=True) on line 159 is called inside the loop on every iteration. It's harmless due to exist_ok=True, but moving it before the loop would be cleaner — it's already called on line 123 in the previous cell anyway.

athena/evaluation/5_llm_as_a_judge.ipynb (1)

217-223: Confirm that the output path data/5_llm_evaluation/ produces a file that notebook 6 can consume.

Notebook 6 currently sets llm_evaluation_progress_path = None (line 25 of 6_analysis_data_export.ipynb). For the end-to-end pipeline to work, the user would need to update that path to point to the LLM output from this cell. Consider adding a comment here noting the expected output filename so the user knows what to reference in step 6.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 27-29: Fix the typo in the notebook comment: change "evalution
progress" to "evaluation progress" in the explanatory string that mentions the
5_llm_as_a_judge notebook and the stored path (the comment line referencing
"data/5_llm_evaluation/evaluation_progress_llm-as-a-judge.json"); leave the
variable llm_evaluation_progress_path as-is.
🧹 Nitpick comments (6)
athena/evaluation/6_analysis_data_export.ipynb (1)

18-57: LLM participant is always appended even when no LLM data is loaded.

Lines 44–55 unconditionally add an 'llm' row to participant_info_df, but llm_evaluation_progress_path defaults to None (line 29), so load_evaluation_progress won't produce any expert_id == 'llm' rows. After the left merge on expert_id (line 56), this phantom participant row won't appear in the final data, so there's no data corruption — but it's misleading. Consider guarding the concat:

Suggested guard
 "# Load participant info and add LLM as a participant\n",
 "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n",
-"participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
+"if llm_evaluation_progress_path:\n",
+"    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
athena/evaluation/4_expert_evaluation.ipynb (1)

142-160: os.makedirs on line 157 is called inside the loop — consider hoisting it.

os.makedirs(output_dir, exist_ok=True) is already called once at line 123. The repeated call inside the loop (line 157) is redundant. It's harmless due to exist_ok=True, but removing it keeps the loop body focused on file I/O.

athena/evaluation/service/json_service.py (4)

610-638: Mix of os.path.join and f-string path concatenation.

Lines 620, 624 correctly use os.path.join / os.path.abspath, but lines 634 and 636 revert to f-string concatenation (f"{data_dir}/{file}"). Use os.path.join consistently for cross-platform safety.

Suggested fix
         if file.startswith("evaluation_config_"):
-            evaluation_config_files.append(f"{data_dir}/{file}")
+            evaluation_config_files.append(os.path.join(data_dir, file))
         elif file.startswith("evaluation_progress_"):
-            evaluation_progress_files.append(f"{data_dir}/{file}")
+            evaluation_progress_files.append(os.path.join(data_dir, file))

716-744: Line 716: very long function signature — consider wrapping for readability.

The signature spans well beyond typical line-length limits. Also, the docstring is missing a Returns section.

Suggested formatting
-def resolve_feedback_types_and_metric_titles(evaluation_progress_files: list[str], metric_ids_to_titles: dict[str, str], pseudonym_to_feedback_type: dict[str, str]) -> dict[str, dict]:
+def resolve_feedback_types_and_metric_titles(
+    evaluation_progress_files: list[str],
+    metric_ids_to_titles: dict[str, str],
+    pseudonym_to_feedback_type: dict[str, str],
+) -> dict[str, dict]:

747-798: Type annotation style inconsistency and empty-DataFrame guard looks good.

The empty-DataFrame guard (lines 790–794) properly addresses the prior review concern.

Minor style nit: line 747 uses str | None (PEP 604 union syntax) while line 56 uses Optional[str]. Both are valid in Python 3.12+, but mixing styles within the same file reduces consistency.


2-8: import os.path on line 6 is redundant.

os is already imported on line 3, and os.path is automatically available as a submodule. The explicit import os.path can be removed.

@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 798-807: The empty-DataFrame branch currently returns only
exercise_id and submission_id causing KeyError downstream; change the
empty-return to construct and return a fully-typed DataFrame with the same
schema as the non-empty case (columns: expert_id, exercise_id, submission_id,
feedback_type, metric, score) using pd.DataFrame({ 'expert_id':
pd.Series(dtype='int64'), 'exercise_id': pd.Series(dtype='int64'),
'submission_id': pd.Series(dtype='int64'), 'feedback_type':
pd.Series(dtype='object'), 'metric': pd.Series(dtype='object'), 'score':
pd.Series(dtype='float64') }) so code that accesses df['expert_id'] etc. works;
keep the existing pd.DataFrame.from_records(records) and df.astype({...}) path
for non-empty results.
🧹 Nitpick comments (3)
athena/evaluation/service/json_service.py (3)

627-637: No file-extension filter on discovered evaluation files.

Lines 632 and 634 match any file whose name starts with evaluation_config_ or evaluation_progress_, regardless of extension. A stray file like evaluation_config_backup.txt or a .zip would be included and would later fail when json.load is called. Consider adding an .endswith(".json") guard.

Proposed fix
     for file in files:
-        if file.startswith("evaluation_config_"):
+        if file.startswith("evaluation_config_") and file.endswith(".json"):
             evaluation_config_files.append(os.path.join(data_dir, file))
-        elif file.startswith("evaluation_progress_"):
+        elif file.startswith("evaluation_progress_") and file.endswith(".json"):
             evaluation_progress_files.append(os.path.join(data_dir, file))

730-749: Misleading variable names obscure the nested data structure.

The variable names don't match the data they hold:

  • Line 734: submission is actually the exercise-level dict mapping submission_id → feedback_data, not a single submission.
  • Line 735: feedback is actually a dict of feedback_pseudonym → ratings, not a single feedback item.
  • Line 737: feedback_id is actually a pseudonym key.

Also, Line 747 mutates submission (the dict being iterated on Line 735) by reassigning submission[submission_id]. While updating an existing key during iteration is safe in CPython, it's a readability concern. Building a new dict and replacing after iteration would be clearer.

Proposed refactor for clarity
     for progress_file in evaluation_progress_files:
         with open(progress_file, "r") as file:
             progress_data = json.load(file)
             # De-pseudonymize feedbacks in progress file
-            for _, submission in progress_data.get("selected_values", {}).items():
-                for submission_id, feedback in submission.items():
-                    resolved_feedback = {}
-                    for feedback_id, ratings in feedback.items():
-                        if feedback_id not in pseudonym_to_feedback_type:
-                            raise ValueError(f"Pseudonym {feedback_id} not found in mappings.")
-                        resolved_ratings = {}
-                        for metric_id, score in ratings.items():
-                            if metric_id not in metric_ids_to_titles:
-                                raise ValueError(f"Metric ID {metric_id} not found in metrics.")
-                            title = metric_ids_to_titles[metric_id]
-                            resolved_ratings[title] = score
-                        resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings
-                    submission[submission_id] = resolved_feedback
+            for exercise_id, submissions_data in progress_data.get("selected_values", {}).items():
+                resolved_submissions = {}
+                for submission_id, feedback_types in submissions_data.items():
+                    resolved_feedback = {}
+                    for pseudonym, ratings in feedback_types.items():
+                        if pseudonym not in pseudonym_to_feedback_type:
+                            raise ValueError(f"Pseudonym {pseudonym} not found in mappings.")
+                        resolved_ratings = {}
+                        for metric_id, score in ratings.items():
+                            if metric_id not in metric_ids_to_titles:
+                                raise ValueError(f"Metric ID {metric_id} not found in metrics.")
+                            resolved_ratings[metric_ids_to_titles[metric_id]] = score
+                        resolved_feedback[pseudonym_to_feedback_type[pseudonym]] = resolved_ratings
+                    resolved_submissions[submission_id] = resolved_feedback
+                progress_data["selected_values"][exercise_id] = resolved_submissions
         progress[progress_file] = progress_data

664-681: Silent exit on unresolvable cyclic mappings.

If the while loop exhausts max_iterations without resolving all transitive mappings, execution continues silently with partially-resolved mappings. This will surface later as opaque ValueError("Pseudonym ... not found in mappings") exceptions in resolve_feedback_types_and_metric_titles or create_common_config's exercise-collection pass, making debugging difficult.

Consider raising an explicit error when the loop terminates without convergence:

Proposed fix
         iterations += 1
 
+    if len(intersection) > 0:
+        raise ValueError(
+            f"Could not resolve all transitive mappings after {max_iterations} iterations. "
+            f"Unresolved pseudonyms: {intersection}"
+        )
+
     for config_file in evaluation_config_files:

@ls1intum ls1intum deleted a comment from coderabbitai bot Feb 11, 2026
@ls1intum ls1intum deleted a comment from coderabbitai bot Feb 11, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@DominikRemo
Copy link
Contributor Author

@coderabbitai resolve

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Comments resolved.

@DominikRemo
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Full review triggered.

@DominikRemo
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)

47-59: ⚠️ Potential issue | 🟡 Minor

model_name will be None if the env var is missing, causing AttributeError on .replace().

If LLM_EVALUATION_MODEL is not set, os.getenv returns None and Line 54 crashes with AttributeError: 'NoneType' object has no attribute 'replace'. The same applies to the other env vars passed to AzureChatOpenAI. Consider adding a guard or using a default.

Suggested fix
 model_name = os.getenv("LLM_EVALUATION_MODEL")
 api_key = os.getenv("AZURE_OPENAI_API_KEY")
 api_base = os.getenv("AZURE_OPENAI_ENDPOINT")
 api_version = os.getenv("OPENAI_API_VERSION")
+
+if not all([model_name, api_key, api_base, api_version]):
+    raise ValueError("Missing required environment variables. Check your .env file.")
🤖 Fix all issues with AI agents
In `@athena/evaluation/pyproject.toml`:
- Around line 18-19: Update the pinned package versions in pyproject.toml for
the dependencies ipykernel and tqdm: change ipykernel from 7.1.0 to 7.2.0 and
tqdm from 4.65.0 to 4.67.3, then regenerate the lockfile (e.g., run your
dependency manager's update/lock command such as poetry lock or pip-compile) and
run tests to ensure nothing breaks.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 89-94: The hardcoded max_tokens=100 passed to
model.with_structured_output(MetricEvaluations).invoke(...) is too small for
multi-metric structured JSON and can cause truncation; update this call to use a
larger default (e.g., 500–1000) or make it configurable via a constant/parameter
(e.g., MAX_EVAL_TOKENS or an argument to the LLM judge service) so
get_openai_callback(), model.with_structured_output(MetricEvaluations), and
invoke(...) use that value; ensure any configuration name is documented in the
function/class signature and replace the inline 100 with the new variable.
🧹 Nitpick comments (5)
athena/evaluation/model/evaluation_model.py (1)

24-31: Inconsistent generic syntax: List[BaseMessage] vs list[Metric].

Line 25 uses List[BaseMessage] (from typing) while line 31 uses the built-in list[Metric]. Since the project targets Python 3.12+, prefer the built-in list consistently (matching the style already used by MetricEvaluations on line 19).

♻️ Suggested fix
-    prompt: List[BaseMessage] = Field(
+    prompt: list[BaseMessage] = Field(

This would also allow removing the List import from typing on line 1.

athena/evaluation/6_analysis_data_export.ipynb (1)

44-56: pd.concat without ignore_index=True produces a duplicate index.

The appended LLM row will carry index 0, duplicating the existing first row's index. While this doesn't break the subsequent merge, it can cause subtle issues if the DataFrame is later indexed by position (e.g., df.loc[0] returns two rows).

♻️ Suggested fix
-    "    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",
+    "    participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",

Add ignore_index=True to the concat call:

participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{
    ...
}])], ignore_index=True)
athena/evaluation/service/json_service.py (2)

689-704: Exercise collection re-reads all config files from disk a second time.

The config files are already loaded in the first loop (lines 651–662) for metrics/mappings. Consider collecting exercises in the same pass to avoid redundant I/O, or at minimum document why the two-pass approach is needed (e.g., mappings must be fully resolved before exercise de-pseudonymization).


821-846: load_common_evaluation_config doesn't close the file handle on exception and nests all logic under with.

Minor: the entire function body is inside the with block, meaning the DataFrame construction happens while the file is still open. This works but is slightly unusual. More importantly, there's no error handling if evaluation_config_path doesn't exist.

Also, the function docstring says it returns pd.DataFrame but the return type annotation is missing.

Suggested improvement
-def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame:
+def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame:
     """
     Loads the common evaluation configuration from a JSON file.
     Args:
         evaluation_config_path (str): The path to the common evaluation configuration file.
+    Returns:
+        pd.DataFrame: A DataFrame containing the evaluation configuration records.
     """
     with open(evaluation_config_path, "r") as config_file:
         common_evaluation_config = json.load(config_file)
-        
-        records = []
-        ...
-        return pd.DataFrame.from_records(records)
+
+    records = []
+    for raw_exercise in common_evaluation_config.get("exercises", []):
+        exercise = {k: v for k, v in raw_exercise.items() if k != "submissions"}
+        for raw_submission in raw_exercise.get("submissions", []):
+            submission = {k: v for k, v in raw_submission.items() if k != "feedbacks"}
+            for feedback_type, raw_feedback in raw_submission.get("feedbacks", {}).items():
+                record = {
+                    'exercise_id': exercise['id'],
+                    'submission_id': submission['id'],
+                    'feedback_type': feedback_type,
+                    'exercise': exercise,
+                    'submission': submission,
+                    'feedback': raw_feedback,
+                }
+                records.append(record)
+
+    return pd.DataFrame.from_records(records)
athena/evaluation/5_llm_as_a_judge.ipynb (1)

85-125: Two mutually exclusive metric-loading cells — consider adding a clearer guard.

Both "Define New Metrics" (Lines 85–95) and "Reuse Metrics from Expert Evaluation" (Lines 111–125) assign to the same metrics variable. Running both cells sequentially means the second silently overwrites the first. This is a common notebook pattern but could confuse users. A brief inline comment like # NOTE: Only run ONE of the two metrics cells below in each cell header would help.

@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

@DominikRemo
Copy link
Contributor Author

@CodeRabbit review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@DominikRemo
Copy link
Contributor Author

@CodeRabbit full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)

40-59: ⚠️ Potential issue | 🟡 Minor

model_name can be None if the env var is unset, causing AttributeError on Line 54.

os.getenv("LLM_EVALUATION_MODEL") returns None when the variable isn't configured. The .replace(...) call on Line 54 will then fail. Consider adding a guard or using a clearer error message:

Proposed fix
 "model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n",
 "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
 "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
 "api_version = os.getenv(\"OPENAI_API_VERSION\")\n",
+"\n",
+"if not all([model_name, api_key, api_base, api_version]):\n",
+"    raise ValueError(\"Missing required environment variables. Check your .env file.\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 18-36: The notebook sets analysis_output_path but never ensures
the directory exists before writing output; before exporting the CSV (where
df.to_csv / writing occurs) add a directory creation step using
os.makedirs(analysis_output_path, exist_ok=True) so the path is created if
missing; locate and update the cell around the export step that references
analysis_output_path to call os.makedirs(...) just prior to the write.

In `@athena/evaluation/service/json_service.py`:
- Around line 821-846: The function load_common_evaluation_config currently
returns pd.DataFrame.from_records(records) which yields a column-less DataFrame
when records is empty; change load_common_evaluation_config to detect the
empty-records case and return an empty DataFrame with the expected columns (at
minimum "exercise_id", "submission_id", "feedback_type", "exercise",
"submission", "feedback") so downstream merges on
["exercise_id","submission_id","feedback_type"] don't KeyError; update the
return path in load_common_evaluation_config to construct
pd.DataFrame.from_records(records, columns=[...]) or explicitly create
pd.DataFrame(columns=[...]) when records is empty.
- Around line 670-681: The loop mutates mappings while iterating which can
produce non-deterministic results; change the inner iteration to work on a
stable snapshot (e.g., iterate over list(mappings.items()) or create a new dict)
so updates do not read values updated earlier in the same pass, then apply
deletions as before; specifically modify the block that iterates "for key, value
in mappings.items()" to instead iterate a snapshot (or build new_mappings and
replace mappings) and preserve the subsequent self-reference removal and
recomputation of pseudonyms/feedback_types/intersection.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 87-93: The LLM call using
model.with_structured_output(MetricEvaluations).invoke(...) inside the
progress_bar loop has no error handling so any transient API failure will abort
the loop and lose partial evaluation_progress and total_cost; wrap the
invocation and callback usage (get_openai_callback, metric_evaluations,
cb.total_cost) in a try/except that catches API/network exceptions, log the
exception, mark the current item as errored in evaluation_progress (same pattern
as the existing validation-failure handling), ensure you still accumulate
cb.total_cost if available, persist/write evaluation_progress to disk after the
exception, and continue the loop so remaining requests are processed.
🧹 Nitpick comments (5)
athena/evaluation/model/evaluation_model.py (1)

24-31: New MetricEvaluationRequest model looks good overall.

Minor style nit: lines 25 and 31 mix List[BaseMessage] (from typing) with list[Metric] (builtin). Since the project targets Python 3.12, you could use list[BaseMessage] consistently and drop the List import.

Suggested consistency fix
-from typing import List
-
-from langchain_core.messages import BaseMessage
+from langchain_core.messages import BaseMessage
 from pydantic import BaseModel, Field
...
 class MetricEvaluationRequest(BaseModel):
-    prompt: List[BaseMessage] = Field(
+    prompt: list[BaseMessage] = Field(
         ..., description="The prompt to evaluate the metrics."
     )
athena/evaluation/6_analysis_data_export.ipynb (1)

42-57: Left merges may silently drop data — consider validating merge results.

The left merge on Line 40 and Line 57 could leave NaN in config/participant columns for unmatched records, and no warning is emitted. For a data-export notebook, this could lead to silently incomplete CSVs. Consider adding a brief check (e.g., logging the count of rows with missing config/participant data) after each merge so users are aware of data gaps.

athena/evaluation/4_expert_evaluation.ipynb (1)

142-159: Cross-cell dependency on json import — consider adding it to this cell.

This cell uses json.dump (Line 158) but json is imported in the previous cell (Line 116). If a user runs this cell in isolation or restarts and runs from here, they'll get a NameError. Since os is re-imported on Line 150, it would be consistent to also import json here.

athena/evaluation/service/json_service.py (1)

689-704: Config files are read twice — once for metrics/mappings and again for exercises.

Lines 651-662 open and parse each config file, and Lines 689-704 re-open and re-parse the same files. This is a minor inefficiency.

Suggested optimization

Cache the parsed JSON on the first pass:

+    parsed_configs = []
     for config_file in evaluation_config_files:
         with open(config_file, "r") as file:
             config_data = json.load(file)
+            parsed_configs.append(config_data)
             # Collect metrics
             ...
 
-    for config_file in evaluation_config_files:
-        with open(config_file, "r") as file:
-            config_data = json.load(file)
+    for config_data in parsed_configs:
             # Collect exercises
             ...
athena/evaluation/service/llm_as_a_judge_service.py (1)

146-148: Summary counts use nested list comprehensions that are hard to read.

These can be simplified with sum(...) for clarity:

Suggested refactor
-    print(f"Total evaluated submissions: {len([sub for ex in selected_values.values() for sub in ex.values()])}")
-    print(f"Total evaluated feedbacks: {len([fb for ex in selected_values.values() for sub in ex.values() for fb in sub.values()])}")
+    print(f"Total evaluated submissions: {sum(len(ex) for ex in selected_values.values())}")
+    print(f"Total evaluated feedbacks: {sum(len(sub) for ex in selected_values.values() for sub in ex.values())}")

@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 806-822: The df.empty branch is unreachable because the code
raises ValueError when records is empty; remove the dead empty-DataFrame branch
(the block that constructs and returns the typed empty DataFrame) and keep the
initial raise, so the function simply builds df =
pd.DataFrame.from_records(records) and returns
df.astype({'exercise_id':'int64','submission_id':'int64'}); reference symbols:
records, pd.DataFrame.from_records, df, df.empty, ValueError, and df.astype.
🧹 Nitpick comments (3)
athena/evaluation/service/llm_as_a_judge_service.py (2)

25-48: Minor code duplication between None and list branches.

The None and list cases differ only in the feedbacks argument passed to get_formatted_prompt ([] vs submission.feedbacks). Consider merging them to reduce repetition.

♻️ Proposed consolidation
-            if submission.feedbacks is None:
-                prompt = get_formatted_prompt(exercise, submission, [], metrics)
-                requests.append(
-                    MetricEvaluationRequest(
-                        prompt=prompt,
-                        exercise_id=exercise.id,
-                        submission_id=submission.id,
-                        feedback_type="default",
-                        metrics=metrics,
-                    )
-                )
-            elif isinstance(submission.feedbacks, list):
-                prompt = get_formatted_prompt(
-                    exercise, submission, submission.feedbacks, metrics
-                )
-                requests.append(
-                    MetricEvaluationRequest(
-                        prompt=prompt,
-                        exercise_id=exercise.id,
-                        submission_id=submission.id,
-                        feedback_type="default",
-                        metrics=metrics,
-                    )
-                )
-            elif isinstance(submission.feedbacks, dict):
+            if submission.feedbacks is None or isinstance(submission.feedbacks, list):
+                feedbacks = submission.feedbacks or []
+                prompt = get_formatted_prompt(exercise, submission, feedbacks, metrics)
+                requests.append(
+                    MetricEvaluationRequest(
+                        prompt=prompt,
+                        exercise_id=exercise.id,
+                        submission_id=submission.id,
+                        feedback_type="default",
+                        metrics=metrics,
+                    )
+                )
+            elif isinstance(submission.feedbacks, dict):

87-97: No error handling around the LLM API call — consider narrowing the exception type.

The try/except was added to address a prior review concern (good), but catching bare Exception is flagged by Ruff (BLE001). Since LLM API calls typically raise specific exceptions (e.g., openai.APIError, requests.exceptions.RequestException), narrowing the catch would improve signal. That said, for a notebook-driven evaluation pipeline, the current broad catch is pragmatic and acceptable.

athena/evaluation/service/json_service.py (1)

825-853: load_common_evaluation_config accesses exercise['id'] and submission['id'] without guarding for missing keys.

If a malformed config entry is missing "id", this will raise a KeyError with no context about which entry caused the problem. Consider using .get() with validation or wrapping in a try/except for a more informative error message. This is a minor robustness concern given this is notebook-driven code.

@github-actions
Copy link

Athena Test Results Summary

TestsPassed ✅SkippedFailed
Athena Test Report10 ran10 passed0 skipped0 failed

Failing Tests Summary

TestResult
No test annotations available

@github-actions
Copy link

📊 Detailed Coverage Table

Combining 3 coverage files...
Parsing test-results/programming_module_programming_llm_coverage.xml...
Parsing test-results/text_module_text_llm_coverage.xml...
Parsing test-results/modeling_module_modeling_llm_coverage.xml...
Combining duplicate packages...
Creating combined coverage file: test-results/combined_coverage.xml
✅ Combined coverage saved to test-results/combined_coverage.xml
📊 Combined 31 unique packages

📊 Combined Coverage Summary:

Package Line Rate Branch Rate Status
athena 37.8% 3.3%
athena.helpers 100.0% 100.0%
athena.helpers.programming 33.0% 0.0%
athena.helpers.text 0.0% 100.0%
athena.models 0.0% 0.0%
athena.schemas 76.5% 8.3%
athena.storage 21.1% 0.0%
llm_core 100.0% 100.0%
llm_core.core 26.0% 6.2%
llm_core.loaders 79.3% 37.5%
llm_core.loaders.model_loaders 68.5% 37.5%
llm_core.models 66.7% 35.7%
llm_core.models.providers 77.2% 56.2%
llm_core.utils 52.8% 18.5%
modeling.module_modeling_llm.module_modeling_llm 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer 71.4% 50.0%
modeling.module_modeling_llm.module_modeling_llm.apollon_transformer.parser 79.2% 60.2%
modeling.module_modeling_llm.module_modeling_llm.core 88.9% 50.0%
modeling.module_modeling_llm.module_modeling_llm.models 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.prompts 100.0% 100.0%
modeling.module_modeling_llm.module_modeling_llm.utils 100.0% 50.0%
programming.module_programming_llm.module_programming_llm 100.0% 100.0%
programming.module_programming_llm.module_programming_llm.helpers 27.6% 0.0%
programming.module_programming_llm.module_programming_llm.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm 72.7% 12.5%
text.module_text_llm.module_text_llm.default_approach 66.4% 36.1%
text.module_text_llm.module_text_llm.default_approach.prompts 100.0% 100.0%
text.module_text_llm.module_text_llm.default_approach.schemas 100.0% 100.0%
text.module_text_llm.module_text_llm.divide_and_conquer 34.0% 0.0%
text.module_text_llm.module_text_llm.helpers 55.4% 26.7%
text.module_text_llm.module_text_llm.self_consistency 46.2% 0.0%

Total packages: 31

Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 760-813: The issue is a dtype mismatch: load_evaluation_progress
casts exercise_id and submission_id to int64 but load_common_evaluation_config
leaves exercise['id'] and submission['id'] as-is, causing merges to fail; update
load_common_evaluation_config to coerce the DataFrame columns for exercise_id
and submission_id to int64 (same as load_evaluation_progress) after building its
records/DataFrame so both DataFrames share explicit int64 dtypes for those
columns before any merge.

In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 134-143: The "Total evaluated feedbacks" print counts
feedback_type keys including the per-submission "meta" key, inflating the
number; update the calculation that uses selected_values (the expression summing
len(sub) for ex in selected_values.values() for sub in ex.values()) to exclude
the "meta" key when counting each submission's feedbacks (e.g., sum only keys !=
"meta" or subtract 1 when "meta" in sub.keys()) so the printed total reflects
actual feedback items.
🧹 Nitpick comments (2)
athena/evaluation/service/json_service.py (2)

692-705: exercise["submissions"].sort(key=lambda x: x["id"]) mutates data from parsed_configs in-place.

Since parsed_configs holds references to the originally parsed JSON dicts, this sort mutates the source data. This is likely fine for the current single-use notebook workflow, but worth noting if parsed_configs is ever reused.

More importantly: if a submission lacks an "id" key, this will raise a KeyError. A .get("id", 0) fallback or validation step before this point would be safer.


816-844: load_common_evaluation_config docstring is missing the Returns section.

The function returns a pd.DataFrame, but the docstring only documents Args. Minor documentation gap.

@DominikRemo
Copy link
Contributor Author

@coderabbitai resolve

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Comments resolved.

@DominikRemo
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

✅ Actions performed

Full review triggered.

@laadvo laadvo self-requested a review February 11, 2026 21:55
Copy link
Contributor

@laadvo laadvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over the code and tested the notebooks locally. Working as expected, nice improvement :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants