-
Notifications
You must be signed in to change notification settings - Fork 5
Athena: Unified Expert and LLM-as-Judge Evaluation Pipeline with Data Export for Analysis
#415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds utilities to discover and de‑pseudonymize expert evaluation files, an LLM-as-judge request/processing pipeline, new analysis export notebook, model and prompt adjustments, removal of the legacy llm_service, minor notebook kernel metadata updates, dependency additions, and .gitignore changes. Changes
Sequence Diagram(s)sequenceDiagram
participant Notebooks as Notebooks/Orchestration
participant JSONSvc as json_service
participant JudgeSvc as llm_as_a_judge_service
participant Azure as AzureOpenAI
rect rgba(200,200,255,0.5)
Notebooks->>JSONSvc: get_evaluation_files / load_common_evaluation_config / load_evaluation_progress
JSONSvc-->>Notebooks: evaluation files, common config, progress dataframe
end
rect rgba(200,255,200,0.5)
Notebooks->>JudgeSvc: generate_evaluation_requests(data, metrics, filter?)
JudgeSvc-->>Notebooks: List[MetricEvaluationRequest]
Notebooks->>JudgeSvc: process_feedback_evaluations(requests, output_path, model, metrics)
end
rect rgba(255,200,200,0.5)
JudgeSvc->>Azure: invoke model per request (prompts)
Azure-->>JudgeSvc: responses (evaluations)
JudgeSvc->>JudgeSvc: validate metrics, accumulate results, track cost
JudgeSvc-->>Notebooks: write evaluation_progress_llm-as-a-judge.json
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.4)athena/evaluation/1_sampling_exercises.ipynbathena/evaluation/0_database_setup.ipynbathena/evaluation/2_sampling_submissions.ipynb
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
athena/evaluation/prompts/llm_evaluation_prompt.py (1)
134-149:⚠️ Potential issue | 🟡 Minor
grading_instruction_feedbackcan becomeNone, rendering as literal"None"in the description string.When
feedback.structured_grading_instruction_idis truthy but the ID isn't found in the exercise's grading criteria,grading_instructions.get(...)returnsNone, causing the ternary on line 145 to assignNonetograding_instruction_feedback— overwriting the""default from line 134. The f-string on line 149 then produces"None\n{feedback.description}".Proposed fix
grading_instruction_feedback = ( - grading_instruction.feedback + ": " if grading_instruction else None + grading_instruction.feedback + ": " if grading_instruction else "" )athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59:⚠️ Potential issue | 🟡 Minor
model_namecan beNoneif env var is unset, causingAttributeErroron.replace().
os.getenv("LLM_EVALUATION_MODEL")returnsNonewhen the variable isn't set. Calling.replace()onNoneat line 54 will crash. Consider adding a guard or using a default value.Proposed fix
"model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n", +"if not model_name:\n", +" raise ValueError(\"LLM_EVALUATION_MODEL environment variable is not set.\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/4_expert_evaluation.ipynb`:
- Around line 128-130: The default for the metrics field is wrong: replace
occurrences of common_evaluation_config.get("metrics", {}) with
common_evaluation_config.get("metrics", []) so the default is a list not a dict;
update both places referenced in the diff (the print statement using len(...)
and the later use at line ~152) and ensure this matches how create_common_config
constructs metrics.
In `@athena/evaluation/5_llm_as_a_judge.ipynb`:
- Around line 193-196: The call random.sample(requests, 10) can raise ValueError
when requests has fewer than 10 items; change the sampling to use
random.sample(requests, k) with k = min(10, len(requests)) so it gracefully
handles small lists (locate the sampling line using the symbol random.sample and
the requests variable).
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 79-96: The notebook kernel metadata lists "version": "3.11.9"
which is inconsistent with the project target of Python 3.12; update the
notebook's metadata block (the "kernelspec"/"language_info" fields) to reflect
Python 3.12 by changing "version": "3.11.9" to "3.12" and optionally update
"kernelspec.display_name" to "Python 3.12" and "kernelspec.name" to a matching
identifier (e.g., "python3.12") so the notebook metadata aligns with
pyproject.toml and other notebooks.
- Around line 24-34: The code defines llm_evaluation_progress_path but never
uses it, so load_evaluation_progress(evaluation_progress_path, None) omits LLM
data; update the call to pass the llm_evaluation_progress_path variable as the
second argument to load_evaluation_progress so that both expert
(evaluation_progress_path) and LLM (llm_evaluation_progress_path) progress files
are loaded and merged by that function.
In `@athena/evaluation/model/model.py`:
- Line 62: The constructor parameter using a mutable default (meta: dict = {})
in Submission.__init__ should be changed to use None as the default and
initialize a new dict inside the method to avoid shared state; update the
signature to use meta: Optional[dict] = None and then set self.meta = {} if meta
is None else dict(meta) (or similar) within Submission.__init__; apply the same
change to Exercise.__init__ (the similar meta/default pattern noted at line
~140) to prevent shared mutable defaults across instances.
In `@athena/evaluation/pyproject.toml`:
- Around line 15-20: The project declares several Python deps but omits tqdm
which is imported by athena.evaluation.service.llm_as_a_judge_service (import
tqdm on top of that module), causing ImportError for consumers; fix by adding a
stable tqdm entry to pyproject.toml dependencies (e.g., a compatible caret
version like "^4.65.0" or similar) so the package manager installs tqdm
alongside the other libs.
In `@athena/evaluation/service/json_service.py`:
- Around line 616-621: The zip extraction loop using zipfile.ZipFile(... ) and
zip_ref.extractall(data_dir) is vulnerable to ZipSlip; update the extraction to
iterate over zip_ref.infolist() (or .namelist()), validate each member path by
joining with data_dir and ensuring the normalized absolute path starts with the
intended data_dir prefix, and only then extract that member (skip or raise on
invalid paths) rather than calling extractall; apply this change where files is
iterated and file.endswith(".zip") is checked to securely extract archives.
- Around line 660-670: The while-loop resolving transitive mappings can
infinite-loop if mappings become self-referential (e.g., A→A); add a guard by
tracking iterations or detecting self-mappings: introduce a max_iterations
(e.g., len(mappings) or 1000) and increment each loop, breaking when exceeded,
or after updating mappings check for any key where mappings[key] == key and
remove that key from the intersection (or set mappings[key] = None/remove the
entry) so it no longer appears in pseudonyms and feedback_types; apply this fix
around the variables mappings, pseudonyms, feedback_types, intersection and the
while-loop to ensure termination.
In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 92-101: The current loop breaks on the first validation failure
which abandons the remaining requests; instead, when evaluated_metric_titles !=
expected_metric_titles or the LLM returns an unexpected format, log the error
(including evaluated_metric_titles, expected_metric_titles and request) and use
continue to skip just that request so processing continues for the rest, and
ensure is_finished_evaluating is set to False for any skipped request so the
output indicates incomplete evaluation; update the logic around
evaluated_metric_titles/expected_metric_titles and the else branch to perform
logging + continue (or at minimum set is_finished_evaluating = False before any
early exit) rather than break.
- Around line 138-142: Remove unnecessary f-string prefixes used where there are
no placeholders: change the filename construction that uses
f"evaluation_progress_llm-as-a-judge.json" when building file_path (the
os.path.join call that assigns file_path) to a plain string, and change the
print call print(f"\nSummary of Evaluation:") to print("\nSummary of
Evaluation:"). Keep the rest of the logic (writing evaluation_progress with
json.dump and variable names file_path and evaluation_progress) unchanged.
- Around line 67-140: In process_feedback_evaluations, evaluation_progress is
only set inside the for-loop which causes a NameError when requests is empty;
initialize evaluation_progress before the loop (e.g., set
current_submission_index/current_exercise_index to None, selected_values to {}
and has_started_evaluating/is_finished_evaluating appropriately) so the variable
always exists, then update it inside the loop as currently done; ensure the
final json dump writes that preinitialized evaluation_progress when no
iterations occur.
In `@athena/evaluation/service/plot_service.py`:
- Around line 30-31: Several functions in
athena/evaluation/service/plot_service.py call plt.show() before plt.savefig(),
which causes saved files to be blank; locate the three occurrences of the
plt.show() / plt.savefig() pair (the end of each plotting function in this
module) and swap the calls so plt.savefig(os.path.join(plot_path, filename)) is
executed before plt.show(); apply the same swap in the other two functions that
have the identical pair so all saved plots are written before being displayed.
- Around line 8-31: The plot_boxplot function currently draws on the global
matplotlib state so successive calls overlay plots; fix it by explicitly
creating a new figure/axes at the start (e.g., call plt.figure() or use fig, ax
= plt.subplots()) before calling sns.boxplot and by closing the figure at the
end (e.g., plt.close(fig) or plt.close()) after saving; update plot_boxplot (and
mirror the same pattern in the other plotting functions) so all plotting,
layout, grid, savefig, show, and close operations are performed on the newly
created figure/axes to avoid bleed between calls.
🧹 Nitpick comments (4)
athena/evaluation/service/plot_service.py (1)
34-83:plot_feedback_type_metricandplot_metric_feedback_typeare near-duplicates.They differ only in which column is
xvshueand the legend/filename labels. Consider delegating toplot_boxplot(which already accepts all these parameters) to eliminate the duplication.athena/evaluation/service/json_service.py (2)
617-631: Inconsistent path construction: f-strings vsos.path.join.Lines 620, 629, and 631 use f-string interpolation (
f"{data_dir}/{file}") while most of the rest of this file usesos.path.join. Preferos.path.jointhroughout for cross-platform safety.
735-735: Function signature usesstr | NonealongsideOptional[str]in the same file.Line 735 uses
str | Noneforllm_evaluation_progress_pathwhile other functions (e.g.,group_exercise_dataat Line 56) useOptional[str]. Pick one style for consistency.athena/evaluation/service/llm_as_a_judge_service.py (1)
129-135:evaluation_progressis redundantly rebuilt on every iteration.This dict is reassigned in every loop iteration but only the final value matters (after the loop). Move it outside the loop (as part of the NameError fix above) and update
has_started_evaluating/is_finished_evaluatingonce after the loop completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
athena/evaluation/1_sampling_exercises.ipynb (1)
102-103:⚠️ Potential issue | 🟡 MinorMissing
pandasimport for the commented-outread_csvhint.The commented line references
pd.read_csv, butpandasis never imported (aspdor otherwise) in this notebook. If a user uncomments it, they'll get aNameError.Proposed fix
- "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")" + "# import pandas as pd\n", + "# data = pd.read_csv(\"data/1_exercises/exercises.csv\")"athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59:⚠️ Potential issue | 🟡 Minor
model_name.replace(...)will raiseAttributeErrorif the env var is unset.If
LLM_EVALUATION_MODELis not configured in.env,os.getenvreturnsNoneand the.replace()call on Line 54 crashes. Consider adding a guard or using a default.Proposed fix
"model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n", "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "api_version = os.getenv(\"OPENAI_API_VERSION\")\n", "\n", +"if not all([model_name, api_key, api_base, api_version]):\n", +" raise ValueError(\"Missing required environment variables. Check your .env file.\")\n", "\n", "model = AzureChatOpenAI(\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 71-74: The columns exercise, submission, and feedback in the
DataFrame (populated by load_common_evaluation_config) contain nested dicts that
will be written as Python reprs by df.to_csv and are hard to parse; before
calling df.to_csv (the line using analysis_output_path and
"evaluation_data.csv"), either flatten the dicts into primitive columns (e.g.,
exercise_id, exercise_title, submission_timestamp, feedback_text) or
JSON-serialize those three columns (use json.dumps on each cell) so the CSV
contains valid JSON strings, then write the resulting DataFrame with df.to_csv.
In `@athena/evaluation/service/json_service.py`:
- Around line 654-658: The current aggressive sanitization in the metrics
collection loop (the metric["title"] assignment) removes useful punctuation like
hyphens and parentheses; change the filter to only remove control/invisible
characters instead of stripping most punctuation—import unicodedata and replace
the comprehension with one that keeps characters whose Unicode category does not
start with 'C' (control) or, alternatively, allow a small whitelist of
punctuation (e.g., "-.()/_:") while trimming whitespace; update the
metric["title"] assignment inside the loop that iterates over
config_data.get("metrics", []) and ensure the new logic preserves meaningful
characters while still trimming surrounding spaces.
🧹 Nitpick comments (6)
athena/evaluation/.gitignore (1)
1-3: Consider adding a comment to document the recursive ignore pattern.The recursive pattern correctly handles nested directories and .gitkeep files, preserving the directory structure (e.g.,
data/4_expert_evaluation/input_pseudonymized/.gitkeep,data/6_analysis/.gitkeep). However, the pattern is not immediately obvious and adding documentation would improve maintainability for future contributors.Suggested documentation
+# Ignore all files recursively under data/, but keep directory structure via .gitkeep data/**/* !data/**/ !data/**/.gitkeepathena/evaluation/service/json_service.py (3)
610-637: Inconsistent path construction: mix ofos.path.joinand f-string with/.Lines 620 and 623 correctly use
os.path.join, but Lines 633 and 635 usef"{data_dir}/{file}". This is inconsistent and can break on Windows.Proposed fix
- evaluation_config_files.append(f"{data_dir}/{file}") + evaluation_config_files.append(os.path.join(data_dir, file)) elif file.startswith("evaluation_progress_"): - evaluation_progress_files.append(f"{data_dir}/{file}") + evaluation_progress_files.append(os.path.join(data_dir, file))
715-743: Misleading variable names obscure the data structure.In the nested loops,
feedback_idis actually a feedback-type pseudonym andratingsis a dict of{metric_id: score}. Similarly, the outersubmissionvariable holds per-submission data keyed by submission IDs. Consider renaming for clarity:Suggested renames
- for _, submission in progress_data.get("selected_values", {}).items(): - for submission_id, feedback in submission.items(): - resolved_feedback = {} - for feedback_id, ratings in feedback.items(): - if feedback_id not in pseudonym_to_feedback_type: - raise ValueError(f"Pseudonym {feedback_id} not found in mappings.") + for _exercise_id, submissions in progress_data.get("selected_values", {}).items(): + for submission_id, feedbacks_by_type in submissions.items(): + resolved_feedback = {} + for pseudonym, ratings in feedbacks_by_type.items(): + if pseudonym not in pseudonym_to_feedback_type: + raise ValueError(f"Pseudonym {pseudonym} not found in mappings.")
794-819: Store nested dicts as proper JSON strings or flatten the structure before DataFrame creation.The
exercise,submission, andfeedbackcolumns are exported to CSV by the downstream analysis notebook (6_analysis_data_export.ipynb). Pandas will serialize these nested dicts as string representations (e.g.,"{'id': ...}") rather than valid JSON, making them fragile and difficult to parse. Either flatten these columns into separate DataFrame columns, or serialize them withjson.dumps()before storing in the DataFrame.athena/evaluation/6_analysis_data_export.ipynb (1)
39-53: LLM participant row is always appended even when LLM data is not loaded.If
llm_evaluation_progress_pathremainsNone, no LLM scores exist indf, but the LLM participant row is unconditionally added toparticipant_info_df. After the left merge onexpert_id, this won't cause incorrect data (the LLM row just won't match anything), but it's misleading. Consider guarding the append:Proposed fix
"# Load participant info and add LLM as a participant\n", "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n", +"if llm_evaluation_progress_path:\n", "participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",(with appropriate indentation)
athena/evaluation/4_expert_evaluation.ipynb (1)
148-162:countvariable is incremented but never used.The
countvariable on Lines 155 and 157 is dead code. Consider either printing it after the loop or removing it.Also,
os.makedirs(output_dir, exist_ok=True)on Line 159 is redundant inside the loop —output_dirwas already created in the previous cell (Line 123).Proposed fix
-"count = 0\n", "for progress_file, progress_data in resolve_feedback_types_and_metric_titles(evaluation_progress_files, metric_ids_to_titles, pseudonym_to_feedback_type).items():\n", -" count += 1\n", " output_progress_path = f\"{output_dir}/{os.path.basename(progress_file)}\"\n", -" os.makedirs(output_dir, exist_ok=True)\n", " with open(output_progress_path, \"w\") as output_file:\n", -" json.dump(progress_data, output_file, indent=4)" +" json.dump(progress_data, output_file, indent=4)\n", +"\n", +"print(f\"Resolved {len(evaluation_progress_files)} progress files.\")"
Updated expert evaluation notebook to support multiple config/progress files, added steps for depseudonymization, and improved error handling. Expanded LLM-as-a-judge notebook with options to reuse expert metrics, improved output, and clarified evaluation steps. Added new analysis notebooks and introduced plot service for visualization. Updated dependencies and refactored model and service code to support new workflow.
863c57f to
0e0520b
Compare
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59:⚠️ Potential issue | 🟡 Minor
model_namewill beNoneif the env var is unset, causingAttributeErroron.replace().If
LLM_EVALUATION_MODELis not set in the.envfile,os.getenv(...)returnsNone, andmodel_name.replace("azure_openai_", "")on line 54 will raiseAttributeError: 'NoneType' object has no attribute 'replace'. Consider adding a guard or a clear error message.Proposed fix
"model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n", +"if not model_name:\n", +" raise ValueError(\"Environment variable LLM_EVALUATION_MODEL is not set.\")\n", "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 788-791: The current return uses
pd.DataFrame.from_records(records).astype({'exercise_id': 'int64',
'submission_id': 'int64'}) which will KeyError on an empty DataFrame with no
columns; modify the code to first build the DataFrame (df =
pd.DataFrame.from_records(records)) and if df.empty (or if not records) return
an explicitly typed empty DataFrame (e.g. pd.DataFrame({'exercise_id':
pd.Series(dtype='int64'), 'submission_id': pd.Series(dtype='int64')}), otherwise
perform df.astype({'exercise_id': 'int64', 'submission_id': 'int64'}) and return
the result to avoid the KeyError.
- Around line 620-625: The ZipSlip check around extracting zip entries is
vulnerable because target_path.startswith(os.path.abspath(data_dir)) can be
fooled by sibling paths; update the extraction logic in the block using
zipfile.ZipFile (where target_path and data_dir are computed and zip_ref.extract
is called) to compute a base path with a trailing separator (e.g., base_path =
os.path.abspath(data_dir) + os.sep) and then verify
target_path.startswith(base_path) before calling zip_ref.extract; this ensures
entries like "/app/data_evil/..." do not pass the check.
🧹 Nitpick comments (7)
athena/evaluation/pyproject.toml (1)
18-19: Nit: Inconsistent version pinning fortqdm.All other dependencies in this file use exact version pins (e.g.,
"7.1.0"), buttqdmuses a caret range ("^4.65.0"). Consider pinning it exactly for consistency and reproducibility, e.g.,"4.67.1".athena/evaluation/service/json_service.py (4)
627-636: Inconsistent path construction: f-string interpolation vsos.path.join.Lines 633 and 635 use
f"{data_dir}/{file}"while the zip extraction above correctly usesos.path.join. This can break on Windows where the separator is\.Proposed fix
- evaluation_config_files.append(f"{data_dir}/{file}") + evaluation_config_files.append(os.path.join(data_dir, file)) elif file.startswith("evaluation_progress_"): - evaluation_progress_files.append(f"{data_dir}/{file}") + evaluation_progress_files.append(os.path.join(data_dir, file))
724-742: Confusing variable names obscure the nested structure.The loop variable
submissionon line 728 actually represents per-exercise data (a dict of submissions), andfeedbackon line 729 represents per-submission data (a dict of feedback types). This makes the code hard to follow.Suggested rename
- for _, submission in progress_data.get("selected_values", {}).items(): - for submission_id, feedback in submission.items(): + for _exercise_id, submissions in progress_data.get("selected_values", {}).items(): + for submission_id, feedback_types in submissions.items(): resolved_feedback = {} - for feedback_id, ratings in feedback.items(): + for feedback_id, ratings in feedback_types.items(): if feedback_id not in pseudonym_to_feedback_type: raise ValueError(f"Pseudonym {feedback_id} not found in mappings.") resolved_ratings = {} for metric_id, score in ratings.items(): if metric_id not in metric_ids_to_titles: raise ValueError(f"Metric ID {metric_id} not found in metrics.") title = metric_ids_to_titles[metric_id] resolved_ratings[title] = score resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings - submission[submission_id] = resolved_feedback + submissions[submission_id] = resolved_feedback
757-761: Expert ID extraction is fragile if the filename contains extra underscores or dots.
file.replace("evaluation_progress_", "").replace(".json", "")assumes a single occurrence of these substrings. A filename likeevaluation_progress_expert_1.json.bakorevaluation_progress_A_B.jsoncould produce unexpected IDs. Usingreor splitting on a known pattern would be more robust, though the current approach works for the expected naming convention.
804-811: Direct dict key access (exercise['id'],submission['id']) will raiseKeyErrorif the key is missing.Other parts of this file use
.get()with defaults for defensive access. Consider using.get("id")here too, or at least documenting thatidis a required field so failures are intentional.athena/evaluation/4_expert_evaluation.ipynb (1)
148-162:countvariable is incremented but never used.The
countvariable on lines 155/157 is accumulated but never printed or returned. Either remove it or add a summary print after the loop (e.g.,print(f"Resolved {count} progress files.")).Also,
os.makedirs(output_dir, exist_ok=True)on line 159 is called inside the loop on every iteration. It's harmless due toexist_ok=True, but moving it before the loop would be cleaner — it's already called on line 123 in the previous cell anyway.athena/evaluation/5_llm_as_a_judge.ipynb (1)
217-223: Confirm that the output pathdata/5_llm_evaluation/produces a file that notebook 6 can consume.Notebook 6 currently sets
llm_evaluation_progress_path = None(line 25 of6_analysis_data_export.ipynb). For the end-to-end pipeline to work, the user would need to update that path to point to the LLM output from this cell. Consider adding a comment here noting the expected output filename so the user knows what to reference in step 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 27-29: Fix the typo in the notebook comment: change "evalution
progress" to "evaluation progress" in the explanatory string that mentions the
5_llm_as_a_judge notebook and the stored path (the comment line referencing
"data/5_llm_evaluation/evaluation_progress_llm-as-a-judge.json"); leave the
variable llm_evaluation_progress_path as-is.
🧹 Nitpick comments (6)
athena/evaluation/6_analysis_data_export.ipynb (1)
18-57: LLM participant is always appended even when no LLM data is loaded.Lines 44–55 unconditionally add an
'llm'row toparticipant_info_df, butllm_evaluation_progress_pathdefaults toNone(line 29), soload_evaluation_progresswon't produce anyexpert_id == 'llm'rows. After the left merge onexpert_id(line 56), this phantom participant row won't appear in the final data, so there's no data corruption — but it's misleading. Consider guarding the concat:Suggested guard
"# Load participant info and add LLM as a participant\n", "participant_info_df = pd.read_csv(participant_info_path, delimiter=\";\")\n", -"participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n", +"if llm_evaluation_progress_path:\n", +" participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",athena/evaluation/4_expert_evaluation.ipynb (1)
142-160:os.makedirson line 157 is called inside the loop — consider hoisting it.
os.makedirs(output_dir, exist_ok=True)is already called once at line 123. The repeated call inside the loop (line 157) is redundant. It's harmless due toexist_ok=True, but removing it keeps the loop body focused on file I/O.athena/evaluation/service/json_service.py (4)
610-638: Mix ofos.path.joinand f-string path concatenation.Lines 620, 624 correctly use
os.path.join/os.path.abspath, but lines 634 and 636 revert to f-string concatenation (f"{data_dir}/{file}"). Useos.path.joinconsistently for cross-platform safety.Suggested fix
if file.startswith("evaluation_config_"): - evaluation_config_files.append(f"{data_dir}/{file}") + evaluation_config_files.append(os.path.join(data_dir, file)) elif file.startswith("evaluation_progress_"): - evaluation_progress_files.append(f"{data_dir}/{file}") + evaluation_progress_files.append(os.path.join(data_dir, file))
716-744: Line 716: very long function signature — consider wrapping for readability.The signature spans well beyond typical line-length limits. Also, the docstring is missing a
Returnssection.Suggested formatting
-def resolve_feedback_types_and_metric_titles(evaluation_progress_files: list[str], metric_ids_to_titles: dict[str, str], pseudonym_to_feedback_type: dict[str, str]) -> dict[str, dict]: +def resolve_feedback_types_and_metric_titles( + evaluation_progress_files: list[str], + metric_ids_to_titles: dict[str, str], + pseudonym_to_feedback_type: dict[str, str], +) -> dict[str, dict]:
747-798: Type annotation style inconsistency and empty-DataFrame guard looks good.The empty-DataFrame guard (lines 790–794) properly addresses the prior review concern.
Minor style nit: line 747 uses
str | None(PEP 604 union syntax) while line 56 usesOptional[str]. Both are valid in Python 3.12+, but mixing styles within the same file reduces consistency.
2-8:import os.pathon line 6 is redundant.
osis already imported on line 3, andos.pathis automatically available as a submodule. The explicitimport os.pathcan be removed.
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 798-807: The empty-DataFrame branch currently returns only
exercise_id and submission_id causing KeyError downstream; change the
empty-return to construct and return a fully-typed DataFrame with the same
schema as the non-empty case (columns: expert_id, exercise_id, submission_id,
feedback_type, metric, score) using pd.DataFrame({ 'expert_id':
pd.Series(dtype='int64'), 'exercise_id': pd.Series(dtype='int64'),
'submission_id': pd.Series(dtype='int64'), 'feedback_type':
pd.Series(dtype='object'), 'metric': pd.Series(dtype='object'), 'score':
pd.Series(dtype='float64') }) so code that accesses df['expert_id'] etc. works;
keep the existing pd.DataFrame.from_records(records) and df.astype({...}) path
for non-empty results.
🧹 Nitpick comments (3)
athena/evaluation/service/json_service.py (3)
627-637: No file-extension filter on discovered evaluation files.Lines 632 and 634 match any file whose name starts with
evaluation_config_orevaluation_progress_, regardless of extension. A stray file likeevaluation_config_backup.txtor a.zipwould be included and would later fail whenjson.loadis called. Consider adding an.endswith(".json")guard.Proposed fix
for file in files: - if file.startswith("evaluation_config_"): + if file.startswith("evaluation_config_") and file.endswith(".json"): evaluation_config_files.append(os.path.join(data_dir, file)) - elif file.startswith("evaluation_progress_"): + elif file.startswith("evaluation_progress_") and file.endswith(".json"): evaluation_progress_files.append(os.path.join(data_dir, file))
730-749: Misleading variable names obscure the nested data structure.The variable names don't match the data they hold:
- Line 734:
submissionis actually the exercise-level dict mappingsubmission_id → feedback_data, not a single submission.- Line 735:
feedbackis actually a dict offeedback_pseudonym → ratings, not a single feedback item.- Line 737:
feedback_idis actually a pseudonym key.Also, Line 747 mutates
submission(the dict being iterated on Line 735) by reassigningsubmission[submission_id]. While updating an existing key during iteration is safe in CPython, it's a readability concern. Building a new dict and replacing after iteration would be clearer.Proposed refactor for clarity
for progress_file in evaluation_progress_files: with open(progress_file, "r") as file: progress_data = json.load(file) # De-pseudonymize feedbacks in progress file - for _, submission in progress_data.get("selected_values", {}).items(): - for submission_id, feedback in submission.items(): - resolved_feedback = {} - for feedback_id, ratings in feedback.items(): - if feedback_id not in pseudonym_to_feedback_type: - raise ValueError(f"Pseudonym {feedback_id} not found in mappings.") - resolved_ratings = {} - for metric_id, score in ratings.items(): - if metric_id not in metric_ids_to_titles: - raise ValueError(f"Metric ID {metric_id} not found in metrics.") - title = metric_ids_to_titles[metric_id] - resolved_ratings[title] = score - resolved_feedback[pseudonym_to_feedback_type[feedback_id]] = resolved_ratings - submission[submission_id] = resolved_feedback + for exercise_id, submissions_data in progress_data.get("selected_values", {}).items(): + resolved_submissions = {} + for submission_id, feedback_types in submissions_data.items(): + resolved_feedback = {} + for pseudonym, ratings in feedback_types.items(): + if pseudonym not in pseudonym_to_feedback_type: + raise ValueError(f"Pseudonym {pseudonym} not found in mappings.") + resolved_ratings = {} + for metric_id, score in ratings.items(): + if metric_id not in metric_ids_to_titles: + raise ValueError(f"Metric ID {metric_id} not found in metrics.") + resolved_ratings[metric_ids_to_titles[metric_id]] = score + resolved_feedback[pseudonym_to_feedback_type[pseudonym]] = resolved_ratings + resolved_submissions[submission_id] = resolved_feedback + progress_data["selected_values"][exercise_id] = resolved_submissions progress[progress_file] = progress_data
664-681: Silent exit on unresolvable cyclic mappings.If the
whileloop exhaustsmax_iterationswithout resolving all transitive mappings, execution continues silently with partially-resolved mappings. This will surface later as opaqueValueError("Pseudonym ... not found in mappings")exceptions inresolve_feedback_types_and_metric_titlesorcreate_common_config's exercise-collection pass, making debugging difficult.Consider raising an explicit error when the loop terminates without convergence:
Proposed fix
iterations += 1 + if len(intersection) > 0: + raise ValueError( + f"Could not resolve all transitive mappings after {max_iterations} iterations. " + f"Unresolved pseudonyms: {intersection}" + ) + for config_file in evaluation_config_files:
✅ Actions performedReview triggered.
|
|
@coderabbitai resolve |
✅ Actions performedComments resolved. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)
47-59:⚠️ Potential issue | 🟡 Minor
model_namewill beNoneif the env var is missing, causingAttributeErroron.replace().If
LLM_EVALUATION_MODELis not set,os.getenvreturnsNoneand Line 54 crashes withAttributeError: 'NoneType' object has no attribute 'replace'. The same applies to the other env vars passed toAzureChatOpenAI. Consider adding a guard or using a default.Suggested fix
model_name = os.getenv("LLM_EVALUATION_MODEL") api_key = os.getenv("AZURE_OPENAI_API_KEY") api_base = os.getenv("AZURE_OPENAI_ENDPOINT") api_version = os.getenv("OPENAI_API_VERSION") + +if not all([model_name, api_key, api_base, api_version]): + raise ValueError("Missing required environment variables. Check your .env file.")
🤖 Fix all issues with AI agents
In `@athena/evaluation/pyproject.toml`:
- Around line 18-19: Update the pinned package versions in pyproject.toml for
the dependencies ipykernel and tqdm: change ipykernel from 7.1.0 to 7.2.0 and
tqdm from 4.65.0 to 4.67.3, then regenerate the lockfile (e.g., run your
dependency manager's update/lock command such as poetry lock or pip-compile) and
run tests to ensure nothing breaks.
In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 89-94: The hardcoded max_tokens=100 passed to
model.with_structured_output(MetricEvaluations).invoke(...) is too small for
multi-metric structured JSON and can cause truncation; update this call to use a
larger default (e.g., 500–1000) or make it configurable via a constant/parameter
(e.g., MAX_EVAL_TOKENS or an argument to the LLM judge service) so
get_openai_callback(), model.with_structured_output(MetricEvaluations), and
invoke(...) use that value; ensure any configuration name is documented in the
function/class signature and replace the inline 100 with the new variable.
🧹 Nitpick comments (5)
athena/evaluation/model/evaluation_model.py (1)
24-31: Inconsistent generic syntax:List[BaseMessage]vslist[Metric].Line 25 uses
List[BaseMessage](fromtyping) while line 31 uses the built-inlist[Metric]. Since the project targets Python 3.12+, prefer the built-inlistconsistently (matching the style already used byMetricEvaluationson line 19).♻️ Suggested fix
- prompt: List[BaseMessage] = Field( + prompt: list[BaseMessage] = Field(This would also allow removing the
Listimport fromtypingon line 1.athena/evaluation/6_analysis_data_export.ipynb (1)
44-56:pd.concatwithoutignore_index=Trueproduces a duplicate index.The appended LLM row will carry index
0, duplicating the existing first row's index. While this doesn't break the subsequent merge, it can cause subtle issues if the DataFrame is later indexed by position (e.g.,df.loc[0]returns two rows).♻️ Suggested fix
- " participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n", + " participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{\n",Add
ignore_index=Trueto the concat call:participant_info_df = pd.concat([participant_info_df, pd.DataFrame([{ ... }])], ignore_index=True)athena/evaluation/service/json_service.py (2)
689-704: Exercise collection re-reads all config files from disk a second time.The config files are already loaded in the first loop (lines 651–662) for metrics/mappings. Consider collecting exercises in the same pass to avoid redundant I/O, or at minimum document why the two-pass approach is needed (e.g., mappings must be fully resolved before exercise de-pseudonymization).
821-846:load_common_evaluation_configdoesn't close the file handle on exception and nests all logic underwith.Minor: the entire function body is inside the
withblock, meaning theDataFrameconstruction happens while the file is still open. This works but is slightly unusual. More importantly, there's no error handling ifevaluation_config_pathdoesn't exist.Also, the function docstring says it returns
pd.DataFramebut the return type annotation is missing.Suggested improvement
-def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame: +def load_common_evaluation_config(evaluation_config_path: str) -> pd.DataFrame: """ Loads the common evaluation configuration from a JSON file. Args: evaluation_config_path (str): The path to the common evaluation configuration file. + Returns: + pd.DataFrame: A DataFrame containing the evaluation configuration records. """ with open(evaluation_config_path, "r") as config_file: common_evaluation_config = json.load(config_file) - - records = [] - ... - return pd.DataFrame.from_records(records) + + records = [] + for raw_exercise in common_evaluation_config.get("exercises", []): + exercise = {k: v for k, v in raw_exercise.items() if k != "submissions"} + for raw_submission in raw_exercise.get("submissions", []): + submission = {k: v for k, v in raw_submission.items() if k != "feedbacks"} + for feedback_type, raw_feedback in raw_submission.get("feedbacks", {}).items(): + record = { + 'exercise_id': exercise['id'], + 'submission_id': submission['id'], + 'feedback_type': feedback_type, + 'exercise': exercise, + 'submission': submission, + 'feedback': raw_feedback, + } + records.append(record) + + return pd.DataFrame.from_records(records)athena/evaluation/5_llm_as_a_judge.ipynb (1)
85-125: Two mutually exclusive metric-loading cells — consider adding a clearer guard.Both "Define New Metrics" (Lines 85–95) and "Reuse Metrics from Expert Evaluation" (Lines 111–125) assign to the same
metricsvariable. Running both cells sequentially means the second silently overwrites the first. This is a common notebook pattern but could confuse users. A brief inline comment like# NOTE: Only run ONE of the two metrics cells belowin each cell header would help.
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
athena/evaluation/5_llm_as_a_judge.ipynb (1)
40-59:⚠️ Potential issue | 🟡 Minor
model_namecan beNoneif the env var is unset, causingAttributeErroron Line 54.
os.getenv("LLM_EVALUATION_MODEL")returnsNonewhen the variable isn't configured. The.replace(...)call on Line 54 will then fail. Consider adding a guard or using a clearer error message:Proposed fix
"model_name = os.getenv(\"LLM_EVALUATION_MODEL\")\n", "api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "api_base = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "api_version = os.getenv(\"OPENAI_API_VERSION\")\n", +"\n", +"if not all([model_name, api_key, api_base, api_version]):\n", +" raise ValueError(\"Missing required environment variables. Check your .env file.\")\n",
🤖 Fix all issues with AI agents
In `@athena/evaluation/6_analysis_data_export.ipynb`:
- Around line 18-36: The notebook sets analysis_output_path but never ensures
the directory exists before writing output; before exporting the CSV (where
df.to_csv / writing occurs) add a directory creation step using
os.makedirs(analysis_output_path, exist_ok=True) so the path is created if
missing; locate and update the cell around the export step that references
analysis_output_path to call os.makedirs(...) just prior to the write.
In `@athena/evaluation/service/json_service.py`:
- Around line 821-846: The function load_common_evaluation_config currently
returns pd.DataFrame.from_records(records) which yields a column-less DataFrame
when records is empty; change load_common_evaluation_config to detect the
empty-records case and return an empty DataFrame with the expected columns (at
minimum "exercise_id", "submission_id", "feedback_type", "exercise",
"submission", "feedback") so downstream merges on
["exercise_id","submission_id","feedback_type"] don't KeyError; update the
return path in load_common_evaluation_config to construct
pd.DataFrame.from_records(records, columns=[...]) or explicitly create
pd.DataFrame(columns=[...]) when records is empty.
- Around line 670-681: The loop mutates mappings while iterating which can
produce non-deterministic results; change the inner iteration to work on a
stable snapshot (e.g., iterate over list(mappings.items()) or create a new dict)
so updates do not read values updated earlier in the same pass, then apply
deletions as before; specifically modify the block that iterates "for key, value
in mappings.items()" to instead iterate a snapshot (or build new_mappings and
replace mappings) and preserve the subsequent self-reference removal and
recomputation of pseudonyms/feedback_types/intersection.
In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 87-93: The LLM call using
model.with_structured_output(MetricEvaluations).invoke(...) inside the
progress_bar loop has no error handling so any transient API failure will abort
the loop and lose partial evaluation_progress and total_cost; wrap the
invocation and callback usage (get_openai_callback, metric_evaluations,
cb.total_cost) in a try/except that catches API/network exceptions, log the
exception, mark the current item as errored in evaluation_progress (same pattern
as the existing validation-failure handling), ensure you still accumulate
cb.total_cost if available, persist/write evaluation_progress to disk after the
exception, and continue the loop so remaining requests are processed.
🧹 Nitpick comments (5)
athena/evaluation/model/evaluation_model.py (1)
24-31: NewMetricEvaluationRequestmodel looks good overall.Minor style nit: lines 25 and 31 mix
List[BaseMessage](fromtyping) withlist[Metric](builtin). Since the project targets Python 3.12, you could uselist[BaseMessage]consistently and drop theListimport.Suggested consistency fix
-from typing import List - -from langchain_core.messages import BaseMessage +from langchain_core.messages import BaseMessage from pydantic import BaseModel, Field ... class MetricEvaluationRequest(BaseModel): - prompt: List[BaseMessage] = Field( + prompt: list[BaseMessage] = Field( ..., description="The prompt to evaluate the metrics." )athena/evaluation/6_analysis_data_export.ipynb (1)
42-57: Left merges may silently drop data — consider validating merge results.The left merge on Line 40 and Line 57 could leave
NaNin config/participant columns for unmatched records, and no warning is emitted. For a data-export notebook, this could lead to silently incomplete CSVs. Consider adding a brief check (e.g., logging the count of rows with missing config/participant data) after each merge so users are aware of data gaps.athena/evaluation/4_expert_evaluation.ipynb (1)
142-159: Cross-cell dependency onjsonimport — consider adding it to this cell.This cell uses
json.dump(Line 158) butjsonis imported in the previous cell (Line 116). If a user runs this cell in isolation or restarts and runs from here, they'll get aNameError. Sinceosis re-imported on Line 150, it would be consistent to also importjsonhere.athena/evaluation/service/json_service.py (1)
689-704: Config files are read twice — once for metrics/mappings and again for exercises.Lines 651-662 open and parse each config file, and Lines 689-704 re-open and re-parse the same files. This is a minor inefficiency.
Suggested optimization
Cache the parsed JSON on the first pass:
+ parsed_configs = [] for config_file in evaluation_config_files: with open(config_file, "r") as file: config_data = json.load(file) + parsed_configs.append(config_data) # Collect metrics ... - for config_file in evaluation_config_files: - with open(config_file, "r") as file: - config_data = json.load(file) + for config_data in parsed_configs: # Collect exercises ...athena/evaluation/service/llm_as_a_judge_service.py (1)
146-148: Summary counts use nested list comprehensions that are hard to read.These can be simplified with
sum(...)for clarity:Suggested refactor
- print(f"Total evaluated submissions: {len([sub for ex in selected_values.values() for sub in ex.values()])}") - print(f"Total evaluated feedbacks: {len([fb for ex in selected_values.values() for sub in ex.values() for fb in sub.values()])}") + print(f"Total evaluated submissions: {sum(len(ex) for ex in selected_values.values())}") + print(f"Total evaluated feedbacks: {sum(len(sub) for ex in selected_values.values() for sub in ex.values())}")
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 806-822: The df.empty branch is unreachable because the code
raises ValueError when records is empty; remove the dead empty-DataFrame branch
(the block that constructs and returns the typed empty DataFrame) and keep the
initial raise, so the function simply builds df =
pd.DataFrame.from_records(records) and returns
df.astype({'exercise_id':'int64','submission_id':'int64'}); reference symbols:
records, pd.DataFrame.from_records, df, df.empty, ValueError, and df.astype.
🧹 Nitpick comments (3)
athena/evaluation/service/llm_as_a_judge_service.py (2)
25-48: Minor code duplication betweenNoneandlistbranches.The
Noneandlistcases differ only in the feedbacks argument passed toget_formatted_prompt([]vssubmission.feedbacks). Consider merging them to reduce repetition.♻️ Proposed consolidation
- if submission.feedbacks is None: - prompt = get_formatted_prompt(exercise, submission, [], metrics) - requests.append( - MetricEvaluationRequest( - prompt=prompt, - exercise_id=exercise.id, - submission_id=submission.id, - feedback_type="default", - metrics=metrics, - ) - ) - elif isinstance(submission.feedbacks, list): - prompt = get_formatted_prompt( - exercise, submission, submission.feedbacks, metrics - ) - requests.append( - MetricEvaluationRequest( - prompt=prompt, - exercise_id=exercise.id, - submission_id=submission.id, - feedback_type="default", - metrics=metrics, - ) - ) - elif isinstance(submission.feedbacks, dict): + if submission.feedbacks is None or isinstance(submission.feedbacks, list): + feedbacks = submission.feedbacks or [] + prompt = get_formatted_prompt(exercise, submission, feedbacks, metrics) + requests.append( + MetricEvaluationRequest( + prompt=prompt, + exercise_id=exercise.id, + submission_id=submission.id, + feedback_type="default", + metrics=metrics, + ) + ) + elif isinstance(submission.feedbacks, dict):
87-97: No error handling around the LLM API call — consider narrowing the exception type.The try/except was added to address a prior review concern (good), but catching bare
Exceptionis flagged by Ruff (BLE001). Since LLM API calls typically raise specific exceptions (e.g.,openai.APIError,requests.exceptions.RequestException), narrowing the catch would improve signal. That said, for a notebook-driven evaluation pipeline, the current broad catch is pragmatic and acceptable.athena/evaluation/service/json_service.py (1)
825-853:load_common_evaluation_configaccessesexercise['id']andsubmission['id']without guarding for missing keys.If a malformed config entry is missing
"id", this will raise aKeyErrorwith no context about which entry caused the problem. Consider using.get()with validation or wrapping in a try/except for a more informative error message. This is a minor robustness concern given this is notebook-driven code.
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@athena/evaluation/service/json_service.py`:
- Around line 760-813: The issue is a dtype mismatch: load_evaluation_progress
casts exercise_id and submission_id to int64 but load_common_evaluation_config
leaves exercise['id'] and submission['id'] as-is, causing merges to fail; update
load_common_evaluation_config to coerce the DataFrame columns for exercise_id
and submission_id to int64 (same as load_evaluation_progress) after building its
records/DataFrame so both DataFrames share explicit int64 dtypes for those
columns before any merge.
In `@athena/evaluation/service/llm_as_a_judge_service.py`:
- Around line 134-143: The "Total evaluated feedbacks" print counts
feedback_type keys including the per-submission "meta" key, inflating the
number; update the calculation that uses selected_values (the expression summing
len(sub) for ex in selected_values.values() for sub in ex.values()) to exclude
the "meta" key when counting each submission's feedbacks (e.g., sum only keys !=
"meta" or subtract 1 when "meta" in sub.keys()) so the printed total reflects
actual feedback items.
🧹 Nitpick comments (2)
athena/evaluation/service/json_service.py (2)
692-705:exercise["submissions"].sort(key=lambda x: x["id"])mutates data fromparsed_configsin-place.Since
parsed_configsholds references to the originally parsed JSON dicts, this sort mutates the source data. This is likely fine for the current single-use notebook workflow, but worth noting ifparsed_configsis ever reused.More importantly: if a submission lacks an
"id"key, this will raise aKeyError. A.get("id", 0)fallback or validation step before this point would be safer.
816-844:load_common_evaluation_configdocstring is missing theReturnssection.The function returns a
pd.DataFrame, but the docstring only documentsArgs. Minor documentation gap.
|
@coderabbitai resolve |
✅ Actions performedComments resolved. |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
laadvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked over the code and tested the notebooks locally. Working as expected, nice improvement :)
Motivation and Context
The current evaluation workflow treats expert feedback and LLM-based feedback evaluation separately, making direct comparison difficult. We needed a robust way to:
Description
This PR significantly enhances the evaluation pipeline across steps 4, 5, and 6:
Steps for Testing (Only locally)
Summary by CodeRabbit
New Features
Improvements
Bug Fixes / Removals
Chores