Conversation
…-programming-quality-llm-module
📝 WalkthroughWalkthroughThis pull request introduces a new LLM-based programming quality assessment module that generates both graded and non-graded feedback suggestions. The module integrates with existing infrastructure via configuration files, includes Docker deployment setup, implements file-level analysis using LLM prompts, and adds utilities for repository diffing and content processing. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.4)athena/assessment_module_manager/assessment_module_manager/module/request_to_module.pyathena/llm_core/llm_core/core/predict_and_parse.pyathena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/__init__.py
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
|
There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions. |
…uality-llm-module' into athena/feature/add-programming-quality-llm-module
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
There was a problem hiding this comment.
Actionable comments posted: 13
🤖 Fix all issues with AI agents
In `@athena/modules/programming/module_programming_quality_llm/.env.example`:
- Around line 4-7: The .env.example currently contains a concrete SECRET value
which should be replaced with a placeholder to avoid accidental insecure copies;
update the SECRET entry in .env.example to a placeholder like
SECRET=<your-secret-here> (or SECRET=generate_a_strong_random_value) and add a
short comment or README note instructing users to generate and set a strong,
unique secret (e.g., using a secure random generator) rather than copying the
example value.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/__init__.py`:
- Around line 1-4: The module currently calls dotenv.load_dotenv(override=True)
at import time in the package __init__.py, which can silently overwrite existing
process env vars; remove or change that behavior by either (A) removing the
import-time call from __init__.py and moving dotenv.load_dotenv() to the
application entrypoint, or (B) gate the call behind an explicit flag and stop
forcing overrides (e.g., call dotenv.load_dotenv(override=False) only when a
runtime config or env var like LOAD_DOTENV_OVERRIDE is set). Locate the call to
dotenv.load_dotenv in the module_programming_quality_llm __init__.py (and apply
the same change to the other modules mentioned) and implement one of these two
approaches so imports no longer override environment variables by default.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/__main__.py`:
- Around line 49-52: The second return statement (calling
generate_non_graded_suggestions_by_file) is over-indented causing Flake8 E127;
fix the indentation so that the return aligns with the preceding return (both
return statements start at the same indentation level), keeping the same
arguments (exercise, submission, module_config.non_graded_approach,
module_config.debug) and preserving the existing conditional flow around
generate_graded_suggestions_by_file and generate_non_graded_suggestions_by_file.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/generate_graded_suggestions_by_file.py`:
- Around line 160-175: The Feedback objects are being constructed with
structured_grading_instruction_id=None while the source FeedbackModel exposes
grading_instruction_id; update the mapping so that when iterating
result.feedbacks you set
structured_grading_instruction_id=feedback.grading_instruction_id (or None if
absent). Locate the loop that builds Feedback instances (iterating
result.feedbacks) and replace the hardcoded None for
structured_grading_instruction_id with feedback.grading_instruction_id to
preserve the original grading_instruction_id value.
- Around line 62-70: The add_line_numbers utility currently uses
enumerate(lines) producing 0-based line numbers which leads to off-by-one errors
in reported line_start/line_end; update add_line_numbers (in helpers/utils.py)
to use enumerate(lines, start=1) so it emits 1-based line numbers, and verify
any consumers of add_line_numbers (e.g., generate_graded_suggestions_by_file.py
usage) expect 1-based indices or adjust their mappings accordingly.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/generate_non_graded_suggestions_by_file.py`:
- Around line 210-228: The filtering loop can add duplicates because
filtered_prompt_inputs is built from prompt_inputs and then items are appended
again from prompt_inputs; to fix, after building filtered_prompt_inputs (which
filters by programming_language_extension) remove those same entries from
prompt_inputs (e.g., by file_path or a unique key) before the while loop so the
subsequent prompt_inputs.pop(0) cannot re-add already-included files, ensuring
prompt_inputs, filtered_prompt_inputs and the final prompt_inputs contain unique
file entries up to config.max_number_of_files.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/generate_summary_by_file.py`:
- Around line 145-150: The code uses the model-returned file_summary.file_name
when populating items_dict, which can be incorrect; instead, use the original
prompt input's file path for the map key. Update the loop that iterates over
results (variable results and local file_summary) to reference the corresponding
prompt input (the list of file inputs passed to the LLM, e.g., the same-indexed
item containing file_path) rather than file_summary.file_name, and add a safety
check if results and prompt inputs differ in length before assigning
items_dict[file_path] = file_summary.description.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/helpers/utils.py`:
- Around line 62-69: The add_line_numbers function produces 0-based line numbers
because it calls enumerate(lines) with the default start; change it to
enumerate(lines, start=1) so line numbering begins at 1, keep the existing
line_number_max_length calculation and the f-string formatting (still using
str(line_number).rjust(line_number_max_length) and the same join) to preserve
alignment and output formatting.
- Around line 88-104: temporary_remote currently reuses an existing remote
blindly; update it to verify the existing remote's URL before yielding: call
repo.remote(remote_name) and inspect its URLs (remote.urls or remote.url) and
compare to remote_url; if they match, yield that remote as now, but if they
differ, create a uniquely named temporary remote (e.g. remote_name + "_tmp" or
timestamped) via repo.create_remote(remote_name, remote_url), call
remote.fetch(), yield that temp remote, and ensure you delete only the temp
remote with repo.delete_remote(temp_remote) after use; reference
temporary_remote, repo.remote, repo.create_remote, remote.fetch, and
repo.delete_remote when making the change.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/prompts/generate_graded_suggestions_by_file.py`:
- Around line 23-28: Update the "Output" instructions in the prompt to require a
strict JSON schema instead of freeform text: replace the ambiguous "Return an
array of feedback items... or a positive note" guidance with a precise JSON
object/array schema (e.g., an array of objects with keys title, description,
optional line_start, line_end) and state "Output MUST be valid JSON" and that
empty results should be an empty JSON array; modify the prompt text in
generate_graded_suggestions_by_file.py's Output section so the parser receives
only the structured JSON schema and no conflicting freeform alternatives.
In
`@athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/split_problem_statement_by_file.py`:
- Line 65: The local variable "model" is assigned from config.model.get_model()
but never used; remove the unused assignment to eliminate the dead variable by
either deleting the line or, if the call is needed for side effects, invoke
config.model.get_model() without assigning its result (or assign to "_" to
signal intentional unused). Update the occurrence where "model" is created in
the split_problem_statement_by_file module to reflect this change.
- Around line 44-52: The docstring for split_problem_statement_by_file
references GradedBasicApproachConfig but the function signature/parameter is
BasicApproachConfig; make them consistent by updating the docstring to use
BasicApproachConfig (or if the intended type is GradedBasicApproachConfig,
change the parameter annotation and any imports accordingly). Locate the
function split_problem_statement_by_file (or split_problem_statement) and update
the docstring parameter type for config to BasicApproachConfig (or adjust the
signature to GradedBasicApproachConfig) so the documented type matches the
actual parameter and imports remain correct.
In `@athena/modules/programming/module_programming_quality_llm/README.md`:
- Around line 7-17: The fenced code blocks in README.md lack language
identifiers (triggering MD040); update the two code fences around the shell
commands to include a language tag (e.g., change the opening backticks for the
`cp .env.example .env` block and the `poetry install` block to use ```bash) so
both fenced blocks are annotated with "bash" for proper syntax highlighting and
linting.
🧹 Nitpick comments (10)
athena/llm_core/llm_core/core/predict_and_parse.py (1)
110-120: Remove unnecessary Pydantic v1 compatibility layer; focus on silent failure handling in LM Studio fallback.
model_validateis correctly used for Pydantic v2. The project explicitly requires Pydantic 2.11.7 iniris/pyproject.toml, so the suggestedgetattr/parse_objfallback for v1 compatibility is unnecessary.However, the silent
Nonereturn on parse/validation errors (lines 119–120) does mask malformed output. Consider logging errors or raising exceptions instead of silently returningNone, as this fallback path is critical for LM Studio handling and debugging parse failures is important.athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/prompts/generate_non_graded_suggestions_by_file.py (1)
8-12: Clarify the “no solution suggestions” constraint to avoid conflicting guidance.Line 11 bans “solution suggestions,” but the task is to provide improvement guidance. Consider disallowing full corrected code/complete solutions instead, while still allowing actionable advice.
✏️ Proposed wording tweak
-Create non graded improvement suggestions for a student\'s programming submission that a human tutor would recommend. \ +Create non-graded improvement suggestions for a student\'s programming submission that a human tutor would recommend. \ ... -Important: the answer you generate must not contain any solution suggestions or contain corrected errors. +Important: do not provide full corrected code or complete solutions; focus on principles and guidance the student can apply.athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/generate_graded_suggestions_by_file.py (1)
151-156: Consider addingstrict=Truetozip()calls for safety.While
prompt_inputsandresultsoriginate from the same iteration and should always have equal length, addingstrict=Trueprovides a defensive check that would surface bugs immediately if assumptions change.Proposed fix
- for prompt_input, result in zip(prompt_inputs, results) + for prompt_input, result in zip(prompt_inputs, results, strict=True)- for prompt_input, result in zip(prompt_inputs, results): + for prompt_input, result in zip(prompt_inputs, results, strict=True):athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/split_grading_instructions_by_file.py (2)
31-49: Docstring is missing thedebugparameter.The function signature includes
debug: boolbut the docstring Args section doesn't document it.Proposed fix
Args: exercise (Exercise): Exercise to split the grading instructions for (respecting the changed files) submission (Submission): Submission to split the grading instructions for (respecting the changed files) prompt (ChatPromptTemplate): Prompt template to check for grading_instructions config (GradedBasicApproachConfig): Configuration + debug (bool): Whether to emit debug metadata Returns: Optional[SplitGradingInstructions]: Split grading instructions, None if it is too short or too long
127-136: Mutatingitemson a Pydantic model withSequencetype may cause validation issues.Directly assigning a list to
split_grading_instructions.items(which is typed asSequence[FileGradingInstruction]) works but bypasses Pydantic validation. Consider creating a newSplitGradingInstructionsinstance instead.Proposed fix
- split_grading_instructions.items = [ + deduplicated_items = [ FileGradingInstruction( file_name=file_name, grading_instructions="\n".join( file_grading_instruction.grading_instructions for file_grading_instruction in file_grading_instructions ), ) for file_name, file_grading_instructions in file_grading_instructions_by_file_name.items() ] - return split_grading_instructions + return SplitGradingInstructions(items=deduplicated_items)athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/split_problem_statement_by_file.py (1)
137-146: Same concern: mutating Pydantic model'sSequencefield directly.Similar to
split_grading_instructions_by_file.py, consider returning a new model instance.Proposed fix
- split_problem_statement.items = [ + deduplicated_items = [ FileProblemStatement( file_name=file_name, problem_statement="\n".join( file_problem_statement.problem_statement for file_problem_statement in file_problem_statements ), ) for file_name, file_problem_statements in file_problem_statements_by_file_name.items() ] - return split_problem_statement + return SplitProblemStatement(items=deduplicated_items)athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/generate_non_graded_suggestions_by_file.py (3)
91-102: Translate German comments to English for consistency, and verify the token limit logic.The comments are in German. Additionally, the logic seems counterintuitive: smaller submissions (
< 40000chars) get a lower token limit (5000) while larger submissions get the full config limit. Typically, you'd want more tokens per file when there are fewer files.Proposed translation and consideration
- # Dynamische Token-Berechnung + # Dynamic token calculation total_content_size = sum(len(content) for content in changed_files.values()) - if total_content_size < 40000: # Kleine Submission + if total_content_size < 40000: # Small submission effective_max_tokens = 5000 - elif total_content_size < 100000: # Mittlere Submission + elif total_content_size < 100000: # Medium submission effective_max_tokens = 3000 - else: # Große Submission + else: # Large submission effective_max_tokens = config.max_input_tokens - # Config-Objekt temporär überschreiben (Kopie erstellen um Original nicht zu verändern) + # Temporarily override config object (create copy to avoid modifying original) config = config.model_copy(update={"max_input_tokens": effective_max_tokens})Please verify this token scaling is intentional—the current logic gives large submissions more tokens per file than small ones.
288-352: Priority filtering relies on fragile string prefix matching.The
_filter_feedbacks_by_priorityfunction categorizes feedbacks by checking ifdescription.startswith("CRITICAL"), etc. This creates tight coupling with LLM output format. If the LLM doesn't prefix descriptions exactly as expected, feedbacks fall intootherand aren't prioritized correctly.Consider documenting this contract clearly or using a more robust categorization mechanism (e.g., a dedicated
priorityorseverityfield in the model).
258-263: Consider addingstrict=Truetozip()calls.Same as in the graded suggestions file—adding
strict=Trueprovides defensive checking.athena/modules/programming/module_programming_quality_llm/module_programming_quality_llm/helpers/utils.py (1)
107-141: Hardcodedbranch="main"will fail when repositories use different default branches.The parameter is used consistently across all callers without override, and external repositories (template, submission, solution) may use
masteror custom branch names. Consider detecting the active branch dynamically or making it configurable at the call site.
...odules/programming/module_programming_quality_llm/module_programming_quality_llm/__init__.py
Show resolved
Hide resolved
...odules/programming/module_programming_quality_llm/module_programming_quality_llm/__main__.py
Show resolved
Hide resolved
...rogramming_quality_llm/module_programming_quality_llm/generate_graded_suggestions_by_file.py
Show resolved
Hide resolved
...rogramming_quality_llm/module_programming_quality_llm/generate_graded_suggestions_by_file.py
Show resolved
Hide resolved
...s/programming/module_programming_quality_llm/module_programming_quality_llm/helpers/utils.py
Show resolved
Hide resolved
...ng_quality_llm/module_programming_quality_llm/prompts/generate_graded_suggestions_by_file.py
Show resolved
Hide resolved
...le_programming_quality_llm/module_programming_quality_llm/split_problem_statement_by_file.py
Show resolved
Hide resolved
...le_programming_quality_llm/module_programming_quality_llm/split_problem_statement_by_file.py
Show resolved
Hide resolved
Athena Test Results Summary
Failing Tests Summary
|
||||||||||||||
📊 Detailed Coverage TableCombining 3 coverage files... 📊 Combined Coverage Summary:
Total packages: 31 Note: Coverage thresholds: ✅ (≥70%), ❌ (<70%) |
Motivation and Context
Introduction of a new
module_programming_quality_llmmodule that extends the programming feedback capabilities. While the existingmodule_programming_llmfocuses on verifying code correctness, this new module analyzes code quality based on established criteria including readability, complexity, architectural design, and maintainability.Description
New module: module_programming_quality_llm
This is a new module designed to provide LLM-based code quality analysis for student programming submissions. The module evaluates submissions against standardized quality criteria:
Evaluation Categories:
Module Structure:
generate_graded_suggestions_by_file.py- Generate quality feedback with gradesgenerate_non_graded_suggestions_by_file.py- Generate quality feedback without gradesgenerate_summary_by_file.py- Summary of quality analysis per fileSteps for Testing
Testserver States
Note
These badges show the state of the test servers.
Green = Currently available, Red = Currently locked
Click on the badges to get to the test servers.
Screenshots
Summary by CodeRabbit
New Features
Performance
✏️ Tip: You can customize this high-level summary in your review settings.