Skip to content

Conversation

@Jerryguan777
Copy link

@Jerryguan777 Jerryguan777 commented Jan 15, 2026

Implemented a iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. Achieved 70% success rate (7/10) in initial evaluation.

  • Add IterativeAgent
  • Add config_iterative.yml
  • Add git tools
  • Add SweBenchPredictorIterativeConfig
  • Register iterative predictor and git tool
  • Update README.md

How Has This Been Tested?

export ANTHROPIC_API_KEY=sk-xxxxxx

nat eval --config_file examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml
=== EVALUATION SUMMARY ===
Workflow Status: COMPLETED (workflow_output.json)
Total Runtime: 96.21s

Per evaluator results:
| Evaluator   |   Avg Score | Output File           |
|-------------|-------------|-----------------------|
| swe_bench   |           1 | swe_bench_output.json |

Description

Closes #1397

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Iterative agent predictor for SWE Bench: step-by-step LLM-driven edits that produce patches.
    • Git repository tool to prepare and clean workspace repos used by predictors.
    • Configurable iterative-run settings (step limits, command timeouts, max output handling).
  • Documentation

    • README updated to document the new "iterative" predictor and example configuration for running iterative evaluations.

✏️ Tip: You can customize this high-level summary in your review settings.

- Add IterativeAgent
- Add config_iterative.yml
- Add git tools
- Add SweBenchPredictorIterativeConfig
- Register iterative predictor and git tool
- Update README.md

Signed-off-by: Jerry Guan <[email protected]>
@Jerryguan777 Jerryguan777 requested a review from a team as a code owner January 15, 2026 02:26
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

Walkthrough

Adds an iterative agent-based SWE-bench predictor that runs shell commands step-by-step against a checked-out repo, observes outputs, iteratively refines fixes via an LLM until completion or limits, and provides a git-diff patch; includes repo management tools and config/registration updates.

Changes

Cohort / File(s) Summary
Documentation
examples/evaluation_and_profiling/swe_bench/README.md
Added an "iterative" predictor entry describing the agent's step-by-step bash-based iterative workflow.
Config schema
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
Added SweBenchPredictorIterativeConfig (name="iterative") with llm_name, step_limit, timeout; extended SweBenchPredictorConfig discriminated union; imported Field, FunctionRef, LLMRef; added docstrings.
Config example
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
New YAML example defining LLMs, an iterative predictor section (llm, step_limit, timeout), git_repo_tool settings, dataset/evaluator and concurrency/output settings.
Iterative predictor implementation
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
New IterativeAgent and SweBenchPredictor classes; exception hierarchy; prompt/message management; LLM querying; async command execution with timeout/truncation; iterative reason-action-observe loop producing git-diff patch; predictor registration.
Git workspace tool
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
Added RepoContext dataclass and RepoManager with async setup_repository (clone/checkout), cleanup, and helper clone/checkout functions using GitPython.
Tool registration
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
Added GitRepoToolConfig and registered git_repo_tool async function that yields a FunctionInfo exposing setup and cleanup operations backed by RepoManager.
Predictor registry wiring
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
Imported the iterative predictor (SweBenchPredictor as IterativePredictor) to include it in the predictor registry.
Tool registry exposure
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
Imported git_repo_tool so the git repo tool is available in the tools registry.

Sequence Diagram(s)

sequenceDiagram
    participant Client as SWE-Bench Client
    participant Predictor as SweBenchPredictor
    participant Repo as RepoManager
    participant Agent as IterativeAgent
    participant LLM as LLM Backend
    participant Executor as Command Executor

    Client->>Predictor: predict_fn(swebench_input)
    Predictor->>Repo: setup_repository(repo_url, commit)
    Repo-->>Predictor: RepoContext
    Predictor->>Agent: instantiate with config & builder
    Predictor->>Agent: run(task_description, repo_path)

    loop until COMPLETE or limits
        Agent->>LLM: _query_llm(prompt/messages)
        LLM-->>Agent: response (one bash code block)
        Agent->>Executor: _execute_action(bash_command)
        Executor->>Repo: run command in repo workspace
        Repo-->>Executor: stdout/stderr/return_code
        Executor-->>Agent: observation (truncated if needed)
        Agent->>Agent: add_message(assistant,response)
        Agent->>Agent: add_message(user,observation)
        Agent->>Agent: check for COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
    end

    alt Completed
        Agent-->>Predictor: (patch, status)
    else Error/Timeout/Limits
        Agent-->>Predictor: (error_message, status)
    end

    Predictor->>Repo: cleanup()
    Predictor-->>Client: final patch or error
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main change: implementing an iterative predictor for SWE-bench, following imperative mood and staying within the 72-character limit.
Linked Issues check ✅ Passed All objectives from issue #1397 are met: iterative agent implementation with step-by-step execution, test-driven validation, dynamic feedback loops, and integration with SweBenchPredictorBase framework.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the iterative predictor: config, agent, tools, registration, and documentation updates align with the stated PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 1-6: Add the standard SPDX Apache-2.0 license header as the very
first lines of the YAML (before the "llms" key); update the top of the file
containing the "llms" / "claude_sonnet_llm" entries to begin with the SPDX
Apache-2.0 header so the file complies with the repo policy.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 76-79: The async function clone_repository uses the synchronous
blocking call Repo.clone_from which will block the event loop; change the
implementation to run Repo.clone_from in a background thread (e.g., via
asyncio.to_thread) and await that result so the function remains async and
non-blocking. Locate the clone_repository function and replace the direct call
to Repo.clone_from(repo_url, target_path) with an awaited asyncio.to_thread call
(or equivalent executor) that invokes Repo.clone_from, and keep the logger.info
call as-is.
- Around line 82-85: The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 41-53: The git_operations function lacks input validation and
error handling: catch JSONDecodeError around json.loads(args_str) and return or
raise a clear error message, validate presence of 'operation' and for operation
== "setup" ensure required keys 'repo_url' and 'base_commit' exist before
calling repo_manager.setup_repository (raise ValueError or return a descriptive
error if missing), and wrap the repo_manager.setup_repository and
repo_manager.cleanup calls to catch and log exceptions so callers receive
actionable error messages referencing git_operations and
repo_manager.setup_repository/cleanup.
🧹 Nitpick comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

24-25: Unused import: FunctionRef

FunctionRef is imported but not used in this file. Only LLMRef is used for the llm_name field.

🧹 Remove unused import
 from nat.data_models.common import TypedBaseModel
-from nat.data_models.component_ref import FunctionRef
 from nat.data_models.component_ref import LLMRef
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)

37-67: Add type hints per coding guidelines.

The class is missing type hints on __init__, active_repos, and cleanup(). Per coding guidelines, all public APIs require type hints.

📝 Add type hints
 class RepoManager:
+    active_repos: dict[str, RepoContext]
 
-    def __init__(self, workspace_dir: str):
+    def __init__(self, workspace_dir: str) -> None:
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
 
     # ... setup_repository unchanged ...
 
-    async def cleanup(self):
+    async def cleanup(self) -> None:
         """Clean up all managed repositories."""

25-34: Misleading docstring: not a context manager.

The docstring states "Context manager for repository operations" but RepoContext is a plain dataclass without __enter__/__exit__ methods. Consider updating the docstring to reflect its actual purpose as a data container.

📝 Fix docstring
 `@dataclass`
 class RepoContext:
-    """Context manager for repository operations."""
+    """Data container holding repository state and paths."""
     repo_url: str
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

25-29: Redundant _type field.

The _type field is redundant since TypedBaseModel (parent of FunctionBaseConfig) already manages the type discriminator via name="git_repo_tool". This creates potential confusion with two type fields.

🧹 Remove redundant field
 class GitRepoToolConfig(FunctionBaseConfig, name="git_repo_tool"):
     """Configuration for git repository management tool."""
-    _type: typing.Literal["git_repo_tool"] = "git_repo_tool"
     workspace_dir: str = "./.workspace"  # Base directory for cloning repositories
     cleanup_on_exit: bool = True  # Whether to clean up repos after use

32-60: Unused builder parameter is acceptable for interface consistency.

The builder parameter is unused (as flagged by static analysis) but is likely required by the register_function decorator's expected signature. The cleanup pattern using try/finally is well implemented.

Consider adding a return type hint for the async generator:

📝 Add return type hint
+from collections.abc import AsyncGenerator
+
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder) -> AsyncGenerator[FunctionInfo, None]:
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (6)

76-81: Consider adding docstrings for the configuration fields.

The dataclass lacks documentation for its fields. While the class docstring exists, individual field descriptions would improve clarity.

📝 Suggested improvement
 `@dataclass`
 class IterativeAgentConfig:
     """Configuration for the iterative agent."""
-    step_limit: int = 250
-    timeout: int = 60
-    max_output_length: int = 10000
+    step_limit: int = 250  # Maximum number of agent steps before termination
+    timeout: int = 60  # Command execution timeout in seconds
+    max_output_length: int = 10000  # Maximum characters before output truncation

105-110: Add type hint for llm parameter.

The llm parameter lacks a type annotation. Per coding guidelines, all public APIs require type hints on parameters.

📝 Suggested fix
-    def __init__(self, llm, repo_path: Path, config: IterativeAgentConfig):
+    def __init__(self, llm: typing.Any, repo_path: Path, config: IterativeAgentConfig):
         self.llm = llm
         self.repo_path = repo_path
         self.config = config
-        self.messages: list = []
+        self.messages: list[SystemMessage | HumanMessage | AIMessage] = []
         self.n_steps = 0

Note: Add import typing at the top if not already present. Ideally, use the actual LLM interface type if available from the framework.


360-363: Chain exception and use explicit conversion.

Per coding guidelines, use raise ... from err to preserve the exception chain and use explicit conversion flag instead of str(e).

🔧 Proposed fix
         except Exception as e:
             logger.error("LLM invocation failed: %s", e, exc_info=True)
-            # recoverable error, let the agent continue
-            raise NonTerminatingException(f"LLM call failed: {str(e)}")
+            # recoverable error, let the agent continue
+            raise NonTerminatingException(f"LLM call failed: {e!s}") from e

414-427: Chain exceptions and narrow the exception type.

Multiple issues flagged by static analysis:

  1. Missing exception chaining at lines 425 and 427
  2. Catching broad Exception at line 426 masks specific errors
🔧 Proposed fix
         except (TimeoutError, subprocess.TimeoutExpired) as e:
             # Extract output from exception if available (only subprocess.TimeoutExpired has output attribute)
             if isinstance(e, subprocess.TimeoutExpired) and hasattr(e, "output") and e.output:
                 output = e.output.decode("utf-8", errors="replace")
             else:
                 output = ""
             # Format timeout message using template
             timeout_message = self._TIMEOUT_TEMPLATE.format(
                 action=command,
                 output=output
             )
-            raise ExecutionTimeoutError(timeout_message)
-        except Exception as e:
-            raise NonTerminatingException(f"Error executing command: {str(e)}")
+            raise ExecutionTimeoutError(timeout_message) from e
+        except OSError as e:
+            raise NonTerminatingException(f"Error executing command: {e!s}") from e

Using OSError (or subprocess.SubprocessError) is more appropriate than catching all exceptions, as it covers typical subprocess failures without masking unexpected errors.


462-464: Remove redundant exception object from logger.exception.

When using logger.exception(), the exception info is automatically included. Including e as an argument is redundant (TRY401).

🔧 Proposed fix
         except Exception as e:
-            logger.exception("Failed to setup repository: %s", e)
-            return f"Error: Failed to setup repository - {str(e)}"
+            logger.exception("Failed to setup repository")
+            return f"Error: Failed to setup repository - {e!s}"

493-495: Remove redundant exception object and use explicit conversion.

Same pattern as above - logger.exception() automatically includes exception info.

🔧 Proposed fix
         except Exception as e:
-            logger.exception(f"Error processing {swebench_input.instance_id}: {e}")
-            return f"Error: {str(e)}"
+            logger.exception("Error processing %s", swebench_input.instance_id)
+            return f"Error: {e!s}"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e15df42 and f5771ff.

📒 Files selected for processing (10)
  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/__init__.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{md,mdx}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx}: Use 'NVIDIA NeMo Agent toolkit' for full name (first use), 'NeMo Agent toolkit' or 'the toolkit' for subsequent references, and 'Toolkit' (capital T) in titles/headings, 'toolkit' (lowercase t) in body text
Never use deprecated names: 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' in documentation; update any occurrences unless intentionally referring to deprecated versions or implementing compatibility layers

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
**/*.{md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{md,mdx,rst}: Documentation must be clear, comprehensive, and free of TODOs, FIXMEs, placeholder text, offensive or outdated terms, and spelling mistakes
Do not use words listed in 'ci/vale/styles/config/vocabularies/nat/reject.txt' in documentation
Words listed in 'ci/vale/styles/config/vocabularies/nat/accept.txt' are acceptable even if they appear to be spelling mistakes

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/README.md
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧠 Learnings (1)
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (2)
src/nat/data_models/common.py (3)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (2)
  • FunctionRef (94-102)
  • LLMRef (116-124)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (3)
src/nat/builder/function_info.py (2)
  • FunctionInfo (290-625)
  • from_fn (552-625)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (37-67)
  • setup_repository (44-58)
  • cleanup (60-67)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (1)
  • SweBenchPredictor (431-502)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (3)
src/nat/builder/builder.py (1)
  • Builder (84-811)
src/nat/builder/framework_enum.py (1)
  • LLMFrameworkEnum (19-27)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)
  • SweBenchWorkflowConfig (51-52)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


53-53: Avoid specifying long messages outside the exception class

(TRY003)

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py

127-127: Avoid specifying long messages outside the exception class

(TRY003)


327-327: Avoid specifying long messages outside the exception class

(TRY003)


359-359: Consider moving this statement to an else block

(TRY300)


363-363: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


363-363: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Use explicit conversion flag

Replace with conversion flag

(RUF010)


379-379: subprocess call with shell=True identified, security issue

(S602)


412-412: Consider moving this statement to an else block

(TRY300)


425-425: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


426-426: Do not catch blind exception: Exception

(BLE001)


427-427: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


427-427: Avoid specifying long messages outside the exception class

(TRY003)


427-427: Use explicit conversion flag

Replace with conversion flag

(RUF010)


463-463: Redundant exception object included in logging.exception call

(TRY401)


464-464: Use explicit conversion flag

Replace with conversion flag

(RUF010)


494-494: Redundant exception object included in logging.exception call

(TRY401)


495-495: Use explicit conversion flag

Replace with conversion flag

(RUF010)

🔇 Additional comments (11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

40-49: LGTM!

The SweBenchPredictorIterativeConfig follows the established pattern, with appropriate type hints and field descriptions. The discriminated union is correctly extended to include the new iterative variant.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

70-73: LGTM!

Simple helper with proper type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/register_tools.py (1)

19-19: LGTM!

The import correctly triggers registration of the git_repo_tool via its decorator, following the established pattern in this file.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/register.py (1)

20-20: The import follows the established pattern for predictor registration.

The addition of IterativePredictor on line 20 mirrors the existing GoldPredictor import on line 19, with proper flake8: noqa directives to allow unused imports (which are intentionally present to trigger registration side-effects). The file maintains proper Apache 2.0 licensing and copyright headers.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py (4)

1-14: LGTM!

License header is correctly formatted with SPDX identifier and Apache 2.0 license text.


52-74: LGTM!

Well-structured exception hierarchy with clear separation between recoverable (NonTerminatingException) and terminal (TerminatingException) conditions. This enables clean control flow in the agent loop.


376-389: Acknowledged: shell=True is intentional for this agent design.

The static analysis flags shell=True as a security concern (S602). However, this is expected behavior since the agent is explicitly designed to execute arbitrary bash commands provided by the LLM. The commands are sandboxed within the repository workspace directory.


497-502: LGTM!

The _build_task_description method is clean and well-typed with appropriate type hints.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (2)

23-35: LGTM!

Workflow and functions configuration is well-structured with appropriate iterative predictor settings and git tool configuration. The 2-space indentation follows repository conventions. Based on learnings, this matches the existing YAML style in the repository.


37-70: LGTM!

Evaluation configuration is properly structured with dataset filtering for focused testing. The single-instance allowlist is appropriate for initial development/testing.

examples/evaluation_and_profiling/swe_bench/README.md (1)

159-159: LGTM!

Documentation entry for the iterative predictor follows the established pattern and provides clear, concise description with appropriate links to implementation and configuration files.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 82 to 85
async def checkout_commit(repo: Repo, commit_hash: str):
"""Checkout a specific commit in the repository."""
logger.info("Checking out commit %s", commit_hash)
repo.git.checkout(commit_hash)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Same blocking I/O issue; also missing return type hint.

repo.git.checkout() is synchronous. Wrap with asyncio.to_thread(). Per coding guidelines, add return type hint.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
     logger.info("Checking out commit %s", commit_hash)
-    repo.git.checkout(commit_hash)
+    await asyncio.to_thread(repo.git.checkout, commit_hash)
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 82 - 85, The checkout_commit function performs blocking I/O by
calling the synchronous repo.git.checkout; change checkout_commit to have an
explicit return type hint (-> None) and call the blocking operation inside
asyncio.to_thread (e.g., await asyncio.to_thread(repo.git.checkout,
commit_hash)) so the checkout runs off the event loop; keep the logger.info call
and docstring unchanged and reference the function name checkout_commit and the
blocking call repo.git.checkout when making the change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py`:
- Around line 69-74: Docstring for SweBenchWorkflowConfig incorrectly lists a
"full" predictor option that doesn't exist; update the docstring to reflect the
actual predictor options defined in SweBenchPredictorConfig (e.g., "gold,
skeleton, or iterative") or enumerate the exact union variants from
SweBenchPredictorConfig so the docs and code are consistent; locate the class
SweBenchWorkflowConfig and its docstring and replace "full, gold, skeleton, or
iterative" with the correct set of predictor types from SweBenchPredictorConfig.
- Around line 24-25: Remove the unused import FunctionRef from the top of the
module: delete the "FunctionRef" import token in the import statement that
currently reads "from nat.data_models.component_ref import FunctionRef" so only
LLMRef remains imported (referenced symbol: FunctionRef).

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 71-74: get_repo_path currently builds a path from only the repo
name causing collisions; update get_repo_path to parse the repo URL and extract
the owner/organization component (e.g., the segment immediately preceding the
repo name for HTTPS and the part after ":" for SSH forms) and return
Path(workspace_dir) / owner / repo_name so repositories with the same name under
different orgs are distinct; ensure you handle URLs like
"https://host/org/repo.git" and "git@host:org/repo.git" and strip ".git" from
repo_name.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`:
- Around line 32-38: The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

16-28: Duplicate llms key causes configuration to be overwritten.

The YAML has two separate llms: keys (lines 16 and 23). In YAML, duplicate keys at the same level cause the second to overwrite the first, meaning nim_llm will be silently discarded and only claude_sonnet_llm will be available.

Additionally, nim_llm uses 1-space indentation while claude_sonnet_llm uses 2-space indentation. Per learnings, the repository uses 2-space indentation for YAML files.

🔧 Proposed fix - merge into single llms block with consistent 2-space indentation
-llms:
- nim_llm:
-   _type: nim
-   model_name: mistralai/mistral-nemotron
-   temperature: 0.6
-   max_tokens: 4096    
-
-llms:
-  claude_sonnet_llm:
-    _type: litellm
-    model_name: anthropic/claude-sonnet-4-5-20250929
-    temperature: 0.0
-    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
+llms:
+  nim_llm:
+    _type: nim
+    model_name: mistralai/mistral-nemotron
+    temperature: 0.6
+    max_tokens: 4096
+
+  claude_sonnet_llm:
+    _type: litellm
+    model_name: anthropic/claude-sonnet-4-5-20250929
+    temperature: 0.0
+    api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
🧹 Nitpick comments (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (2)

84-88: Add return type hint for checkout_commit.

Per coding guidelines, all public APIs require type hints on return values. This function returns None implicitly.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
     logger.info("Checking out commit %s", commit_hash)
     # Use asyncio.to_thread to avoid blocking the event loop during checkout
     await asyncio.to_thread(repo.git.checkout, commit_hash)

38-43: Add type hint for active_repos dictionary.

Per coding guidelines, type hints are required. The dictionary maps repo paths to RepoContext objects.

🔧 Proposed fix
 class RepoManager:

     def __init__(self, workspace_dir: str):
         self.workspace = Path(workspace_dir)
         self.workspace.mkdir(parents=True, exist_ok=True)
-        self.active_repos = {}
+        self.active_repos: dict[str, RepoContext] = {}
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

80-82: Remove trailing blank lines.

YAMLlint reports too many blank lines at the end of the file. Files should end with a single newline.

🔧 Proposed fix
   evaluators:
     swe_bench:
       _type: swe_bench
       run_id: nat_iterative_1
       clean: true
-
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5771ff and a39f138.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (38-68)
  • setup_repository (45-59)
  • cleanup (61-68)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
  • BaseModelRegistryTag (91-93)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (2)
  • FunctionRef (94-102)
  • LLMRef (116-124)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)


[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (5)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

77-81: LGTM - blocking I/O properly handled.

The use of asyncio.to_thread correctly offloads the blocking Repo.clone_from call to a thread pool, preventing event loop blocking.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Input validation improvements look good.

The JSON parsing is now wrapped in a try-except, and required keys are validated before use. This addresses the past review feedback.


61-66: Good resource cleanup pattern with finally block.

The use of try/finally ensures repo_manager.cleanup() is called even if an exception occurs during the tool's lifecycle, preventing orphaned repository directories.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (1)

51-61: LGTM - SweBenchPredictorIterativeConfig is well-structured.

The configuration properly uses LLMRef for the LLM reference, and Field with descriptions for step_limit and timeout. The defaults (250 steps, 60s timeout) are reasonable for iterative problem solving.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

1-14: LGTM - SPDX license header is present.

The required Apache-2.0 license header has been added, addressing the past review feedback.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +32 to +38
@register_function(config_type=GitRepoToolConfig)
async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
"""Git repository management tool for SWE Bench."""
import json

from .git_tool import RepoManager
repo_manager = RepoManager(tool_config.workspace_dir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find and read the register.py file
fd "register.py" -p "examples/evaluation_and_profiling/swe_bench"

# Get the full content to understand context
find . -path "*/examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py" -type f -exec wc -l {} \;

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 450


🏁 Script executed:

# Search for all `@register_function` decorated functions to check builder parameter usage
rg "@register_function" -A 3 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 50381


🏁 Script executed:

# Find the definition of register_function decorator
rg "def register_function" -B 2 -A 10 --type=py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 6591


🏁 Script executed:

cat -n examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

Repository: NVIDIA/NeMo-Agent-Toolkit

Length of output: 3277


Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is not used in this function. Follow the established pattern in the codebase by renaming it to _builder to indicate intentional non-use.

🔧 Suggested fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
🧰 Tools
🪛 Ruff (0.14.11)

33-33: Unused function argument: builder

(ARG001)

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py`
around lines 32 - 38, The git_repo_tool function declares an unused parameter
named builder; rename it to _builder to follow the codebase convention for
intentionally unused parameters (update the function signature async def
git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder): and any
references in the decorator/register_function call if necessary) so
linters/readers know it is intentionally unused.

Signed-off-by: Jerry Guan <[email protected]>
@Jerryguan777 Jerryguan777 force-pushed the feat/iterative-predictor branch from a39f138 to dbc2dd6 Compare January 15, 2026 06:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`:
- Around line 16-28: The YAML defines two separate top-level llms mappings which
causes the first (nim_llm) to be overwritten by the second (claude_sonnet_llm);
merge both entries under a single llms key so both nim_llm and claude_sonnet_llm
are present, and fix nim_llm’s indentation to match the file’s 2-space style;
locate the nim_llm and claude_sonnet_llm blocks and combine them into one llms
mapping preserving their model_name, temperature, max_tokens and api_key fields.

In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`:
- Around line 33-35: RepoContext.__post_init__ sets repo_path = base_path /
repo_name which omits the organization and disagrees with get_repo_path; update
__post_init__ so repo_path includes the org (e.g., repo_path = base_path /
self.org_name / self.repo_name) or call the existing get_repo_path logic to
compute it, ensuring RepoContext.repo_path matches the path used by
setup_repository and clone operations.
♻️ Duplicate comments (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

88-92: Add missing return type hint.

Per coding guidelines, all public functions require type hints. This async function returns None.

🔧 Proposed fix
-async def checkout_commit(repo: Repo, commit_hash: str):
+async def checkout_commit(repo: Repo, commit_hash: str) -> None:
     """Checkout a specific commit in the repository."""
🧹 Nitpick comments (4)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml (1)

81-82: Remove extra trailing blank line.

YAMLlint reports too many blank lines at the end. Files should end with exactly one newline.

🔧 Proposed fix
       clean: true
-
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

1-1: Copyright year should be updated to 2025-2026.

Other files in this PR use 2025-2026 in the copyright header. This file uses only 2025.

🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

32-34: Prefix unused builder parameter with underscore.

The builder parameter is required by the @register_function interface but is unused. Follow the codebase convention by renaming to _builder.

🔧 Proposed fix
 `@register_function`(config_type=GitRepoToolConfig)
-async def git_repo_tool(tool_config: GitRepoToolConfig, builder: Builder):
+async def git_repo_tool(tool_config: GitRepoToolConfig, _builder: Builder):
     """Git repository management tool for SWE Bench."""
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

1-1: Copyright year should be updated to 2025-2026.

For consistency with other files in this PR.

🔧 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a39f138 and dbc2dd6.

📒 Files selected for processing (4)
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py
🧠 Learnings (6)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.py : Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Every file must start with the standard SPDX Apache-2.0 header

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : All source files must include the SPDX Apache-2.0 header template

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-03T18:42:23.494Z
Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1147
File: packages/nvidia_nat_a2a/pyproject.toml:1-10
Timestamp: 2025-12-03T18:42:23.494Z
Learning: In the packages/ directory, pyproject.toml files typically do not include SPDX license headers. Out of 34 packages, only nvidia_nat_strands is an exception. This pattern differs from the requirement for SPDX headers in source code files (.py, .js, .ts, etc.).

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-11-05T11:45:35.119Z
Learnt from: thepatrickchin
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1152
File: examples/config_inheritance/pyproject.toml:1-25
Timestamp: 2025-11-05T11:45:35.119Z
Learning: In the examples/ directory, pyproject.toml files typically do not include SPDX license headers, with only one exception (adk_demo). This is an established pattern that differs from the general guideline requiring SPDX headers in all .toml files.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
📚 Learning: 2025-12-12T20:49:44.305Z
Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:44.305Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, YAML files generally use 2-space indentation. When reviewing YAML, prefer 2-space indentation to match the existing style over a 4-space guideline until a repo-wide standardization is performed. This applies to YAML configuration files (e.g., red-teaming.yml) and, more broadly, all *.yml files in the project.

Applied to files:

  • examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml
🧬 Code graph analysis (3)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)
src/nat/data_models/common.py (4)
  • BaseModelRegistryTag (91-93)
  • TypedBaseModel (96-171)
  • static_type (157-158)
  • discriminator (165-171)
src/nat/data_models/component_ref.py (1)
  • LLMRef (116-124)
src/nat/data_models/function.py (1)
  • FunctionBaseConfig (26-36)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)
src/nat/runtime/runner.py (1)
  • context (93-94)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (3)
  • RepoManager (38-68)
  • setup_repository (45-59)
  • cleanup (61-68)
🪛 Ruff (0.14.11)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py

33-33: Unused function argument: builder

(ARG001)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

🪛 YAMLlint (1.37.1)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)


[warning] 82-82: too many blank lines (2 > 0)

(empty-lines)

🔇 Additional comments (6)
examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/config.py (3)

50-60: LGTM! Well-structured iterative predictor configuration.

The new SweBenchPredictorIterativeConfig is properly defined with appropriate type hints and sensible defaults. The LLMRef type correctly references the component reference pattern used elsewhere in the codebase.

Minor nit: Line 57 has trailing whitespace after the closing triple quotes.


62-66: Discriminated union correctly extended.

The SweBenchPredictorConfig union properly includes the new iterative config using the established Tag/Discriminator pattern.


68-74: Docstring now correctly lists available predictor types.

The docstring accurately reflects the available options: "gold, skeleton, or iterative".

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/register.py (2)

41-59: Good input validation and error handling.

The function properly validates JSON input, checks for required keys on the setup operation, and provides clear error messages. This addresses the concerns from prior reviews.


61-66: Correct async generator pattern with guaranteed cleanup.

The try/finally ensures cleanup() runs when the tool is disposed, respecting the cleanup_on_exit config flag.

examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py (1)

81-85: LGTM! Blocking I/O correctly wrapped with asyncio.to_thread.

The clone operation now properly yields to the event loop, addressing the prior review concern.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +16 to +28
llms:
nim_llm:
_type: nim
model_name: mistralai/mistral-nemotron
temperature: 0.6
max_tokens: 4096

llms:
claude_sonnet_llm:
_type: litellm
model_name: anthropic/claude-sonnet-4-5-20250929
temperature: 0.0
api_key: "${ANTHROPIC_API_KEY}" # Set this environment variable before running
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Duplicate llms key causes first LLM configuration to be silently overwritten.

The YAML has two separate llms: keys (lines 16 and 23). YAML spec allows duplicate keys but only the last one is retained, so nim_llm will be silently discarded and only claude_sonnet_llm will be available at runtime.

Additionally, the nim_llm block uses 1-space indentation (lines 17-21) instead of the 2-space standard used elsewhere.

🔧 Proposed fix - merge into single llms block
 llms:
-nim_llm:
-  _type: nim
-  model_name: mistralai/mistral-nemotron
-  temperature: 0.6
-  max_tokens: 4096    
-
-llms:
+  nim_llm:
+    _type: nim
+    model_name: mistralai/mistral-nemotron
+    temperature: 0.6
+    max_tokens: 4096
   claude_sonnet_llm:
     _type: litellm
     model_name: anthropic/claude-sonnet-4-5-20250929
     temperature: 0.0
     api_key: "${ANTHROPIC_API_KEY}"  # Set this environment variable before running
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 23-23: duplication of key "llms" in mapping

(key-duplicates)

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`
around lines 16 - 28, The YAML defines two separate top-level llms mappings
which causes the first (nim_llm) to be overwritten by the second
(claude_sonnet_llm); merge both entries under a single llms key so both nim_llm
and claude_sonnet_llm are present, and fix nim_llm’s indentation to match the
file’s 2-space style; locate the nim_llm and claude_sonnet_llm blocks and
combine them into one llms mapping preserving their model_name, temperature,
max_tokens and api_key fields.

Comment on lines +33 to +35
def __post_init__(self):
self.repo_name = self.repo_url.split('/')[-1].replace('.git', '')
self.repo_path = self.base_path / self.repo_name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

RepoContext.repo_path calculation is inconsistent with get_repo_path.

RepoContext.__post_init__ computes repo_path = base_path / repo_name (without organization), but get_repo_path returns workspace_dir / org_name / repo_name (with organization).

In setup_repository (line 57), a RepoContext is created with base_path=self.workspace, so context.repo_path will be workspace/repo_name. However, the actual cloned repository is at workspace/org/repo_name (from get_repo_path at line 47). This means context.repo_path points to the wrong location.

🔧 Proposed fix - align RepoContext with get_repo_path
 `@dataclass`
 class RepoContext:
     """Context manager for repository operations."""
     repo_url: str
     base_path: Path
     repo: Repo | None = None

     def __post_init__(self):
-        self.repo_name = self.repo_url.split('/')[-1].replace('.git', '')
-        self.repo_path = self.base_path / self.repo_name
+        parts = self.repo_url.rstrip('/').split('/')
+        self.repo_name = parts[-1].replace('.git', '')
+        self.org_name = parts[-2]
+        self.repo_path = self.base_path / self.org_name / self.repo_name
🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/tools/git_tool.py`
around lines 33 - 35, RepoContext.__post_init__ sets repo_path = base_path /
repo_name which omits the organization and disagrees with get_repo_path; update
__post_init__ so repo_path includes the org (e.g., repo_path = base_path /
self.org_name / self.repo_name) or call the existing get_repo_path logic to
compute it, ensuring RepoContext.repo_path matches the path used by
setup_repository and clone operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Iterative Predictor for Improved SWE-bench Issue Resolution

1 participant