Skip to content

Conversation

@XiaoBoAI
Copy link
Collaborator

  • Add main application entry (app.py)
  • Add UI components (input_panel, result_panel, sidebar, multimodal)
  • Add grader registry and factory services
  • Add theme and styling configuration
  • Add utility helpers and constants
  • Update pre-commit config to use Python 3.11

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

- Add main application entry (app.py)
- Add UI components (input_panel, result_panel, sidebar, multimodal)
- Add grader registry and factory services
- Add theme and styling configuration
- Add utility helpers and constants
- Update pre-commit config to use Python 3.11
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a significant new feature: a Streamlit-powered web application called "OpenJudge Studio." This application provides an intuitive graphical interface for users to interact with the OpenJudge framework, enabling them to easily configure API endpoints and keys, select from a wide range of LLM graders across different categories (e.g., correctness, relevance, multimodal, agent tool use), input evaluation data, and visualize the results. The goal is to streamline the process of evaluating LLM responses, making it more accessible and efficient for developers and researchers.

Highlights

  • Streamlit UI Introduction: A new Streamlit-based user interface, "OpenJudge Studio," has been added for interactive LLM grader evaluation.
  • Modular UI Components: The UI is built with modular components for input, result display, sidebar configuration, and multimodal data handling.
  • Grader and Model Services: New services for dynamic grader and model creation are introduced, allowing flexible selection and configuration of evaluation tools.
  • Comprehensive Grader Registry: A detailed registry categorizes and defines various graders (common, text, format, code, math, multimodal, agent) with their specific requirements and parameters.
  • Theming and Utilities: Custom dark theme styling and a suite of utility helpers for async operations, JSON parsing, and data formatting are included to enhance user experience and development.
  • Pre-commit Configuration Update: The pre-commit configuration has been updated to use Python 3.11, ensuring compatibility and modern development practices.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Streamlit-based user interface for evaluating LLM responses using the OpenJudge framework. The new UI is well-structured, with clear separation of concerns into components, services, and configuration. My review focuses on improving maintainability, robustness, and error handling. Key suggestions include refactoring the import system to avoid sys.path manipulation, improving exception handling and logging for easier debugging, and replacing fragile code introspection with more robust methods. Overall, this is a great addition that significantly enhances the usability of the OpenJudge framework.

Comment on lines 35 to 241
def render_input_panel(sidebar_config: dict[str, Any]) -> dict[str, Any]:
"""Render the input panel and return input data.

Args:
sidebar_config: Configuration from sidebar

Returns:
Dictionary containing all input data
"""
grader_config = sidebar_config.get("grader_config")
grader_name = sidebar_config.get("grader_name", "")
category = sidebar_config.get("grader_category", "common")

render_section_header("Input Data")

# Action buttons
col_btn1, col_btn2 = st.columns([1, 1])
with col_btn1:
load_example = st.button("Load Example", use_container_width=True)
with col_btn2:
clear_all = st.button("Clear All", use_container_width=True)

# Handle button actions
if load_example:
st.session_state.example_loaded = True
st.session_state.evaluation_result = None
if clear_all:
st.session_state.example_loaded = False
st.session_state.evaluation_result = None

# Get default values
if st.session_state.get("example_loaded", False):
defaults = _get_example_data(grader_name, category)
else:
defaults = {
"query": "",
"response": "",
"reference_response": "",
"context": "",
"tool_definitions": "",
"tool_calls": "",
}

input_data: dict[str, Any] = {}

# =========================================================================
# Render appropriate input fields based on grader type
# =========================================================================

if not grader_config:
st.warning("Please select a grader from the sidebar")
return input_data

input_fields = grader_config.get("input_fields", ["query", "response"])

# -------------------------------------------------------------------------
# Multimodal Graders (Image + Text)
# -------------------------------------------------------------------------
if "response_multimodal" in input_fields:
content_list, context = render_multimodal_input()
input_data["response"] = content_list
input_data["has_content"] = len(content_list) > 0
return input_data

if "response_image" in input_fields:
# Text-to-Image grader
text_prompt, image = render_text_to_image_input()
input_data["query"] = text_prompt
input_data["response"] = image
input_data["has_content"] = bool(text_prompt and image)
return input_data

# -------------------------------------------------------------------------
# Agent Graders (Tool definitions and calls)
# -------------------------------------------------------------------------
if "tool_definitions" in input_fields:
tab_main, tab_tools, tab_context = st.tabs(["Query", "Tools", "Context"])

with tab_main:
query = st.text_area(
"Query",
value=defaults.get("query", ""),
height=100,
placeholder="Enter the user's query to the agent...",
help="The task or question given to the agent",
)
input_data["query"] = query

with tab_tools:
st.markdown(
"""
<div class="info-card">
<div style="font-size: 0.85rem; color: #94A3B8;">
Enter tool definitions and calls in JSON format
</div>
</div>
""",
unsafe_allow_html=True,
)

tool_definitions = st.text_area(
"Available Tool Definitions (JSON)",
value=defaults.get("tool_definitions", ""),
height=200,
placeholder='[{"name": "get_weather", "description": "...", "parameters": {...}}]',
help="JSON array of available tool definitions",
)
input_data["tool_definitions"] = tool_definitions

tool_calls = st.text_area(
"Agent's Tool Calls (JSON)",
value=defaults.get("tool_calls", ""),
height=150,
placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]',
help="JSON array of tool calls made by the agent",
)
input_data["tool_calls"] = tool_calls

# Reference tool calls for accuracy evaluation
if "reference_tool_calls" in input_fields:
reference_tool_calls = st.text_area(
"Expected Tool Calls (JSON)",
value="",
height=150,
placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]',
help="JSON array of expected/correct tool calls",
)
input_data["reference_tool_calls"] = reference_tool_calls

with tab_context:
context = st.text_area(
"Additional Context",
value="",
height=200,
placeholder="Enter any additional context...",
help="Optional background information",
)
input_data["context"] = context

input_data["has_content"] = bool(query and tool_definitions and tool_calls)
return input_data

# -------------------------------------------------------------------------
# Standard Graders (Query/Response/Reference)
# -------------------------------------------------------------------------
tab_main, tab_context = st.tabs(["Main Input", "Context"])

with tab_main:
# Query field
if "query" in input_fields:
query = st.text_area(
"Query",
value=defaults.get("query", ""),
height=100,
placeholder="Enter the user's question or prompt...",
help="The original question or prompt from the user",
)
input_data["query"] = query

# Response field (always present for standard graders)
response = st.text_area(
"Response to Evaluate",
value=defaults.get("response", ""),
height=150,
placeholder="Enter the response to be evaluated...",
help="The model's response that needs to be evaluated",
)
input_data["response"] = response

# Reference response field
requires_reference = grader_config.get("requires_reference", False)
if "reference_response" in input_fields or requires_reference:
ref_label = (
"Reference Response *"
if requires_reference
else "Reference Response (Optional)"
)
reference_response = st.text_area(
ref_label,
value=defaults.get("reference_response", ""),
height=120,
placeholder="Enter the reference/golden answer..."
+ (" (Required)" if requires_reference else ""),
help="The expected or ideal response for comparison",
)
input_data["reference_response"] = reference_response

with tab_context:
context = st.text_area(
"Additional Context",
value=defaults.get("context", ""),
height=200,
placeholder="Enter any additional context that might help with evaluation...",
help="Optional background information for the evaluation",
)
input_data["context"] = context

# Determine if we have enough content to run
has_content = bool(input_data.get("response", ""))
if "query" in input_fields:
has_content = has_content and bool(input_data.get("query", ""))
if requires_reference:
has_content = has_content and bool(input_data.get("reference_response", ""))

input_data["has_content"] = has_content

return input_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The render_input_panel function is quite long and handles rendering logic for multiple types of graders (multimodal, agent, standard). This reduces its readability and maintainability.

Consider refactoring this function by extracting the logic for each grader type into its own private helper function, for example:

  • _render_multimodal_inputs(...)
  • _render_agent_inputs(...)
  • _render_standard_inputs(...)

This would make render_input_panel a dispatcher function, which would be much easier to read and modify in the future.

Comment on lines 79 to 81
except Exception:
st.warning("Could not load image preview")
return url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide the actual cause of an error and make debugging difficult. It's better to catch more specific exceptions related to network or image loading. Additionally, logging the exception would provide valuable context for debugging by showing the error to the user in the UI.

Suggested change
except Exception:
st.warning("Could not load image preview")
return url
except Exception as e:
st.warning(f"Could not load image preview. Error: {e}")
return url

- Fix wrong-import-position in app.py
- Remove unused grader_name parameter in input_panel.py
- Split render_input_panel into smaller helper functions
- Split render_result_panel into smaller helper functions
- Split render_sidebar into smaller helper functions
- Fix unnecessary list comprehension in sidebar.py
- Remove unused col3 variable in sidebar.py
- Fix line-too-long in result_panel.py and constants.py
- Add logging for exception handling in result_panel.py
- Use inspect.signature() instead of __code__.co_varnames in grader_factory.py
- Replace sys.path modification with relative imports across ui package
- Use render_divider() helper function in app.py
- Extract _render_multimodal_inputs() for better code organization
- Show error details in image preview warning in multimodal.py
- Remove unused render_loading_state() function
Relative imports add complexity for Streamlit apps that are
typically run directly with 'streamlit run'. Keep other review
fixes (logging, inspect.signature, render_divider, etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants