feat(ui): Add Streamlit-based grader evaluation UI #71

XiaoBoAI · 2026-01-16T06:43:16Z

Add main application entry (app.py)
Add UI components (input_panel, result_panel, sidebar, multimodal)
Add grader registry and factory services
Add theme and styling configuration
Add utility helpers and constants
Update pre-commit config to use Python 3.11

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with pre-commit run --all-files command
All tests are passing
Docstrings are in Google style
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

- Add main application entry (app.py) - Add UI components (input_panel, result_panel, sidebar, multimodal) - Add grader registry and factory services - Add theme and styling configuration - Add utility helpers and constants - Update pre-commit config to use Python 3.11

gemini-code-assist · 2026-01-16T06:43:44Z

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a significant new feature: a Streamlit-powered web application called "OpenJudge Studio." This application provides an intuitive graphical interface for users to interact with the OpenJudge framework, enabling them to easily configure API endpoints and keys, select from a wide range of LLM graders across different categories (e.g., correctness, relevance, multimodal, agent tool use), input evaluation data, and visualize the results. The goal is to streamline the process of evaluating LLM responses, making it more accessible and efficient for developers and researchers.

Highlights

Streamlit UI Introduction: A new Streamlit-based user interface, "OpenJudge Studio," has been added for interactive LLM grader evaluation.
Modular UI Components: The UI is built with modular components for input, result display, sidebar configuration, and multimodal data handling.
Grader and Model Services: New services for dynamic grader and model creation are introduced, allowing flexible selection and configuration of evaluation tools.
Comprehensive Grader Registry: A detailed registry categorizes and defines various graders (common, text, format, code, math, multimodal, agent) with their specific requirements and parameters.
Theming and Utilities: Custom dark theme styling and a suite of utility helpers for async operations, JSON parsing, and data formatting are included to enhance user experience and development.
Pre-commit Configuration Update: The pre-commit configuration has been updated to use Python 3.11, ensuring compatibility and modern development practices.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive Streamlit-based user interface for evaluating LLM responses using the OpenJudge framework. The new UI is well-structured, with clear separation of concerns into components, services, and configuration. My review focuses on improving maintainability, robustness, and error handling. Key suggestions include refactoring the import system to avoid sys.path manipulation, improving exception handling and logging for easier debugging, and replacing fragile code introspection with more robust methods. Overall, this is a great addition that significantly enhances the usability of the OpenJudge framework.

ui/components/result_panel.py

ui/services/grader_factory.py

ui/app.py

gemini-code-assist · 2026-01-16T06:45:58Z

ui/components/input_panel.py

+def render_input_panel(sidebar_config: dict[str, Any]) -> dict[str, Any]:
+    """Render the input panel and return input data.
+
+    Args:
+        sidebar_config: Configuration from sidebar
+
+    Returns:
+        Dictionary containing all input data
+    """
+    grader_config = sidebar_config.get("grader_config")
+    grader_name = sidebar_config.get("grader_name", "")
+    category = sidebar_config.get("grader_category", "common")
+
+    render_section_header("Input Data")
+
+    # Action buttons
+    col_btn1, col_btn2 = st.columns([1, 1])
+    with col_btn1:
+        load_example = st.button("Load Example", use_container_width=True)
+    with col_btn2:
+        clear_all = st.button("Clear All", use_container_width=True)
+
+    # Handle button actions
+    if load_example:
+        st.session_state.example_loaded = True
+        st.session_state.evaluation_result = None
+    if clear_all:
+        st.session_state.example_loaded = False
+        st.session_state.evaluation_result = None
+
+    # Get default values
+    if st.session_state.get("example_loaded", False):
+        defaults = _get_example_data(grader_name, category)
+    else:
+        defaults = {
+            "query": "",
+            "response": "",
+            "reference_response": "",
+            "context": "",
+            "tool_definitions": "",
+            "tool_calls": "",
+        }
+
+    input_data: dict[str, Any] = {}
+
+    # =========================================================================
+    # Render appropriate input fields based on grader type
+    # =========================================================================
+
+    if not grader_config:
+        st.warning("Please select a grader from the sidebar")
+        return input_data
+
+    input_fields = grader_config.get("input_fields", ["query", "response"])
+
+    # -------------------------------------------------------------------------
+    # Multimodal Graders (Image + Text)
+    # -------------------------------------------------------------------------
+    if "response_multimodal" in input_fields:
+        content_list, context = render_multimodal_input()
+        input_data["response"] = content_list
+        input_data["has_content"] = len(content_list) > 0
+        return input_data
+
+    if "response_image" in input_fields:
+        # Text-to-Image grader
+        text_prompt, image = render_text_to_image_input()
+        input_data["query"] = text_prompt
+        input_data["response"] = image
+        input_data["has_content"] = bool(text_prompt and image)
+        return input_data
+
+    # -------------------------------------------------------------------------
+    # Agent Graders (Tool definitions and calls)
+    # -------------------------------------------------------------------------
+    if "tool_definitions" in input_fields:
+        tab_main, tab_tools, tab_context = st.tabs(["Query", "Tools", "Context"])
+
+        with tab_main:
+            query = st.text_area(
+                "Query",
+                value=defaults.get("query", ""),
+                height=100,
+                placeholder="Enter the user's query to the agent...",
+                help="The task or question given to the agent",
+            )
+            input_data["query"] = query
+
+        with tab_tools:
+            st.markdown(
+                """
+                <div class="info-card">
+                    <div style="font-size: 0.85rem; color: #94A3B8;">
+                        Enter tool definitions and calls in JSON format
+                    </div>
+                </div>
+                """,
+                unsafe_allow_html=True,
+            )
+
+            tool_definitions = st.text_area(
+                "Available Tool Definitions (JSON)",
+                value=defaults.get("tool_definitions", ""),
+                height=200,
+                placeholder='[{"name": "get_weather", "description": "...", "parameters": {...}}]',
+                help="JSON array of available tool definitions",
+            )
+            input_data["tool_definitions"] = tool_definitions
+
+            tool_calls = st.text_area(
+                "Agent's Tool Calls (JSON)",
+                value=defaults.get("tool_calls", ""),
+                height=150,
+                placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]',
+                help="JSON array of tool calls made by the agent",
+            )
+            input_data["tool_calls"] = tool_calls
+
+            # Reference tool calls for accuracy evaluation
+            if "reference_tool_calls" in input_fields:
+                reference_tool_calls = st.text_area(
+                    "Expected Tool Calls (JSON)",
+                    value="",
+                    height=150,
+                    placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]',
+                    help="JSON array of expected/correct tool calls",
+                )
+                input_data["reference_tool_calls"] = reference_tool_calls
+
+        with tab_context:
+            context = st.text_area(
+                "Additional Context",
+                value="",
+                height=200,
+                placeholder="Enter any additional context...",
+                help="Optional background information",
+            )
+            input_data["context"] = context
+
+        input_data["has_content"] = bool(query and tool_definitions and tool_calls)
+        return input_data
+
+    # -------------------------------------------------------------------------
+    # Standard Graders (Query/Response/Reference)
+    # -------------------------------------------------------------------------
+    tab_main, tab_context = st.tabs(["Main Input", "Context"])
+
+    with tab_main:
+        # Query field
+        if "query" in input_fields:
+            query = st.text_area(
+                "Query",
+                value=defaults.get("query", ""),
+                height=100,
+                placeholder="Enter the user's question or prompt...",
+                help="The original question or prompt from the user",
+            )
+            input_data["query"] = query
+
+        # Response field (always present for standard graders)
+        response = st.text_area(
+            "Response to Evaluate",
+            value=defaults.get("response", ""),
+            height=150,
+            placeholder="Enter the response to be evaluated...",
+            help="The model's response that needs to be evaluated",
+        )
+        input_data["response"] = response
+
+        # Reference response field
+        requires_reference = grader_config.get("requires_reference", False)
+        if "reference_response" in input_fields or requires_reference:
+            ref_label = (
+                "Reference Response *"
+                if requires_reference
+                else "Reference Response (Optional)"
+            )
+            reference_response = st.text_area(
+                ref_label,
+                value=defaults.get("reference_response", ""),
+                height=120,
+                placeholder="Enter the reference/golden answer..."
+                + (" (Required)" if requires_reference else ""),
+                help="The expected or ideal response for comparison",
+            )
+            input_data["reference_response"] = reference_response
+
+    with tab_context:
+        context = st.text_area(
+            "Additional Context",
+            value=defaults.get("context", ""),
+            height=200,
+            placeholder="Enter any additional context that might help with evaluation...",
+            help="Optional background information for the evaluation",
+        )
+        input_data["context"] = context
+
+    # Determine if we have enough content to run
+    has_content = bool(input_data.get("response", ""))
+    if "query" in input_fields:
+        has_content = has_content and bool(input_data.get("query", ""))
+    if requires_reference:
+        has_content = has_content and bool(input_data.get("reference_response", ""))
+
+    input_data["has_content"] = has_content
+
+    return input_data


The render_input_panel function is quite long and handles rendering logic for multiple types of graders (multimodal, agent, standard). This reduces its readability and maintainability.

Consider refactoring this function by extracting the logic for each grader type into its own private helper function, for example:

_render_multimodal_inputs(...)

_render_agent_inputs(...)

_render_standard_inputs(...)

This would make render_input_panel a dispatcher function, which would be much easier to read and modify in the future.

gemini-code-assist · 2026-01-16T06:45:58Z

ui/components/multimodal.py

+        except Exception:
+            st.warning("Could not load image preview")
+            return url


Catching a broad Exception can hide the actual cause of an error and make debugging difficult. It's better to catch more specific exceptions related to network or image loading. Additionally, logging the exception would provide valuable context for debugging by showing the error to the user in the UI.

Suggested change

except Exception:

st.warning("Could not load image preview")

return url

except Exception as e:

st.warning(f"Could not load image preview. Error: {e}")

return url

ui/components/shared.py

- Fix wrong-import-position in app.py - Remove unused grader_name parameter in input_panel.py - Split render_input_panel into smaller helper functions - Split render_result_panel into smaller helper functions - Split render_sidebar into smaller helper functions - Fix unnecessary list comprehension in sidebar.py - Remove unused col3 variable in sidebar.py - Fix line-too-long in result_panel.py and constants.py

- Add logging for exception handling in result_panel.py - Use inspect.signature() instead of __code__.co_varnames in grader_factory.py - Replace sys.path modification with relative imports across ui package - Use render_divider() helper function in app.py - Extract _render_multimodal_inputs() for better code organization - Show error details in image preview warning in multimodal.py - Remove unused render_loading_state() function

Relative imports add complexity for Streamlit apps that are typically run directly with 'streamlit run'. Keep other review fixes (logging, inspect.signature, render_divider, etc.).

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

helloml0326 requested review from XieLipeng0830, chr6192, helloml0326, ployts and weizhang25 January 16, 2026 06:52

XiaoBoAI added 5 commits January 16, 2026 14:56

style: Apply black formatting to input_panel.py and grader_registry.py

cf7f49e

revert(ui): restore sys.path approach for imports

782018d

Relative imports add complexity for Streamlit apps that are typically run directly with 'streamlit run'. Keep other review fixes (logging, inspect.signature, render_divider, etc.).

style(ui): optimize sidebar layout and reduce padding

59b8ea5

style(ui): compact header and sidebar layout

7e14fde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ui): Add Streamlit-based grader evaluation UI #71

feat(ui): Add Streamlit-based grader evaluation UI #71

Uh oh!

XiaoBoAI commented Jan 16, 2026

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(ui): Add Streamlit-based grader evaluation UI #71

Are you sure you want to change the base?

feat(ui): Add Streamlit-based grader evaluation UI #71

Uh oh!

Conversation

XiaoBoAI commented Jan 16, 2026

OpenJudge Version

Description

Checklist

Uh oh!

gemini-code-assist bot commented Jan 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants