-
Notifications
You must be signed in to change notification settings - Fork 15
feat(ui): Add Streamlit-based grader evaluation UI #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add main application entry (app.py) - Add UI components (input_panel, result_panel, sidebar, multimodal) - Add grader registry and factory services - Add theme and styling configuration - Add utility helpers and constants - Update pre-commit config to use Python 3.11
Summary of ChangesHello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a significant new feature: a Streamlit-powered web application called "OpenJudge Studio." This application provides an intuitive graphical interface for users to interact with the OpenJudge framework, enabling them to easily configure API endpoints and keys, select from a wide range of LLM graders across different categories (e.g., correctness, relevance, multimodal, agent tool use), input evaluation data, and visualize the results. The goal is to streamline the process of evaluating LLM responses, making it more accessible and efficient for developers and researchers. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive Streamlit-based user interface for evaluating LLM responses using the OpenJudge framework. The new UI is well-structured, with clear separation of concerns into components, services, and configuration. My review focuses on improving maintainability, robustness, and error handling. Key suggestions include refactoring the import system to avoid sys.path manipulation, improving exception handling and logging for easier debugging, and replacing fragile code introspection with more robust methods. Overall, this is a great addition that significantly enhances the usability of the OpenJudge framework.
ui/components/input_panel.py
Outdated
| def render_input_panel(sidebar_config: dict[str, Any]) -> dict[str, Any]: | ||
| """Render the input panel and return input data. | ||
|
|
||
| Args: | ||
| sidebar_config: Configuration from sidebar | ||
|
|
||
| Returns: | ||
| Dictionary containing all input data | ||
| """ | ||
| grader_config = sidebar_config.get("grader_config") | ||
| grader_name = sidebar_config.get("grader_name", "") | ||
| category = sidebar_config.get("grader_category", "common") | ||
|
|
||
| render_section_header("Input Data") | ||
|
|
||
| # Action buttons | ||
| col_btn1, col_btn2 = st.columns([1, 1]) | ||
| with col_btn1: | ||
| load_example = st.button("Load Example", use_container_width=True) | ||
| with col_btn2: | ||
| clear_all = st.button("Clear All", use_container_width=True) | ||
|
|
||
| # Handle button actions | ||
| if load_example: | ||
| st.session_state.example_loaded = True | ||
| st.session_state.evaluation_result = None | ||
| if clear_all: | ||
| st.session_state.example_loaded = False | ||
| st.session_state.evaluation_result = None | ||
|
|
||
| # Get default values | ||
| if st.session_state.get("example_loaded", False): | ||
| defaults = _get_example_data(grader_name, category) | ||
| else: | ||
| defaults = { | ||
| "query": "", | ||
| "response": "", | ||
| "reference_response": "", | ||
| "context": "", | ||
| "tool_definitions": "", | ||
| "tool_calls": "", | ||
| } | ||
|
|
||
| input_data: dict[str, Any] = {} | ||
|
|
||
| # ========================================================================= | ||
| # Render appropriate input fields based on grader type | ||
| # ========================================================================= | ||
|
|
||
| if not grader_config: | ||
| st.warning("Please select a grader from the sidebar") | ||
| return input_data | ||
|
|
||
| input_fields = grader_config.get("input_fields", ["query", "response"]) | ||
|
|
||
| # ------------------------------------------------------------------------- | ||
| # Multimodal Graders (Image + Text) | ||
| # ------------------------------------------------------------------------- | ||
| if "response_multimodal" in input_fields: | ||
| content_list, context = render_multimodal_input() | ||
| input_data["response"] = content_list | ||
| input_data["has_content"] = len(content_list) > 0 | ||
| return input_data | ||
|
|
||
| if "response_image" in input_fields: | ||
| # Text-to-Image grader | ||
| text_prompt, image = render_text_to_image_input() | ||
| input_data["query"] = text_prompt | ||
| input_data["response"] = image | ||
| input_data["has_content"] = bool(text_prompt and image) | ||
| return input_data | ||
|
|
||
| # ------------------------------------------------------------------------- | ||
| # Agent Graders (Tool definitions and calls) | ||
| # ------------------------------------------------------------------------- | ||
| if "tool_definitions" in input_fields: | ||
| tab_main, tab_tools, tab_context = st.tabs(["Query", "Tools", "Context"]) | ||
|
|
||
| with tab_main: | ||
| query = st.text_area( | ||
| "Query", | ||
| value=defaults.get("query", ""), | ||
| height=100, | ||
| placeholder="Enter the user's query to the agent...", | ||
| help="The task or question given to the agent", | ||
| ) | ||
| input_data["query"] = query | ||
|
|
||
| with tab_tools: | ||
| st.markdown( | ||
| """ | ||
| <div class="info-card"> | ||
| <div style="font-size: 0.85rem; color: #94A3B8;"> | ||
| Enter tool definitions and calls in JSON format | ||
| </div> | ||
| </div> | ||
| """, | ||
| unsafe_allow_html=True, | ||
| ) | ||
|
|
||
| tool_definitions = st.text_area( | ||
| "Available Tool Definitions (JSON)", | ||
| value=defaults.get("tool_definitions", ""), | ||
| height=200, | ||
| placeholder='[{"name": "get_weather", "description": "...", "parameters": {...}}]', | ||
| help="JSON array of available tool definitions", | ||
| ) | ||
| input_data["tool_definitions"] = tool_definitions | ||
|
|
||
| tool_calls = st.text_area( | ||
| "Agent's Tool Calls (JSON)", | ||
| value=defaults.get("tool_calls", ""), | ||
| height=150, | ||
| placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]', | ||
| help="JSON array of tool calls made by the agent", | ||
| ) | ||
| input_data["tool_calls"] = tool_calls | ||
|
|
||
| # Reference tool calls for accuracy evaluation | ||
| if "reference_tool_calls" in input_fields: | ||
| reference_tool_calls = st.text_area( | ||
| "Expected Tool Calls (JSON)", | ||
| value="", | ||
| height=150, | ||
| placeholder='[{"name": "get_weather", "arguments": {"location": "Beijing"}}]', | ||
| help="JSON array of expected/correct tool calls", | ||
| ) | ||
| input_data["reference_tool_calls"] = reference_tool_calls | ||
|
|
||
| with tab_context: | ||
| context = st.text_area( | ||
| "Additional Context", | ||
| value="", | ||
| height=200, | ||
| placeholder="Enter any additional context...", | ||
| help="Optional background information", | ||
| ) | ||
| input_data["context"] = context | ||
|
|
||
| input_data["has_content"] = bool(query and tool_definitions and tool_calls) | ||
| return input_data | ||
|
|
||
| # ------------------------------------------------------------------------- | ||
| # Standard Graders (Query/Response/Reference) | ||
| # ------------------------------------------------------------------------- | ||
| tab_main, tab_context = st.tabs(["Main Input", "Context"]) | ||
|
|
||
| with tab_main: | ||
| # Query field | ||
| if "query" in input_fields: | ||
| query = st.text_area( | ||
| "Query", | ||
| value=defaults.get("query", ""), | ||
| height=100, | ||
| placeholder="Enter the user's question or prompt...", | ||
| help="The original question or prompt from the user", | ||
| ) | ||
| input_data["query"] = query | ||
|
|
||
| # Response field (always present for standard graders) | ||
| response = st.text_area( | ||
| "Response to Evaluate", | ||
| value=defaults.get("response", ""), | ||
| height=150, | ||
| placeholder="Enter the response to be evaluated...", | ||
| help="The model's response that needs to be evaluated", | ||
| ) | ||
| input_data["response"] = response | ||
|
|
||
| # Reference response field | ||
| requires_reference = grader_config.get("requires_reference", False) | ||
| if "reference_response" in input_fields or requires_reference: | ||
| ref_label = ( | ||
| "Reference Response *" | ||
| if requires_reference | ||
| else "Reference Response (Optional)" | ||
| ) | ||
| reference_response = st.text_area( | ||
| ref_label, | ||
| value=defaults.get("reference_response", ""), | ||
| height=120, | ||
| placeholder="Enter the reference/golden answer..." | ||
| + (" (Required)" if requires_reference else ""), | ||
| help="The expected or ideal response for comparison", | ||
| ) | ||
| input_data["reference_response"] = reference_response | ||
|
|
||
| with tab_context: | ||
| context = st.text_area( | ||
| "Additional Context", | ||
| value=defaults.get("context", ""), | ||
| height=200, | ||
| placeholder="Enter any additional context that might help with evaluation...", | ||
| help="Optional background information for the evaluation", | ||
| ) | ||
| input_data["context"] = context | ||
|
|
||
| # Determine if we have enough content to run | ||
| has_content = bool(input_data.get("response", "")) | ||
| if "query" in input_fields: | ||
| has_content = has_content and bool(input_data.get("query", "")) | ||
| if requires_reference: | ||
| has_content = has_content and bool(input_data.get("reference_response", "")) | ||
|
|
||
| input_data["has_content"] = has_content | ||
|
|
||
| return input_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The render_input_panel function is quite long and handles rendering logic for multiple types of graders (multimodal, agent, standard). This reduces its readability and maintainability.
Consider refactoring this function by extracting the logic for each grader type into its own private helper function, for example:
_render_multimodal_inputs(...)_render_agent_inputs(...)_render_standard_inputs(...)
This would make render_input_panel a dispatcher function, which would be much easier to read and modify in the future.
ui/components/multimodal.py
Outdated
| except Exception: | ||
| st.warning("Could not load image preview") | ||
| return url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception can hide the actual cause of an error and make debugging difficult. It's better to catch more specific exceptions related to network or image loading. Additionally, logging the exception would provide valuable context for debugging by showing the error to the user in the UI.
| except Exception: | |
| st.warning("Could not load image preview") | |
| return url | |
| except Exception as e: | |
| st.warning(f"Could not load image preview. Error: {e}") | |
| return url |
- Fix wrong-import-position in app.py - Remove unused grader_name parameter in input_panel.py - Split render_input_panel into smaller helper functions - Split render_result_panel into smaller helper functions - Split render_sidebar into smaller helper functions - Fix unnecessary list comprehension in sidebar.py - Remove unused col3 variable in sidebar.py - Fix line-too-long in result_panel.py and constants.py
- Add logging for exception handling in result_panel.py - Use inspect.signature() instead of __code__.co_varnames in grader_factory.py - Replace sys.path modification with relative imports across ui package - Use render_divider() helper function in app.py - Extract _render_multimodal_inputs() for better code organization - Show error details in image preview warning in multimodal.py - Remove unused render_loading_state() function
Relative imports add complexity for Streamlit apps that are typically run directly with 'streamlit run'. Keep other review fixes (logging, inspect.signature, render_divider, etc.).
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand