-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fixed current agent citation issues and added sub_section_writer agen… #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…t for upcoming SurfSense research agent
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughThis update introduces a new submodule for the LangGraph Agent within the backend. A collection of files is added to define and execute a custom state graph workflow for writing sub-sections, including configuration, state handling, asynchronous nodes, and prompt definitions. Additionally, minor changes enforce type safety in routes, tasks, and connector services, and the project dependency on LangGraph has been updated. Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Configuration
participant Graph as StateGraph
participant FD as fetch_relevant_documents
participant WS as write_sub_section
participant State as ExecutionState
Config->>Graph: Build workflow with configuration schema
Graph->>FD: Invoke document fetching
FD-->>State: Return fetched documents
Graph->>WS: Invoke sub-section writing
WS-->>State: Return final answer
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
surfsense_backend/app/utils/connector_service.py (1)
445-515: 🛠️ Refactor suggestionValidate YouTube metadata fields and unique ID generation.
- Line 470 assigns
self.source_id_counterto the chunk’s document ID. If concurrency is possible, consider a safer ID generation strategy.\- At lines 488–490, the code truncates the description to 100 characters. Confirm that the appended ellipsis (“...”) is consistent with the rest of your project’s user experience.
🧰 Tools
🪛 Ruff (0.8.2)
480-480: Local variable
published_dateis assigned to but never usedRemove assignment to unused variable
published_date(F841)
🧹 Nitpick comments (8)
surfsense_backend/app/agents/researcher/sub_section_writer/prompts.py (1)
1-2: Remove extraneous 'f' prefix from string.The string is defined with an f-prefix but doesn't contain any placeholders.
-citation_system_prompt = f""" +citation_system_prompt = """🧰 Tools
🪛 Ruff (0.8.2)
1-82: f-string without any placeholders
Remove extraneous
fprefix(F541)
surfsense_backend/app/agents/researcher/sub_section_writer/nodes.py (2)
11-161: Consider stabilizing content deduplication and adding connector error handling.
- The built-in
hash()used at lines 118–124 can vary between runs due to Python’s hash randomization. If consistency is required for deduplication across processes or sessions, consider using a stable hash function likemd5orsha256.- When iterating over multiple connectors (lines 40–108), if a connector fails or is missing, the code silently proceeds. This might mask errors. Consider adding try/except blocks or skipping unavailable connectors with clear logging to improve resilience.
164-245: Incorporate the sub-section title in the final LLM prompt and handle potential LLM failures.
- The code references
configuration.sub_section_title(line 212) but it is commented out, resulting in no mention of the actual title in lines 217–229. If the sub-section title is important, reintroduce it into the prompt to provide full context to the LLM.- At lines 238–239, there is no exception handling if the LLM invocation fails. Consider wrapping this call in a try/except block to handle potential network errors or LLM unavailability gracefully.
surfsense_backend/app/utils/connector_service.py (5)
16-60: Evaluate concurrency risks with source_id_counter and improve logging for missing connectors.
- The method
search_crawled_urls(lines 16–60) incrementsself.source_id_counterat line 48 in a simple loop. If this service is used concurrently, you risk race conditions. Consider making the counter thread-safe or generating IDs differently.- No fallback or error handling is present if no results are found (returns an empty list). Logging or messages to guide the caller might be beneficial.
61-105: Check concurrency for source_id_counter and maintain consistent behavior with file searches.Just like in
search_crawled_urls,source_id_counteris incremented at line 93. If concurrency is possible, address potential race conditions. If concurrency is not intended, consider documenting that limitation.
126-218: Guard against missing or malformed Tavily configuration and concurrency on source IDs.
- Line 139 fetches a connector but handles an empty result by returning an empty list. Consider logging that no Tavily connector was found to help debug misconfiguration.
- As with other methods,
self.source_id_counterincrements at line 197 in a loop; concurrency might create duplicate IDs.
219-282: Improve Slack-specific metadata usage and concurrency caution.
- The Slack metadata logic (lines 243–268) is helpful but consider adding fallback logs if the needed fields are missing (e.g.,
channel_name).- The shared concurrency concern with
source_id_counter(line 270) remains. If these methods are called in parallel, you might encounter duplicates.
283-354: Enhance code clarity for Notion page retrieval and concurrency handling.
- Lines 320–323 build a title from the Notion page’s metadata. Consider logging or skipping if critical metadata is missing.\
- Incrementing
self.source_id_counterat line 342 can create ID collisions if multiple coroutines access the same connector.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
surfsense_backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
surfsense_backend/app/agents/researcher/sub_section_writer/__init__.py(1 hunks)surfsense_backend/app/agents/researcher/sub_section_writer/configuration.py(1 hunks)surfsense_backend/app/agents/researcher/sub_section_writer/graph.py(1 hunks)surfsense_backend/app/agents/researcher/sub_section_writer/nodes.py(1 hunks)surfsense_backend/app/agents/researcher/sub_section_writer/prompts.py(1 hunks)surfsense_backend/app/agents/researcher/sub_section_writer/state.py(1 hunks)surfsense_backend/app/routes/chats_routes.py(1 hunks)surfsense_backend/app/tasks/stream_connector_search_results.py(1 hunks)surfsense_backend/app/utils/connector_service.py(21 hunks)surfsense_backend/pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
surfsense_backend/app/agents/researcher/sub_section_writer/graph.py (3)
surfsense_backend/app/agents/researcher/sub_section_writer/state.py (1)
State(10-22)surfsense_backend/app/agents/researcher/sub_section_writer/nodes.py (2)
fetch_relevant_documents(11-160)write_sub_section(164-243)surfsense_backend/app/agents/researcher/sub_section_writer/configuration.py (1)
Configuration(12-31)
surfsense_backend/app/agents/researcher/sub_section_writer/nodes.py (4)
surfsense_backend/app/agents/researcher/sub_section_writer/configuration.py (2)
Configuration(12-31)from_runnable_config(25-31)surfsense_backend/app/agents/researcher/sub_section_writer/state.py (1)
State(10-22)surfsense_backend/app/utils/connector_service.py (6)
ConnectorService(10-515)search_crawled_urls(16-59)search_files(61-104)search_tavily(126-217)search_slack(219-281)search_notion(283-353)surfsense_backend/app/utils/reranker_service.py (2)
RerankerService(5-95)rerank_documents(19-80)
🪛 Ruff (0.8.2)
surfsense_backend/app/agents/researcher/sub_section_writer/prompts.py
1-82: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (10)
surfsense_backend/app/tasks/stream_connector_search_results.py (1)
17-17: Type change from int to str looks goodThe parameter type change for
user_idfrominttostraligns with similar changes in the connector service. This change maintains type consistency across the application.surfsense_backend/pyproject.toml (1)
16-16: Dependency addition for LangGraph looks goodThe addition of langgraph dependency with a minimum version constraint is appropriate for supporting the new sub_section_writer agent functionality.
surfsense_backend/app/agents/researcher/sub_section_writer/__init__.py (1)
1-8: Well-structured module initializationThe new module file follows Python best practices with a clear docstring, appropriate imports, and explicit public API definition through
__all__.surfsense_backend/app/agents/researcher/sub_section_writer/graph.py (1)
1-23: Well-structured graph definition for the sub-section writer workflow.The implementation follows a clean pattern for defining a LangGraph workflow with a sequential flow:
- Start → Fetch relevant documents
- Fetch relevant documents → Write sub-section
- Write sub-section → End
This straightforward structure makes the workflow easy to understand and maintain.
surfsense_backend/app/agents/researcher/sub_section_writer/prompts.py (1)
2-82: Comprehensive citation system prompt that addresses the PR's citation issues.The prompt provides detailed instructions for IEEE citation format usage, including:
- Clear steps for analyzing documents and extracting information
- Specific citation format requirements (square brackets with source_id)
- Input/output examples demonstrating proper citation usage
- Common incorrect citation formats to avoid
This well-structured prompt should effectively resolve the citation issues mentioned in the PR objectives.
🧰 Tools
🪛 Ruff (0.8.2)
1-82: f-string without any placeholders
Remove extraneous
fprefix(F541)
surfsense_backend/app/agents/researcher/sub_section_writer/state.py (1)
9-22: Clean state implementation with proper typing.The State class properly defines:
- A runtime context with the database session
- Output fields to store the results from each node in the workflow
The types are well-defined, and the optional fields have appropriate default values of None.
surfsense_backend/app/agents/researcher/sub_section_writer/configuration.py (2)
11-21: Well-defined configuration with appropriate types and defaults.The Configuration class effectively captures all the input parameters needed for the sub-section writer agent. Using
kw_only=Trueon the dataclass ensures a clean instantiation pattern.
24-31: Robust configuration factory method with proper field filtering.The
from_runnable_configclass method safely:
- Handles the case where config might be None
- Extracts only the fields that exist in the Configuration class
- Returns a properly constructed instance
This approach prevents errors when passing in configurations with extra fields.
surfsense_backend/app/utils/connector_service.py (2)
106-125: Ensure connector is properly typed and validated.The signature change to accept
user_id: str(line 106) may expose potential mismatches if existing code still passes an integer. Verify that all call sites use a string user ID. Additionally, consider logging when a connector is not found for troubleshooting.
355-444: Check extension-specific metadata reliability.
- At lines 393–424, parsing extension data includes date/time handling and visit duration logic. Ensure each key is reliably present, or handle missing fields gracefully.\
source_id_counterconcurrency concerns apply at line 432.🧰 Tools
🪛 Ruff (0.8.2)
391-391: Local variable
browsing_session_idis assigned to but never usedRemove assignment to unused variable
browsing_session_id(F841)
401-401: Do not use bare
except(E722)
421-421: Do not use bare
except(E722)
…chat data handling
Fixed current agent citation issues and added sub_section_writer agen…
Fixed current agent citation issues and added sub_section_writer agen…
…t for upcoming SurfSense research agent
Summary by CodeRabbit
New Features
Bug Fixes
Chores