Skip to content

Conversation

@qbc2016
Copy link
Member

@qbc2016 qbc2016 commented Oct 28, 2025

AgentScope Version

[The version of AgentScope you are working on, e.g. import agentscope; print(agentscope.__version__)]

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

@DavdGao DavdGao self-requested a review October 29, 2025 02:12
@DavdGao DavdGao changed the title Implement Excel Reader feat(rag): add excel reader to support excel files in the rag module Nov 3, 2025
@cla-assistant
Copy link

cla-assistant bot commented Dec 2, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds an ExcelReader to the RAG module to support reading and chunking Excel files (.xlsx, .xls). The implementation follows the pattern of existing readers (PDFReader, WordReader) and includes support for extracting text, tables, and images from Excel sheets.

Changes:

  • Added new ExcelReader class with support for table extraction in Markdown or JSON format, image extraction, and sheet-level processing control
  • Updated WordReader hash algorithm from MD5 to SHA256 for document ID generation
  • Added comprehensive test cases for the Excel reader functionality
  • Added required dependencies (pandas, openpyxl) to the full extras list

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/agentscope/rag/_reader/_excel_reader.py New ExcelReader implementation with 630 lines including table/image extraction
tests/rag_reader_test.py Added test case for Excel reader with images and tables
tests/test.xlsx Binary Excel test file with sample data
src/agentscope/rag/_reader/_word_reader.py Changed hash algorithm from MD5 to SHA256
src/agentscope/rag/_reader/init.py Exported ExcelReader class
src/agentscope/rag/init.py Exported ExcelReader at package level
pyproject.toml Added pandas and openpyxl to full dependencies

Copy link
Member

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, why we remove the example.docx here?

@DavdGao
Copy link
Member

DavdGao commented Jan 17, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ExcelReader for the RAG module, which is a great addition for handling Excel files. The implementation correctly uses pandas and openpyxl to extract table data and images. The new reader is properly integrated and new dependencies are added to pyproject.toml.

I've found a couple of issues that need attention:

  • The include_cell_coordinates parameter in ExcelReader is currently unused.
  • There's a bug in how Markdown tables are generated: pipe characters are not escaped correctly, which will lead to rendering issues.
  • The new tests include some debugging code that should be removed and also have assertions that rely on the incorrect table generation, which will need to be updated.

Additionally, the update to use sha256 for document ID generation in WordReader is a good improvement for consistency.

Overall, this is a solid contribution with a few minor fixes required.

@qbc2016
Copy link
Member Author

qbc2016 commented Jan 19, 2026

Similar, why we remove the example.docx here?

Since it's not used anywhere and is just a legacy dependency.

@qbc2016
Copy link
Member Author

qbc2016 commented Jan 19, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new ExcelReader to the RAG module, enabling the processing of Excel files. The implementation is robust, covering text, tables, and images, and includes good error handling and comprehensive unit tests. The use of lazy loading for dependencies and the structured approach to extracting and ordering content from sheets are well done.

I have a couple of suggestions for improvement:

  • For performance and safety, it's better to open the Excel workbook in read-only mode.
  • There is an inconsistency in the hashing algorithm used for generating document IDs across different readers, which could be standardized for better maintainability.

Overall, this is a great addition to the project's capabilities.

Copy link
Member

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DavdGao DavdGao merged commit c29bdde into agentscope-ai:main Jan 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants