-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat(rag): add excel reader to support excel files in the rag module #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds an ExcelReader to the RAG module to support reading and chunking Excel files (.xlsx, .xls). The implementation follows the pattern of existing readers (PDFReader, WordReader) and includes support for extracting text, tables, and images from Excel sheets.
Changes:
- Added new
ExcelReaderclass with support for table extraction in Markdown or JSON format, image extraction, and sheet-level processing control - Updated
WordReaderhash algorithm from MD5 to SHA256 for document ID generation - Added comprehensive test cases for the Excel reader functionality
- Added required dependencies (pandas, openpyxl) to the full extras list
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/agentscope/rag/_reader/_excel_reader.py | New ExcelReader implementation with 630 lines including table/image extraction |
| tests/rag_reader_test.py | Added test case for Excel reader with images and tables |
| tests/test.xlsx | Binary Excel test file with sample data |
| src/agentscope/rag/_reader/_word_reader.py | Changed hash algorithm from MD5 to SHA256 |
| src/agentscope/rag/_reader/init.py | Exported ExcelReader class |
| src/agentscope/rag/init.py | Exported ExcelReader at package level |
| pyproject.toml | Added pandas and openpyxl to full dependencies |
DavdGao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar, why we remove the example.docx here?
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ExcelReader for the RAG module, which is a great addition for handling Excel files. The implementation correctly uses pandas and openpyxl to extract table data and images. The new reader is properly integrated and new dependencies are added to pyproject.toml.
I've found a couple of issues that need attention:
- The
include_cell_coordinatesparameter inExcelReaderis currently unused. - There's a bug in how Markdown tables are generated: pipe characters are not escaped correctly, which will lead to rendering issues.
- The new tests include some debugging code that should be removed and also have assertions that rely on the incorrect table generation, which will need to be updated.
Additionally, the update to use sha256 for document ID generation in WordReader is a good improvement for consistency.
Overall, this is a solid contribution with a few minor fixes required.
Since it's not used anywhere and is just a legacy dependency. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new ExcelReader to the RAG module, enabling the processing of Excel files. The implementation is robust, covering text, tables, and images, and includes good error handling and comprehensive unit tests. The use of lazy loading for dependencies and the structured approach to extracting and ordering content from sheets are well done.
I have a couple of suggestions for improvement:
- For performance and safety, it's better to open the Excel workbook in read-only mode.
- There is an inconsistency in the hashing algorithm used for generating document IDs across different readers, which could be standardized for better maintainability.
Overall, this is a great addition to the project's capabilities.
DavdGao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
AgentScope Version
[The version of AgentScope you are working on, e.g.
import agentscope; print(agentscope.__version__)]Description
As the title says.
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand