-
Notifications
You must be signed in to change notification settings - Fork 1.3k
feat(rag): add powerpoint reader for slide files #873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # pyproject.toml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds PowerPoint file reading capability to AgentScope's RAG module, enabling extraction of text, tables, and images from .pptx files.
Changes:
- Adds new
PowerPointReaderclass that supports chunking text content and extracting images and tables from PowerPoint presentations - Updates dependency management in
pyproject.tomlto includepython-pptxlibrary and fix dependency group naming (hyphens to underscores) - Changes
WordReader.get_doc_id()to use SHA-256 instead of MD5 for document ID generation - Adds comprehensive test coverage for PowerPoint reading functionality
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/agentscope/rag/_reader/_ppt_reader.py | New PowerPoint reader implementation with support for text extraction, table parsing (markdown/JSON), and image extraction |
| tests/rag_reader_test.py | Adds comprehensive test case for PowerPoint reader with images and tables |
| src/agentscope/rag/_reader/init.py | Exports new PowerPointReader class |
| src/agentscope/rag/init.py | Exports PowerPointReader at package level |
| pyproject.toml | Adds ppt_reader dependency group and fixes dependency naming conventions |
| src/agentscope/rag/_reader/_word_reader.py | Changes document ID hashing from MD5 to SHA-256 |
| tests/test.pptx | Test fixture PowerPoint file (binary) |
Comments suppressed due to low confidence (1)
src/agentscope/rag/_reader/_ppt_reader.py:639
- The
get_doc_idmethod uses SHA-256 hashing, but the WordReader class now also uses SHA-256 (changed from MD5 in line 508 of _word_reader.py). However, the docstring in PowerPointReader still mentions MD5. The docstring should be updated to reflect that SHA-256 is being used instead of MD5.
Args:
ppt_path (`str`):
The path to the PowerPoint file.
Returns:
`str`:
The document ID (SHA256 hash of the file path).
"""
return hashlib.sha256(ppt_path.encode("utf-8")).hexdigest()
tests/rag_reader_test.py
Outdated
| import json | ||
|
|
||
| print( | ||
| json.dumps( | ||
| [_.metadata.content.get("text") for _ in docs], | ||
| indent=4, | ||
| ensure_ascii=False, | ||
| ), | ||
| ) | ||
|
|
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement detected. The test contains a print() statement with JSON dumps that appears to be leftover from debugging. This should be removed before merging as it will clutter test output.
| import json | |
| print( | |
| json.dumps( | |
| [_.metadata.content.get("text") for _ in docs], | |
| indent=4, | |
| ensure_ascii=False, | |
| ), | |
| ) |
tests/rag_reader_test.py
Outdated
| import json | ||
|
|
||
| print( | ||
| json.dumps( | ||
| [_.metadata.content.get("text") for _ in docs], | ||
| indent=4, | ||
| ensure_ascii=False, | ||
| ), | ||
| ) | ||
|
|
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import statement. The json import on line 181 is only used for the debug print statement that should be removed. Once the print statement is removed, this import should also be removed.
| import json | |
| print( | |
| json.dumps( | |
| [_.metadata.content.get("text") for _ in docs], | |
| indent=4, | |
| ensure_ascii=False, | |
| ), | |
| ) |
| try: | ||
| # Check if text frame contains inline images | ||
| # This is a simplified check - pptx doesn't directly support | ||
| # inline images in text frames like Word does | ||
| pass | ||
| except Exception: | ||
| pass | ||
|
|
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| try: | |
| # Check if text frame contains inline images | |
| # This is a simplified check - pptx doesn't directly support | |
| # inline images in text frames like Word does | |
| pass | |
| except Exception: | |
| pass | |
| # Inline images in text frames are currently not supported by | |
| # python-pptx. This conditional block is kept as a placeholder | |
| # for future enhancements when inline image extraction becomes | |
| # feasible. | |
| pass |
| try: | ||
| # Check if text frame contains inline images | ||
| # This is a simplified check - pptx doesn't directly support | ||
| # inline images in text frames like Word does | ||
| pass | ||
| except Exception: | ||
| pass |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unreachable.
| try: | |
| # Check if text frame contains inline images | |
| # This is a simplified check - pptx doesn't directly support | |
| # inline images in text frames like Word does | |
| pass | |
| except Exception: | |
| pass | |
| # Check if text frame contains inline images | |
| # This is a simplified check - pptx doesn't directly support | |
| # inline images in text frames like Word does | |
| # Currently, no additional handling is implemented here. | |
| ... |
DavdGao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good to me. Plz note:
- why we delete the example.docx from the example directory?
- the
PowerpointReaderworks well, but we should add more tests for it.
| ["image/png", "image/png", "image/png"], | ||
| ) | ||
|
|
||
| async def test_ppt_reader_with_images_and_tables(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more tests about the different argument combinations, e.g. when extracing table in JSON format
tests/rag_reader_test.py
Outdated
|
|
||
| import json | ||
|
|
||
| print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the printing statements here
|
|
||
| # Also check for images in text frames (e.g., inline images) | ||
| if hasattr(shape, "has_text_frame") and shape.has_text_frame: | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
| return await self._process_slides_separately(prs, doc_id) | ||
| else: | ||
| return await self._process_slides_merged(prs, doc_id) | ||
| finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary statements here
| """ | ||
| blocks: list[TextBlock | ImageBlock] = [] | ||
| last_type = None | ||
| slide_header = f"Slide {slide_idx + 1}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is appropriate header for the ppt reader, maybe we can refer to the other rag libraries like llamaindex
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a PowerPointReader for extracting content from .pptx files, a valuable addition to the RAG capabilities. The implementation is well-structured, handling text, tables, and images, with good configuration options and error handling. The PR also includes a consistency improvement by switching from MD5 to SHA256 for document ID generation and standardizing dependency names in pyproject.toml.
My review focuses on a few areas for improvement: I've identified a bug in the JSON table serialization, pointed out some redundant code that can be cleaned up, and suggested removing leftover debugging statements from the new test case. Overall, this is a solid contribution.
| def _table_to_json(table_data: list[list[str]]) -> str: | ||
| """Convert table data to JSON string. | ||
| Args: | ||
| table_data (`list[list[str]]`): | ||
| Table data represented as a 2D list. | ||
| Returns: | ||
| `str`: | ||
| Table in JSON string format. | ||
| """ | ||
| json_strs = [ | ||
| "<system-info>A table loaded as a JSON array:</system-info>", | ||
| ] | ||
|
|
||
| for row in table_data: | ||
| json_strs.append( | ||
| json.dumps(row, ensure_ascii=False), | ||
| ) | ||
|
|
||
| return "\n".join(json_strs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of _table_to_json does not align with its docstring. The docstring specifies that it returns 'a JSON string of a list[list[str]] object', but the current implementation generates a string with newline-separated JSON arrays for each row. This is not a valid single JSON object and could cause parsing issues for consumers. To match the documentation and provide a more standard output, the entire table_data should be serialized into a single JSON array.
@staticmethod
def _table_to_json(table_data: list[list[str]]) -> str:
"""Convert table data to JSON string.
Args:
table_data (`list[list[str]]`):
Table data represented as a 2D list.
Returns:
`str`:
Table in JSON string format.
"""
return "\n".join(
[
"<system-info>A table loaded as a JSON array:</system-info>",
json.dumps(table_data, ensure_ascii=False),
],
)| try: | ||
| # Check if text frame contains inline images | ||
| # This is a simplified check - pptx doesn't directly support | ||
| # inline images in text frames like Word does | ||
| pass | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try...except Exception: pass block is redundant because it only wraps a pass statement and comments. It appears to be a placeholder but currently adds unnecessary code. It can be simplified by removing the try...except wrapper.
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass| The generated document ID. | ||
| """ | ||
| return hashlib.md5(word_path.encode("utf-8")).hexdigest() | ||
| return hashlib.sha256(word_path.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see the hashing algorithm for document ID generation being updated from MD5 to SHA256. SHA256 is more secure and has a much lower probability of collisions, making it a better choice for ensuring document uniqueness. This change also aligns with the implementation in the new PowerPointReader.
tests/rag_reader_test.py
Outdated
| import json | ||
|
|
||
| print( | ||
| json.dumps( | ||
| [_.metadata.content.get("text") for _ in docs], | ||
| indent=4, | ||
| ensure_ascii=False, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…gentscope-ai#869) --------- Signed-off-by: Li Yi <[email protected]> Co-authored-by: DavdGao <[email protected]>
…gentscope-ai#1115) --------- Co-authored-by: DavdGao <[email protected]>
AgentScope Version
1.0.6
Description
As the title says.
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand