Skip to content

Conversation

@qbc2016
Copy link
Member

@qbc2016 qbc2016 commented Oct 28, 2025

AgentScope Version

1.0.6

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

@DavdGao DavdGao self-requested a review October 29, 2025 02:12
@DavdGao DavdGao changed the title Implement ppt reader feat(rag): add powerpoint reader for slide files Nov 3, 2025
@DavdGao DavdGao linked an issue Nov 25, 2025 that may be closed by this pull request
@cla-assistant
Copy link

cla-assistant bot commented Dec 2, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds PowerPoint file reading capability to AgentScope's RAG module, enabling extraction of text, tables, and images from .pptx files.

Changes:

  • Adds new PowerPointReader class that supports chunking text content and extracting images and tables from PowerPoint presentations
  • Updates dependency management in pyproject.toml to include python-pptx library and fix dependency group naming (hyphens to underscores)
  • Changes WordReader.get_doc_id() to use SHA-256 instead of MD5 for document ID generation
  • Adds comprehensive test coverage for PowerPoint reading functionality

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/agentscope/rag/_reader/_ppt_reader.py New PowerPoint reader implementation with support for text extraction, table parsing (markdown/JSON), and image extraction
tests/rag_reader_test.py Adds comprehensive test case for PowerPoint reader with images and tables
src/agentscope/rag/_reader/init.py Exports new PowerPointReader class
src/agentscope/rag/init.py Exports PowerPointReader at package level
pyproject.toml Adds ppt_reader dependency group and fixes dependency naming conventions
src/agentscope/rag/_reader/_word_reader.py Changes document ID hashing from MD5 to SHA-256
tests/test.pptx Test fixture PowerPoint file (binary)
Comments suppressed due to low confidence (1)

src/agentscope/rag/_reader/_ppt_reader.py:639

  • The get_doc_id method uses SHA-256 hashing, but the WordReader class now also uses SHA-256 (changed from MD5 in line 508 of _word_reader.py). However, the docstring in PowerPointReader still mentions MD5. The docstring should be updated to reflect that SHA-256 is being used instead of MD5.
        Args:
            ppt_path (`str`):
                The path to the PowerPoint file.

        Returns:
            `str`:
                The document ID (SHA256 hash of the file path).
        """
        return hashlib.sha256(ppt_path.encode("utf-8")).hexdigest()

Comment on lines 181 to 190
import json

print(
json.dumps(
[_.metadata.content.get("text") for _ in docs],
indent=4,
ensure_ascii=False,
),
)

Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement detected. The test contains a print() statement with JSON dumps that appears to be leftover from debugging. This should be removed before merging as it will clutter test output.

Suggested change
import json
print(
json.dumps(
[_.metadata.content.get("text") for _ in docs],
indent=4,
ensure_ascii=False,
),
)

Copilot uses AI. Check for mistakes.
Comment on lines 181 to 190
import json

print(
json.dumps(
[_.metadata.content.get("text") for _ in docs],
indent=4,
ensure_ascii=False,
),
)

Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import statement. The json import on line 181 is only used for the debug print statement that should be removed. Once the print statement is removed, this import should also be removed.

Suggested change
import json
print(
json.dumps(
[_.metadata.content.get("text") for _ in docs],
indent=4,
ensure_ascii=False,
),
)

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +94
try:
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass
except Exception:
pass

Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
try:
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass
except Exception:
pass
# Inline images in text frames are currently not supported by
# python-pptx. This conditional block is kept as a placeholder
# for future enhancements when inline image extraction becomes
# feasible.
pass

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +93
try:
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass
except Exception:
pass
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Suggested change
try:
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass
except Exception:
pass
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
# Currently, no additional handling is implemented here.
...

Copilot uses AI. Check for mistakes.
Copy link
Member

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me. Plz note:

  1. why we delete the example.docx from the example directory?
  2. the PowerpointReader works well, but we should add more tests for it.

["image/png", "image/png", "image/png"],
)

async def test_ppt_reader_with_images_and_tables(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more tests about the different argument combinations, e.g. when extracing table in JSON format


import json

print(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the printing statements here


# Also check for images in text frames (e.g., inline images)
if hasattr(shape, "has_text_frame") and shape.has_text_frame:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

return await self._process_slides_separately(prs, doc_id)
else:
return await self._process_slides_merged(prs, doc_id)
finally:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary statements here

"""
blocks: list[TextBlock | ImageBlock] = []
last_type = None
slide_header = f"Slide {slide_idx + 1}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is appropriate header for the ppt reader, maybe we can refer to the other rag libraries like llamaindex

@DavdGao
Copy link
Member

DavdGao commented Jan 17, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PowerPointReader for extracting content from .pptx files, a valuable addition to the RAG capabilities. The implementation is well-structured, handling text, tables, and images, with good configuration options and error handling. The PR also includes a consistency improvement by switching from MD5 to SHA256 for document ID generation and standardizing dependency names in pyproject.toml.

My review focuses on a few areas for improvement: I've identified a bug in the JSON table serialization, pointed out some redundant code that can be cleaned up, and suggested removing leftover debugging statements from the new test case. Overall, this is a solid contribution.

Comment on lines +605 to +625
def _table_to_json(table_data: list[list[str]]) -> str:
"""Convert table data to JSON string.
Args:
table_data (`list[list[str]]`):
Table data represented as a 2D list.
Returns:
`str`:
Table in JSON string format.
"""
json_strs = [
"<system-info>A table loaded as a JSON array:</system-info>",
]

for row in table_data:
json_strs.append(
json.dumps(row, ensure_ascii=False),
)

return "\n".join(json_strs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The implementation of _table_to_json does not align with its docstring. The docstring specifies that it returns 'a JSON string of a list[list[str]] object', but the current implementation generates a string with newline-separated JSON arrays for each row. This is not a valid single JSON object and could cause parsing issues for consumers. To match the documentation and provide a more standard output, the entire table_data should be serialized into a single JSON array.

    @staticmethod
    def _table_to_json(table_data: list[list[str]]) -> str:
        """Convert table data to JSON string.

        Args:
            table_data (`list[list[str]]`):
                Table data represented as a 2D list.

        Returns:
            `str`:
                Table in JSON string format.
        """
        return "\n".join(
            [
                "<system-info>A table loaded as a JSON array:</system-info>",
                json.dumps(table_data, ensure_ascii=False),
            ],
        )

Comment on lines +87 to +93
try:
# Check if text frame contains inline images
# This is a simplified check - pptx doesn't directly support
# inline images in text frames like Word does
pass
except Exception:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This try...except Exception: pass block is redundant because it only wraps a pass statement and comments. It appears to be a placeholder but currently adds unnecessary code. It can be simplified by removing the try...except wrapper.

        # Check if text frame contains inline images
        # This is a simplified check - pptx doesn't directly support
        # inline images in text frames like Word does
        pass

The generated document ID.
"""
return hashlib.md5(word_path.encode("utf-8")).hexdigest()
return hashlib.sha256(word_path.encode("utf-8")).hexdigest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Good to see the hashing algorithm for document ID generation being updated from MD5 to SHA256. SHA256 is more secure and has a much lower probability of collisions, making it a better choice for ensuring document uniqueness. This change also aligns with the implementation in the new PowerPointReader.

Comment on lines 181 to 189
import json

print(
json.dumps(
[_.metadata.content.get("text") for _ in docs],
indent=4,
ensure_ascii=False,
),
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These print statements and the associated import json appear to be for debugging purposes. It's best practice to remove them from the final test suite to keep the test output clean and focused on success or failure results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Provide Microsoft Powerpoint reader

4 participants