feat(rag): add powerpoint reader for slide files #873

qbc2016 · 2025-10-28T09:30:16Z

AgentScope Version

1.0.6

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with pre-commit run --all-files command
All tests are passing
Docstrings are in Google style
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

cla-assistant · 2025-12-02T09:51:27Z

All committers have signed the CLA.

# Conflicts: # pyproject.toml

Copilot

Pull request overview

This pull request adds PowerPoint file reading capability to AgentScope's RAG module, enabling extraction of text, tables, and images from .pptx files.

Changes:

Adds new PowerPointReader class that supports chunking text content and extracting images and tables from PowerPoint presentations
Updates dependency management in pyproject.toml to include python-pptx library and fix dependency group naming (hyphens to underscores)
Changes WordReader.get_doc_id() to use SHA-256 instead of MD5 for document ID generation
Adds comprehensive test coverage for PowerPoint reading functionality

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/agentscope/rag/_reader/_ppt_reader.py	New PowerPoint reader implementation with support for text extraction, table parsing (markdown/JSON), and image extraction
tests/rag_reader_test.py	Adds comprehensive test case for PowerPoint reader with images and tables
src/agentscope/rag/_reader/init.py	Exports new PowerPointReader class
src/agentscope/rag/init.py	Exports PowerPointReader at package level
pyproject.toml	Adds ppt_reader dependency group and fixes dependency naming conventions
src/agentscope/rag/_reader/_word_reader.py	Changes document ID hashing from MD5 to SHA-256
tests/test.pptx	Test fixture PowerPoint file (binary)

Comments suppressed due to low confidence (1)

src/agentscope/rag/_reader/_ppt_reader.py:639

The get_doc_id method uses SHA-256 hashing, but the WordReader class now also uses SHA-256 (changed from MD5 in line 508 of _word_reader.py). However, the docstring in PowerPointReader still mentions MD5. The docstring should be updated to reflect that SHA-256 is being used instead of MD5.

        Args:
            ppt_path (`str`):
                The path to the PowerPoint file.

        Returns:
            `str`:
                The document ID (SHA256 hash of the file path).
        """
        return hashlib.sha256(ppt_path.encode("utf-8")).hexdigest()

src/agentscope/rag/_reader/_ppt_reader.py

Copilot · 2026-01-16T05:49:01Z

tests/rag_reader_test.py

+        import json
+
+        print(
+            json.dumps(
+                [_.metadata.content.get("text") for _ in docs],
+                indent=4,
+                ensure_ascii=False,
+            ),
+        )
+


Debug print statement detected. The test contains a print() statement with JSON dumps that appears to be leftover from debugging. This should be removed before merging as it will clutter test output.

Suggested change

import json

print(

json.dumps(

[_.metadata.content.get("text") for _ in docs],

indent=4,

ensure_ascii=False,

),

)

Copilot · 2026-01-16T05:49:02Z

tests/rag_reader_test.py

+        import json
+
+        print(
+            json.dumps(
+                [_.metadata.content.get("text") for _ in docs],
+                indent=4,
+                ensure_ascii=False,
+            ),
+        )
+


Unused import statement. The json import on line 181 is only used for the debug print statement that should be removed. Once the print statement is removed, this import should also be removed.

Suggested change

import json

print(

json.dumps(

[_.metadata.content.get("text") for _ in docs],

indent=4,

ensure_ascii=False,

),

)

Copilot · 2026-01-16T05:49:02Z

src/agentscope/rag/_reader/_ppt_reader.py

+        try:
+            # Check if text frame contains inline images
+            # This is a simplified check - pptx doesn't directly support
+            # inline images in text frames like Word does
+            pass
+        except Exception:
+            pass
+


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

try:

# Check if text frame contains inline images

# This is a simplified check - pptx doesn't directly support

# inline images in text frames like Word does

pass

except Exception:

pass

# Inline images in text frames are currently not supported by

# python-pptx. This conditional block is kept as a placeholder

# for future enhancements when inline image extraction becomes

# feasible.

pass

Copilot · 2026-01-16T05:49:02Z

src/agentscope/rag/_reader/_ppt_reader.py

+        try:
+            # Check if text frame contains inline images
+            # This is a simplified check - pptx doesn't directly support
+            # inline images in text frames like Word does
+            pass
+        except Exception:
+            pass


This statement is unreachable.

Suggested change

try:

# Check if text frame contains inline images

# This is a simplified check - pptx doesn't directly support

# inline images in text frames like Word does

pass

except Exception:

pass

# Check if text frame contains inline images

# This is a simplified check - pptx doesn't directly support

# inline images in text frames like Word does

# Currently, no additional handling is implemented here.

...

DavdGao

Overall, it looks good to me. Plz note:

why we delete the example.docx from the example directory?
the PowerpointReader works well, but we should add more tests for it.

DavdGao · 2026-01-16T05:47:19Z

tests/rag_reader_test.py

            ["image/png", "image/png", "image/png"],
        )
+
+    async def test_ppt_reader_with_images_and_tables(self) -> None:


We should add more tests about the different argument combinations, e.g. when extracing table in JSON format

DavdGao · 2026-01-16T05:50:49Z

tests/rag_reader_test.py

+
+        import json
+
+        print(


remove the printing statements here

DavdGao · 2026-01-16T05:52:10Z

src/agentscope/rag/_reader/_ppt_reader.py

+
+    # Also check for images in text frames (e.g., inline images)
+    if hasattr(shape, "has_text_frame") and shape.has_text_frame:
+        try:


What's this for?

DavdGao · 2026-01-16T05:52:59Z

src/agentscope/rag/_reader/_ppt_reader.py

+                return await self._process_slides_separately(prs, doc_id)
+            else:
+                return await self._process_slides_merged(prs, doc_id)
+        finally:


Remove unnecessary statements here

DavdGao · 2026-01-16T05:54:18Z

src/agentscope/rag/_reader/_ppt_reader.py

+        """
+        blocks: list[TextBlock | ImageBlock] = []
+        last_type = None
+        slide_header = f"Slide {slide_idx + 1}"


I'm not sure if this is appropriate header for the ppt reader, maybe we can refer to the other rag libraries like llamaindex

DavdGao · 2026-01-17T13:32:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a PowerPointReader for extracting content from .pptx files, a valuable addition to the RAG capabilities. The implementation is well-structured, handling text, tables, and images, with good configuration options and error handling. The PR also includes a consistency improvement by switching from MD5 to SHA256 for document ID generation and standardizing dependency names in pyproject.toml.

My review focuses on a few areas for improvement: I've identified a bug in the JSON table serialization, pointed out some redundant code that can be cleaned up, and suggested removing leftover debugging statements from the new test case. Overall, this is a solid contribution.

gemini-code-assist · 2026-01-17T13:33:22Z

src/agentscope/rag/_reader/_ppt_reader.py

+    def _table_to_json(table_data: list[list[str]]) -> str:
+        """Convert table data to JSON string.
+
+        Args:
+            table_data (`list[list[str]]`):
+                Table data represented as a 2D list.
+
+        Returns:
+            `str`:
+                Table in JSON string format.
+        """
+        json_strs = [
+            "<system-info>A table loaded as a JSON array:</system-info>",
+        ]
+
+        for row in table_data:
+            json_strs.append(
+                json.dumps(row, ensure_ascii=False),
+            )
+
+        return "\n".join(json_strs)


The implementation of _table_to_json does not align with its docstring. The docstring specifies that it returns 'a JSON string of a list[list[str]] object', but the current implementation generates a string with newline-separated JSON arrays for each row. This is not a valid single JSON object and could cause parsing issues for consumers. To match the documentation and provide a more standard output, the entire table_data should be serialized into a single JSON array.

@staticmethod def _table_to_json(table_data: list[list[str]]) -> str: """Convert table data to JSON string. Args: table_data (`list[list[str]]`): Table data represented as a 2D list. Returns: `str`: Table in JSON string format. """ return "\n".join( [ "<system-info>A table loaded as a JSON array:</system-info>", json.dumps(table_data, ensure_ascii=False), ], )

gemini-code-assist · 2026-01-17T13:33:22Z

src/agentscope/rag/_reader/_ppt_reader.py

+        try:
+            # Check if text frame contains inline images
+            # This is a simplified check - pptx doesn't directly support
+            # inline images in text frames like Word does
+            pass
+        except Exception:
+            pass


This try...except Exception: pass block is redundant because it only wraps a pass statement and comments. It appears to be a placeholder but currently adds unnecessary code. It can be simplified by removing the try...except wrapper.

# Check if text frame contains inline images # This is a simplified check - pptx doesn't directly support # inline images in text frames like Word does pass

gemini-code-assist · 2026-01-17T13:33:23Z

src/agentscope/rag/_reader/_word_reader.py

                The generated document ID.
        """
-        return hashlib.md5(word_path.encode("utf-8")).hexdigest()
+        return hashlib.sha256(word_path.encode("utf-8")).hexdigest()


Good to see the hashing algorithm for document ID generation being updated from MD5 to SHA256. SHA256 is more secure and has a much lower probability of collisions, making it a better choice for ensuring document uniqueness. This change also aligns with the implementation in the new PowerPointReader.

gemini-code-assist · 2026-01-17T13:33:23Z

tests/rag_reader_test.py

+        import json
+
+        print(
+            json.dumps(
+                [_.metadata.content.get("text") for _ in docs],
+                indent=4,
+                ensure_ascii=False,
+            ),
+        )


These print statements and the associated import json appear to be for debugging purposes. It's best practice to remove them from the final test suite to keep the test output clean and focused on success or failure results.

…gentscope-ai#869) --------- Signed-off-by: Li Yi <[email protected]> Co-authored-by: DavdGao <[email protected]>

…gentscope-ai#1115) --------- Co-authored-by: DavdGao <[email protected]>

…#860)

…ope-ai#1135)

qbc2016 added 2 commits October 28, 2025 17:27

Implememt ppt reader

cb0472a

update

b971cb1

DavdGao self-requested a review October 29, 2025 02:12

DavdGao added Ready for Review Roadmap The development plan module: rag labels Oct 29, 2025

qbc2016 added 2 commits November 3, 2025 12:14

merge main and fix conflicts

09a35a6

update

c081c62

DavdGao changed the title ~~Implement ppt reader~~ feat(rag): add powerpoint reader for slide files Nov 3, 2025

DavdGao mentioned this pull request Nov 25, 2025

[Feature]: Provide Microsoft Powerpoint reader #963

Closed

DavdGao linked an issue Nov 25, 2025 that may be closed by this pull request

[Feature]: Provide Microsoft Powerpoint reader #963

Closed

Merge remote-tracking branch 'agentscope/main' into bc/ppt_reader

045d726

# Conflicts: # pyproject.toml

DavdGao requested a review from Copilot January 16, 2026 05:46

Copilot started reviewing on behalf of DavdGao January 16, 2026 05:46 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

DavdGao reviewed Jan 16, 2026

View reviewed changes

DavdGao mentioned this pull request Jan 17, 2026

Extend the current RAG implementation #811

Open

10 tasks

gemini-code-assist bot reviewed Jan 17, 2026

View reviewed changes

pan-x-c and others added 9 commits January 19, 2026 16:59

fix(docs): fix links and typos in the tuner example (agentscope-ai#1131)

8878be8

fix(tuner): update links of the trinity-rft (agentscope-ai#1132)

4afd9fc

feat(example): add example integration for Qwen-Deep-Research model (a…

f908449

…gentscope-ai#869) --------- Signed-off-by: Li Yi <[email protected]> Co-authored-by: DavdGao <[email protected]>

feat(tool): add func_name parameter to support custom tool renaming (a…

4623b9d

…gentscope-ai#1115) --------- Co-authored-by: DavdGao <[email protected]>

feat(rag): add MongoDB vector store in the RAG module. (agentscope-ai…

d957ceb

…#860)

ci(gemini): add code review guide for the gemini code assist (agentsc…

7f7bffe

…ope-ai#1135)

fix(docs): fix the typos in tutorial (agentscope-ai#1137)

10c03d7

merge main and fix conflicts

84e2ff9

update

df641b9

qbc2016 added 2 commits January 19, 2026 17:16

update

87e3f30

update

ce3d0ba

feat(rag): add powerpoint reader for slide files #873

Are you sure you want to change the base?

feat(rag): add powerpoint reader for slide files #873

Conversation

qbc2016 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AgentScope Version

Description

Checklist

Uh oh!

cla-assistant bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao left a comment

Choose a reason for hiding this comment

Uh oh!

DavdGao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

DavdGao commented Jan 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qbc2016 commented Oct 28, 2025 •

edited

Loading

cla-assistant bot commented Dec 2, 2025 •

edited

Loading