Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix testcases for updated model #73

Merged
merged 11 commits into from
Dec 24, 2024
Merged

fix testcases for updated model #73

merged 11 commits into from
Dec 24, 2024

Conversation

boqiny
Copy link
Member

@boqiny boqiny commented Dec 16, 2024

User description

Description

Update the ground truth to match the output of updated anyparser model

Related Issue

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

Screenshots (if applicable)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes


PR Type

Bug fix, Tests


Description

  • Updated test cases in tests/test.py to use a new sample file (resume_1.pdf) and improved parameter formatting for readability.
  • Replaced and reformatted ground truth files (correct_docx_output.txt, correct_pdf_output.txt, correct_png_output.txt, correct_pptx_output.txt) to align with the updated model outputs.
  • Enhanced table formatting, corrected typos, and ensured consistency across all output files.
  • Removed outdated or redundant content in ground truth files to reflect the new model's behavior.

Changes walkthrough 📝

Relevant files
Tests
test.py
Update test cases to align with new model and file structure

tests/test.py

  • Updated test cases to use a new sample file (resume_1.pdf) instead of
    the previous file.
  • Adjusted the async_parse method to improve readability by formatting
    parameters across multiple lines.
  • Ensured consistency in file paths and content handling across multiple
    test cases.
  • +7/-6     
    correct_docx_output.txt
    Update ground truth for DOCX output with formatting fixes

    tests/outputs/correct_docx_output.txt

  • Modified table formatting for better alignment and readability.
  • Corrected a typo in the growth rates description.
  • Removed unnecessary trailing lines and ensured proper file formatting.

  • +6/-6     
    correct_pdf_output.txt
    Replace PDF ground truth with updated resume-based content

    tests/outputs/correct_pdf_output.txt

  • Replaced outdated content with a new resume-based example.
  • Simplified and reformatted the structure to reflect updated output.
  • Removed extensive index-related content to align with the new model.
  • +31/-125
    correct_png_output.txt
    Update PNG ground truth with consistent table formatting 

    tests/outputs/correct_png_output.txt

  • Adjusted table formatting for consistency with other outputs.
  • Removed redundant lines and ensured proper alignment.
  • +1/-3     
    correct_pptx_output.txt
    Update PPTX ground truth with improved formatting               

    tests/outputs/correct_pptx_output.txt

  • Improved table formatting for better readability.
  • Added spacing and alignment adjustments to match updated standards.
  • Ensured proper file termination with consistent formatting.
  • +4/-2     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    github-actions bot commented Dec 16, 2024

    PR Reviewer Guide 🔍

    (Review updated until commit 9aa3f48)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Code Smell
    The hardcoded file path ./examples/sample_data/resume_1.pdf is repeated multiple times in the test cases. Consider refactoring to use a constant or a setup method to avoid duplication and improve maintainability.

    Formatting Issue
    The table formatting in the updated ground truth file introduces inconsistent spacing and alignment. Ensure the formatting adheres to a consistent style for better readability and maintainability.

    Content Accuracy
    The updated ground truth content for the PDF output has been entirely replaced. Verify that the new content aligns with the expected output of the updated model and does not introduce discrepancies.

    Formatting Issue
    The table formatting in the updated ground truth file introduces inconsistent spacing and alignment. Ensure the formatting adheres to a consistent style for better readability and maintainability.

    Copy link

    github-actions bot commented Dec 16, 2024

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Ensure the file path is valid and the file exists before processing it

    Validate that the file at working_file exists before attempting to parse it to
    prevent runtime errors if the file is missing or incorrectly specified.

    tests/test.py [51]

     working_file = "./examples/sample_data/resume_1.pdf"
    +if not os.path.exists(working_file):
    +    raise FileNotFoundError(f"File not found: {working_file}")
    Suggestion importance[1-10]: 9

    Why: The suggestion adds a crucial validation step to ensure the file exists before processing, which prevents potential runtime errors and improves the robustness of the test cases.

    9
    Verify that the file content is not empty before proceeding with parsing

    Add a check to ensure that the file_content variable is not empty after reading the
    file to avoid passing invalid data to the async_parse method.

    tests/test.py [115]

     file_content = base64.b64encode(file.read()).decode("utf-8")
    +if not file_content:
    +    raise ValueError("File content is empty after reading the file.")
    Suggestion importance[1-10]: 8

    Why: This suggestion introduces a validation to ensure the file content is not empty, which is important to avoid passing invalid data to the async_parse method, thereby enhancing the reliability of the test.

    8

    @boqiny boqiny closed this Dec 16, 2024
    Copy link

    Persistent review updated to latest commit 9aa3f48

    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Ensure the file path is valid and the file exists before processing it

    Validate that the file at working_file exists before attempting to parse it to
    prevent runtime errors if the file is missing or incorrectly specified.

    tests/test.py [51]

     working_file = "./examples/sample_data/resume_1.pdf"
    +if not os.path.exists(working_file):
    +    raise FileNotFoundError(f"File not found: {working_file}")
    Suggestion importance[1-10]: 9

    Why: The suggestion adds a crucial validation step to ensure the file exists before processing, which prevents potential runtime errors and improves the robustness of the test cases.

    9
    Verify that the file content is not empty before proceeding with parsing

    Add a check to ensure that the file_content variable is not empty after reading the
    file to avoid passing invalid data to the async_parse method.

    tests/test.py [115]

     file_content = base64.b64encode(file.read()).decode("utf-8")
    +if not file_content:
    +    raise ValueError("File content is empty after reading the file.")
    Suggestion importance[1-10]: 8

    Why: This suggestion introduces a validation to ensure the file content is not empty, which is important to avoid passing invalid data to the async_parse method, thereby enhancing the reliability of the test.

    8

    Charles Yuan added 2 commits December 16, 2024 18:23
    @@ -1,137 +1,43 @@
    STOXX INDEX METHODOLOGY GUIDE CONTENTS
    John Doe
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    If I recall correctly, this is a demo file that Rachel was using. However, the output looks completely different from before and are you sure this is the correct output?

    Copy link
    Member Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    changed to input file due to model output instability for now

    tests/test.py Outdated
    @@ -54,7 +54,7 @@ def test_pdf_sync_parse(self):
    # extract
    markdown_list, elapsed_time = self.ap.parse(file_path=working_file)
    markdown = "\n".join(markdown_list)

    print(markdown)
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: remove this.

    Comment on lines 7 to 11
    49
    7.11. STOXX SEMICONDUCTOR 30 INDEX
    76
    6.5.2. INDEX REVIEW
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It looks like the original model follow the the proper layout to output the left column first and then the right. However, the new model does not proper follow the layout?

    Copy link
    Member Author

    @boqiny boqiny Dec 23, 2024

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    resolved with the sgl fix

    @boqiny boqiny requested a review from lingjiekong December 23, 2024 09:32
    @boqiny boqiny closed this Dec 24, 2024
    @boqiny boqiny reopened this Dec 24, 2024
    @boqiny boqiny merged commit 4299488 into main Dec 24, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants