Skip to content

Conversation

@eyupcanakman
Copy link

Closes #720: Review HTML element list and conversion.

Ensured all MDN HTML elements are accounted for and correctly mapped to XML.

  • Created html_elements_reference.py snapshot including all 95+ MDN elements (modern, legacy, deprecated).
  • Preserved explicit semantic mappings (headings → head, lists → list).
  • Added default identity mappings (tagtag) ensuring no elements are lost.
  • Explicit mappings defined for semantic elements.
  • Automatic identity mappings provided as defaults.
  • New tests verify complete MDN coverage.

Copilot AI review requested due to automatic review settings June 8, 2025 22:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR ensures that all MDN HTML elements are correctly accounted for and mapped to XML, thereby resolving issue #720.

  • Introduces an explicit conversion mapping (HTML_EL_TO_XML_EL) for HTML-to-XML element conversions.
  • Adds a loop that fills any missing mapping with an identity rule based on MDN_ELEMENTS.
  • Provides new tests to validate both explicit and default identity mappings.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
trafilatura/htmlprocessing.py Added new conversion map with explicit mappings and identity mappings.
trafilatura/html_elements_reference.py Added a frozen snapshot of MDN HTML element names.
tests/test_html_elements.py Added tests to verify complete mapping coverage against MDN elements.
Comments suppressed due to low confidence (1)

trafilatura/htmlprocessing.py:111

  • [nitpick] Using the variable name '_tag' might suggest that the variable is unused. Consider renaming it to 'tag' for improved clarity.
for _tag in MDN_ELEMENTS:

@codecov
Copy link

codecov bot commented Jun 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.29%. Comparing base (badd594) to head (a4c6730).

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #802   +/-   ##
=======================================
  Coverage   99.29%   99.29%           
=======================================
  Files          21       22    +1     
  Lines        3664     3680   +16     
=======================================
+ Hits         3638     3654   +16     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@adbar
Copy link
Owner

adbar commented Jun 10, 2025

Hi @eyupcanakman, the idea looks good but as it stands your code isn't actually used during the extraction. So it's hard to tell what would be the benefit here.

@adbar
Copy link
Owner

adbar commented Jul 14, 2025

@eyupcanakman Your PR doesn't change anything in the way documents are processed, I will close it if you don't integrate it into the actual code.

@eyupcanakman eyupcanakman force-pushed the feat/html-element-audit-720 branch from 2a28678 to a4c6730 Compare July 15, 2025 06:10
@eyupcanakman
Copy link
Author

Hi @adbar

Thanks for your feedback and the reminder. Sorry for the late reply.

I have pushed a new update. The new HTML mapping is now fully connected to the convert_tags function, so the logic is being used as you suggested.

Changes:

  • All tags are included: All HTML tags from the MDN list are now processed. They are either converted to a new tag or kept the same (for example, "search" stays "search").
  • Works with old rules: The new code does not conflict with existing special rules for handling tables, lists, or headings.
  • New tests: I added new tests to ensure that the new logic works correctly and doesn't cause any issues.

I believe this update resolves the issue you pointed out. It is ready for your review. I look forward to your feedback!

@adbar
Copy link
Owner

adbar commented Jul 25, 2025

@eyupcanakman It works but it doesn't make much sense to keep both conversions active, or am I getting it wrong?

  • for elem in tree.iter(CONVERSIONS.keys()):
  • for elem in tree.iter(*_ALL_TAGS_TO_CONVERT):

The code is slower with both (obviously).

@eyupcanakman eyupcanakman force-pushed the feat/html-element-audit-720 branch from a4c6730 to a12f09f Compare July 25, 2025 19:29
@eyupcanakman
Copy link
Author

@adbar You're right the code was processing the DOM twice which doesn't make sense. I just pushed a fix for that.

@eyupcanakman eyupcanakman force-pushed the feat/html-element-audit-720 branch from a12f09f to 2824d95 Compare August 12, 2025 12:39
@adbar
Copy link
Owner

adbar commented Sep 12, 2025

@eyupcanakman The last change looks good but I still need to think about the PR. There is a small negative impact on the benchmark.
You get more coverage if you use all the elements but it may the cause of unexpected behavior further down the line as the elements are converted to the output format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review HTML element list and conversion

2 participants