-
-
Notifications
You must be signed in to change notification settings - Fork 334
Review HTML element list and ensure complete XML conversion coverage #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Review HTML element list and ensure complete XML conversion coverage #802
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR ensures that all MDN HTML elements are correctly accounted for and mapped to XML, thereby resolving issue #720.
- Introduces an explicit conversion mapping (HTML_EL_TO_XML_EL) for HTML-to-XML element conversions.
- Adds a loop that fills any missing mapping with an identity rule based on MDN_ELEMENTS.
- Provides new tests to validate both explicit and default identity mappings.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| trafilatura/htmlprocessing.py | Added new conversion map with explicit mappings and identity mappings. |
| trafilatura/html_elements_reference.py | Added a frozen snapshot of MDN HTML element names. |
| tests/test_html_elements.py | Added tests to verify complete mapping coverage against MDN elements. |
Comments suppressed due to low confidence (1)
trafilatura/htmlprocessing.py:111
- [nitpick] Using the variable name '_tag' might suggest that the variable is unused. Consider renaming it to 'tag' for improved clarity.
for _tag in MDN_ELEMENTS:
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #802 +/- ##
=======================================
Coverage 99.29% 99.29%
=======================================
Files 21 22 +1
Lines 3664 3680 +16
=======================================
+ Hits 3638 3654 +16
Misses 26 26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @eyupcanakman, the idea looks good but as it stands your code isn't actually used during the extraction. So it's hard to tell what would be the benefit here. |
|
@eyupcanakman Your PR doesn't change anything in the way documents are processed, I will close it if you don't integrate it into the actual code. |
2a28678 to
a4c6730
Compare
|
Hi @adbar Thanks for your feedback and the reminder. Sorry for the late reply. I have pushed a new update. The new HTML mapping is now fully connected to the convert_tags function, so the logic is being used as you suggested. Changes:
I believe this update resolves the issue you pointed out. It is ready for your review. I look forward to your feedback! |
|
@eyupcanakman It works but it doesn't make much sense to keep both conversions active, or am I getting it wrong?
The code is slower with both (obviously). |
a4c6730 to
a12f09f
Compare
|
@adbar You're right the code was processing the DOM twice which doesn't make sense. I just pushed a fix for that. |
…dled HTML elements to XML counterparts (adbar#720)
a12f09f to
2824d95
Compare
|
@eyupcanakman The last change looks good but I still need to think about the PR. There is a small negative impact on the benchmark. |
Closes #720: Review HTML element list and conversion.
Ensured all MDN HTML elements are accounted for and correctly mapped to XML.
html_elements_reference.pysnapshot including all 95+ MDN elements (modern, legacy, deprecated).head, lists →list).tag→tag) ensuring no elements are lost.