Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label offest no accurate in case of table #92

Open
ShakedAharonn opened this issue Feb 4, 2025 · 3 comments
Open

Label offest no accurate in case of table #92

ShakedAharonn opened this issue Feb 4, 2025 · 3 comments

Comments

@ShakedAharonn
Copy link

ShakedAharonn commented Feb 4, 2025

Hi,
I encuontered this bug while trying to scarpe a specific site:

`
page = """

  • item1
  • item2
  • item3
  • item4
  • item5
  • item6
  • item7
  • item8
"""

rules = {'ul':['ul'], 'table':['table']}

output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}

(start_index, end_index, annotation) = output['label'][1]
(output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item'
`

as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table

@AlbertWeichselbraun
Copy link
Contributor

Would annotating li rather than ul fix your problem?

@ShakedAharonn
Copy link
Author

I can try, but it will miss the point of me trying to capture the full list as a single segment, wouldn't it?

@AlbertWeichselbraun
Copy link
Contributor

with the current implementation annotations cover the area between an element's start and stop tag.

in case of an ul in a table cell this leads to overlaps, with the uls start tags in the following cells (otherwise one annotation would need to yield multiple areas (i.e., one for each line) rather than a single one).

in my opinion this is a use case where it makes more sense to capture the content of the ul tag with an xpath expression (e.g., via lxml) and then use inscriptis to convert the extracted content to text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants