Label offest no accurate in case of table #92

ShakedAharonn · 2025-02-04T09:28:33Z

Hi,
I encuontered this bug while trying to scarpe a specific site:

`
page = """

item1 item2 item3 item4	item5 item6 item7 item8

"""

rules = {'ul':['ul'], 'table':['table']}

output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}

(start_index, end_index, annotation) = output['label'][1]
(output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item'
`

as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table

AlbertWeichselbraun · 2025-02-04T18:12:26Z

Would annotating li rather than ul fix your problem?

ShakedAharonn · 2025-02-05T09:40:36Z

I can try, but it will miss the point of me trying to capture the full list as a single segment, wouldn't it?

AlbertWeichselbraun · 2025-02-05T10:19:18Z

with the current implementation annotations cover the area between an element's start and stop tag.

in case of an ul in a table cell this leads to overlaps, with the uls start tags in the following cells (otherwise one annotation would need to yield multiple areas (i.e., one for each line) rather than a single one).

in my opinion this is a use case where it makes more sense to capture the content of the ul tag with an xpath expression (e.g., via lxml) and then use inscriptis to convert the extracted content to text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label offest no accurate in case of table #92

Label offest no accurate in case of table #92

ShakedAharonn commented Feb 4, 2025 •

edited

Loading

AlbertWeichselbraun commented Feb 4, 2025

ShakedAharonn commented Feb 5, 2025

AlbertWeichselbraun commented Feb 5, 2025

Label offest no accurate in case of table #92

Label offest no accurate in case of table #92

Comments

ShakedAharonn commented Feb 4, 2025 • edited Loading

AlbertWeichselbraun commented Feb 4, 2025

ShakedAharonn commented Feb 5, 2025

AlbertWeichselbraun commented Feb 5, 2025

ShakedAharonn commented Feb 4, 2025 •

edited

Loading