You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
with the current implementation annotations cover the area between an element's start and stop tag.
in case of an ul in a table cell this leads to overlaps, with the uls start tags in the following cells (otherwise one annotation would need to yield multiple areas (i.e., one for each line) rather than a single one).
in my opinion this is a use case where it makes more sense to capture the content of the ul tag with an xpath expression (e.g., via lxml) and then use inscriptis to convert the extracted content to text.
Hi,
I encuontered this bug while trying to scarpe a specific site:
`
page = """
rules = {'ul':['ul'], 'table':['table']}
output = get_annotated_text(page, ParserConfig(annotation_rules=rules)) // {'text': ' * item1 * item5\n * item2 * item6\n * item3 * item7\n * item4 * item8\n', 'label': [(0, 85, 'table'), (0, 40, 'ul'), (11, 51, 'ul')]}
(start_index, end_index, annotation) = output['label'][1]
(output['text'][start_index:end_index]) //' * item1 * item5\n * item2 * item'
`
as can be seen, accessing the text of the relevant label isn't working as the offsets aren't accurate when viewing a table
The text was updated successfully, but these errors were encountered: