Issues with ToC parsing and importing pages. #249

a-velasco · 2024-06-17T14:05:52Z

Issues

The action fails to parse this index.md table of contents.
~~Running the action on a repo with no /docs folder creates a complete table of contents in index.md, but only imports some of those pages into the repo.~~ Solved: see this comment.

Steps taken

The repository has an existing /docs folder with all the pages published to Charmhub. This migration was done by an old fork of the discourse-gatekeeper action.

Issue 1: Table of contents nesting

We updated our workflow from the fork to canonical/discourse-gatekeeper@stable. Then, we re-triggered the workflow with no changes to discourse or GitHub.

The action failed due to an index/ToC related error (link to logs):

src.exceptions.InputError: A nested item is a reference to a path that is not within the directory of its parent. item=_ParsedListItem(whitespace_count=0, reference_title='Reference', reference_value='reference', rank=19, hidden=False), expected parent path: PosixPath('how-to')

It fails on this item of the contents table.

The same error happens with canonical/discourse-gatekeeper@main.

Issue 2 (solved): Action does not import all documentation pages

See this comment

We deleted the entire /docs directory to allow the action to re-generate all docs from scratch with the expected structure. However, the action only downloaded a select few pages.
This older revision shows the expected pages.
This newer revision shows the result of deleting /docs and running the action.

This might be an entirely different issue altogether. It happens with both main and stable branches of discourse-gatekeeper.

After this, subsequent runs of the action (example run) yield the following error:

gatekeeper.exceptions.InputError: An item is not a file, directory or external HTTP resource. item=_ParsedListItem(whitespace_count=2, reference_title='1. Introduction', reference_value='tutorial/t-overview.md', rank=1, hidden=False)

The item referenced in the error above is the first page that exists in the table of contents but was not correctly imported to GitHub, so I guess that is why it could not be parsed.

The text was updated successfully, but these errors were encountered:

a-velasco · 2024-06-18T10:45:59Z

Issue 2 resolved: We had not deleted the /base-content tag when resetting the docs.
We made sure to delete it completely before running the action, and it worked as expected (imported all documentation pages)

a-velasco · 2024-06-19T19:32:16Z

RE: Issue 1

I think this might be due to the logic in the _calculate_contents_hierarchy function.

When checking if the current item has a lower whitespace count than expected, it seems to assume the item is just one level away, so it always corrects the hierarchy by subtracting 1.

In a case where it is two or more levels away, its hierarchy value will be incorrect, so the _check_contents_item function will complain that this item isn't part of the group that its hierarchy indicates (i.e. raises that InputError I quoted in the OP)

I've had some success by modifying _calculate_contents_hierarchy like so:

while item:
        # All items in the current directory have been processed
        if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
+           if item.whitespace_count == 0:
+               hierarchy = 0
+               aggregate_dir = Path('.')
+           else:
+               hierarchy = hierarchy - 1
+               parent = parents.pop()
+               aggregate_dir = Path(parent.reference_value).parent
-           hierarchy = hierarchy - 1
-           parent = parents.pop()
-           aggregate_dir = Path(parent.reference_value).parent

This passes the unit tests in tests/unit/test_index_contents_hierarchy.py, and parses the contents as expected when I test it on the repository's CI.

I also added a new test case to simulate this scenario:

        pytest.param(
            (
                item_1 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_1 := "dir1")
                ),
                item_2 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_2 := f"{value_1}/dir2")
                ),
                item_3 := factories.IndexParsedListItemFactory(
                    whitespace_count=2, reference_value=(value_3 := f"{value_2}/file3.md")
                ),
                item_4 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_4 := "dir4")
                ),
                item_5 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_5 :=f"{value_4}/file5.md")
                ),    
            ),
            ("dir", "dir", "file", "dir", "file"),
            (
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_1.reference_title,
                    reference_value=value_1,
                    rank=item_1.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_2.reference_title,
                    reference_value=value_2,
                    rank=item_2.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=3,
                    reference_title=item_3.reference_title,
                    reference_value=value_3,
                    rank=item_3.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_4.reference_title,
                    reference_value=value_4,
                    rank=item_4.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_5.reference_title,
                    reference_value=value_5,
                    rank=item_5.rank,
                ),
            ),
            id="directory following a file nested twice",
        )

@jdkandersson I would love to hear your thoughts about this and get some more insights into how the hierarchy calculation is designed! I haven't raised a PR because I'm not sure if this is the best approach, and I feel like there are probably more considerations to take into account when modifying this logic.

jdkandersson · 2024-06-26T06:42:20Z

I'm taking a look now

jdkandersson · 2024-06-26T06:52:37Z

I think the reduction of level by 1 shouldn't be the problem since, if the whitespace count is still lower than expected, on the next loop it should be popped again. I haven't had the time yet, I would write a test potentially here:

discourse-gatekeeper/tests/unit/test_index_contents_parse.py

Line 522 in 50d13cb

def test__get_contents_parsed_items(

with the case and see if it fails and check why

jdkandersson · 2024-06-26T06:54:38Z

actually, you already wrote the test, I'll use that as a starting point and see if I can fix the issue

jdkandersson · 2024-06-26T06:58:54Z

Ok, also got the test failure, now need to check why

jdkandersson · 2024-06-26T07:00:35Z

Part of the reason why I'm not sure we should use the proposed code is that it only takes care of the case where the whitespace count is 0, it should also be solved for cases where there is a 2 or more drop in hierarchy when the parent isn't at the root level

a-velasco · 2024-06-26T08:25:14Z

@jdkandersson Thanks a lot for taking a look!

I wrote that code to check whether this was the part of the logic that was failing for multiple level jumps, but if you agree that this is where the hierarchy mis-calculation is happening, then I definitely agree that it should be generalized for other cases.

I'll try to come up with a more clean and general solution, and submit a PR.

jdkandersson · 2024-06-27T04:20:26Z

The solution is quite simple:

if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
    hierarchy = hierarchy - 1
    parent = parents.pop()
    aggregate_dir = Path(parent.reference_value).parent
    continue

Just needed to add the continue so that the if statement is evaluated again rather than processing continues after only going up one hierarchy

jdkandersson · 2024-06-27T04:21:11Z

Would you like to raise a PR with this and the new test case you have written?

jdkandersson · 2024-06-27T04:22:55Z

Thanks for finding, reporting and even creating a test case for this!

a-velasco · 2024-06-27T07:59:48Z

Ahh no way, I reached almost the same solution but thought the aggregate directory calculation had to be different. Thanks! I'll be happy to raise the PR with the solution + test case :)

a-velasco mentioned this issue Jun 17, 2024

Readd sync_docs canonical/opensearch-operator#330

Merged

a-velasco mentioned this issue Jun 27, 2024

Extend hierarchy calculation for deeply nested pages #250

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with ToC parsing and importing pages. #249

Issues with ToC parsing and importing pages. #249

a-velasco commented Jun 17, 2024 •

edited

Loading

a-velasco commented Jun 18, 2024 •

edited

Loading

a-velasco commented Jun 19, 2024 •

edited

Loading

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

a-velasco commented Jun 26, 2024

jdkandersson commented Jun 27, 2024 •

edited

Loading

jdkandersson commented Jun 27, 2024

jdkandersson commented Jun 27, 2024

a-velasco commented Jun 27, 2024 •

edited

Loading

Issues with ToC parsing and importing pages. #249

Issues with ToC parsing and importing pages. #249

Comments

a-velasco commented Jun 17, 2024 • edited Loading

Issues

Steps taken

Issue 1: Table of contents nesting

Issue 2 (solved): Action does not import all documentation pages

a-velasco commented Jun 18, 2024 • edited Loading

a-velasco commented Jun 19, 2024 • edited Loading

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

jdkandersson commented Jun 26, 2024

a-velasco commented Jun 26, 2024

jdkandersson commented Jun 27, 2024 • edited Loading

jdkandersson commented Jun 27, 2024

jdkandersson commented Jun 27, 2024

a-velasco commented Jun 27, 2024 • edited Loading

a-velasco commented Jun 17, 2024 •

edited

Loading

a-velasco commented Jun 18, 2024 •

edited

Loading

a-velasco commented Jun 19, 2024 •

edited

Loading

jdkandersson commented Jun 27, 2024 •

edited

Loading

a-velasco commented Jun 27, 2024 •

edited

Loading