Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with ToC parsing and importing pages. #249

Open
a-velasco opened this issue Jun 17, 2024 · 12 comments
Open

Issues with ToC parsing and importing pages. #249

a-velasco opened this issue Jun 17, 2024 · 12 comments

Comments

@a-velasco
Copy link

a-velasco commented Jun 17, 2024

Issues

  1. The action fails to parse this index.md table of contents.
  2. Running the action on a repo with no /docs folder creates a complete table of contents in index.md, but only imports some of those pages into the repo. Solved: see this comment.

Steps taken

The repository has an existing /docs folder with all the pages published to Charmhub. This migration was done by an old fork of the discourse-gatekeeper action.

Issue 1: Table of contents nesting

We updated our workflow from the fork to canonical/discourse-gatekeeper@stable. Then, we re-triggered the workflow with no changes to discourse or GitHub.

The action failed due to an index/ToC related error (link to logs):

src.exceptions.InputError: A nested item is a reference to a path that is not within the directory of its parent. item=_ParsedListItem(whitespace_count=0, reference_title='Reference', reference_value='reference', rank=19, hidden=False), expected parent path: PosixPath('how-to')

It fails on this item of the contents table.

The same error happens with canonical/discourse-gatekeeper@main.

Issue 2 (solved): Action does not import all documentation pages

See this comment

We deleted the entire /docs directory to allow the action to re-generate all docs from scratch with the expected structure. However, the action only downloaded a select few pages.
This older revision shows the expected pages.
This newer revision shows the result of deleting /docs and running the action.

This might be an entirely different issue altogether. It happens with both main and stable branches of discourse-gatekeeper.

After this, subsequent runs of the action (example run) yield the following error:

gatekeeper.exceptions.InputError: An item is not a file, directory or external HTTP resource. item=_ParsedListItem(whitespace_count=2, reference_title='1. Introduction', reference_value='tutorial/t-overview.md', rank=1, hidden=False)

The item referenced in the error above is the first page that exists in the table of contents but was not correctly imported to GitHub, so I guess that is why it could not be parsed.

@a-velasco
Copy link
Author

a-velasco commented Jun 18, 2024

Issue 2 resolved: We had not deleted the /base-content tag when resetting the docs.
We made sure to delete it completely before running the action, and it worked as expected (imported all documentation pages)

@a-velasco
Copy link
Author

a-velasco commented Jun 19, 2024

RE: Issue 1

I think this might be due to the logic in the _calculate_contents_hierarchy function.

When checking if the current item has a lower whitespace count than expected, it seems to assume the item is just one level away, so it always corrects the hierarchy by subtracting 1.

In a case where it is two or more levels away, its hierarchy value will be incorrect, so the _check_contents_item function will complain that this item isn't part of the group that its hierarchy indicates (i.e. raises that InputError I quoted in the OP)

I've had some success by modifying _calculate_contents_hierarchy like so:

while item:
        # All items in the current directory have been processed
        if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
+           if item.whitespace_count == 0:
+               hierarchy = 0
+               aggregate_dir = Path('.')
+           else:
+               hierarchy = hierarchy - 1
+               parent = parents.pop()
+               aggregate_dir = Path(parent.reference_value).parent
-           hierarchy = hierarchy - 1
-           parent = parents.pop()
-           aggregate_dir = Path(parent.reference_value).parent

This passes the unit tests in tests/unit/test_index_contents_hierarchy.py, and parses the contents as expected when I test it on the repository's CI.

I also added a new test case to simulate this scenario:

        pytest.param(
            (
                item_1 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_1 := "dir1")
                ),
                item_2 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_2 := f"{value_1}/dir2")
                ),
                item_3 := factories.IndexParsedListItemFactory(
                    whitespace_count=2, reference_value=(value_3 := f"{value_2}/file3.md")
                ),
                item_4 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_4 := "dir4")
                ),
                item_5 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_5 :=f"{value_4}/file5.md")
                ),    
            ),
            ("dir", "dir", "file", "dir", "file"),
            (
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_1.reference_title,
                    reference_value=value_1,
                    rank=item_1.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_2.reference_title,
                    reference_value=value_2,
                    rank=item_2.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=3,
                    reference_title=item_3.reference_title,
                    reference_value=value_3,
                    rank=item_3.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_4.reference_title,
                    reference_value=value_4,
                    rank=item_4.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_5.reference_title,
                    reference_value=value_5,
                    rank=item_5.rank,
                ),
            ),
            id="directory following a file nested twice",
        )

@jdkandersson I would love to hear your thoughts about this and get some more insights into how the hierarchy calculation is designed! I haven't raised a PR because I'm not sure if this is the best approach, and I feel like there are probably more considerations to take into account when modifying this logic.

@jdkandersson
Copy link
Contributor

I'm taking a look now

@jdkandersson
Copy link
Contributor

I think the reduction of level by 1 shouldn't be the problem since, if the whitespace count is still lower than expected, on the next loop it should be popped again. I haven't had the time yet, I would write a test potentially here:

def test__get_contents_parsed_items(
with the case and see if it fails and check why

@jdkandersson
Copy link
Contributor

actually, you already wrote the test, I'll use that as a starting point and see if I can fix the issue

@jdkandersson
Copy link
Contributor

Ok, also got the test failure, now need to check why

@jdkandersson
Copy link
Contributor

Part of the reason why I'm not sure we should use the proposed code is that it only takes care of the case where the whitespace count is 0, it should also be solved for cases where there is a 2 or more drop in hierarchy when the parent isn't at the root level

@a-velasco
Copy link
Author

@jdkandersson Thanks a lot for taking a look!

I wrote that code to check whether this was the part of the logic that was failing for multiple level jumps, but if you agree that this is where the hierarchy mis-calculation is happening, then I definitely agree that it should be generalized for other cases.

I'll try to come up with a more clean and general solution, and submit a PR.

@jdkandersson
Copy link
Contributor

jdkandersson commented Jun 27, 2024

The solution is quite simple:

if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
    hierarchy = hierarchy - 1
    parent = parents.pop()
    aggregate_dir = Path(parent.reference_value).parent
    continue

Just needed to add the continue so that the if statement is evaluated again rather than processing continues after only going up one hierarchy

@jdkandersson
Copy link
Contributor

Would you like to raise a PR with this and the new test case you have written?

@jdkandersson
Copy link
Contributor

Thanks for finding, reporting and even creating a test case for this!

@a-velasco
Copy link
Author

a-velasco commented Jun 27, 2024

Ahh no way, I reached almost the same solution but thought the aggregate directory calculation had to be different. Thanks! I'll be happy to raise the PR with the solution + test case :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants