"Un-parametrize" webcrawler tests #2966

lunkwill42 · 2024-09-12T12:24:20Z

This is a necessary step to move the test web server setup from tests/integration/conftest.py to a test suite fixture in the future:

Before this PR, a reachability test and a HTML validation test was dynamically generated for each reachable page, by crawling a NAV web server at test discovery time, and using this to parametrize the reachability and validation tests (which sort of artificially inflated the number of tests in the test suite).

However, it is much more useful to make a resource fixture for the web server, so tests that need it can declare their dependency on it. Even if it were possible to use a fixture as input to test parametrization, it would mean that the test discovery phase was still dependent on a web server already running.

This leaves us with the suggestion of this PR: Make the web crawler itself a fixture that generates a list of pages, and have a single validation test and a single reachability test that loops over all the generated results of this fixture.

This vastly reduces the total number of tests in the test suite - and the number will not "randomly" increase every time some new web pages become crawl-able. It also considerably reduces the time spent in the test discovery phase, as the crawler will only run if the current test session needs it.

Later, we can make the crawler fixture depend on a web server fixture, thereby also ensuring a web server is only started if selecting tests that need it (instead of the current solution: which always starts a web server if any integration test is selected).

This was extracted and made independent from #2675

github-actions · 2024-09-12T12:32:56Z

Test results

9 files 9 suites 7m 55s ⏱️
2 109 tests 2 109 ✅ 0 💤 0 ❌
3 957 runs 3 957 ✅ 0 💤 0 ❌

Results for commit 809a18a.

♻️ This comment has been updated with latest results.

codecov · 2024-09-12T12:34:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.83%. Comparing base (8ae4170) to head (809a18a).
Report is 25 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2966      +/-   ##
==========================================
+ Coverage   56.60%   56.83%   +0.22%     
==========================================
  Files         602      602              
  Lines       43713    43715       +2     
  Branches       48       48              
==========================================
+ Hits        24744    24845     +101     
+ Misses      18957    18858      -99     
  Partials       12       12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

While pytest can accomplish a lot of exciting things, it cannot use fixtures as input to test parametrization. While we can make a test depend on a fixture for getting a running webserver instance, the test discovery phase that generates the tests cannot. I.e. we cannot get our test parameters from the webcrawler unless the web server was already up and running when the test discovery phase started. Sad, but true. This changes the webcrawler into a fixture, and changes the web link reachability and HTML validation tests to iterate the pages provided by this session-scoped crawler. This also considerably shortens the discovery phase, since the crawling actually takes place during the running of the first test that uses the fixture.

This is mostly relevant for devs.

github-actions · 2024-09-20T12:53:07Z

🦙 MegaLinter status: ✅ SUCCESS

Descriptor	Linter	Files	Fixed	Errors	Elapsed time
✅ PYTHON	black	987		0	23.93s
✅ PYTHON	flake8	987		0	434.84s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by

johannaengland

See comments, I think this is a good idea generally, but IMO we should not stop at the first page that is not reachable/invalid HTML, but rather collect all the errors and show them at the end

johannaengland · 2024-09-23T09:38:50Z

tests/integration/web/crawler_test.py

+        if page.response != 200:
+            # No need to fill up the test report files with contents of OK pages
+            print(_content_as_string(page.content))
+        assert page.response == 200, "{} is not reachable".format(page.url)


My only complaint here is that it stops as soon as the first page is not reachable. Before you could see all pages that aren't reachable, which might help when trouble shooting, to easier figure out what is broken

johannaengland · 2024-09-23T09:39:02Z

tests/integration/web/crawler_test.py


-    assert not errors, "Found following validation errors:\n" + errors
+        assert not errors, "{} did not validate as HTML".format(page.url)


Same comment here as above

lunkwill42 · 2024-09-23T12:32:13Z

See comments, I think this is a good idea generally, but IMO we should not stop at the first page that is not reachable/invalid HTML, but rather collect all the errors and show them at the end

Yeah. I would tend to agree :)

The page reachability and validation tests would both exit on the first page that fails the check. This is different to when each page generated a dynamic test. It was rightly-so pointed out in code review that this may mask multiple failures, and that all failures should be output.

For some HTTPErrors, it seems the original crawler code overloads the meaning of the Page.content_type attribute, putting an exception in there instead. Not entirely sure what the reasoning could be, so this just adds handling of that potential case to the `should_validate` function.

lunkwill42 · 2024-09-23T13:30:00Z

It turns out that it is exceedingly difficult to provoke an HTML validation failure to test that this still works. Our default config suppresses tidy warnings in favor of only fetching tidy errors. However, it seems most problems tidy finds in HTML is classed as a warning, and not an error (HTML5 is very forgiving).

We may want to reconsider the warning-suppression after a team discussion.

If there are multiple unreachable or non-validated pages, pytest will eventually truncate the list it prints. This explicitly adds an assertion message that should include the full list of failures, separated by newlines.

Base automatically changed from test/move-snmpsim-requirement to master September 12, 2024 16:42

lunkwill42 force-pushed the test/webcrawler-as-single-test branch from 0d24838 to e820b88 Compare September 20, 2024 12:44

Add news fragment

18db455

This is mostly relevant for devs.

lunkwill42 self-assigned this Sep 20, 2024

lunkwill42 added cleanup tests labels Sep 20, 2024

lunkwill42 requested a review from a team September 20, 2024 12:56

lunkwill42 marked this pull request as ready for review September 20, 2024 12:56

johannaengland reviewed Sep 23, 2024

View reviewed changes

lunkwill42 added 2 commits September 23, 2024 13:26

lunkwill42 requested review from johannaengland and stveit September 23, 2024 13:28

johannaengland approved these changes Sep 24, 2024

View reviewed changes

Make webcrawler assertions more legible

809a18a

If there are multiple unreachable or non-validated pages, pytest will eventually truncate the list it prints. This explicitly adds an assertion message that should include the full list of failures, separated by newlines.

lunkwill42 force-pushed the test/webcrawler-as-single-test branch from c16b1f7 to 809a18a Compare September 24, 2024 10:31

lunkwill42 merged commit fb9353f into master Sep 24, 2024
13 checks passed

lunkwill42 deleted the test/webcrawler-as-single-test branch September 24, 2024 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Un-parametrize" webcrawler tests #2966

"Un-parametrize" webcrawler tests #2966

lunkwill42 commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading

johannaengland left a comment

johannaengland Sep 23, 2024

lunkwill42 Sep 23, 2024

johannaengland Sep 23, 2024

lunkwill42 Sep 23, 2024

lunkwill42 commented Sep 23, 2024

lunkwill42 commented Sep 23, 2024


		assert not errors, "Found following validation errors:\n" + errors
		assert not errors, "{} did not validate as HTML".format(page.url)

"Un-parametrize" webcrawler tests #2966

"Un-parametrize" webcrawler tests #2966

Conversation

lunkwill42 commented Sep 12, 2024 • edited Loading

github-actions bot commented Sep 12, 2024 • edited Loading

Test results

codecov bot commented Sep 12, 2024 • edited Loading

Codecov Report

github-actions bot commented Sep 20, 2024 • edited Loading

🦙 MegaLinter status: ✅ SUCCESS

johannaengland left a comment

Choose a reason for hiding this comment

johannaengland Sep 23, 2024

Choose a reason for hiding this comment

lunkwill42 Sep 23, 2024

Choose a reason for hiding this comment

johannaengland Sep 23, 2024

Choose a reason for hiding this comment

lunkwill42 Sep 23, 2024

Choose a reason for hiding this comment

lunkwill42 commented Sep 23, 2024

lunkwill42 commented Sep 23, 2024

lunkwill42 commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 12, 2024 •

edited

Loading

codecov bot commented Sep 12, 2024 •

edited

Loading

github-actions bot commented Sep 20, 2024 •

edited

Loading