-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Un-parametrize" webcrawler tests #2966
Conversation
Test results 9 files 9 suites 7m 55s ⏱️ Results for commit 809a18a. ♻️ This comment has been updated with latest results. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2966 +/- ##
==========================================
+ Coverage 56.60% 56.83% +0.22%
==========================================
Files 602 602
Lines 43713 43715 +2
Branches 48 48
==========================================
+ Hits 24744 24845 +101
+ Misses 18957 18858 -99
Partials 12 12 ☔ View full report in Codecov by Sentry. |
While pytest can accomplish a lot of exciting things, it cannot use fixtures as input to test parametrization. While we can make a test depend on a fixture for getting a running webserver instance, the test discovery phase that generates the tests cannot. I.e. we cannot get our test parameters from the webcrawler unless the web server was already up and running when the test discovery phase started. Sad, but true. This changes the webcrawler into a fixture, and changes the web link reachability and HTML validation tests to iterate the pages provided by this session-scoped crawler. This also considerably shortens the discovery phase, since the crawling actually takes place during the running of the first test that uses the fixture.
0d24838
to
e820b88
Compare
This is mostly relevant for devs.
🦙 MegaLinter status: ✅ SUCCESS
See detailed report in MegaLinter reports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments, I think this is a good idea generally, but IMO we should not stop at the first page that is not reachable/invalid HTML, but rather collect all the errors and show them at the end
if page.response != 200: | ||
# No need to fill up the test report files with contents of OK pages | ||
print(_content_as_string(page.content)) | ||
assert page.response == 200, "{} is not reachable".format(page.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only complaint here is that it stops as soon as the first page is not reachable. Before you could see all pages that aren't reachable, which might help when trouble shooting, to easier figure out what is broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
assert not errors, "Found following validation errors:\n" + errors | ||
assert not errors, "{} did not validate as HTML".format(page.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Yeah. I would tend to agree :) |
The page reachability and validation tests would both exit on the first page that fails the check. This is different to when each page generated a dynamic test. It was rightly-so pointed out in code review that this may mask multiple failures, and that all failures should be output.
For some HTTPErrors, it seems the original crawler code overloads the meaning of the Page.content_type attribute, putting an exception in there instead. Not entirely sure what the reasoning could be, so this just adds handling of that potential case to the `should_validate` function.
It turns out that it is exceedingly difficult to provoke an HTML validation failure to test that this still works. Our default config suppresses tidy warnings in favor of only fetching tidy errors. However, it seems most problems tidy finds in HTML is classed as a warning, and not an error (HTML5 is very forgiving). We may want to reconsider the warning-suppression after a team discussion. |
If there are multiple unreachable or non-validated pages, pytest will eventually truncate the list it prints. This explicitly adds an assertion message that should include the full list of failures, separated by newlines.
c16b1f7
to
809a18a
Compare
This is a necessary step to move the test web server setup from
tests/integration/conftest.py
to a test suite fixture in the future:Before this PR, a reachability test and a HTML validation test was dynamically generated for each reachable page, by crawling a NAV web server at test discovery time, and using this to parametrize the reachability and validation tests (which sort of artificially inflated the number of tests in the test suite).
However, it is much more useful to make a resource fixture for the web server, so tests that need it can declare their dependency on it. Even if it were possible to use a fixture as input to test parametrization, it would mean that the test discovery phase was still dependent on a web server already running.
This leaves us with the suggestion of this PR: Make the web crawler itself a fixture that generates a list of pages, and have a single validation test and a single reachability test that loops over all the generated results of this fixture.
This vastly reduces the total number of tests in the test suite - and the number will not "randomly" increase every time some new web pages become crawl-able. It also considerably reduces the time spent in the test discovery phase, as the crawler will only run if the current test session needs it.
Later, we can make the crawler fixture depend on a web server fixture, thereby also ensuring a web server is only started if selecting tests that need it (instead of the current solution: which always starts a web server if any integration test is selected).
This was extracted and made independent from #2675