Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs-ci: make linkcheck prone to transient network failures #106

Closed
victorlin opened this issue Aug 23, 2024 · 19 comments
Closed

docs-ci: make linkcheck prone to transient network failures #106

victorlin opened this issue Aug 23, 2024 · 19 comments
Assignees

Comments

@victorlin
Copy link
Member

victorlin commented Aug 23, 2024

I've just run into this error on an Augur PR which did not change any docs links:

(api/developer/augur.merge: line    7) broken    https://www.gnu.org/software/bash/manual/bash.html#ANSI_002dC-Quoting - HTTPSConnectionPool(host='www.gnu.org', port=443): Max retries exceeded with url: /software/bash/manual/bash.html (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4835cbc1c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
build finished with problems.
make: *** [Makefile:20: linkcheck] Error 1

This seems like a transient network error which shows up as a failing check ❌ on the PR which confused me at first.

Possible solutions

  1. ⛔️ Split linkcheck into a separate job on docs-ci and use continue-on-error: true
  2. Don't run linkcheck in CI but instead on a weekly schedule
  3. Ignore timeout codes
  4. (per-project) Handle broken codes on a case-by-case basis. If it's intermittent, add to linkcheck_ignore. Otherwise, update or remove the link.
@genehack
Copy link
Contributor

This seems like a transient network error

FWIW, I did see these types of errors occasionally (locally) while I was working on correcting links across the various repos.

Thanks for creating the issue; if this happens frequently, I'll handle the split/continue-on-error changes.

@victorlin
Copy link
Member Author

Documenting another occurrence:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/
(releases/changelog: line  646) broken    https://github.com/nextstrain/augur/pull/1033 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1033
(releases/changelog: line  642) broken    https://github.com/nextstrain/augur/pull/1034 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1034
(releases/changelog: line  626) ok        https://github.com/nextstrain/augur/pull/1070
(releases/changelog: line  598) broken    https://github.com/nextstrain/augur/pull/1039 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1039
(releases/changelog: line  643) broken    https://github.com/nextstrain/augur/pull/1042 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1042

@tsibley
Copy link
Member

tsibley commented Aug 28, 2024

And another, twice in a row.

@genehack
Copy link
Contributor

And another, twice in a row.

BOOOOOO.

I will pick this up and make it continue-on-error: true in the next work cycle.

@genehack genehack self-assigned this Aug 28, 2024
genehack added a commit that referenced this issue Sep 3, 2024
Also add `continue-on-error-comment` step, so that if linkcheck fails,
a comment will be added to the PR, instead of a silent failure with a
green check.
genehack added a commit that referenced this issue Sep 3, 2024
Also add `continue-on-error-comment` step, so that if linkcheck fails,
a comment will be added to the PR, instead of a silent failure with a
green check.
@victorlin
Copy link
Member Author

I'm wondering if continue-on-error: true is the right solution here. With this setting as-is, "real" linkcheck issues are likely to go unnoticed.

On the other hand, with something like the CI failures we have currently or mainmatter/continue-on-error-comment, I'm worried that it could be unnecessarily noisy given the high rate of these failures as of lately (I've seen many in Augur, but no longer linking them back to here).

Some alternatives:

  1. If the network failures are only on a few URLs/domains, configure linkcheck to ignore those domains
  2. Don't run linkcheck in CI but instead on a weekly schedule with retries + cooldown periods in between each try. This would reduce the impact of transient network failures while making sure links are valid.

I realize this comment is coming a bit late but it's longer-term thinking. continue-on-error: true should be good to reduce CI failures in the short-term.

@tsibley
Copy link
Member

tsibley commented Sep 4, 2024

I generally agree with @victorlin here.

@victorlin
Copy link
Member Author

victorlin commented Oct 10, 2024

Actually, I think what we want is a filter on the HTTP response. 404s are useful (example) but network errors are not.

Sphinx 8.0 (released Jul 29, 2024) changed the default value of linkcheck_report_timeouts_as_broken to False, which is a step in the right direction.

However, the new timeout code still results in a GitHub Actions ❌ because it has the same exit code as broken (example, src). Ideally linkcheck should allow configuring timeout to not cause a non-zero exit code.

@victorlin
Copy link
Member Author

New proposal: Runmake linkcheck and ignore the exit code. Run a custom script that errors only when there is a status=broken in the summary file $BUILDDIR/linkcheck/output.json.

@genehack what do you think about this approach over #107? It doesn't require splitting linkcheck into a separate job. I can open a PR to demonstrate.

@genehack
Copy link
Contributor

@genehack what do you think about this approach over #107? It doesn't require splitting linkcheck into a separate job. I can open a PR to demonstrate.

yeah, sure, seems like it may solve the one problem without causing the other one...

@victorlin
Copy link
Member Author

Noting that broken includes some transient failures, for example:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/

but ignoring timeout should still be an improvement. If it's just a few domains that return broken transiently, we can configure linkcheck to ignore those domains.

@genehack
Copy link
Contributor

Noting that broken includes some transient failures, for example:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/

So, that's actually a 403; that's unlikely to be transient (or if it is, it reflects some sort of oddness on the remote end, e.g., load-balanced servers giving different responses).

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.

@victorlin
Copy link
Member Author

So, that's actually a 403; that's unlikely to be transient (or if it is, it reflects some sort of oddness on the remote end, e.g., load-balanced servers giving different responses).

For this specific URL it is some sort of oddness on the remote end – see nextstrain/augur#1593 (comment)

@victorlin
Copy link
Member Author

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.

  1. There doesn't seem to be a way to configure linkcheck to ignore certain HTTP responses. It only distinguishes between timeout vs. broken
  2. I don't think we should ignore 403s. We should check if it is no longer publicly accessible, in which case we should update or remove the link.

@genehack
Copy link
Contributor

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.

1. There doesn't seem to be a way to configure linkcheck to ignore certain HTTP responses. It only distinguishes between `timeout` vs. `broken`

I'm suggesting adding an additional filtering step, analogous to the current "broken" one, that specifically looks for this 403 -- say, by using an additional "grep" step.

It wouldn't even need to exit 1 to be useful; just reporting out the collection of things returning 403 would be helpful over time.

2. I don't think we should ignore 403s. We should check if it is no longer publicly accessible, in which case we should update or remove the link.

IME, when putting together the initial linkcheck stuff, pretty much every link that returned a 403 "client denied" error like this was perfectly accessible in a browser, but had some sort of server-side client-sniffing that was denying programmatic requests.

@victorlin
Copy link
Member Author

pretty much every link that returned a 403 "client denied" error like this was perfectly accessible in a browser, but had some sort of server-side client-sniffing that was denying programmatic requests.

Did this happen for URLs other than http://www.microbesonline.org/fasttree/? That's the only one I'm aware of. I think we agree on the weirdness of that one and that we should ignore it. For anything else that doesn't show a pattern yet, new 403 errors should be evaluated on a case-by-case basis. I'm not sure a separate jq query to report 403s would be helpful – it seems fine to just evaluate all broken results on a case-by-case basis regardless of HTTP error code.

@tsibley
Copy link
Member

tsibley commented Oct 24, 2024

What @victorlin said.

@genehack
Copy link
Contributor

Did this happen for URLs other than http://www.microbesonline.org/fasttree/?

When I initially did the link check cleanup, there were a number of sites that consistently returned this 403 error; I added those to the ignore config (and I think commented that the links worked in a browser but failed with the link check only). There weren't any intermittent 403s that I saw.

new 403 errors should be evaluated on a case-by-case basis. I'm not sure a separate jq query to report 403s would be helpful – it seems fine to just evaluate all broken results on a case-by-case basis regardless of HTTP error code.

fair 'nuff.

@victorlin
Copy link
Member Author

I added those to the ignore config (and I think commented that the links worked in a browser but failed with the link check only)

I see, thanks! Doing the same to the FastTree link in nextstrain/augur#1660.

@victorlin
Copy link
Member Author

I'm going to consider this closed by #110 and nextstrain/augur#1660.

I think it's still worth reconsidering how we are using linkcheck, but I've written up a separate issue for that: #116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants