docs-ci: `make linkcheck` prone to transient network failures #106

victorlin · 2024-08-23T18:24:40Z

I've just run into this error on an Augur PR which did not change any docs links:

(api/developer/augur.merge: line    7) broken    https://www.gnu.org/software/bash/manual/bash.html#ANSI_002dC-Quoting - HTTPSConnectionPool(host='www.gnu.org', port=443): Max retries exceeded with url: /software/bash/manual/bash.html (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4835cbc1c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
build finished with problems.
make: *** [Makefile:20: linkcheck] Error 1

This seems like a transient network error which shows up as a failing check ❌ on the PR which confused me at first.

Possible solutions

⛔️ Split linkcheck into a separate job on docs-ci and use continue-on-error: true
- Split make linkcheck out into a distinct job [#106] #107
- Not a great solution because it masks real linkcheck issues
Don't run linkcheck in CI but instead on a weekly schedule
Ignore timeout codes
- docs-ci: Don't error on timeouts #110
- This should be a good improvement but does not cover all transient failures (see below)
(per-project) Handle broken codes on a case-by-case basis. If it's intermittent, add to linkcheck_ignore. Otherwise, update or remove the link.

The text was updated successfully, but these errors were encountered:

genehack · 2024-08-23T18:28:29Z

This seems like a transient network error

FWIW, I did see these types of errors occasionally (locally) while I was working on correcting links across the various repos.

Thanks for creating the issue; if this happens frequently, I'll handle the split/continue-on-error changes.

victorlin · 2024-08-27T23:35:01Z

Documenting another occurrence:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/
(releases/changelog: line  646) broken    https://github.com/nextstrain/augur/pull/1033 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1033
(releases/changelog: line  642) broken    https://github.com/nextstrain/augur/pull/1034 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1034
(releases/changelog: line  626) ok        https://github.com/nextstrain/augur/pull/1070
(releases/changelog: line  598) broken    https://github.com/nextstrain/augur/pull/1039 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1039
(releases/changelog: line  643) broken    https://github.com/nextstrain/augur/pull/1042 - 502 Server Error: Bad Gateway for url: https://github.com/nextstrain/augur/pull/1042

tsibley · 2024-08-28T19:38:41Z

And another, twice in a row.

genehack · 2024-08-28T20:28:17Z

And another, twice in a row.

BOOOOOO.

I will pick this up and make it continue-on-error: true in the next work cycle.

Also add `continue-on-error-comment` step, so that if linkcheck fails, a comment will be added to the PR, instead of a silent failure with a green check.

victorlin · 2024-09-03T22:37:42Z

I'm wondering if continue-on-error: true is the right solution here. With this setting as-is, "real" linkcheck issues are likely to go unnoticed.

On the other hand, with something like the CI failures we have currently or mainmatter/continue-on-error-comment, I'm worried that it could be unnecessarily noisy given the high rate of these failures as of lately (I've seen many in Augur, but no longer linking them back to here).

Some alternatives:

If the network failures are only on a few URLs/domains, configure linkcheck to ignore those domains
Don't run linkcheck in CI but instead on a weekly schedule with retries + cooldown periods in between each try. This would reduce the impact of transient network failures while making sure links are valid.

I realize this comment is coming a bit late but it's longer-term thinking. continue-on-error: true should be good to reduce CI failures in the short-term.

tsibley · 2024-09-04T03:22:54Z

I generally agree with @victorlin here.

victorlin · 2024-10-10T20:07:38Z

Actually, I think what we want is a filter on the HTTP response. 404s are useful (example) but network errors are not.

Sphinx 8.0 (released Jul 29, 2024) changed the default value of linkcheck_report_timeouts_as_broken to False, which is a step in the right direction.

However, the new timeout code still results in a GitHub Actions ❌ because it has the same exit code as broken (example, src). Ideally linkcheck should allow configuring timeout to not cause a non-zero exit code.

victorlin · 2024-10-11T20:35:35Z

New proposal: Runmake linkcheck and ignore the exit code. Run a custom script that errors only when there is a status=broken in the summary file $BUILDDIR/linkcheck/output.json.

@genehack what do you think about this approach over #107? It doesn't require splitting linkcheck into a separate job. I can open a PR to demonstrate.

genehack · 2024-10-11T22:55:50Z

@genehack what do you think about this approach over #107? It doesn't require splitting linkcheck into a separate job. I can open a PR to demonstrate.

yeah, sure, seems like it may solve the one problem without causing the other one...

victorlin · 2024-10-11T23:38:50Z

Noting that broken includes some transient failures, for example:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/

but ignoring timeout should still be an improvement. If it's just a few domains that return broken transiently, we can configure linkcheck to ignore those domains.

genehack · 2024-10-17T17:17:48Z

Noting that broken includes some transient failures, for example:

(installation/installation: line    9) broken    http://www.microbesonline.org/fasttree/ - 403 Client Error: Forbidden for url: http://www.microbesonline.org/fasttree/

So, that's actually a 403; that's unlikely to be transient (or if it is, it reflects some sort of oddness on the remote end, e.g., load-balanced servers giving different responses).

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.

victorlin · 2024-10-17T20:26:11Z

So, that's actually a 403; that's unlikely to be transient (or if it is, it reflects some sort of oddness on the remote end, e.g., load-balanced servers giving different responses).

For this specific URL it is some sort of oddness on the remote end – see nextstrain/augur#1593 (comment)

victorlin · 2024-10-17T20:29:52Z

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.

There doesn't seem to be a way to configure linkcheck to ignore certain HTTP responses. It only distinguishes between timeout vs. broken
I don't think we should ignore 403s. We should check if it is no longer publicly accessible, in which case we should update or remove the link.

genehack · 2024-10-22T18:21:29Z

It's probably worth extending the filtering so that 403 Client Error: Forbidden is also ignored.
1. There doesn't seem to be a way to configure linkcheck to ignore certain HTTP responses. It only distinguishes between `timeout` vs. `broken`

I'm suggesting adding an additional filtering step, analogous to the current "broken" one, that specifically looks for this 403 -- say, by using an additional "grep" step.

It wouldn't even need to exit 1 to be useful; just reporting out the collection of things returning 403 would be helpful over time.

2. I don't think we should ignore 403s. We should check if it is no longer publicly accessible, in which case we should update or remove the link.

IME, when putting together the initial linkcheck stuff, pretty much every link that returned a 403 "client denied" error like this was perfectly accessible in a browser, but had some sort of server-side client-sniffing that was denying programmatic requests.

victorlin · 2024-10-23T21:40:50Z

pretty much every link that returned a 403 "client denied" error like this was perfectly accessible in a browser, but had some sort of server-side client-sniffing that was denying programmatic requests.

Did this happen for URLs other than http://www.microbesonline.org/fasttree/? That's the only one I'm aware of. I think we agree on the weirdness of that one and that we should ignore it. For anything else that doesn't show a pattern yet, new 403 errors should be evaluated on a case-by-case basis. I'm not sure a separate jq query to report 403s would be helpful – it seems fine to just evaluate all broken results on a case-by-case basis regardless of HTTP error code.

tsibley · 2024-10-24T17:05:55Z

What @victorlin said.

genehack · 2024-10-25T17:12:46Z

Did this happen for URLs other than http://www.microbesonline.org/fasttree/?

When I initially did the link check cleanup, there were a number of sites that consistently returned this 403 error; I added those to the ignore config (and I think commented that the links worked in a browser but failed with the link check only). There weren't any intermittent 403s that I saw.

new 403 errors should be evaluated on a case-by-case basis. I'm not sure a separate jq query to report 403s would be helpful – it seems fine to just evaluate all broken results on a case-by-case basis regardless of HTTP error code.

fair 'nuff.

victorlin · 2024-11-04T17:10:53Z

I added those to the ignore config (and I think commented that the links worked in a browser but failed with the link check only)

I see, thanks! Doing the same to the FastTree link in nextstrain/augur#1660.

victorlin · 2024-11-12T00:34:05Z

I'm going to consider this closed by #110 and nextstrain/augur#1660.

I think it's still worth reconsidering how we are using linkcheck, but I've written up a separate issue for that: #116

genehack self-assigned this Aug 28, 2024

genehack added a commit that referenced this issue Sep 3, 2024

Split make linkcheck out into a distinct job [#106]

83e03d5

Also add `continue-on-error-comment` step, so that if linkcheck fails, a comment will be added to the PR, instead of a silent failure with a green check.

genehack mentioned this issue Sep 3, 2024

Split make linkcheck out into a distinct job [#106] #107

Closed

1 task

genehack added a commit that referenced this issue Sep 3, 2024

Split make linkcheck out into a distinct job [#106]

627517d

Also add `continue-on-error-comment` step, so that if linkcheck fails, a comment will be added to the PR, instead of a silent failure with a green check.

genehack added a commit that referenced this issue Sep 3, 2024

Split make linkcheck out into a distinct job [#106]

4d7062b

genehack added a commit that referenced this issue Sep 3, 2024

Split make linkcheck out into a distinct job [#106]

9fbfaea

genehack added a commit that referenced this issue Sep 4, 2024

Split make linkcheck out into a distinct job [#106]

eea6f9f

tsibley mentioned this issue Sep 4, 2024

CI: linkcheck fails occasionally due to fasttree link nextstrain/augur#1621

Closed

victorlin mentioned this issue Oct 11, 2024

docs-ci: Don't error on timeouts #110

Merged

4 tasks

victorlin mentioned this issue Nov 4, 2024

docs: Ignore FastTree link for linkcheck nextstrain/augur#1660

Merged

4 tasks

victorlin mentioned this issue Nov 12, 2024

docs-ci: Reduce unnecessary linkchecks #116

Open

victorlin closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs-ci: `make linkcheck` prone to transient network failures #106

docs-ci: `make linkcheck` prone to transient network failures #106

victorlin commented Aug 23, 2024 •

edited

Loading

genehack commented Aug 23, 2024

victorlin commented Aug 27, 2024

tsibley commented Aug 28, 2024

genehack commented Aug 28, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Oct 10, 2024 •

edited

Loading

victorlin commented Oct 11, 2024

genehack commented Oct 11, 2024

victorlin commented Oct 11, 2024

genehack commented Oct 17, 2024

victorlin commented Oct 17, 2024

victorlin commented Oct 17, 2024

genehack commented Oct 22, 2024

victorlin commented Oct 23, 2024

tsibley commented Oct 24, 2024

genehack commented Oct 25, 2024

victorlin commented Nov 4, 2024

victorlin commented Nov 12, 2024

docs-ci: make linkcheck prone to transient network failures #106

docs-ci: make linkcheck prone to transient network failures #106

Comments

victorlin commented Aug 23, 2024 • edited Loading

Possible solutions

genehack commented Aug 23, 2024

victorlin commented Aug 27, 2024

tsibley commented Aug 28, 2024

genehack commented Aug 28, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Oct 10, 2024 • edited Loading

victorlin commented Oct 11, 2024

genehack commented Oct 11, 2024

victorlin commented Oct 11, 2024

genehack commented Oct 17, 2024

victorlin commented Oct 17, 2024

victorlin commented Oct 17, 2024

genehack commented Oct 22, 2024

victorlin commented Oct 23, 2024

tsibley commented Oct 24, 2024

genehack commented Oct 25, 2024

victorlin commented Nov 4, 2024

victorlin commented Nov 12, 2024

docs-ci: `make linkcheck` prone to transient network failures #106

docs-ci: `make linkcheck` prone to transient network failures #106

victorlin commented Aug 23, 2024 •

edited

Loading

victorlin commented Oct 10, 2024 •

edited

Loading