Failure of www.wikidoc.org due to missing CSS dependency #2091

benoit74 · 2024-10-07T07:59:52Z

I tried to create a ZIM of https://www.wikidoc.org/ with docker run --rm --name mwoffliner_test ghcr.io/openzim/mwoffliner:dev mwoffliner --adminEmail="[email protected]" --customZimDescription="Desc" --format="novid:maxi" --mwUrl="https://www.wikidoc.org/" --mwWikiPath "index.php" --mwActionApiPath "api.php" --mwRestApiPath "rest.php" --publisher="openZIM" --webp --customZimTitle="Custom title" --verbose

It fails with following error:

[error] [2024-10-07T07:51:01.434Z] Unable to retrieve js/css dependencies for article 'CSA Trust': nosuchrevid
[log] [2024-10-07T07:51:01.434Z] Exiting with code [1]
[log] [2024-10-07T07:51:01.434Z] Deleting temporary directory [/tmp/mwoffliner-1728287394504]
file:///tmp/mwoffliner/lib/Downloader.js:570
            throw new Error(errorMessage);
                  ^

Error: Unable to retrieve js/css dependencies for article 'CSA Trust': nosuchrevid
    at Downloader.getModuleDependencies (file:///tmp/mwoffliner/lib/Downloader.js:570:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async file:///tmp/mwoffliner/lib/util/saveArticles.js:250:45

Does it means we cannot ZIM this wiki just because we have one bad CSS configured? Is there a way to ignore it (it is probably not used anyway on live website if it does not exists)?

The text was updated successfully, but these errors were encountered:

audiodude · 2024-10-08T15:15:44Z

I tried running this locally, and got the same error, except for a different article:

Unable to retrieve js/css dependencies for article 'MPP+': nosuchrevid

Looking at the source website, I can't find articles for MPP+ or CSA Trust (which is the article mentioned in OP). So there are two questions:

Should we fail the entire ZIM if we can't load the module dependencies of one page?
Are we making a mistake in how we get the list of articles to download?

For 1), my impression was that the general approach of mwoffliner is to fail if an article cannot be retrieved, except in the narrow case that it was deleted between the time the article list was built and when the data was requested. @kelson42 what are your thoughts?

For 2), I would need to dig more into the way the article list is built, because I'm not not immediately familiar with it.

benoit74 · 2024-10-08T18:52:05Z

Thank you!

Regarding the fact that our attempts stop at a different article, this is not a surprise to me. From my experience, the order of articles list seems to be "random".

audiodude · 2024-10-09T00:20:25Z

They're not "random" really, just highly asynchronous as you pointed out in #2092

audiodude · 2024-10-09T00:45:50Z

So I tried to start a PR that would ignore the nosuchrevid error code. However, the scraping then fails with:

[error] [2024-10-09T00:41:01.659Z] Error downloading article MPP+

So I think the problem is definitely in the methodology for figuring out which articles to scrape.

audiodude · 2024-10-14T19:28:40Z

Okay, so mwoffliner fetches API responses from URLs like:

https://www.wikidoc.org/api.php?action=query&format=json&prop=redirects%7Crevisions%7Ccoordinates&rdlimit=max&rdnamespace=0&formatversion=2&colimit=max&rawcontinue=true&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&gapnamespace=0&gapcontinue=MCM7

And uses the result to get a list of article titles to later download.

This endpoint is returning the following:

 {
    "pageid": 112260,
    "ns": 0,
    "title": "MS Bike Tour",
    "revisions": [
      {
        "revid": 718699,
        "parentid": 678940,
        "minor": true,
        "user": "WikiBot",
        "timestamp": "2012-09-04T19:21:10Z",
        "comment": "Robot: Automated text replacement (-{{WikiDoc Cardiology Network Infobox}} +, -<references /> +{{reflist|2}}, -{{reflist}} +{{reflist|2}})"
      }
    ]
  },
  {
    "pageid": 119777,
    "ns": 0,
    "title": "MEND-CABG II trial does not suggest improved outcomes with the novel drug MC1 in patients undergoing high risk coronary artery bypass surgery"
  },
  {
    "pageid": 120569,
    "ns": 0,
    "title": "MPP+"
  },
  {
    "pageid": 123474,
    "ns": 0,
    "title": "MDL Chime",
    "revisions": [
      {
        "revid": 678905,
        "parentid": 352108,
        "minor": true,
        "user": "WikiBot",
        "timestamp": "2012-08-09T17:05:45Z",
        "comment": "Robot: Automated text replacement (-{{SIB}} + & -{{EH}} + & -{{EJ}} + & -{{Editor Help}} + & -{{Editor Join}} +)"
      }
    ]
  },

So I hate to say it, but I think the wiki is misconfigured. It's returning an MPP+ page with no revisions, which leads to the errors later.

This page recommends running update.php which I believe is part of the Mediawiki install? Can we reach out to the operators of the wiki to do that, and then try again?

audiodude · 2024-10-14T19:29:14Z

If we really need to, we can also filter pages with no revisions from the scrape.

audiodude · 2024-10-15T02:25:42Z

Now I'm getting this:

{
  error: {
  code: "parsoid-stash-rate-limit-error",
  info: "Stashing failed because rate limit was exceeded. Please try again later.",
  docref: "See https://www.wikidoc.org/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
  }
}

for this URL:

https://www.wikidoc.org/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=Lymphangiomyomatosis_surgery

audiodude · 2024-10-15T02:58:46Z

See wikimedia/restbase#1140

benoit74 · 2024-10-15T07:28:15Z

So I hate to say it, but I think the wiki is misconfigured. It's returning an MPP+ page with no revisions, which leads to the errors later.$

Is it a wiki misconfiguration or just a slightly broken database content?

I already achieved to break database content on multiple occasion due to mediawiki bugs

Anyway, I really consider the scraper should be capable to continue on such errors, and just stop if too many errors occurs. It is a sad from my PoV to not put a content offline just because the website has some small issues, and the website maintainer is not here anymore / capable to fix them. This happens just way too often. And we cannot expect the scraper user to list manually one by one all pages which finally have an issue. Or at least we should report all articles which have an issue at once, and fail the scrape, letting the user decide if it is ok for him to ignore these articles (adding them to the ignore list).

kelson42 · 2024-10-15T09:29:07Z

Is is tolerated that a whole article is missing (so http 404). This scenario can happen any time because a user can anytime delete an article.

What is not tolerated is that the backend does not deliver like it should (for example with timeouts or http 5xx errors).

But here the situation seems different and not that easy to assess.

audiodude · 2024-10-17T02:14:40Z

Here's some more examples from action=query endpoint:

{
  pageid: 22273,
  ns: 0,
  title: "MLN64"
},
{
  pageid: 23910,
  ns: 0,
  title: "MSin3 interaction domain"
},
{
  pageid: 31260,
  ns: 0,
  title: "MHC restriction"
},

None of these entries have a revisions key, so I assume they would also error out. In fact, when I added code to skip pages with no revisions key and log a warning, I got hundreds and hundreds of warnings. So many, that I thought my code was broken and that you can't count on data items from this request to have revisions. However, when I search on wikidoc.org for these article titles I get no results.

So even if we had a configurable limit of missing/broken articles, in this case we would likely exceed it anyways. I don't think mwoffliner can do much when the wiki in question is very broken.

kelson42 · 2024-10-17T06:11:54Z

We should definitly not skip articles without revision.

At this stage we should understand why there is no revid.

If this is a feature, then we will have to handle it, AFAIK and from a technical POV we could make all requests without giving a revid... and then it will take latest version. Again, this has to be confirmed.

If this is somehow a bug, we should stop the scraping process properly with a proper error.

benoit74 · 2024-10-17T07:25:50Z

Looking at https://www.mediawiki.org/wiki/Manual:RevisionDelete, it seems totally possible to completely hide/remove all revisions of a page. To be tested on a mediawiki instance to confirm of course.

audiodude · 2024-10-18T04:57:44Z

we could make all requests without giving a revid... and then it will take latest version. Again, this has to be confirmed.

We do not currently use the revision ID in the request. We do not extract it from the article list scrape either. The fact that it's missing is a symptom of a more fundamental problem, probably the one that @benoit74 pointed out.

I assume the practical reason is that they have articles that they don't want to be public facing, but they don't want to delete either.

See this URL:

https://www.wikidoc.org/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=The_Living_Guidelines%3A_UA%2FNSTEMI_Recomendations_for_CABG_View_the_Current_CLASS_IIa_Guidelines

The error is:

{
  error: {
    code: "nosuchrevid",
    info: "No current revision of title The Living Guidelines: UA/NSTEMI Recomendations for CABG View the Current CLASS IIa Guidelines.",
    docref: "See https://www.wikidoc.org/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
  }
}

audiodude · 2024-10-18T05:02:53Z

So with that said, I believe we should skip articles with no revisions. They have no public facing pages and are not a tangible part of the wiki. They are essentially "hidden".

benoit74 · 2024-10-18T06:10:32Z

So with that said, I believe we should skip articles with no revisions. They have no public facing pages and are not a tangible part of the wiki. They are essentially "hidden".

This makes sense to me, and it is not that different from a page returning a 404.

The more complex question is "what should we do when we encounter a link to a page with no revid?". But this could be tracked in a distinct issue, and it is maybe even already handled by the scraper.

kelson42 · 2024-10-18T07:29:58Z

I'm a bit surprised if revid is not used at all, hard to believe to me.

The reason why the revid should not be totally ignored is that ultimatively I want to be able to deal with it, see #982 or #2072 for example.

Here we need to confirm if this is he consequence of deleting manually a revision. I doubt a bit about that.

My question would be: why as we retrieve the whole list of article titles of the wiki, we get these articles listed... although they are not available. If we face here a kind of feature we should probably fix the problem there. If at this stage the MediaWiki is not able to delicer a revid of an article to scrape later, then we should maybe skip it... but we clearly need to understand why this could happen.

audiodude · 2024-10-18T19:45:55Z

So someone (probably me) has to spin up a MediaWiki instance, install the plugin for hiding revisions, and confirm that the JSON I posted above is what is returned in that case?

audiodude · 2024-10-20T17:43:57Z

So I did it, I installed a local MediaWiki instance and enabled revision deletion. I tried to delete all revisions of a page and got this:

So as we expected, the Mediawiki software should be requiring every article to have at least one revision. These wikidoc pages have been altered in some other way.

kelson42 · 2024-10-20T19:40:19Z

@audiodude Thx for the effort, even if I'm not surprised about the conclusion. I should really have a look IMHO.

benoit74 added bug question labels Oct 7, 2024

benoit74 mentioned this issue Oct 7, 2024

Wiki Doc openzim/zim-requests#1071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure of www.wikidoc.org due to missing CSS dependency #2091

Failure of www.wikidoc.org due to missing CSS dependency #2091

benoit74 commented Oct 7, 2024

audiodude commented Oct 8, 2024

benoit74 commented Oct 8, 2024

audiodude commented Oct 9, 2024

audiodude commented Oct 9, 2024

audiodude commented Oct 14, 2024

audiodude commented Oct 14, 2024

audiodude commented Oct 15, 2024

audiodude commented Oct 15, 2024

benoit74 commented Oct 15, 2024

kelson42 commented Oct 15, 2024 •

edited

Loading

audiodude commented Oct 17, 2024

kelson42 commented Oct 17, 2024 •

edited

Loading

benoit74 commented Oct 17, 2024

audiodude commented Oct 18, 2024

audiodude commented Oct 18, 2024

benoit74 commented Oct 18, 2024

kelson42 commented Oct 18, 2024

audiodude commented Oct 18, 2024

audiodude commented Oct 20, 2024

kelson42 commented Oct 20, 2024

Failure of www.wikidoc.org due to missing CSS dependency #2091

Failure of www.wikidoc.org due to missing CSS dependency #2091

Comments

benoit74 commented Oct 7, 2024

audiodude commented Oct 8, 2024

benoit74 commented Oct 8, 2024

audiodude commented Oct 9, 2024

audiodude commented Oct 9, 2024

audiodude commented Oct 14, 2024

audiodude commented Oct 14, 2024

audiodude commented Oct 15, 2024

audiodude commented Oct 15, 2024

benoit74 commented Oct 15, 2024

kelson42 commented Oct 15, 2024 • edited Loading

audiodude commented Oct 17, 2024

kelson42 commented Oct 17, 2024 • edited Loading

benoit74 commented Oct 17, 2024

audiodude commented Oct 18, 2024

audiodude commented Oct 18, 2024

benoit74 commented Oct 18, 2024

kelson42 commented Oct 18, 2024

audiodude commented Oct 18, 2024

audiodude commented Oct 20, 2024

kelson42 commented Oct 20, 2024

kelson42 commented Oct 15, 2024 •

edited

Loading

kelson42 commented Oct 17, 2024 •

edited

Loading