Skip to content

Latest commit

 

History

History
105 lines (80 loc) · 3.36 KB

README.md

File metadata and controls

105 lines (80 loc) · 3.36 KB

The WP1 Selection tools gather and compile multiple indicators to provide Wikipedia article subset selections. It has been created for the Wikipedia 1.0 project and is complementary of the WP1 engine.

The results are made available at https://download.openzim.org/wp1.

CodeFactor License: GPL v3

Requirements

To run it, you need:

  • MANDATORY: a GNU/Linux system
  • MANDATORY: an access to Internet
  • MANDATORY: an access to a Wikipedia database
  • OPTION: an access to enwp10 rating database for Wikipedia in English

Context

Many Wikipedias, in different languages, have more than 500.000 articles and even if we can provide offline versions with a reasonnable size, this is still too much for many devices. That's why we need to build offline versions with only a selections with the TOP best articles.

Principle

This tool builds lists of key values (pageviews, links, ...) about Wikipedia articles and put them in a directory. These key values are everything we have as input to build smart selection algorithms. To get more detalis about the list, read the README in the language based directory.

Tools

  • build_biggest_wikipedia_list.sh give you the list of all wikipedia/languages with more than 500.000 entries.

  • build_selections.sh takes a language code ('en' for example) as first argument and create the directory with all the key values.

  • build_all_selections.sh to build/upload lists for all Wikipedia with more than 500.000 pages.

  • build_en_vital_articles_list.sh generates a the list Wikipedia in English vital articles (https://en.wikipedia.org/wiki/Wikipedia:Vital_articles)

  • build_custom_selections.sh generates selections which need custom (non-standard) handling.

  • build_projects_lists.pl generates the lists for projects with articles sorted (reverse order) by scores. Works only for Wikipedia in English.

  • build_translated_list.pl translates a list in the given language based on Wikipedia in English language links and local language scores.

Download

You can download the output of that scripts directly from download.kiwix.org/wp1/ using FTP, HTTP(s) or rsync.

You might be interested by downloading only the last version, here is a small command (based on rsync) to retrieve the right directory name.

for ENTRY in $(rsync --recursive --list-only download.kiwix.org::download.kiwix.org/wp1/ | tr -s ' ' | cut -d ' ' -f5 | grep wiki | grep -v '/' | sort -r)
do
    RADICAL=`echo $ENTRY | sed 's/_20[0-9][0-9]-[0-9][0-9]//g'`;
    if [[ $LAST != $RADICAL ]]
    then
        echo $ENTRY
        LAST=$RADICAL
    fi
done

VPS

To run it on VPS via Docker:

docker run -d --name wp1_selection_tools
  -v /srv/wp1_selection_tools/data:/data \
  -v /srv/wp1_selection_tools/.ssh/:/root/.ssh \
  -v /srv/wp1_selection_tools/replica.my.cnf:/root/replica.my.cnf \
  ghcr.io/openzim/wp1_selection_tools

License

GPLv3 or later, see LICENSE for more details.