Wikimedia is the driving force behind Wikipedia. They provide a monthly full backup of all the data on Wikipedia as well as their properties. The purpose of this repo is to convert the Wikimedia dump from the given format into the text corpus format we use. I.E.
- The full corpus consisting of one or more TXT files in a single folder
- One or more articles in a single TXT file
- Each article will have a header in the form "--- {id} ---"
- Each article will have its abstract and body extracted
- One sentence per line
- Paragraphs are separated by a blank line
You can install the package using the following steps:
pip
install using an admin prompt.
pip uninstall wikimedia
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/wikimedia.git
or if you have the code local
pip uninstall wikimedia
python -OO -m pip install -v c:/repos/TextCorpusLabs/wikimedia
You are responsible for getting the source files. They can be found at this site. You will need to further navigate into particular wiki you want to download.
You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.
The reason you are responsible is because the dump files are a single MASSIVE file. Sometimes Wikimedia will be busy and the download will be slow. Modern browsers support resume for exactly this case. As of 2023/01/22 it is over 90 GB in .xml form. You must make sure you have enough space before you start.
All the below commands assume the corpus is an extracted .xml file.
- Extracts the metadata from the corpus.
wikimedia metadata -source d:/data/wiki/enwiki.xml -dest d:/data/wiki/enwiki.meta.csv
The following are required parameters:
source
is the .xml file sourced from Wikimedia.dest
is the CSV file used to store the metadata.
The following are optional parameters:
log
is the folder of raw XML chunks that did not process. It defaults to empty (not saved).
- Convert the data to our standard format.
wikimedia convert -source d:/data/wiki/enwiki.xml -dest d:/data/wiki.std
The following are required parameters:
source
is the .xml file sourced from Wikimedia.dest
is the folder for the converted TXT files.
The following are optional parameters:
lines
is the number of lines per TXT file. The default is 1000000.dest_pattern
is the format of the TXT file name. It defaults towikimedia.{id:04}.txt
.id
is an increasing value that increments afterlines
are stored in a file.log
is the folder of raw XML chunks that did not process. It defaults to empty (not saved).
The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).
pip uninstall wikimedia
python -m pip install -e c:/repos/TextCorpusLabs/wikimedia
Below is the suggested text to add to the "Methods and Materials" section of your paper when using this process. The references can be found here
The 2022/10/01 English version of Wikipedia [@wikipedia2020] was downloaded using Wikimedia's download service [@wikimedia2020]. The single-file data dump was then converted to a corpus of plain text articles using the process described in [@wikicorpus2020].