This script helps create a long wordlist from the Wikimedia dump where articles of the Odia-language Wikipedia, word entries of the Odia Wiktionary and texts of the Odia Wikisource are uploaded on a regular basis for community use and research. A wordlist is generally used for a range of research Natural Language Processing (NLP). Some common use cases of a wordlist includes creating a spell-check engine (or predictive text for helping with input on mobile devices), dictionary or even recording pronunciation of words in a language. The original script was written by our friend T. Shrinivasan which he then guided OFDN's Subhashish Panigrahi during a session for accomodating the needs of Odia.
Download the Wikimedia dumps. You can find all latest dumps from this link (look up for "orwiki" for Odia Wikipedia, "orwiktionary" for Odia Wiktionary and "orwikisource" for Odia Wikisource).
Alternatively, you can also download specific files for each project (for inatance, you want to download only the titles of Odia Wikipedia and not the content of all the articles or just the category names). Check here for Odia Wikipedia, here for Odia Wikisource, here for Odia Wiktionary. The folder name "latest" will show you the latest dump and above that folder link you can find some recent historical dumps.
After downloading, keep the file in a specific folder.
We are using the example of Odia Wikipedia below on a Unix computer (Linux and MacOS included) but the same process applies for a file from any other Wikimedia project. In case you are visiting the folder link as explained above, you could see the explainatory file names such as "orwiki-latest-pages-articles-multistream.xml.bz2". Download the file from the directory and extract/unzip. You can use the below command line by opening your computer terminal (On MacOS press Cmd+Space bar >> type "terminal" >> Enter).
Navigate to the specific folder where you have saved the dump file. If you have kept it in the "Wiki" subfolder inside the "Documents" folder the type in the terminal cd Documents/Wiki
and press Enter if you are in the root folder. You can also type pwd
to see which folder you are in when unsure. Typing cd ..
and pressing Enter takes you one folder up.
bunzip2 orwiki-latest-pages-articles-multistream.xml.bz2
This will create a new file called "orwiki-latest-pages-articles-multistream.xml". Rename that to "orwiki.xml" (use command mv orwiki-latest-pages-articles-multistream.xml orwiki.xml
)
To create a wordlist you will need to have Python installed (mostly pre-installed in most modern Unix computers). You need to download and extract this Github repository either by using command line or as a ZIP file. Once unzipped copy the file called "create_wordlist.py" to the folder where you have the Wikimedia data dump.
Run on terminal
python create_wordlist.py
wc -l unique_odia_words.txt
This will show a result such as 1200 unique_odia_words.txt
.
Run the command
sort unique_odia_words.txt > unique_odia_sorted_words.txt
This will sort the words in the file "unique_odia_sorted_words.txt".