Step 1: Download data dump

This script helps create a long wordlist from the Wikimedia dump where articles of the Odia-language Wikipedia, word entries of the Odia Wiktionary and texts of the Odia Wikisource are uploaded on a regular basis for community use and research. A wordlist is generally used for a range of research Natural Language Processing (NLP). Some common use cases of a wordlist includes creating a spell-check engine (or predictive text for helping with input on mobile devices), dictionary or even recording pronunciation of words in a language. The original script was written by our friend T. Shrinivasan which he then guided OFDN's Subhashish Panigrahi during a session for accomodating the needs of Odia.

Step 1: Download data dump

Download the Wikimedia dumps. You can find all latest dumps from this link (look up for "orwiki" for Odia Wikipedia, "orwiktionary" for Odia Wiktionary and "orwikisource" for Odia Wikisource).

Alternatively, you can also download specific files for each project (for inatance, you want to download only the titles of Odia Wikipedia and not the content of all the articles or just the category names). Check here for Odia Wikipedia, here for Odia Wikisource, here for Odia Wiktionary. The folder name "latest" will show you the latest dump and above that folder link you can find some recent historical dumps.

After downloading, keep the file in a specific folder.

Step 2: Extract XML file

We are using the example of Odia Wikipedia below on a Unix computer (Linux and MacOS included) but the same process applies for a file from any other Wikimedia project. In case you are visiting the folder link as explained above, you could see the explainatory file names such as "orwiki-latest-pages-articles-multistream.xml.bz2". Download the file from the directory and extract/unzip. You can use the below command line by opening your computer terminal (On MacOS press Cmd+Space bar >> type "terminal" >> Enter).

Navigate to the specific folder where you have saved the dump file. If you have kept it in the "Wiki" subfolder inside the "Documents" folder the type in the terminal cd Documents/Wiki and press Enter if you are in the root folder. You can also type pwd to see which folder you are in when unsure. Typing cd .. and pressing Enter takes you one folder up.

bunzip2  orwiki-latest-pages-articles-multistream.xml.bz2

This will create a new file called "orwiki-latest-pages-articles-multistream.xml". Rename that to "orwiki.xml" (use command mv orwiki-latest-pages-articles-multistream.xml orwiki.xml)

Remove all English (Latin) characters that are not required

To create a wordlist you will need to have Python installed (mostly pre-installed in most modern Unix computers). You need to download and extract this Github repository either by using command line or as a ZIP file. Once unzipped copy the file called "create_wordlist.py" to the folder where you have the Wikimedia data dump.

Run on terminal

python create_wordlist.py

BONUS 1: Count the total number of unique words

wc -l unique_odia_words.txt

This will show a result such as 1200 unique_odia_words.txt.

Sort the words alphabetically

Run the command

sort unique_odia_words.txt > unique_odia_sorted_words.txt

This will sort the words in the file "unique_odia_sorted_words.txt".

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
create_wordlist.py		create_wordlist.py
line.sh		line.sh
only_tamil_uniq_sorted_words.txt		only_tamil_uniq_sorted_words.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step 1: Download data dump

Step 2: Extract XML file

Remove all English (Latin) characters that are not required

BONUS 1: Count the total number of unique words

Sort the words alphabetically

About

Releases

Packages

Languages

License

ofdn/odia-wordlist-from-wikimedia-dump

Folders and files

Latest commit

History

Repository files navigation

Step 1: Download data dump

Step 2: Extract XML file

Remove all English (Latin) characters that are not required

BONUS 1: Count the total number of unique words

Sort the words alphabetically

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages