Kaggle has provided an excelent data source for the COVID-19 courtesy of AI2 The purpose of this repo is to convert it from the given format into the normal text corpus format. I.E. one document per file, one sentence per line, pargraphs have a blank line between them.
The following packages need to be installed. I recommend using Chocolatey.
if('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
refreshenv
choco install 7zip.install -y
choco install python3 -y
All scripts have been tested on Python 3.8.2.
The below modules are need to run the scripts.
The scripts were tested on the noted versions, so YMMV.
Note: not all modules are required for all scripts.
If this it the first time running the scripts, the modules will need to be installed.
They can be installed by navigating to the ~/code
folder, then using the below code.
- nltk 3.4.5
- progressbar2 3.47.0
pip install -r requirments.txt
python -c "import nltk;nltk.download('punkt')"
The below document describes how to recreate the text corpus.
It assumes that a particular path structure will be used, but the commands can be modified to target a different directory structure without changing the code.
I am choosing the d:/covid19
directory because my d drive is big enough to hold everything.
- Clone this repo then open a shell to the
~/code
directory. - Retrieve the dataset by hand. Click on the download link, saving the file to a know location.
- Extract the data in-place with no folder structure.
- The
e
switch flattens the extract so the custom code does not need to recursivaly search the folder structure.
- The
"C:/Program Files/7-Zip/7z.exe" e -od:/covid19/raw "d:/covid19/*.zip"
- Extract the meta-data.
This will create a single
metadata.csv
containing some useful information. In general this would be used as part of segementation or as part of a MANOVA.
python extract_metadata.py -in d:/covid19/raw -out d:/covid19/metadata.csv
- Convert the raw JSON files into the nomal folder corpus format.
This will create a text corpus folder at the location I.E.
./corpus
containing 2 sub folders, one for the abstract and one for the body. Some of the files provide by Kaggle are not full text articles I.E. empty abstract or body. These incomplete files are filtered out of the final folders and noted inerror.csv
python convert_to_corpus.py -in d:/covid19/raw -out d:/covid19/corpus