OAS To Text Corpus

The National Institutes of Health has provided an excellent data source for text mining. Not only does it cover Medical journals, but other ones from mathematics to chemistry. The purpose of this repo is to convert the PMC Open Access Subset from the given format into the text corpus format we use. I.E.

The full corpus consisting of one or more TXT files in a single folder
One or more articles in a single TXT file

Each article will have a header in the form:

--- {id} ---
--- {journal} ---
--- {title} ---

Each article will have its abstract and body extracted
One sentence per line
Paragraphs are separated by a blank line

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall oas
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/oas.git

or if you have the code local

pip uninstall oas
python -OO -m pip install -v c:/repos/TextCorpusLabs/oas

Run

You are responsible for getting the source files. They can be found on this FTP site. You will need to further navigate into the three sub-folders: oa_comm, oa_noncomm, and oa_other. I recommend using FileZilla. I installed my copy using Chocolatey.

You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.

The reason you are responsible is because the server the NIH keeps the files on is fickle. Sometimes it will serve corrupted files. Those files need re-downloaded and re-verified, then the file inside (the files are .tar.gz) needs verified too. OAS is also HUGE. As of 2024/03/25 it is almost 500 GB in .tar form. You must make sure you have enough space before you start.

All the below commands assume the corpus is a folder of .tar files.

Extracts the metadata from the corpus.

oas metadata -source c:/data/oas -dest c:/data/oas.meta.csv

The following are required parameters:

source is the folder containing the .tar'ed JATS files.
dest is the CSV file used to store the metadata.

The following are optional parameters:

log is the folder of raw JATS files that did not process. It defaults to empty (not saved).

Convert the data to our standard format.

oas convert -source c:/data/oas -dest c:/data/oas.std

The following are required parameters:

source is the folder containing the .tar'ed JATS files.
dest is the folder for the converted TXT files.

The following are optional parameters:

lines is the number of lines per TXT file. The default is 250000.
dest_pattern is the format of the TXT file name. It defaults to {source}.{id:04}.txt. source is the source file name's stem. id is an increasing value that increments after lines are stored in a file.
log is the folder of raw JATS files that did not process. It defaults to empty (not saved).

Debug/Test

The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).

pip uninstall oas
python -m pip install -e c:/repos/TextCorpusLabs/oas

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
src/oas		src/oas
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OAS To Text Corpus

Operation

Install

Run

Debug/Test

About

Languages

License

TextCorpusLabs/oas

Folders and files

Latest commit

History

Repository files navigation

OAS To Text Corpus

Operation

Install

Run

Debug/Test

About

Topics

Resources

License

Stars

Watchers

Forks

Languages