Generating a Knowledge Graph of COVID-19 Literature

Section	Description
Installing	Installing the requirements
Downloading	Downloading the data
Preparing	Preparing the JSONs
Running	Running RML
Querying	Linked Data Fragments endpoint
Analyzing	Knowledge Graph Applications

Installing the requirements

Generating the COVID KG requires a few dependencies. Firstly, Python 3 should be installed with the following libraries:

SPARQLWrapper
nltk
numpy
pandas
requests
scipy
sklearn
tqdm

Further, to run RML, you will need:

to have a recent version of Node.js and install the following dependency: npm i @rmlio/yarrrml-parser -g
a recent Java version, and RMLMapper

Downloading the data

The dataset can be retrieved from Kaggle. After downloading the dataset, re-arrange the files to have the following directory structure:

data
|-metadata.csv
|-papers
  |-<PAPER_1>.json
  |-...
  |-<PAPER_N>.json

Preparing the JSONs

Create a sample dataset (OPTIONAL)

Make sure you have a directory called sample with a papers directory in there. Then run python3 scripts/generate_sample_data.

Create bag of words of the content

To create a bag of words for title, abstract and body, run python3 scripts/create_bow.py <INPUT_DIR> <OUTPUT_DIR>. As an example, you could run: python3 scripts/create_bow.py sample output.

Mapping the string representations to known resources

Run python3 scripts/map_entities.py <INPUT_DIR> <OUTPUT_DIR> to generate different pickled dictionaries with the following structure: {string: URI}.

Run python3 scripts/get_db_resources.py <INPUT_DIR> <OUTPUT_DIR> to get the dbpedia ntriple files of the known resources. The <INPUT_DIR> is usually the previous script <OUTPUT_DIR>.

Change the json representation and add links to known resources

To add the information of the known resources in the paper's json representation, run python3 scripts/ountry_institution_json.py <INPUT_DIR> <PICKLE_DIR> <OUTPUT_DIR>. Iteratively, this script will add the country and institution external links to the json dictionaries of all files in the INPUT_DIR.

Change the metadata csv and add links to known resources

To add the external links to the metadata.csv file run python3 scripts/csv_transform.py <INPUT_DIR> <PICKLE_DIR> <OUTPUT_DIR>. This script will add an additional column with the journal dbpedia link.

Running RML

After preparing the JSONs, we can convert them to RDF using RML. The python3 scripts/loop.py <INPUT_DIR> <JOBS> script shows how this transformation can be performed in python, using external commands:

yarrrml-parser -i rules.yml -o rules.rml.ttl
java -jar /path/to/rmlmapper.jar -m rules.rml.ttl

In this script, all json files from the INPUT_DIR are first copied to the tmp/ folder. This is the source entrypoint defined by our yarrrml script. You can change this location by changing the sources in the rule.yml file. This conversion can be exectued in parallel and the parameter is defined to indicute how many thread can be used at the same time.

Analogue, the metadata.csv and bow.json can be transformed to RDF by using the corresponding yml files in the RML folder.

yarrrml-parser -i mapping-csv.yml -o csv.rml.ttl
java -jar /path/to/rmlmapper.jar -m csv.rml.ttl -o <DIR>/metadata.nt

yarrrml-parser -i mapping-bow.yml -o csv.rml.ttl
java -jar /path/to/rmlmapper.jar -m csv.rml.ttl -o <DIR>bow.nt

Create KG

Executing all these rmlmapper commands result in a large set of .nt files. All of them were combined in one sigle file to represent the KG. Simply concat them using the following bash command:

for i in *.nt;do cat $i >> kg.nt;done

Linked Data Fragments endpoint

We are hosting an endpoint that can be used for querying here. The corresponding repository for this can be found here.

Knowledge Graph Applications

Citation

A paper on this work has been accepted to the resource track of ISWC2019! Our paper will be made available soon. If you use the COVID-KG in scientific work, we would appreciate citations:

"Steenwinckel B., Vandewiele G., Rausch I., Heyvaert P., Taelman R., Colpaert P., Simoens P., Dimou A., De Turck F. and Ongenae F. Facilitating COVID-19 Meta-analysis Through a Literature Knowledge Graph. In Proc. of 19th International Semantic Web Conference (ISWC), 2-6 November 2020 (accepted)"

or

@inproceedings{covid_kg,
  title={Facilitating COVID-19 Meta-analysis Through a Literature Knowledge Graph},
  author={Bram Steenwinckel and Gilles Vandewiele and
          Ilja Rausch and Pieter Heyvaert and 
          Pieter Colpaert and Pieter Simoens and
          Anastasia Dimou and Filip De Turkc and
          Femke Ongenae},
  booktitle={Accepted in Proc. of 19th International Semantic Web Conference (ISWC)},
  year={2020}
}

Acknowledgements

This has been a collaboration between a lot of people:

Bram Steenwinckel
Pieter Heyvaert
Michael Weyns
Anastasia Dimou
Pieter Colpaert
Ruben Taelman
Ruben Dedecker
Dylan Van Assche
Femke Ongenae

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
docs		docs
images		images
output		output
rml		rml
sample		sample
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating a Knowledge Graph of COVID-19 Literature

Installing the requirements

Downloading the data

Preparing the JSONs

Create a sample dataset (OPTIONAL)

Create bag of words of the content

Mapping the string representations to known resources

Change the json representation and add links to known resources

Change the metadata csv and add links to known resources

Running RML

Create KG

Linked Data Fragments endpoint

Knowledge Graph Applications

Citation

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

GillesVandewiele/COVID-KG

Folders and files

Latest commit

History

Repository files navigation

Generating a Knowledge Graph of COVID-19 Literature

Installing the requirements

Downloading the data

Preparing the JSONs

Create a sample dataset (OPTIONAL)

Create bag of words of the content

Mapping the string representations to known resources

Change the json representation and add links to known resources

Change the metadata csv and add links to known resources

Running RML

Create KG

Linked Data Fragments endpoint

Knowledge Graph Applications

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages