This project consists of a series of extract-transform-load (ETL) pipelines for adding common sense knowledge triples to the MOWGLI Common Sense Knowledge Graph (CSKG).
The CSKG is used by downstream applications such as question answering systems and knowledge graph browsers. The graph consists of nodes and edges serialized in KGTK edge format, which is a specialization of the general KGTK format.
ConceptNet serves as the core of the CSKG, and other sources such as Wikidata are linked to it. The majority of the predicates/relations in the CSKG are reused from ConceptNet.
From the current directory:
python3 -m venv venv
On Unix:
source venv/bin/activate
On Windows
venv\Scripts\activate
pip install -r requirements.txt
The framework uses LevelDB for whole-graph operations such as duplicate checking.
OS X:
brew install leveldb
CFLAGS=-I$(brew --prefix)/include LDFLAGS=-L$(brew --prefix)/lib pip install plyvel
Linux:
pip install plyvel
The RDF loader can use the rdflib "Sleepycat" store if the bsddb3
module is present.
Linux:
pip install bsddb3
Activate the virtual environment as above, then run:
pytest
Activate the virtual environment as above, then run:
python3 -m mowgli_etl.cli etl rpi_combined
to run all of the available pipelines as well as combine their output.
The extract, transform, and load stages of the pipelines write data to the data
directory. (The path to this directory can be changed on the command line). The structure of the data
directory is data/<pipeline id>/<stage>
. For example, data/swow/loaded
for the final products of the swow
pipeline.
The rpi_combined
pipeline "loads" the outputs of the other pipelines into its data/rpi_combined/loaded
directory in the CSKG CSV format.
The mowgli-etl code base consists of:
- a minimal bespoke framework for implementing ETL pipelines
- pipeline implementations for different data sources, such as the
swow
pipeline for the Small World of Words word association lexicon
A pipeline consists of:
- an extractor, inheriting from the
_Extractor
abstract base class - a transformer, inheriting from the
_Transformer
abstract base class - an optional loader (
_Loader
subclass), which is usually not explicitly specified by pipelines; a default is provided instead - a
_Pipeline
subclass that ties everything together
Running a pipeline with a command such as
python3 -m mowgli_etl.cli etl swow
initiates the following process, where swow
is the pipeline id.
- Instantiate the pipeline by
- finding a module named exactly
mowgli_etl.pipeline.swow.swow_pipeline
(or adapted from another pipeline id) - finding a subclass of _Pipeline declared in that module
- instantiating that subclass with a few arguments from the command line as constructor parameters
- finding a module named exactly
- Call the
extract
method of theextractor
on the pipeline. See the docstring of_Extractor.extract
for information on the contract ofextract
. - Call the
transform
method of thetransformer
on the pipeline, passing in a**kwds
dictionary returned byextract
. See the docstring of_Transformer.transform
for more information. - The
transform
method is a generator for a sequence of models, typicallyKgEdge
s andKgNode
s to add to the CSKG. This generator is passed to the loader, which iterates over it, loading data as it goes. For example, the default KGTK loader buffers nodes and appends edge rows to an output KGTK file. This loading process does not usually need to be handled by the pipeline implementations, most of which rely on the default loader.
- Generators
- Type hints and the
typing
module, especiallyNamedTuple
dataclasses
- Keyword-only arguments (
def f(*, x, y)
), and**kwds
keyword variadic arguments - Abstract base classes, abstract methods, and the
abc
module - The pytest framework for unit testing
- The
pathlib
module - Class methods
We follow PEP8 and the Google Python Style Guide, preferring the former where the two are inconsistent.
We encourage using an IDE such as PyCharm. Please format your code with Black before committing it. The formatter can be integrated into most editors, to format on save.
Most code should be part of a class. There should be one class per file, and the file should be named after the class (SomeClass
as some_class.py
).
The swow
pipeline is the best model for new pipelines.
Extractors typically work in one of two ways:
- Using pre-downloaded data that is committed to the per-pipeline
data
subdirectory. This is the best approach for smaller data sets that change infrequently. - Downloading source data when the
extract
method is called. The data can be cached in the per-pipelinedata
subdirectory and reused ifforce
is not specified. Cached data should be.gitignore
d. Use an implementation of theEtlHttpClient
rather than usingurllib
,requests
, or another HTTP client directly. This makes it easier to mock the HTTP client in unit tests.
The extract
method receives a storage
parameter that points to a PipelineStorage
instance, which has the path to appropriate subdirectory of data
. Extractors should use this path (storage.extracted_data_dir_path
) rather than trying to locate data
directly, since the path to data
can be changed on the command line.
Once the data is available, the extractor must pass it to the transformer by returning a **kwds
dictionary. This is typically done in one of two ways:
- Returning
{"path_to_file": Path("the/file/path")}
fromextract
, so thattransform
isdef transform(self, *, path_to_file: Path)
. This is the preferred approach for large files. - Reading the file in the extractor and returning
{"file_data": "..."}
, in which casetransform
isdef transform(self, *, file_data: str)
or similar. This is acceptable for small data.
Given extracted data in one of the forms listed above, the transformer's task is to:
- parse the data in its source format
- create a sequence of
KgEdge
andKgNode
models that capture the data - yield those models
Transformers can be implemented in a variety of ways, as long as they conform to the _Transformer
abstract base class. For example, in many implementations the top-level transform
methods delegates to multiple private helper methods or helper classes. It is easier to test the code if the logic of the transformer is broken up into relatively small methods that can be tested individually, rather than one large transform
method with many branches.
Note that KgEdge
and KgNode
have legacy factory classmethods (.legacy
in both cases) corresponding to an older data model. These should not be used in new code. New code should instantiate the models directly or use one of the other factory classmethods as a convenience.
The swow
pipeline tests in tests/mowgli_etl_test/pipeline/swow
can be used as a model for how to test a pipeline. Familiarity with the pytest
framework is necessary.
We use the GitHub flow with feature branches on this code. Branches should be named after (e.g., GH-###
) or otherwise linked to an issue in the issue tracker. Please tag a staff person for code reviews, and re-tag when you have addressed the staff person's comments in the code and rebutted the comments in the PR. See the Google Code Review Developer Guide for more information on code reviews.
We use CircleCI for continuous integration. CircleCI runs the tests in tests/
on every push to origin
. Merging a feature branch is contingent on having adequate tests and all tests passing. We encourage test-driven development.
- conceptnet.io and the ConceptNet paper for understanding common sense knowledge graphs
- The Storks et al. survey "Recent Advances in Natural Language Inference"
- The Missing Semester of Your CS Education