This repository contains code and data for the paper "Relational Data Embeddings for Feature Enrichment with Background Information".
1) The folder KEN contains the implementation of our approach, KEN, as described in the paper. It includes:
-
KEN/models/entity_embeddingclasses for the TransE, DistMult and MuRE knowledge-graph embedding models (based on the PyKEEN package). -
KEN/models/numerical_embeddingclasses implementing our approach (a linear layer with ReLU activation, seelinear2.py) and a binning approach to embed numerical values (binning.py). -
KEN/sampling/pseudo_type.pyan adaptation of PyKEEN PseudoTypedNegativeSampler, which replaces head entities by a random entity occuring in the same relation. -
KEN/training/hpp_trainer.pya class to train embedding models with/without KEN, possibly with multiple hyperparameters. It also measures the time and memory needed for training, and save the results in a .parquet file. -
KEN/baselines/dfs.pya class to perform Deep Feature Synthesis using the implementation from featuretools. It also measures the time/memory needed, and the number of generated features. -
KEN/evualation/prediction_scores.pya set a functions to compute the cross-validation scores of embeddings / deep features on a target dataset. -
KEN/dataloader/dataloader.pya class to load triples in the .npy format and convert them to a TriplesFactory object that can be used by PyKEEN. -
KEN/dataloader/make_triples.pya function that takes as input tables/knowledge-graphs and turn them into a set triples saved with .npy format.
2) The folder experiments contains the datasets and the code to run our experiments.
-
experiments/model_trainingcode to train embedding models (TransE, DistMult, MuRE, RDF2Vec), save them as checkpoints during training, and store metadata about checkpoints (parameters, time/memory complexity) in a .parquet file. -
experiments/deep_feature_synthesiscode to perform Deep Feature Synthesis, save the generated features, and store metadata (time/memory complexity, number of features) in a .parquet file. -
experiments/manual_feature_engineeringcode to manually build features and store them in .parquet files. -
experiments/prediction_scorescode to compute cross-validation scores of all methods under study, and store the results (scores, time complexity) in .parquet files. -
experiments/attribute_reconstructioncode to compute cross-validation scores when reconstructing entities numerical attributes (e.g. county population) from their embeddings. We store the results in a .parquet file. -
experiments/embedding_visualizationcode to visualize in 2D MuRE and MuRE + KEN embeddings trained on YAGO3. -
experiments/results_visualizationa set of functions to visualize the results of the experiments.
3) The datasets used in our experiments are available here in the form of a zip file.
The unzipped datasets folder should be placed in experiments.For each dataset xxx, experiments/datasets/xxx contains:
- a file
target.parquetthat contains the entities of interest (e.g. counties, cities) and the target to predict. - a folder
tripletsthat contains the training triples in .npy format and their metadata.
- Install KEN using the
setup.pyfile. - Run experiments (in order:
model_training,deep_feature_synthesis, thenprediction_scoresandattribute_reconstruction) - To avoid re-running the experiments, we provide the result files used in the paper. You can visualize them with functions from
experiments/results_visualization/results_visualization.py.