multiParamNewsBenchmark

Introduction

This repository hosts a semi-synthetic benchmark based on the UCI NY-Times corpus (Newman, 2008) and the simulation described in "Learning Representations for Counterfactual Inference"[1].

The benchmark realizes an arbitrary number of binary or parametric (continuous) treatment options. Its covariates are based on real-world data, but the treatment assignments and outcomes are simulated.

BEFORE RUNNING

Before simulating the data, you need to download the NY Times corpus ( https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/ ) and place it in the 'data' folder. Download both the docword and vocab file.

How to run

After downloading the corpus, run preprocess.py to generate an LDA representation of the data.
Then run simulate.py to generate the simulated dataset.
The jupyter notebook analysis.ipynb is provided as a starting point to explore the dataset.
The config.py file contains options for the simulation process. Among others, how many treatments should be simulated and how many datasets should be created.

Result files

For each simulated dataset a separate file is created and the simulation outcome is saved as dictionary in a numpy file containing the following keys:
'centroids_z': A list of the treatment centroids in topic space.
'centroids_x': Same in word space.
'z': All sampled documents in topic space. [n_samples, n_topics]
'x': Same in word space. [n_samples, n_words]
't': The treatment given for each sample. 0 represents the control group. [n_samples]
'mu': The deterministic (true) outcome based on z and t for each sample and treatment. [n_samples, n_treatments]
'y': The measured (noisy) outcome based on mu + noise for each sample and treatment. [n_samples, n_treatments]
's': The treatment 'strength' for each sample and treatment. Is 1 for binary treatment and in [0,1] for parametric treatments. [n_samples, n_treatments]
'treatment_types': A boolean list signifying whether a treatment is binary (0) or parametric (1). [n_treatments]
To be able to calculate a more accurate counterfactual error for the parametric treatment options, additional counterfactual samles are provided in:
'mu_pcf': [n_samples, n_parametric_treatments, n_additional_samples]
'y_pcf': [n_samples, n_parametric_treatments, n_additional_samples]
's_pcf': Uniform random numbers in [0,1]. [n_samples, n_parametric_treatments, n_additional_samples]

In general, you only need x,t,y,s,y_pcf, and s_pcf to train and evaluate your model.

Note: Currently the data can also be saved in binary format. To this end, multi-dimensional matrices are flattened to 2D and the format changes - using this option is so far not recommended.

Generation process

As in [1] the NYT corpus is used as a basis for the simulation. Each unit x represents a document as a vector of word counts.
On this data, LDA is run to generate for each document a vector z in topic space Z. The dimensionality of Z is a parameter.
The vectors in X are reduced to only keep the dimensions corresponding to the union of most probable words in the topics identified by LDA.
For each treatment, a centroid is defined in Z space. For t=0 it is the mean vector of all documents. For all others it is a randomly chosen document.
The probability of treatment assignment for each unit and treatment depends on parameter k and the proximity of the unit to the treatment centroids in Z space.
The treatment strength is determined as function of treatment, and proximity to treatment centroid.
The treatment outcome is based on treatment, treatment strength, and proximity to treatment centroid.
Both outcome y and treatment strength s are augmented with random noise.
For each simulation run, centroids are rerandomized and treatment assignments and outcomes are calculared.

References

[1] Fredrik D. Johansson, Uri Shalit & David Sontag. Learning Representations for Counterfactual Inference. 33rd International Conference on Machine Learning (ICML), June 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
config.py		config.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
saver.py		saver.py
simulate.py		simulate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multiParamNewsBenchmark

Introduction

BEFORE RUNNING

How to run

Result files

Generation process

References

About

Uh oh!

Releases

Packages

Languages

License

a1247418/multiParamNewsBenchmark

Folders and files

Latest commit

History

Repository files navigation

multiParamNewsBenchmark

Introduction

BEFORE RUNNING

How to run

Result files

Generation process

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages