Bayesian Learning of Latent Representations of Language Structures

Requirements

Python3
- numpy
- scipy
R (for missing data imputation)
- missMDA package (required for now but it is not difficult to remove dependency on it)
- NPBayesImpute (only for comparison)

Preprocessing

WALS

Download wals_language.csv.zip from WALS http://wals.info/ to obtain data/wals/language.csv (already in our repository)
Convert the CSV into two JSON files

python format_wals.py ../data/wals/language.csv ../data/wals/langs.json ../data/wals/flist.json

Missing data imputation for initialization

python -mmv.json2tsv ../data/wals/langs.json ../data/wals/flist.json ../data/wals/langs.tsv
R --vanilla -f mv/impute_mca.r --args ../data/wals/langs.tsv ../data/wals/langs.filled.tsv
python -mmv.tsv2json ../data/wals/langs.json ../data/wals/langs.filled.tsv ../data/wals/flist.json ../data/wals/langs.filled.json

TODO: Remove the dependency on missMDA as our model is now insensitive to initialization.

Autotyp

Suppose we are at ~/download. First download the database.

git clone git@github.com:autotyp/autotyp-data.git

or if you do not have a github account with SSH keys, try

git clone https://github.com/autotyp/autotyp-data.git

(optional) for replicability, you may want to try the same version

git checkout 98cae32c387bfe0c7fb1b7151070d834b120a0f1

Convert the data into two JSON files

mkdir -p ../data/autotyp
python format_autotyp.py ~/download/autotyp-data ../data/autotyp/langs.json ../data/autotyp/flist.json

Missing data imputation for initialization

python -mmv.json2tsv ../data/autotyp/langs.json ../data/autotyp/flist.json ../data/autotyp/langs.tsv
R --vanilla -f mv/impute_mca.r --args ../data/autotyp/langs.tsv ../data/autotyp/langs.filled.tsv
python -mmv.tsv2json ../data/autotyp/langs.json ../data/autotyp/langs.filled.tsv ../data/autotyp/flist.json ../data/autotyp/langs.filled.json

Run the model

Perform posterior inference. The hyperparameter settings must be changed properly. Note that the inference is extremely slow (1-2 hours per iteration for WALS with K=100) and linear in time with K.

python train_mda.py --seed=10 --K=100 --iter=1000 --bias --hmc_epsilon=0.025 --maxanneal=100 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --output ../data/wals/mda_K100.pkl ../data/wals/langs.filled.json ../data/wals/flist.json

python train_mda.py --seed=10 --K=50 --iter=1000 --bias --maxanneal=100 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --output ../data/autotyp/mda_K50.pkl ../data/autotyp/langs.filled.json ../data/autotyp/flist.json

Collect samples

python sample_auto.py --seed=10 --a_repeat=5 --iter=100 ../data/wals/mda_K100.pkl.final - | bzip2 -c > ../data/wals/mda_K100.xz.json.bz2
python convert_auto_xz.py --burnin=0 --update --input=../data/wals/mda_K100.xz.json.bz2 ../data/wals/langs.filled.json ../data/wals/flist.json > ../data/wals/mda_K100.xz.merged.json

python sample_auto.py --seed=10 --a_repeat=5 --iter=100 ../data/autotyp/mda_K50.pkl.final - | bzip2 -c > ../data/autotyp/mda_K50.xz.json.bz2 &
python convert_auto_xz.py --burnin=0 --update --input=../data/autotyp/mda_K50.xz.json.bz2 ../data/autotyp/langs.filled.json ../data/autotyp/flist.json > ../data/autotyp/mda_K50.xz.merged.json

Evaluation of missing data imputation

make -j 20 -f eval_mv.make DATATYPE=wals CV=10 MODEL_PREFIX=mda TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --hmc_epsilon=0.025 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if" mda
make -j 20 -f eval_mv.make DATATYPE=wals CV=10 MODEL_PREFIX=mda_dv TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --hmc_epsilon=0.025 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --drop_vs" mda
make -j 20 -f eval_mv.make DATATYPE=wals CV=10 MODEL_PREFIX=mda_dh TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --hmc_epsilon=0.025 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --drop_hs" mda
make -j 20 -f eval_mv.make DATATYPE=wals CV=10 MODEL_PREFIX=mda_oa TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --hmc_epsilon=0.025 --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --only_alphas" mda
make -j 100 -f eval_mv.make al DATATYPE=wals CV=10

make -j 20 -f eval_mv.make DATATYPE=autotyp CV=10 MODEL_PREFIX=mda TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --norm_sigma=10.0 --gamma_scale=1.0 --resume_if" mda
make -j 20 -f eval_mv.make DATATYPE=autotyp CV=10 MODEL_PREFIX=mda_dv TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --drop_vs" mda
make -j 20 -f eval_mv.make DATATYPE=autotyp CV=10 MODEL_PREFIX=mda_dh TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --drop_hs" mda
make -j 20 -f eval_mv.make DATATYPE=autotyp CV=10 MODEL_PREFIX=mda_oa TRAIN_OPTS="--maxanneal=100 --iter=500 --bias --norm_sigma=10.0 --gamma_scale=1.0 --resume_if --only_alphas" mda
make -j 100 -f eval_mv.make al DATATYPE=autotyp CV=10

Bayesian Analysis of Correlated Evolution Involving Multiple Discrete Features

About

Yugo Murawaki. Analyzing Correlated Evolution of Multiple Features Using Latent Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP2018), pp. 4371-4382, Brussels, Belgium, 2018.

Preprocessing

convert Glottolog trees

python newick_tree.py ../data/glottolog/tree_glottolog_newick.txt ../data/glottolog/trees_all.pkl

combine WALS languages and Glottolog trees

python merge_glottolog.py --npriors ../data/node_priors.json  ../data/wals/langs.json ../data/glottolog/trees_all.pkl ../data/wals/trees_attached.pkl

Train the model

the main inference

nice -19 python train_bin_ctmc.py --has_bias --resume_if --seed=0 --npriors ../data/node_priors.json ../data/wals/trees_attached.pkl ../data/wals/mda_K100.0.xz.merged.json ../data/wals/paramevo_K100.0.tree.pkl 2>&1 | tee -a ../data/wals/paramevo_K100.0.tree.log

collect samples

nice -19 python train_bin_ctmc.py --iter=1100 --save_interval=10 --has_bias --resume ../data/wals/paramevo_K100.0.tree.pkl.final --seed=0 --npriors ../data/node_priors.json ../data/wals/trees_attached.pkl ../data/wals/mda_K100.0.xz.merged.json ../data/wals/paramevo_K100.0.tree_plus.pkl 2>&1 | tee -a ../data/wals/paramevo_K100.0.tree_plus.log

estimate CTMC parameters for the surface feature

nice -19 python train_surface_ctmc.py --seed=0 ../data/wals/paramevo_K100.0.tree.pkl.final ../data/wals/flist.json ../data/wals/mda_K100.0.xz.merged.json ../data/wals/paramevo_K100.0.surface_tree.pkl 2>&1 | tee ../data/wals/paramevo_K100.0.surface_tree.log

(TODO) clean up a Jupyter Notebook (used for further analysis) and add it to the repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Bayesian Learning of Latent Representations of Language Structures

Requirements

Preprocessing

WALS

Autotyp

Run the model

Evaluation of missing data imputation

Bayesian Analysis of Correlated Evolution Involving Multiple Discrete Features

About

Preprocessing

Train the model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Bayesian Learning of Latent Representations of Language Structures

Requirements

Preprocessing

WALS

Autotyp

Run the model

Evaluation of missing data imputation

Bayesian Analysis of Correlated Evolution Involving Multiple Discrete Features

About

Preprocessing

Train the model