The BREAK dataset is stored in BREAK=$QDMR_ROOT/data/break
.
We use the logical form files - QDMRs (those are started as all subsets of BREAK logical forms that are built for SPIDER):
$BREAK/logical-forms/dev_spider.csv
$BREAK/logical-forms/train_spider.csv
The script $ROOT/qdmr2sparql/fix_qdmr_static.py
looks only at QDMRs and checks them according to simple rules.
Usage (save new dataset files to $BREAK/logical-forms-fixed
):
export QDMR_ROOT=`pwd`
export PYTHONPATH=$QDMR_ROOT:$QDMR_ROOT/qdmr2sparql:$PYTHONPATH
export BREAK=$QDMR_ROOT/data/break
export GRND_PATH=$BREAK/groundings
mkdir -p $BREAK/logical-forms-fixed
mkdir -p $GRND_PATH
python $QDMR_ROOT/qdmr2sparql/fix_qdmr_static.py --qdmr_path $BREAK/logical-forms --dev --output_path $BREAK/logical-forms-fixed/dev_spider.csv
python $QDMR_ROOT/qdmr2sparql/fix_qdmr_static.py --qdmr_path $BREAK/logical-forms --output_path $BREAK/logical-forms-fixed/train_spider.csv
The script $ROOT/qdmr2sparql/get_qdmr_grounding_from_sql.py
looks at both QDMR and SQL from SPIDER and computes grounding.
Usage:
python $QDMR_ROOT/qdmr2sparql/get_qdmr_grounding_from_sql.py --dev --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --output_path $GRND_PATH/grnd_partial_dev.json | tee $GRND_PATH/grnd_partial_dev_log.txt
python $QDMR_ROOT/qdmr2sparql/get_qdmr_grounding_from_sql.py --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --output_path $GRND_PATH/grnd_partial_train.json | tee $GRND_PATH/grnd_partial_train_log.txt
SQL queries are corrected in the following files:
- $QDMR_ROOT/data/spider/dev.json - for the dev subset
To reparse the SQL run this command:
PYTHONPATH=$QDMR_ROOT/spider:$PYTHONPATH python $QDMR_ROOT/qdmr2sparql/parse_raw_json.py
python $QDMR_ROOT/qdmr2sparql/fix_databases.py --spider_path $QDMR_ROOT/data/spider
The script $QDMR_ROOT/qdmr2sparql/get_complete_qdmr_groundings.py
looks at both QDMR and SQL from SPIDER and computes grounding.
Usage (uses partial groundings stored in $GRND_PATH
):
python $QDMR_ROOT/qdmr2sparql/get_complete_qdmr_groundings.py --dev --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --input_grounding $GRND_PATH/grnd_partial_dev.json --output_path $GRND_PATH/grnd_complete_dev.json | tee $GRND_PATH/grnd_complete_dev_log.txt
# runs < 1 min
python $QDMR_ROOT/qdmr2sparql/get_complete_qdmr_groundings.py --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --input_grounding $GRND_PATH/grnd_partial_train.json --output_path $GRND_PATH/grnd_complete_train.json | tee $GRND_PATH/grnd_complete_train_log.txt
The script $ROOT/qdmr2sparql/select_grounding_with_sparql.py
looks at both QDMR and SQL from SPIDER and computes grounding.
Usage (uses groundings stored in $GRND_PATH
):
python $QDMR_ROOT/qdmr2sparql/select_grounding_with_sparql.py --dev --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --input_grounding $GRND_PATH/grnd_complete_dev.json --output_path $GRND_PATH/grnd_list_positive_dev.json --output_path_all $GRND_PATH/grnd_list_all_dev.json | tee $GRND_PATH/grnd_list_dev_log.json
# this will run forever: need to run this in parallel
python $QDMR_ROOT/qdmr2sparql/select_grounding_with_sparql.py --qdmr_path $BREAK/logical-forms-fixed --spider_path $QDMR_ROOT/data/spider --input_grounding $GRND_PATH/grnd_complete_train.json --output_path $GRND_PATH/grnd_list_positive_train.json --output_path_all $GRND_PATH/grnd_list_all_train.json | tee $GRND_PATH/grnd_list_train_log.json
# to run this on the HSE cluster in parallel use these scripts:
PYTHONPATH=$QDMR_ROOT/utils/cluster:$PYTHONPATH python $QDMR_ROOT/utils/cluster/launcher_select_groundings_train.py --slurm --num-gpus 0 --num-cpus 1 --timeout 100
# merging results from all jobs:
python $QDMR_ROOT/utils/cluster/launcher_select_groundings_train_collect.py --output_path $GRND_PATH/grnd_list_positive_train.json --output_path_all $GRND_PATH/grnd_list_all_train.json
# Overall, read 4350 positives (63%) and 6921 examples
The script $ROOT/qdmr2sparql/fix_qdmr_static.py
looks only at QDMRs and checks them according to simple rules.
Usage (save new dataset files to $BREAK/logical-forms-fixed/logical-forms
):
export QDMR_ROOT=`pwd`
export PYTHONPATH=$QDMR_ROOT:$QDMR_ROOT/qdmr2sparql:$PYTHONPATH
mkdir -p $BREAK/logical-forms-fixed
python $QDMR_ROOT/qdmr2sparql/fix_qdmr_static.py --qdmr_path $BREAK/logical-forms --dev --full_break --output_path $BREAK/logical-forms-fixed/dev.csv
python $QDMR_ROOT/qdmr2sparql/fix_qdmr_static.py --qdmr_path $BREAK/logical-forms --full_break --output_path $BREAK/logical-forms-fixed/train.csv
The script $ROOT/qdmr2sparql/get_qdmr_grounding_from_break.py
looks at both QDMR and SQL from SPIDER and computes grounding.
Usage:
export GRND_PATH=$BREAK/groundings
mkdir -p $GRND_PATH
python $QDMR_ROOT/qdmr2sparql/get_qdmr_grounding_from_break.py --dev --full_break --qdmr_path $QDMR_ROOT/Break/break_dataset_static_fix --output_path $GRND_PATH/grnd_full_break_dev.json --output_path_all $GRND_PATH/grnd_full_with_errors_break_dev.json | tee $GRND_PATH/grnd_full_break_dev_log.txt
python $QDMR_ROOT/qdmr2sparql/get_qdmr_grounding_from_break.py --full_break --qdmr_path $QDMR_ROOT/Break/break_dataset_static_fix --output_path $GRND_PATH/grnd_full_break_train.json --output_path_all $GRND_PATH/grnd_full_with_errors_break_train.json | tee $GRND_PATH/grnd_full_break_train_log.txt