This repository contains the official implementation of CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay by Natasha Butt, Blazej Manczak, Auke Wiggers, Corrado Rainone, David Zhang, Michaël Defferrard, and Taco Cohen, published at ICML 2024.
Contact: [email protected]. Work completed while at QUVA Lab as part of the research collaboration between Qualcomm Technologies, Inc. (QTI) and the University of Amsterdam (UvA).
Large language models are increasingly solving tasks that are commonly believed to require human-level reasoning ability. However, these models still perform very poorly on benchmarks of general intelligence such as the Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a programming-by-examples problem, and introduce a novel and scalable method for language model self-improvement called Code Iteration (CodeIt). Our method iterates between 1) program sampling and hindsight relabeling, and 2) learning from prioritized experience replay. By relabeling the goal of an episode (i.e., the target program output given input) to the realized output produced by the sampled program, our method effectively deals with the extreme sparsity of rewards in program synthesis. Applying CodeIt to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. CodeIt is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at https://github.com/Qualcomm-AI-research/codeit.
Clone the repository.
git clone https://github.com/Qualcomm-AI-research/codeit.git
cd codeit
Build the Docker image from inside the checked-out repository. On Linux, this can be done using
docker build . -t codeit:latest
Once the image has built successfully, we can run a container that is based on it. On Linux, the command that does this and also
- mounts the current working directory into the container
- changes to the current working directory
- exposes all of the host's GPUs in the container
is
docker run --rm -it -v "${PWD}":"${PWD}" -w "${PWD}" --gpus=all codeit:latest /bin/bash
Note: we require Python 3.9 at least due to our use of the ast library.
Note: the stored data/training
and data/evaluation
directories were created by the below.
Download official ARC training and evaluation data into data/raw
:
bash download-data.sh
Preprocess ARC data:
python run/preprocess_tasks.py
Mutate ARC training tasks and save these as outputs/mutation/mutated_tasks_train_{i}.json
:
mkdir -p outputs/mutation/
python run/run_mutate.py hydra.run.dir=outputs/mutation/
Note: there will be many errors printed to the log as the mutation algorithm tests code by executing with timeout and printing errors. The mutation algorithm stops after baseline.num_samples
samples.
To evaluate mutated programs from the mutated tasks by executing each program on all ARC evaluation tasks, run:
python run/evaluate_mutations.py hydra.run.dir=outputs/mutation/ baseline.tasks_file=outputs/mutation/mutated_tasks_train_ baseline.parallel=True baseline.sig_alarm=True
mkdir -p outputs/codeit/
python run/run_codeit.py hydra.run.dir=outputs/codeit/
Note: by default, CodeIt will use the mutated tasks stored in data/mutated_tasks_train_*.json
.
To instead use the mutated tasks created by the above, set replay_buffer.mutated_train_tasks_file=outputs/mutation/mutated_tasks_train_
.
Note: to reduce memory requirements, lower data.dataloader.batch_size
or consider using a smaller model such as codet5_small
.
Outputs saved to outputs/codeit/
:
performance.csv
contains performance and number of programs/tasks in buffer for every meta-iterationtensorboard
directories contain scalar metrics and text logs for all programs with seen example performance of 1, along with their task demonstration performance and test performancelog_{i}.json
: policy samples at meta-iterationi
solutions_{i}.json
: solutions at meta-iterationi
last.ckpt.dir
: last model checkpointlog.out
: terminal output
A1, no ExIt:
first run the mutation baseline to obtain mutated programs evaluated on ARC evaluation tasks, then
mkdir -p outputs/a1_no_exit/
python run/run_codeit.py hydra.run.dir=outputs/a1_no_exit/ ablation.used=True exit.add_policy_samples=False
A2, no relabeling:
mkdir -p outputs/a2_no_relabeling/
python run/run_codeit.py hydra.run.dir=outputs/a2_no_relabeling/ ablation.sampling_and_filtering=True
A3, no priority:
mkdir -p outputs/a3_no_priority/
python run/run_codeit.py hydra.run.dir=outputs/a3_no_priority/ replay_buffer.performance_penalty=0
A4, no pretraining:
mkdir -p outputs/a4_no_pretraining/
python run/run_codeit.py hydra.run.dir=outputs/a4_no_pretraining/ model.random_weights=True
A5, one demo:
mkdir -p outputs/a5_one_demo_example/
python run/run_codeit.py hydra.run.dir=outputs/a5_one_demo_example/ data.dataloader.tokenizer.n_examples=1
A6, no mutation:
mkdir -p outputs/a6_no_mutation/
python run/run_codeit.py hydra.run.dir=outputs/a6_no_mutation/ replay_buffer.mutated_train_tasks_file=""
Use results.ipynb
to compute the mutation baseline performance, CodeIt cumulative performance, and CodeIt policy-only performance.
Note: if you ran CodeIt with the mutated tasks stored in data/
, the default, leave MUTATION_PATH=data/
.
Else, if you re-ran the mutation baseline and ran CodeIt with mutated tasks from this run (by setting replay_buffer.mutated_train_tasks_file=outputs/mutation/mutated_tasks_train_
), set MUTATION_PATH=outputs/mutation/
.
├─ codeit/
│ ├─ augment/
│ │ ├─ genetic.py # Mutation algorithm implementation
│ │ ├─ mutate_grid.py # Lower level functions for input mutation (not used for mutation baseline)
│ │ ├─ mutate_program.py # Lower level functions for program mutation
│ │ ├─ program_sampler.py # Random program sampler implementation (reuses code from mutation algorithm)
│ │ └─ type_inference.py # Code for type inference
│ ├─ dsl/ # Domain-specific language, based on https://github.com/michaelhodel/arc-dsl
│ │ ├─ arc_types.py # Type system
│ │ ├─ dsl.py # Primitive function definitions
│ │ ├─ primitives.py # Primitive constant definitions and function names
│ │ └─ solvers.py # Solutions to ARC training tasks
│ ├─ policy/
│ │ ├─ environment.py # Code for executing programs with timeout
│ │ ├─ inference.py # Code for policy generation and evaluation
│ │ └─ tokenize.py # Code for tokenizing examples
│ ├─ agent.py # Agent manages the replay buffer, samples the policy, adds to solutions
│ ├─ exit_data_module.py # Exit data module samples the replay buffer and collates dataset for dataloaders
│ ├─ replay_buffer.py # Stores tokenized training examples
│ ├─ task.py # Functionality related to tasks
│ ├─ callbacks.py # Contains callback for model saving
│ ├─ utils.py # General purpose functions
│ ├─ typing_custom.py # Contains custom typing functions
│ ├─ concept_arc.py # Functions for loading and processing concept arc dataset
│ └─ hf_model_module.py # Pytorch lightning module
├─ data/
│ ├─ evaluation/ # Processed ARC evaluation tasks
│ ├─ training/ # Processed ARC training tasks
│ ├─ codeit_arc_eval_solutions_100.json # Solutions after 100 meta-iterations for codeit run
│ ├─ mutation_performance.csv # Mutation baseline d1 performance
│ ├─ mutation_solutions_19200.json # Mutation baseline d1 solutions after 19,200 samples
│ ├─ mutated_tasks_train_9600.json # Mutation baseline d1 tasks after 9,600 samples
│ ├─ mutated_tasks_train_19200.json # Mutation baseline d1 tasks after 19,200 samples
│ └─ split_keys.json # Contains custom validation split for training tasks
├─ outputs/ # Experiment outputs, created by running the above commands
├─ run/
│ ├─ config/
│ │ ├─ models/ # Configs for various models
│ │ └─ base_config.yaml # All default parameters are stored here
│ ├─ preprocess_tasks.py # Script for preprocessing ARC tasks
│ ├─ run_codeit.py # Script for running codeit
│ ├─ evaluate_mutations.py # Script for evaluating mutated tasks
│ └─ run_mutate.py # Script for mutating programs
├─ Dockerfile
├─ download-data.sh # Script to download raw ARC data
├─ LICENSE
├─ README.md
├─ requirements.txt
├─ results.ipynb # Notebook for reporting performance of CodeIt and mutation baseline
└─ setup.py
The domain-specific language for solving ARC, implemented in codeit/dsl
, is largely based on Michael Hodel's implementation, with some minor updates for ease of program execution and mutation.
If you find our code useful, please cite:
@inproceedings{butt2024codeit,
title = {CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay},
author = {Butt, Natasha and Manczak, Blazej and Wiggers, Auke and Rainone, Corrado and Zhang, David and Defferrard, Micha{\"e}l and Cohen, Taco},
booktitle = {International Conference on Machine Learning},
year = {2024},
organization = {PMLR},
eprint = {2402.04858},
url = {https://arxiv.org/abs/2402.04858},
}