Improving Language Understanding from Screenshots

This repository contains the code, data, and models for paper Improving Language Understanding from Screenshots. In this paper, we focus on improving the language understanding ability of "screenshot LM" (models that process everything -- including text -- within visual inputs) and propose patch-and-text prediction (PTP), a novel pre-training objective for screenshot LMs.

Quick Links

Environment
Preparing the data
Reproducing our pre-trained models
Downloading our models
Fine-tuning PTP models
Bugs or Questions?
Citation

Environment

Firstly, please install the latest compatible PyTorch.

Then, install all the required packages by running:

pip install -r requirements.txt

We strongly recommend using the exact same transformers and accelerate versions for best reproducibility. Please checkout the renderer readme to make sure that the renderer is correctly configured.

Preparing the data

For our encoder-decoder experiments and the train-from-scratch autoregressive screenshot LM experiments, we use Wikipedia+BookCorpus as the pre-training data. You can find the already-tokenized dataset from this Huggingface website. You can download the data by

git clone https://huggingface.co/datasets/princeton-nlp/ptp_data data

This folder contains four files

wikibook_256_opt_tk_train.npy and wikibook_256_opt_tk_val.npy: Wiki+Book using OPT tokenizer, 256 tokens per example (for encoder-decoder).
wikibook_512_llama_tk_train.npy and wikibook_512_llama_tk_val.npy: Wiki+Book using LLAMA tokenizer, 512 tokens per example (for train-from scratch autoregressive).

For continuing training Sheared-llama to use screenshots, we use Sheared-llama's pipeline for processing RedPajama data. Please follow this guideline for processing the data. Our example config will use ./data/sheared-llama-rp/for_ft for continuing pre-training and ./data/sheared-llama-rp/eval for evaluation.

Reproducing our pre-trained models

To reproduce our models, run the following command (requires 8 GPUs):

NUM_GPU=8 bash run_multiple_gpus.sh {CONFIG PATH}

There are three example configs:

run_configs/ptp.yaml: our main PTP model (encoder-decoder).
run_configs/screenshot-llama-380m.yaml: train-from-scratch autoregressive.
run_configs/screenshot-llama-1.3b-from-sheared-llama.yaml: continuing pre-training sheared-llama.

You can also run the single-GPU command run_single_gpu.sh for testing. To ensure the same hyperparameters, you should adjust the per-GPU batch size (per_device_train_batch_size) or the gradient accumulation steps (gradient_accumulation_steps) accordingly if you are not using 8 GPUs or your GPUs cannot fit our preset batch sizes.

Downloading our models

We provide the following pre-trained models on Huggingface:

princeton-nlp/ptp
princeton-nlp/screenshot-llama-380m
princeton-nlp/screenshot-llama-1.3b-from-sheared-llama

Fine-tuning PTP models

Coming soon!

Bugs or questions?

If you have any questions related to the paper, feel free to email Tianyu (tianyug@cs.princeton.edu). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use PTP in your work:

@article{gao2024improving,
  title={Improving Language Understanding from Screenshots},
  author={Gao, Tianyu and Wang, Zirui and Bhaskar, Adithya and Chen, Danqi},
  journal={arXiv preprint arXiv:2402.14073},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Improving Language Understanding from Screenshots

Quick Links

Environment

Preparing the data

Reproducing our pre-trained models

Downloading our models

Fine-tuning PTP models

Bugs or questions?

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Improving Language Understanding from Screenshots

Quick Links

Environment

Preparing the data

Reproducing our pre-trained models

Downloading our models

Fine-tuning PTP models

Bugs or questions?

Citation