Skip to content
/ maya Public

Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya

License

Notifications You must be signed in to change notification settings

nahidalam/maya

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2092286 · Jan 6, 2025

History

42 Commits
Nov 27, 2024
Dec 20, 2024
Dec 7, 2024
Dec 8, 2024
Dec 15, 2024
Dec 15, 2024
Dec 15, 2024
Dec 15, 2024
Dec 15, 2024
Dec 15, 2024
Nov 27, 2024
Jan 6, 2025
Nov 27, 2024
Nov 27, 2024
Dec 9, 2024
Nov 27, 2024

Repository files navigation

Maya: Multimodal Multilingual LLM

Multimodal LLM supporting 8 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi

Contents

Install

The following steps worked on a CUDA Version: 12.4.

  1. Clone this repository and navigate to maya directory
git clone https://github.com/nahidalam/maya
cd maya
  1. Install Package
conda create -n maya python=3.10 -y
conda activate maya
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir

Model Weights and Dataset

HuggingFace

Train

Pretraining

To pretrain the projection layer,

  • get the pretraining dataset from HuggingFace and keep it in /dev/data/LLaVA_Pretrain
  • get the images with wget https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/resolve/main/images.zip and keep them in /dev/data/images
bash scripts/maya/pretrain_aya_siglip.sh

Instruction Tuning

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

After downloading all of them, organize the data as follows in /dev/data/instruction_tune_dataset/,

instruction_tune_dataset
    ├── coco
    │   └── train2017
    ├── gqa
    │   └── images
    ├── ocr_vqa
    │   └── images
    ├── textvqa
    │   └── train_images
    └── vg
        ├── VG_100K
        └── VG_100K_2

Put the palo_multilingual_dataset.json in /dev/data/annotations/palo_multilingual_dataset.json

Make sure to keep the pretrained model you have in a path that you specify in the scripts/maya/finetune_aya_siglip.sh script throught the --pretrain_mm_mlp_adapter flag

Then run

bash scripts/maya/finetune_aya_siglip.sh

Evaluation

For multilingual evaluation using PALO multilingual test dataset

  • Download the PALO evaluation dataset: Create the following directory structure if it doesn't exist.
    LLaVA/playground/data/eval
    git clone https://huggingface.co/datasets/MBZUAI/multilingual-llava-bench-in-the-wild
    
  • Specifically test images can be found here
  • Run the evaluation script
bash scripts/v1_5/eval/eval_all_languages.sh \
    "model_base" \
    "model_path" \
    "model_name" \
    "your-openai-api-key"

Citation

If you find Maya useful for your research and applications, please cite using this BibTeX:

@misc{alam2024mayainstructionfinetunedmultilingual,
      title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, 
      author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
      year={2024},
      eprint={2412.07112},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.07112}, 
}

Contributors

In no particular order

Acknowledgement

  • This codebase is based on LLaVA. Thank you for the easily understandable codebase.
  • This project would not be possible without the support of Cohere and their Aya-35B API grant. We are thankful to Sara Hooker, Madeline, Shivalika, Shristhi and the entire Cohere for AI team for their support.
  • We thank Merve and the HuggingFace team for GPU support for the inference demo