NeMo Framework is NVIDIA's GPU accelerated, end-to-end training framework for large language models (LLMs), multi-modal models and speech models. It enables seamless scaling of training (both pretraining and post-training) workloads from single GPU to thousand-node clusters for both π€Hugging Face/PyTorch and Megatron models. It includes a suite of libraries and recipe collections to help users train models from end to end. The AutoModel library ("NeMo AutoModel") provides GPU-accelerated PyTorch training for π€Hugging Face models on Day-0. Users can start training and fine-tuning models instantly without conversion delays, scale effortlessly with PyTorch-native parallelisms, optimized custom kernels, and memory-efficient recipes-all while preserving the original checkpoint format for seamless use across the Hugging Face ecosystem.
β οΈ Note: NeMo AutoModel is under active development. New features, improvements, and documentation updates are released regularly. We are working toward a stable release, so expect the interface to solidify over time. Your feedback and contributions are welcome, and we encourage you to follow along as new updates roll out.
β Available now | π Coming in 25.09
-
β HuggingFace Integration - Works with 1-70B models (Qwen, Llama).
-
β Distributed Training - Fully Sharded Data Parallel (FSDP2) support.
-
β Environment Support - Support for SLURM and interactive training.
-
β Learning Algorithms - SFT (Supervised Fine-Tuning), and PEFT (Parameter Efficient Fine-Tuning).
-
β Large Model Support - Native PyTorch support for models up to 70B parameters.
-
β Advanced Parallelism - PyTorch native FSDP2, TP, CP, and SP for efficient training.
-
β Sequence Packing - Sequence packing in both DTensor and MCore for huge training perf gains.
-
β DCP - Distributed Checkpoint support with SafeTensors output.
-
β HSDP - Hybrid Sharding Data Parallelism based on FSDP2.
-
π Pipeline Support - Torch-native support for pipelining composable with FSDP2 and DTensor (3D Parallelism).
-
π Pre-training - Support for model pre-training, including DeepSeekV3, GPT-OSS and Qwen3 (Coder-480B-A35B, etc).
-
π Knowledge Distillation - Support for knowledge distillation with LLMs; VLM support will be added post 25.09.
NeMo AutoModel provides native support for a wide range of models available on the Hugging Face Hub, enabling efficient fine-tuning for various domains. Below is a comprehensive list of all supported models with their available recipes:
To get started quickly, NeMo AutoModel provides a collection of ready-to-use recipes for common LLM and VLM fine-tuning tasks. Simply select the recipe that matches your model and training setup (e.g., single-GPU, multi-GPU, or multi-node).
Domain | Model Family | Model ID | Recipes |
---|---|---|---|
LLM | LLaMA | meta-llama/Llama-3.2-1B |
SFT, PEFT |
meta-llama/Llama-3.2-3B-Instruct |
SFT, PEFT | ||
meta-llama/Llama-3.1-8B |
FP8 | ||
LLM | Mistral | mistralai/Mistral-7B-v0.1 |
SFT, PEFT, FP8 |
mistralai/Mistral-Nemo-Base-2407 |
SFT, PEFT, FP8 | ||
mistralai/Mixtral-8x7B-Instruct-v0.1 |
PEFT | ||
LLM | Qwen | Qwen/Qwen2.5-7B |
SFT, PEFT, FP8 |
Qwen/Qwen3-0.6B |
SFT, PEFT | ||
Qwen/QwQ-32B |
SFT, PEFT | ||
LLM | Gemma | google/gemma-3-270m |
SFT, PEFT |
google/gemma-2-9b-it |
SFT, PEFT, FP8 | ||
google/gemma-7b |
SFT, PEFT | ||
LLM | Phi | microsoft/phi-2 |
SFT, PEFT |
microsoft/Phi-3-mini-4k-instruct |
SFT, PEFT | ||
microsoft/phi-4 |
SFT, PEFT, FP8 | ||
LLM | Seed | ByteDance-Seed/Seed-Coder-8B-Instruct |
SFT, PEFT, FP8 |
ByteDance-Seed/Seed-OSS-36B-Instruct |
SFT, PEFT | ||
LLM | Baichuan | baichuan-inc/Baichuan2-7B-Chat |
SFT, PEFT, FP8 |
VLM | Gemma | google/gemma-3-4b-it |
SFT, PEFT |
google/gemma-3n-e4b-it |
SFT, PEFT |
And more: Check out more LLM and VLM examples! Any causal LM on Hugging Face Hub can be used with the base recipe template!
To run a NeMo AutoModel recipe, you need a recipe script (e.g., LLM, VLM) and a YAML config file (e.g., LLM, VLM):
# Command invocation format:
uv run <recipe_script_path> --config <yaml_config_path>
# LLM example: multi-GPU with FSDP2
uv run torchrun --nproc-per-node=8 recipes/llm_finetune/finetune.py --config recipes/llm_finetune/llama/llama3_2_1b_hellaswag.yaml
# VLM example: single GPU fine-tuning (Gemma-3-VL) with LoRA
uv run recipes/vlm_finetune/finetune.py --config recipes/vlm_finetune/gemma3/gemma3_vl_3b_cord_v2_peft.yaml
- Day-0 Hugging Face Support: Instantly fine-tune any model from the Hugging Face Hub
- Lightning Fast Performance: Custom CUDA kernels and memory optimizations deliver 2β5Γ speedups
- Large-Scale Distributed Training: Built-in FSDP2 and nvFSDP for seamless multi-node scaling
- Vision-Language Model Ready: Native support for VLMs (Qwen2-VL, Gemma-3-VL, etc)
- Advanced PEFT Methods: LoRA and extensible PEFT system out of the box
- Seamless HF Ecosystem: Fine-tuned models work perfectly with Transformers pipeline, VLM, etc.
- Robust Infrastructure: Distributed checkpointing with integrated logging and monitoring
- Optimized Recipes: Pre-built configurations for common models and datasets
- Flexible Configuration: YAML-based configuration system for reproducible experiments
- FP8 Precision: Native FP8 training & inference for higher throughput and lower memory use
- INT4 / INT8 Quantization: Turn-key quantization workflows for ultra-compact, low-memory training
NeMo AutoModel is offered both as a standard Python package installable via pip and as a ready-to-run NeMo Framework Docker container.
# We use `uv` for package management and environment isolation.
pip3 install uv
# If you cannot install at the system level, you can install for your user with
# pip3 install --user uv
Run every command with uv run
. It auto-installs the virtual environment from the lock file and keeps it up to date, so you never need to activate a venv manually. Example: uv run recipes/llm_finetune/finetune.py
. If you prefer to install NeMo Automodel explicitly, please follow the instructions below.
# Install the latest stable release from PyPI
# We first need to initialize the virtual environment using uv
uv venv
uv pip install nemo_automodel # or: uv pip install --upgrade nemo_automodel
# Install the latest NeMo Automodel from the GitHub repo (best for development).
# We first need to initialize the virtual environment using uv
uv venv
# We can now install from source
uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git
uv run python -c "import nemo_automodel; print('β
NeMo AutoModel ready')"
distributed:
_target_: nemo_automodel.distributed.nvfsdp.NVFSDPManager
dp_size: 8
tp_size: 1
cp_size: 1
peft:
peft_fn: nemo_automodel._peft.lora.apply_lora_to_linear_modules
match_all_linear: True
dim: 8
alpha: 32
use_triton: True
model:
_target_: nemo_automodel._transformers.NeMoAutoModelForImageTextToText.from_pretrained
pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct
processor:
_target_: transformers.AutoProcessor.from_pretrained
pretrained_model_name_or_path: Qwen/Qwen2.5-VL-3B-Instruct
min_pixels: 200704
max_pixels: 1003520
checkpoint:
enabled: true
checkpoint_dir: ./checkpoints
save_consolidated: true # HF-compatible safetensors
model_save_format: safetensors
NeMo-Automodel/
βββ nemo_automodel/ # Core library
β βββ _peft/ # PEFT implementations (LoRA)
β βββ _transformers/ # HF model integrations
β βββ checkpoint/ # Distributed checkpointing
β βββ datasets/ # Dataset loaders
β β βββ llm/ # LLM datasets (HellaSwag, SQuAD, etc.)
β β βββ vlm/ # VLM datasets (CORD-v2, rdr etc.)
β βββ distributed/ # FSDP2, nvFSDP, parallelization
β βββ loss/ # Optimized loss functions
β βββ training/ # Training recipes and utilities
βββ recipes/ # Ready-to-use training recipes
β βββ llm/ # LLM fine-tuning recipes
β βββ vlm/ # VLM fine-tuning recipes
βββ tests/ # Comprehensive test suite
We welcome contributions! Please see our Contributing Guide for details.
NVIDIA NeMo AutoModel is licensed under the Apache License 2.0.
- Documentation: https://docs.nvidia.com/nemo-framework/user-guide/latest/automodel/index.html
- Hugging Face Hub: https://huggingface.co/models
- Issues: https://github.com/NVIDIA-NeMo/Automodel/issues
- Discussions: https://github.com/NVIDIA-NeMo/Automodel/discussions
Made with β€οΈ by NVIDIA
Accelerating AI for everyone