NVlabs
diff --git a/‎README.md‎
Lines changed: 14 additions & 0 deletions b/‎README.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎asset/docs/inference_scaling/inference_scaling.md‎
Lines changed: 59 additions & 0 deletions b/‎asset/docs/inference_scaling/inference_scaling.md‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎asset/docs/inference_scaling/results.jpg‎
91.8 KB b/‎asset/docs/inference_scaling/results.jpg‎
91.8 KB
diff --git a/‎asset/docs/inference_scaling/scaling_curve.jpg‎
195 KB b/‎asset/docs/inference_scaling/scaling_curve.jpg‎
195 KB
diff --git a/‎asset/docs/model_zoo.md‎
Lines changed: 3 additions & 3 deletions b/‎asset/docs/model_zoo.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎scripts/infer_run_inference_geneval.sh‎
Lines changed: 29 additions & 5 deletions b/‎scripts/infer_run_inference_geneval.sh‎
Lines changed: 29 additions & 5 deletions
diff --git a/‎scripts/inference_geneval.py‎
Lines changed: 35 additions & 17 deletions b/‎scripts/inference_geneval.py‎
Lines changed: 35 additions & 17 deletions
diff --git a/‎scripts/inference_geneval_diffusers.py‎
Lines changed: 12 additions & 5 deletions b/‎scripts/inference_geneval_diffusers.py‎
Lines changed: 12 additions & 5 deletions
@@ -40,6 +40,7 @@ As a result, Sana-0.6B is very competitive with modern giant diffusion models (e
 
 ## 🔥🔥 News
 
+- (🔥 New) \[2025/3/21\] 🚀Sana + Inference Scaling is released. [\[Guidance\]](asset/docs/inference_scaling/inference_scaling.md)
 - (🔥 New) \[2025/3/16\] 🔥**SANA-1.5 code & weights are released!** 🎉 Include: [DDP/FSDP](#3-train-with-tar-file) | [TAR file WebDataset](#3-train-with-tar-file) | [Multi-Scale](#3-train-with-tar-file) Training code and [Weights](asset/docs/model_zoo.md) | [HF](https://huggingface.co/collections/Efficient-Large-Model/sana-15-67d6803867cb21c230b780e4) are all released.
 - (🔥 New) \[2025/3/14\] 🏃SANA-Sprint is coming out!🎉 A new one/few-step generator of Sana. 0.1s per 1024px image on H100, 0.3s on RTX 4090. Find out more details: [\[Page\]](https://nvlabs.github.io/Sana/Sprint/) | [\[Arxiv\]](https://arxiv.org/abs/2503.09641). Code is coming very soon along with `diffusers`
 - (🔥 New) \[2025/2/10\] 🚀Sana + ControlNet is released. [\[Guidance\]](asset/docs/sana_controlnet.md) | [\[Model\]](asset/docs/model_zoo.md) | [\[Demo\]](https://nv-sana.mit.edu/ctrlnet/)
@@ -393,6 +394,19 @@ bash train_scripts/train.sh \
 
 Refer to [Toolkit Manual](asset/docs/metrics_toolkit.md).
 
+# 🚀 5. Inference Scaling
+
+We trained a specialized [NVILA-2B](https://huggingface.co/Efficient-Large-Model/NVILA-Lite-2B-Verifier) model to score images, which we named VISA (VIla as SAna verifier). By selecting the top 4 images from 2,048 candidates, we enhanced the GenEval performance of SD1.5 and SANA-1.5-4.8B v2, increasing their scores from 42 to 87 and 81 to 96, respectively.
+
+| Method                         | Overall | Single | Two  | Counting | Colors | Position | Color Attribution |
+|--------------------------------|---------|--------|------|----------|--------|----------|------------------|
+| SD1.5                          | 0.42    | 0.98   | 0.39 | 0.31     | 0.72   | 0.04     | 0.06             |
+| **+ Inference Scaling**        | **0.87** | **1.00** | **0.97** | **0.93** | **0.96** | **0.75** | **0.62** |
+| SANA-1.5 4.8B v2              | 0.81    | 0.99   | 0.86 | 0.86     | 0.84   | 0.59     | 0.65             |
+| **+ Inference Scaling**        | **0.96** | **1.00** | **1.00** | **0.97** | **0.94** | **0.96** | **0.87** |
+
+Details refer to [Inference Scaling Manual](asset/docs/inference_scaling/inference_scaling.md).
+
 # 💪To-Do List
 
 We will try our best to release
 
@@ -0,0 +1,59 @@
+## Inference Time Scaling for SANA-1.5
+
+![results](results.jpg)
+
+We trained a specialized [NVILA-2B](https://huggingface.co/Efficient-Large-Model/NVILA-Lite-2B-Verifier) model to score images, which we named VISA (VIla as SAna verifier). By selecting the top 4 images from 2,048 candidates, we enhanced the GenEval performance of SD1.5 and SANA-1.5-4.8B v2, increasing their scores from 42 to 87 and 81 to 96, respectively.
+
+![curve](scaling_curve.jpg)
+
+Even for smaller number of candidates, like 32, we can also push the performance over 90% for SANA-1.5-4.8B v2 in the GenEval.
+
+### Environment Requirement
+
+Dependency setups:
+
+```bash
+# other transformers version may also work, but we have not tested
+pip install transformers==4.46
+pip install git+https://github.com/bfshi/scaling_on_scales.git
+```
+
+### 1. Generate N images with a .pth file for the following selection
+
+```bash
+# download the checkpoint for the following generation
+huggingface-cli download Efficient-Large-Model/Sana_600M_512px --repo-type model --local-dir output/Sana_600M_512px --local-dir-use-symlinks False
+# 32 is a relatively small number for test but can already push the geneval>90% when we verify the SANA-1.5-4.8B v2 model. Set it to larger number like 2048 for the limit of sky.
+n_samples=32
+pick_number=4
+
+output_dir=output/geneval_generated_path
+# example
+bash scripts/infer_run_inference_geneval.sh \
+    configs/sana_config/512ms/Sana_600M_img512.yaml \
+    output/Sana_600M_512px/checkpoints/Sana_600M_512px_MultiLing.pth \
+    --img_nums_per_sample=$n_samples \
+    --output_dir=$output_dir
+```
+
+### 2. Use NVILA-Verifier to select from the generated images
+
+```bash
+bash tools/inference_scaling/nvila_sana_pick.sh \
+    $output_dir \
+    $n_samples \
+    $pick_number
+```
+
+### 3. Calculate the GenEval metric
+
+You need to use the GenEval environment for the final evaluation. The document about installation can be found [here](../../../tools/metrics/geneval/geneval_env.md).
+
+```bash
+# activate geneval env
+conda activate geneval
+
+DIR_AFTER_PICK="output/nvila_pick/best_${pick_number}_of_${n_samples}/${output_dir}"
+
+bash tools/metrics/compute_geneval.sh $(dirname "$DIR_AFTER_PICK") $(basename "$DIR_AFTER_PICK")
+```
@@ -19,16 +19,16 @@
 | Sana-1.6B-ControlNet | 1Kpx   | [Sana_1600M_1024px_BF16_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_ControlNet_HED) | Coming soon                                                                                                                                       | **bf16**/fp32 | Multi-Language |
 | Sana-0.6B-ControlNet | 1Kpx   | [Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED)             | Coming soon                                                                                                                                       | fp16/fp32     | -              |
 
----
+______________________________________________________________________
 
 ### SANA-1.5
 
 | Model        | Reso   | pth link                                                                                  | diffusers                                                              | Precision | Description    |
 |--------------|--------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------|----------------|
-| SANA1.5-4.8B | 1024px | [SANA1.5_4.8B_1024px](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px)   | [Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers]()(coming soon)   | bf16      | Multi-Language |
+| SANA1.5-4.8B | 1024px | [SANA1.5_4.8B_1024px](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px)   | [Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers](<>)(coming soon)   | bf16      | Multi-Language |
 
+______________________________________________________________________
 
----
 ## ❗ 2. Make sure to use correct precision(fp16/bf16/fp32) for training and inference.
 
 ### We provide two samples to use fp16 and bf16 weights, respectively.
 
@@ -10,6 +10,8 @@ default_step=20   # 14
 default_sample_nums=553
 default_sampling_algo="flow_dpm-solver"
 default_add_label=''
+default_img_nums_per_sample=4
+default_batch_size=1
 
 # parser
 config_file=$1
@@ -22,6 +24,18 @@ do
         step="${arg#*=}"
         shift
         ;;
+        --sample_nums=*)
+        sample_nums="${arg#*=}"
+        shift
+        ;;
+        --img_nums_per_sample=*)
+        img_nums_per_sample="${arg#*=}"
+        shift
+        ;;
+        --batch_size=*)
+        batch_size="${arg#*=}"
+        shift
+        ;;
         --sampling_algo=*)
         sampling_algo="${arg#*=}"
         shift
@@ -34,8 +48,8 @@ do
         model_paths="${arg#*=}"
         shift
         ;;
-        --sample_nums=*)
-        sample_nums="${arg#*=}"
+        --output_dir=*)
+        output_dir="${arg#*=}"
         shift
         ;;
         --cfg_scale=*)
@@ -63,21 +77,31 @@ samples_per_gpu=$((sample_nums / np))
 add_label=${add_label:-$default_add_label}
 ablation_key=${ablation_key:-''}
 ablation_selections=${ablation_selections:-''}
+img_nums_per_sample=${img_nums_per_sample:-$default_img_nums_per_sample}
+batch_size=${batch_size:-$default_batch_size}
+output_dir=${output_dir:-''}
+sssss
 
 echo "Step: $step"
 echo "Sample numbers: $sample_nums"
+echo "Image numbers per sample: $img_nums_per_sample"
+echo "Batch size: $batch_size"
 echo "Sampling Algo: $sampling_algo"
 echo "CFG scale: $cfg_scale"
 echo "Add label: $add_label"
 echo "Exist time prefix: $exist_time_prefix"
 
 cmd_template="DPM_TQDM=True python scripts/inference_geneval.py --config={config_file} --model_path={model_path} \
-    --sampling_algo $sampling_algo --step $step --cfg_scale $cfg_scale --sample_nums $sample_nums \
-    --gpu_id {gpu_id} --start_index {start_index} --end_index {end_index}"
+    --sampling_algo $sampling_algo --step $step --cfg_scale $cfg_scale --sample_nums $sample_nums --n_samples $img_nums_per_sample \
+    --batch_size $batch_size --gpu_id {gpu_id} --start_index {start_index} --end_index {end_index}"
 if [ -n "${add_label}" ]; then
     cmd_template="${cmd_template} --add_label ${add_label}"
 fi
 
+if [ -n "${output_dir}" ]; then
+    cmd_template="${cmd_template} --output_dir ${output_dir}"
+fi
+
 if [ -n "${ablation_key}" ]; then
     cmd_template="${cmd_template} --ablation_key ${ablation_key} --ablation_selections "${ablation_selections}""
     echo "ablation_key: $ablation_key"
@@ -108,7 +132,7 @@ if [[ "$model_paths" == *.pth ]]; then
     cmd="${cmd//\{end_index\}/$end_index}"
 
     echo "Running on GPU $gpu_id: samples $start_index to $end_index"
-    echo $cmd
+    echo "cmd: $cmd"
     eval CUDA_VISIBLE_DEVICES=$gpu_id $cmd &
   done
   wait
 
@@ -30,6 +30,7 @@
 import torch
 from einops import rearrange
 from PIL import Image
+from termcolor import colored
 from torchvision.utils import _log_api_usage_once, make_grid, save_image
 from tqdm import tqdm
 
@@ -173,7 +174,6 @@ def visualize(sample_steps, cfg_scale, pag_scale):
         os.makedirs(sample_path, exist_ok=True)
 
         prompt = metadata["prompt"]
-        # print(f"Prompt ({index: >3}/{len(metadatas)}): '{prompt}'")
         with open(os.path.join(outpath, "metadata.jsonl"), "w") as fp:
             json.dump(metadata, fp)
 
@@ -347,7 +347,7 @@ def parse_args():
 class SanaInference(SanaConfig):
     config: str = ""
     dataset: str = "GenEval"
-    outdir: str = field(default="outputs", metadata={"help": "dir to write results to"})
+    output_dir: str = field(default=None, metadata={"help": "dir to write results to"})
     n_samples: int = field(default=4, metadata={"help": "number of samples"})
     batch_size: int = field(default=1, metadata={"help": "how many samples can be produced simultaneously"})
     skip_grid: bool = field(default=False, metadata={"help": "skip saving grid"})
@@ -394,8 +394,9 @@ class SanaInference(SanaConfig):
     device = "cuda" if torch.cuda.is_available() else "cpu"
     logger = get_root_logger()
 
-    n_rows = batch_size = args.n_samples
-    assert args.batch_size == 1, ValueError(f"{batch_size} > 1 is not available in GenEval")
+    batch_size = args.batch_size
+    n_rows = 4 if args.n_samples > 4 else args.n_samples
+    assert args.n_samples % args.batch_size == 0, ValueError(f"{args.n_samples} cannot be divided by {args.batch_size}")
 
     # only support fixed latent size currently
     latent_size = args.image_size // config.vae.vae_downsample_rate
@@ -448,12 +449,25 @@ class SanaInference(SanaConfig):
         if ("flow" not in args.model_path or args.sampling_algo == "flow_dpm-solver")
         else "flow_euler"
     )
+    logger.info(f"Sampler {args.sampling_algo}")
 
-    work_dir = (
-        f"/{os.path.join(*args.model_path.split('/')[:-2])}"
-        if args.model_path.startswith("/")
-        else os.path.join(*args.model_path.split("/")[:-2])
-    )
+    # save path
+    if args.output_dir is None:
+        work_dir = (
+            f"/{os.path.join(*args.model_path.split('/')[:-2])}"
+            if args.model_path.startswith("/")
+            else os.path.join(*args.model_path.split("/")[:-2])
+        )
+        img_save_dir = os.path.join(str(work_dir), "vis")
+
+        os.umask(0o000)
+        os.makedirs(img_save_dir, exist_ok=True)
+        logger.info(colored(f"Saving images at {img_save_dir}", "green"))
+    else:
+        work_dir = args.output_dir
+
+        os.umask(0o000)
+        os.makedirs(work_dir, exist_ok=True)
 
     # dataset
     metadatas = datasets.load_dataset(
@@ -465,15 +479,9 @@ class SanaInference(SanaConfig):
     match = re.search(r".*epoch_(\d+).*step_(\d+).*", args.model_path)
     epoch_name, step_name = match.groups() if match else ("unknown", "unknown")
 
-    img_save_dir = os.path.join(str(work_dir), "vis")
-    os.umask(0o000)
-    os.makedirs(img_save_dir, exist_ok=True)
-    logger.info(f"Sampler {args.sampling_algo}")
-
     def create_save_root(args, dataset, epoch_name, step_name, sample_steps, guidance_type):
         save_root = os.path.join(
             img_save_dir,
-            # f"{datetime.now().date() if args.exist_time_prefix == '' else args.exist_time_prefix}_"
             f"{dataset}_epoch{epoch_name}_step{step_name}_scale{args.cfg_scale}"
             f"_step{sample_steps}_size{args.image_size}_bs{batch_size}_samp{args.sampling_algo}"
             f"_seed{args.seed}_{str(weight_dtype).split('.')[-1]}",
@@ -487,6 +495,10 @@ def create_save_root(args, dataset, epoch_name, step_name, sample_steps, guidanc
             save_root += f"_{guidance_type}"
         if args.interval_guidance[0] != 0 and args.interval_guidance[1] != 1:
             save_root += f"_intervalguidance{args.interval_guidance[0]}{args.interval_guidance[1]}"
+        if not DATA_URL.endswith("evaluation_metadata.jsonl"):
+            save_root += f"_metadata{DATA_URL.split('/')[-1]}"
+        if args.n_samples != 4:
+            save_root += f"_nsample{args.n_samples}"
 
         save_root += f"_imgnums{args.sample_nums}" + args.add_label
         return save_root
@@ -505,7 +517,10 @@ def guidance_type_select(default_guidance_type, pag_scale, attn_type):
             sample_steps = args.step if args.step != -1 else sample_steps_dict[args.sampling_algo]
             guidance_type = guidance_type_select(guidance_type, args.pag_scale, config.model.attn_type)
 
-            save_root = create_save_root(args, args.dataset, epoch_name, step_name, sample_steps, guidance_type)
+            if args.output_dir is None:
+                save_root = create_save_root(args, args.dataset, epoch_name, step_name, sample_steps, guidance_type)
+            else:
+                save_root = args.output_dir
             os.makedirs(save_root, exist_ok=True)
             if args.if_save_dirname and args.gpu_id == 0:
                 # save at work_dir/metrics/tmp_xxx.txt for metrics testing
@@ -519,7 +534,10 @@ def guidance_type_select(default_guidance_type, pag_scale, attn_type):
         guidance_type = guidance_type_select(guidance_type, args.pag_scale, config.model.attn_type)
         logger.info(f"Inference with {weight_dtype}, guidance_type: {guidance_type}, flow_shift: {flow_shift}")
 
-        save_root = create_save_root(args, args.dataset, epoch_name, step_name, sample_steps, guidance_type)
+        if args.output_dir is None:
+            save_root = create_save_root(args, args.dataset, epoch_name, step_name, sample_steps, guidance_type)
+        else:
+            save_root = args.output_dir
         os.makedirs(save_root, exist_ok=True)
         if args.if_save_dirname and args.gpu_id == 0:
             os.makedirs(f"{work_dir}/metrics", exist_ok=True)
 
@@ -151,6 +151,7 @@ def parse_args():
         help="skip saving grid",
     )
 
+    parser.add_argument("--work_dir", default=None, type=str)
     parser.add_argument("--sample_nums", default=553, type=int)
     parser.add_argument("--add_label", default="", type=str)
     parser.add_argument("--exist_time_prefix", default="", type=str)
@@ -193,12 +194,17 @@ def parse_args():
     logger.info(f"Eval {len(metadatas)} samples")
 
     # save path
-    work_dir = (
-        f"/{os.path.join(*args.model_path.split('/')[:-1])}"
-        if args.model_path.startswith("/")
-        else os.path.join(*args.model_path.split("/")[:-1])
-    )
+    if args.work_dir is None:
+        work_dir = (
+            f"/{os.path.join(*args.model_path.split('/')[:-1])}"
+            if args.model_path.startswith("/")
+            else os.path.join(*args.model_path.split("/")[:-1])
+        )
+    else:
+        work_dir = args.work_dir
+    args.work_dir = work_dir
     img_save_dir = os.path.join(str(work_dir), "vis")
+
     os.umask(0o000)
     os.makedirs(img_save_dir, exist_ok=True)
 
@@ -214,6 +220,7 @@ def parse_args():
 
     if args.if_save_dirname and args.gpu_id == 0:
         # save at work_dir/metrics/tmp_xxx.txt for metrics testing
+        os.makedirs(f"{work_dir}/metrics", exist_ok=True)
         with open(f"{work_dir}/metrics/tmp_geneval_{time.time()}.txt", "w") as f:
             print(f"save tmp file at {work_dir}/metrics/tmp_geneval_{time.time()}.txt")
             f.write(os.path.basename(save_root))