Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.3 70B Compilation Error on Trainium (trn1) with Batch Size 4 #1075

Open
wxnfifth5 opened this issue Dec 30, 2024 · 1 comment
Open

Comments

@wxnfifth5
Copy link

I am following the tutorial at https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/trn2-llama3.3-70b-tutorial.html#scenario-1-run-llama3-3-70b-on-trn2 to compile and run Llama 3.3 70B model on trn1. While the compilation works successfully with batch sizes 1 and 2, it fails when attempting to compile with batch size 4.

The compiler (neuronx-cc) terminates with the following error:

2024-12-27T04:26:59Z [F134] neuronx-cc terminated abnormally - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new

Environment:

  • AMI: Deep Learning AMI Neuron (Ubuntu 22.04)
  • Python virtual environment: /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/
  • Package versions:
    • libneuronxla: 2.1.681.0
    • neuronx-cc: 2.16.345.0+69131dd3
    • neuronx-distributed: 0.10.0
    • neuronx-distributed-inference: 0.1.0
    • torch-neuronx: 2.5.1.2.4.0
    • vllm: 0.1.dev2830+g22c56ee.neuron216 (from /home/ubuntu/upstreaming-to-vllm)

Command Configuration:

MODEL_PATH="meta-llama/Llama-3.3-70B-Instruct"
BATCH_SIZE=4
SEQ_LEN=2048
COMPILED_MODEL_PATH="./traced_model"
TP_DEGREE=32
LNC=1

# Environment variables
export NEURON_RT_EXEC_TIMEOUT=1200
export XLA_DENSE_GATHER_FACTOR=0
export NEURON_RT_INSPECT_ENABLE=0

# Command
inference_demo \
    --model-type llama \
    --task-type causal-lm \
    run \
    --model-path $MODEL_PATH \
    --compiled-model-path $COMPILED_MODEL_PATH \
    --torch-dtype bfloat16 \
    --start_rank_id 0 \
    --local_ranks_size $TP_DEGREE \
    --tp-degree $TP_DEGREE \
    --batch-size $BATCH_SIZE \
    --seq-len $SEQ_LEN \
    --on-device-sampling \
    --top-k 1 \
    --do-sample \
    --fused-qkv \
    --sequence-parallel-enabled \
    --qkv-kernel-enabled \
    --attn-kernel-enabled \
    --mlp-kernel-enabled \
    --cc-pipeline-tiling-factor 1 \
    --pad-token-id 2 \
    --enable-bucketing \
    --logical-neuron-cores $LNC \
    --prompt "What is annapurna labs?"

Full error traceback:

Traceback (most recent call last):
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/inference_demo", line 8, in <module>
    sys.exit(main())
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py", line 486, in main
    run_inference(model_cls, args)
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/inference_demo.py", line 296, in run_inference
    model.compile(args.compiled_model_path, debug=args.hlo_debug)
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/neuronx_distributed_inference/models/application_base.py", line 145, in compile
    traced_model = self.get_builder(debug).trace(initialize_model_weights=False)
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/trace/model_builder.py", line 310, in trace
    key, bucket_rank, neff_artifacts = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/neuronx_distributed/trace/model_builder.py", line 254, in submit_compilation_job
    return key, bucket_rank, torch_neuronx.xla_impl.trace.generate_neff(*args)
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 506, in generate_neff
    neff_filename = hlo_compile(
  File "/opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 396, in hlo_compile
    raise RuntimeError(f"neuronx-cc failed with {status}")
RuntimeError: neuronx-cc failed with 70

The error occurs during the model compilation phase when the neuronx-cc compiler attempts to generate the NEFF (Neuron Executable File Format) file for the model with batch size 4. The same configuration works successfully with batch sizes 1 and 2.

@jluntamazon
Copy link
Contributor

Hi @wxnfifth5,

We are actively looking into improving batching support at the moment and will look to see if we can resolve this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants