Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

Open
2 of 4 tasks
niyathimariya opened this issue Oct 19, 2024 · 3 comments
Open
2 of 4 tasks
Labels
question Further information is requested

Comments

@niyathimariya
Copy link

niyathimariya commented Oct 19, 2024

System Info

Optimum version: 1.22.0
Platform: Linux (Ubuntu 22.04.4 LTS)
Python version: 3.12.2
ONNX Runtime Version: 1.19.2
CUDA Version: 12.1
CUDA Execution Provider: Yes (CUDA 12.1)

Who can help?

@JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

def load_model(self, model_name):
    session_options = ort.SessionOptions()
    session_options.add_session_config_entry('cudnn_conv_use_max_workspace', '0')
    session_options.enable_mem_pattern = False
    session_options.arena_extend_strategy = "kSameAsRequested"
    session_options.gpu_mem_limit = 10 * 1024 * 1024 * 1024
    
    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", session_options=session_options)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer, model

def inference(self, batch, doc_id='-1'):
    responses, status = '', False
    try:
        encodings = self.tokenizer(batch, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(self.device)
        with torch.no_grad():
            generated_ids = self.model.generate(
                encodings.input_ids,
                max_new_tokens=1024
            )
            responses = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            status = True  
    except Exception as e:
        logger.error(f"Failed to do inference on LLM, error: {e}")

    torch.cuda.empty_cache()
    return status, responses

Expected behavior

I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.
Picture1

@niyathimariya niyathimariya added the bug Something isn't working label Oct 19, 2024
@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Oct 21, 2024

Hi, the code you provided doesn't explain how you got the chart in you issue and what is "sample number" in this case ?

@IlyasMoutawwakil IlyasMoutawwakil added question Further information is requested and removed bug Something isn't working labels Oct 21, 2024
@niyathimariya
Copy link
Author

niyathimariya commented Oct 21, 2024

Hi @IlyasMoutawwakil, the code I’ve provided shows how I’m loading the model and performing inference. I’ve also included a graph showing the GPU memory consumed as inferencing progresses (I recorded the GPU usage after each inference by using the following code:

result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,gpu_name,used_memory', '--format=csv,noheader,nounits'], 
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print("Failed to run nvidia-smi:", result.stderr)
    return None
gpu_processes = result.stdout.strip().split('\n')

for process in gpu_processes:
    process_info = process.split(', ')
    process_pid = process_info[0]

    if process_pid == str(pid):
        used_memory_mib = int(process_info[2])

Graph demonstrate that the model, which I’ve converted to ONNX using the save_pretrained() method, is not releasing memory when it encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases.

I've plotted another graph showing input shape (batch size,sequence length)
Picture1

@IlyasMoutawwakil
Copy link
Member

IlyasMoutawwakil commented Oct 21, 2024

and I assume, "sample number" is supposed to mean sequence length ? edit: okay thanks I see the updated graph.
either ways, this doesn't seem like an optimum issue, but rather on onnxruntime side (the inference session), since it's the part that handles memory allocation and release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants