Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct example to use TensorRT? #1985

Open
2 of 4 tasks
sherlcok314159 opened this issue Aug 8, 2024 · 2 comments
Open
2 of 4 tasks

Correct example to use TensorRT? #1985

sherlcok314159 opened this issue Aug 8, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@sherlcok314159
Copy link

sherlcok314159 commented Aug 8, 2024

System Info

optimum: 1.20.0
os: ubuntu 20.04 with RTX 2080TI
python: 3.10.14

Who can help?

@michaelbenayoun @JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I followed the doc here. The below is my code:

from transformers import AutoProcessor
from optimum.onnxruntime import ORTModelForVision2Seq

model = 'facebook/nougat-small'
ort_model = ORTModelForVision2Seq.from_pretrained(
    "facebook/nougat-small",
    export=True,
    provider="TensorrtExecutionProvider",
)

assert ort_model.providers == ["TensorrtExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]
processor = AutoProcessor.from_pretrained(model)
ort_model.save_pretrained('./nougat-small-trt')
processor.save_pretrained('./nougat-small-trt')

When running the code, the terminal looks like:

2024-08-08 16:31:02.881585368 [W:onnxruntime:Default, tensorrt_execution_provider.h:83 log] [2024-08-08 08:31:02 WARNING] onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped

I waited for almost half an hour for exporting the model (RTX 2080TI). However, when I loaded it by the below code, it just repeated the same thing.

        session_options = ort.SessionOptions()
        session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        session_options.log_severity_level = 3
        trt_engine_cache = './nougat-small-trt-cache'
        os.makedirs(trt_engine_cache, exist_ok=True)
        provider_options = {
            'trt_engine_cache_enable': True,
            'trt_engine_cache_path': trt_engine_cache
        }
        self.model = ORTModelForVision2Seq.from_pretrained(
            model,
            provider='TensorrtExecutionProvider',
            provider_options=provider_options,
            session_options=session_options,
        )

Therefore, I want to know whether Optimum supports TensorRT or not. Or there is something wrong with the official doc to run TensorRT.

Expected behavior

When loading the converted model by TensorRT, optimum should not repeat the converting process again.

@sherlcok314159 sherlcok314159 added the bug Something isn't working label Aug 8, 2024
@mht-sharma mht-sharma self-assigned this Aug 21, 2024
@mht-sharma
Copy link
Contributor

Hi @sherlcok314159 thanks for the issue.

I tested the ORTModel with various models using the TensorRTExecutionProvider for caching, and it generally worked as expected.

However, for facebook/nougat-small encoder , the engine cache doesn't seem to be effective enough to significantly reduce the load time, as it appears the model is recompiling.

I timed the inference of the encoder with and without the cache and noticed a 5-6 minute speedup with the engine cache. Despite this, the total load time remained around 30 minutes on A100.

Given that, this is a TensorRT-specific issue. Could you please create a separate issue on the Nvidia forum or ORT? Below is the code I used for testing:

from onnxruntime import SessionOptions, InferenceSession
import onnxruntime as ort
from PIL import Image
from transformers import AutoProcessor
from huggingface_hub import hf_hub_download

# Set session options
sess_opt = SessionOptions()
sess_opt.log_severity_level = 0
# sess_opt.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL

# Provider options for TensorRT
provider_options = {
    "trt_detailed_build_log": True,
    "trt_engine_cache_enable": True,
    "trt_engine_cache_path": "nougat_onx_cache"
}

# Load the ONNX model
model_path = "nougat-small-onnx/encoder_model.onnx"

# Create an inference session with TensorRT execution provider
session = InferenceSession(model_path, sess_options=sess_opt, providers=[('TensorrtExecutionProvider', provider_options)])

# Download and process the image
filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
image = Image.open(filepath)
processor = AutoProcessor.from_pretrained("facebook/nougat-small")  # Replace with your processor model
pixel_values = processor(image, return_tensors="pt").pixel_values.numpy()

# Prepare input data
input_name = session.get_inputs()[0].name

# Run inference
outputs = session.run(None, {input_name: pixel_values})

# Process and print the output
print("Inference output:", outputs)

@sherlcok314159
Copy link
Author

I dont have time for following the issues. Could you please refer this issue to TensorRT? And if there is solution, inform me of it please. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants