Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在win10系统下使用PaddleInference2.5编译ppyoloe_crn_l,出现如下问题, 请问如何解决? #519

Open
dict1234 opened this issue May 5, 2024 · 11 comments
Assignees

Comments

@dict1234
Copy link

dict1234 commented May 5, 2024

ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams

D:\000-AI\paddle\Deploy\2.5\Paddle-Inference-Demo-master\c++\gpu\ppyoloe_crn_l\build\Release>ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams
e[1me[35m--- Running analysis [ir_graph_build_pass]e[0m
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0505 13:45:31.034536 5320 executor.cc:187] Old Executor is Running.
e[1me[35m--- Running analysis [ir_analysis_pass]e[0m
e[32m--- Running IR pass [map_op_to_another_pass]e[0m
e[32m--- Running IR pass [identity_scale_op_clean_pass]e[0m
e[32m--- Running IR pass [is_test_pass]e[0m
e[32m--- Running IR pass [simplify_with_basic_ops_pass]e[0m
e[32m--- Running IR pass [delete_quant_dequant_linear_op_pass]e[0m
e[32m--- Running IR pass [delete_weight_dequant_linear_op_pass]e[0m
e[32m--- Running IR pass [constant_folding_pass]e[0m
e[32m--- Running IR pass [silu_fuse_pass]e[0m
e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m
I0505 13:45:31.640825 5320 fuse_pass_base.cc:59] --- detected 78 subgraphs
e[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]e[0m
e[32m--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]e[0m
e[32m--- Running IR pass [multihead_matmul_fuse_pass_v2]e[0m
e[32m--- Running IR pass [vit_attention_fuse_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_encoder_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_decoder_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [fuse_multi_transformer_layer_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass]e[0m
e[32m--- Running IR pass [matmul_scale_fuse_pass]e[0m
e[32m--- Running IR pass [multihead_matmul_fuse_pass_v3]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass]e[0m
e[32m--- Running IR pass [fc_fuse_pass]e[0m
e[32m--- Running IR pass [fc_elementwise_layernorm_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m
I0505 13:45:34.308128 5320 fuse_pass_base.cc:59] --- detected 9 subgraphs
e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add_fuse_pass]e[0m
I0505 13:45:34.541483 5320 fuse_pass_base.cc:59] --- detected 118 subgraphs
e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m
e[32m--- Running IR pass [conv2d_fusion_layout_transfer_pass]e[0m
e[32m--- Running IR pass [transfer_layout_elim_pass]e[0m
e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m
e[32m--- Running IR pass [inplace_op_var_pass]e[0m
e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m
W0505 13:45:34.565424 5320 save_optimized_model_pass.cc:28] save_optim_cache_model is turned off, skip save_optimized_model_pass
e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m
I0505 13:45:34.566417 5320 ir_params_sync_among_devices_pass.cc:51] Sync params from CPU to GPU
e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m
e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m
e[1me[35m--- Running analysis [memory_optimize_pass]e[0m
I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : tmp_2 size: 26214400
I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : batch_norm_2.tmp_2 size: 26214400
I0505 13:45:34.726987 5320 memory_optimize_pass.cc:222] Cluster name : image size: 4915200
I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : sigmoid_2.tmp_0 size: 26214400
I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : batch_norm_48.tmp_2 size: 1228800
I0505 13:45:34.727985 5320 memory_optimize_pass.cc:222] Cluster name : tmp_0 size: 13107200
I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : elementwise_add_0 size: 4915200
I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : tmp_7 size: 4915200
I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : elementwise_add_16 size: 614400
I0505 13:45:34.728982 5320 memory_optimize_pass.cc:222] Cluster name : pool2d_5.tmp_0 size: 768
I0505 13:45:34.729979 5320 memory_optimize_pass.cc:222] Cluster name : scale_factor size: 8
I0505 13:45:34.729979 5320 memory_optimize_pass.cc:222] Cluster name : shape_2.tmp_0_slice_0 size: 4
e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m
I0505 13:45:34.970336 5320 analysis_predictor.cc:1660] ======= optimize end =======
I0505 13:45:34.971334 5320 naive_executor.cc:164] --- skip [feed], feed -> scale_factor
I0505 13:45:34.973328 5320 naive_executor.cc:164] --- skip [feed], feed -> image
I0505 13:45:34.984299 5320 naive_executor.cc:164] --- skip [gather_nd_0.tmp_0], fetch -> fetch
I0505 13:45:34.984299 5320 naive_executor.cc:164] --- skip [multiclass_nms3_0.tmp_2], fetch -> fetch
W0505 13:45:34.987293 5320 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.2, Runtime API Version: 11.8
W0505 13:45:34.991281 5320 gpu_resources.cc:149] device: 0, cuDNN Version: 8.6.


C++ Traceback (most recent call last):

Not support stack backtrace yet.


Error Message Summary:

InvalidArgumentError: The axis is expected to be in range of [-1, 1), but got 1
[Hint: Expected axis_value >= -rank && axis_value < rank == true, but received axis_value >= -rank && axis_value < rank:0 != true:1.] (at ..\paddle\phi\infermeta\unary.cc:3567)

@kangguangli
Copy link

你好,感谢您的反馈,从上面的报错暂时看不出原因。你可以设置如下flag:
export FLAGS_call_stack_level=2
这可以帮我们打印C++侧报错栈,提供更多信息。

另外,也可以参考 https://www.paddlepaddle.org.cn/inference/master/guides/performance_tuning/precision_tracing.html 确认是否为某个pass的问题。

最后,可以尝试更换到2.6版本,确认问题是否存在。

如果你尝试了以上办法,请务必将运行结果贴在这里,这对我们后续的分析定位会很有帮助。

@dict1234
Copy link
Author

dict1234 commented May 7, 2024

win10系统下, 在cmd设置 set FLAGS_call_stack_level = 2
配置:
paddle_inference2.6
CUDA11.8
cuDNN8.6
TensorRT8.5

Code:
// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include
#include
#include
#include

#include <gflags/gflags.h>
#include <glog/logging.h>

#include "paddle_inference_api.h"

using paddle_infer::Config;
using paddle_infer::CreatePredictor;
using paddle_infer::PrecisionType;
using paddle_infer::Predictor;

DEFINE_string(model_dir, "", "Directory of the inference model.");
DEFINE_string(model_file, "", "Path of the inference model file.");
DEFINE_string(params_file, "", "Path of the inference params file.");
DEFINE_string(run_mode, "paddle_gpu", "run_mode which can be: trt_fp32, trt_fp16 and trt_int8 and paddle_gpu");
DEFINE_int32(batch_size, 1, "Batch size.");
DEFINE_int32(gpu_id, 0, "GPU card ID num.");
DEFINE_int32(trt_min_subgraph_size, 3, "tensorrt min_subgraph_size");
DEFINE_int32(warmup, 50, "warmup");
DEFINE_int32(repeats, 1000, "repeats");
DEFINE_bool(use_dynamic_shape, false, "use trt dynaminc shape.");
DEFINE_bool(use_calib, true, "use trt int8 calibration.");
DEFINE_bool(use_collect_shape, false, "Collect trt shape information");
DEFINE_string(dynamic_shape_file, "", "trt shape information name");

using Time = decltype(std::chrono::high_resolution_clock::now());
Time time() { return std::chrono::high_resolution_clock::now(); };
double time_diff(Time t1, Time t2)
{
typedef std::chrono::microseconds ms;
auto diff = t2 - t1;
ms counter = std::chrono::duration_cast(diff);
return counter.count() / 1000.0;
}

std::shared_ptr InitPredictor()
{
Config config;
if (FLAGS_model_dir != "")
{
config.SetModel(FLAGS_model_dir);
}
config.SetModel(FLAGS_model_file, FLAGS_params_file);

config.EnableUseGpu(500, FLAGS_gpu_id);

if (FLAGS_run_mode == "trt_fp32")
{
	config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
		FLAGS_batch_size,
		FLAGS_trt_min_subgraph_size,
		PrecisionType::kFloat32,
		false,
		false);
}
else if (FLAGS_run_mode == "trt_fp16") 
{
	config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
		FLAGS_batch_size,
		FLAGS_trt_min_subgraph_size,
		PrecisionType::kHalf,
		false,
		false);
}
else if (FLAGS_run_mode == "trt_int8")
{
	config.EnableTensorRtEngine(1 << 30 * FLAGS_batch_size,
		FLAGS_batch_size,
		FLAGS_trt_min_subgraph_size,
		PrecisionType::kInt8,
		false,
		FLAGS_use_calib);
}

if (FLAGS_use_dynamic_shape && FLAGS_use_collect_shape)
{
	config.CollectShapeRangeInfo(FLAGS_dynamic_shape_file);
}
else if (FLAGS_use_dynamic_shape && !FLAGS_use_collect_shape)
{
	config.EnableTunedTensorRtDynamicShape(FLAGS_dynamic_shape_file);
}
// Open the memory optim.
config.EnableMemoryOptim();
config.SwitchIrOptim(true);
return CreatePredictor(config);

}

void run(Predictor* predictor,const std::vector& input,const std::vector& input_shape,std::vector* out_data)
{
int input_num = std::accumulate(input_shape.begin(), input_shape.end(), 1, std::multiplies());

auto input_names = predictor->GetInputNames();
auto output_names = predictor->GetOutputNames();
auto input_t = predictor->GetInputHandle(input_names[0]);
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]获得句柄...";
input_t->Reshape(input_shape);
input_t->CopyFromCpu(input.data());
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]FLAGS_warmup...";
for (size_t i = 0; i < FLAGS_warmup; ++i)
{
	CHECK(predictor->Run());
}
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "[run]开始循环预测...";
auto st = time();
for (size_t i = 0; i < FLAGS_repeats; ++i)
{
	LOG(INFO) << "[run]..." << i;
	CHECK(predictor->Run());
	auto output_t = predictor->GetOutputHandle(output_names[0]);
	std::vector<int> output_shape = output_t->shape();
	int out_num = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<int>());
	out_data->resize(out_num);
	output_t->CopyToCpu(out_data->data());
	LOG(INFO) << "[run] out_data:" << out_data->size();
}
LOG(INFO) << "[" << __FUNCTION__ << ":" << __LINE__ << "]" << "run avg time is " << time_diff(st, time()) / FLAGS_repeats << " ms";

}

int main(int argc, char* argv[])
{
google::ParseCommandLineFlags(&argc, &argv, true);
LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化预测器...";
auto predictor = InitPredictor();
LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化图像数据格式...";
std::vector input_shape = { FLAGS_batch_size, 3, 640, 640 };
std::vector input_data(FLAGS_batch_size * 3 * 640 * 640);
LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "初始化图像数据...";
for (size_t i = 0; i < input_data.size(); ++i) input_data[i] = i % 255 * 0.1;
std::vector out_data;
LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "开始检测...";
run(predictor.get(), input_data, input_shape, &out_data);
LOG(INFO) << "[" << FUNCTION << ":" << LINE << "]" << "检测完成..." << "out_data:" << out_data.size();

return 0;

}

Problem:
D:\000-AI\paddle\Deploy\2.6\Paddle-Inference-Demo-master\c++\gpu\ppyoloe_crn_l\build\Release>ppyoloe_crn_l.exe --model_file ppyoloe_crn_l_300e_coco/model.pdmodel --params_file ppyoloe_crn_l_300e_coco/model.pdiparams
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0507 08:33:49.911533 8644 ppyoloe_crn_l.cc:142] [main:142]初始化预测器...
e[1me[35m--- Running analysis [ir_graph_build_pass]e[0m
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0507 08:33:51.262887 8644 executor.cc:187] Old Executor is Running.
e[1me[35m--- Running analysis [ir_analysis_pass]e[0m
e[32m--- Running IR pass [map_op_to_another_pass]e[0m
e[32m--- Running IR pass [is_test_pass]e[0m
e[32m--- Running IR pass [simplify_with_basic_ops_pass]e[0m
e[32m--- Running IR pass [delete_quant_dequant_linear_op_pass]e[0m
e[32m--- Running IR pass [delete_weight_dequant_linear_op_pass]e[0m
e[32m--- Running IR pass [constant_folding_pass]e[0m
I0507 08:33:51.661852 8644 fuse_pass_base.cc:59] --- detected 13 subgraphs
e[32m--- Running IR pass [silu_fuse_pass]e[0m
e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m
I0507 08:33:51.857331 8644 fuse_pass_base.cc:59] --- detected 78 subgraphs
e[32m--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]e[0m
e[32m--- Running IR pass [embedding_eltwise_layernorm_fuse_pass]e[0m
e[32m--- Running IR pass [multihead_matmul_fuse_pass_v2]e[0m
e[32m--- Running IR pass [vit_attention_fuse_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_encoder_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_decoder_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_encoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [fused_multi_transformer_decoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_encoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [multi_devices_fused_multi_transformer_decoder_fuse_qkv_pass]e[0m
e[32m--- Running IR pass [fuse_multi_transformer_layer_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_squeeze2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_reshape2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_flatten2_matmul_fuse_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_mul_pass]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_v2_to_matmul_pass]e[0m
e[32m--- Running IR pass [matmul_scale_fuse_pass]e[0m
e[32m--- Running IR pass [multihead_matmul_fuse_pass_v3]e[0m
e[32m--- Running IR pass [gpu_cpu_map_matmul_to_mul_pass]e[0m
e[32m--- Running IR pass [fc_fuse_pass]e[0m
e[32m--- Running IR pass [fc_elementwise_layernorm_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m
I0507 08:33:54.501273 8644 fuse_pass_base.cc:59] --- detected 9 subgraphs
e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add_fuse_pass]e[0m
I0507 08:33:54.719719 8644 fuse_pass_base.cc:59] --- detected 118 subgraphs
e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m
e[32m--- Running IR pass [fused_conv2d_add_act_layout_transfer_pass]e[0m
e[32m--- Running IR pass [transfer_layout_elim_pass]e[0m
I0507 08:33:54.738648 8644 transfer_layout_elim_pass.cc:346] move down 0 transfer_layout
I0507 08:33:54.738648 8644 transfer_layout_elim_pass.cc:347] eliminate 0 pair of transfer_layout
e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m
e[32m--- Running IR pass [identity_op_clean_pass]e[0m
e[32m--- Running IR pass [inplace_op_var_pass]e[0m
I0507 08:33:54.753597 8644 fuse_pass_base.cc:59] --- detected 11 subgraphs
e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m
e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m
I0507 08:33:54.756589 8644 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU
e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m
e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m
e[1me[35m--- Running analysis [memory_optimize_pass]e[0m
I0507 08:33:54.897214 8644 memory_optimize_pass.cc:118] The persistable params in main graph are : 199.204MB
I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : tmp_2 size: 26214400
I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : batch_norm_2.tmp_2 size: 26214400
I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : tmp_68 size: 1228800
I0507 08:33:54.906189 8644 memory_optimize_pass.cc:246] Cluster name : image size: 4915200
I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : sigmoid_2.tmp_0 size: 26214400
I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : tmp_0 size: 13107200
I0507 08:33:54.907187 8644 memory_optimize_pass.cc:246] Cluster name : elementwise_add_0 size: 4915200
I0507 08:33:54.908183 8644 memory_optimize_pass.cc:246] Cluster name : tmp_7 size: 4915200
I0507 08:33:54.909183 8644 memory_optimize_pass.cc:246] Cluster name : scale_factor size: 8
I0507 08:33:54.910179 8644 memory_optimize_pass.cc:246] Cluster name : pool2d_1.tmp_0 size: 614400
I0507 08:33:54.918157 8644 memory_optimize_pass.cc:246] Cluster name : pool2d_5.tmp_0 size: 768
I0507 08:33:54.919155 8644 memory_optimize_pass.cc:246] Cluster name : shape_2.tmp_0_slice_0 size: 4
e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m
I0507 08:33:55.138567 8644 analysis_predictor.cc:1838] ======= optimize end =======
I0507 08:33:55.139565 8644 naive_executor.cc:200] --- skip [feed], feed -> scale_factor
I0507 08:33:55.139565 8644 naive_executor.cc:200] --- skip [feed], feed -> image
I0507 08:33:55.150537 8644 naive_executor.cc:200] --- skip [gather_nd_0.tmp_0], fetch -> fetch
I0507 08:33:55.150537 8644 naive_executor.cc:200] --- skip [multiclass_nms3_0.tmp_2], fetch -> fetch
I0507 08:33:55.151533 8644 ppyoloe_crn_l.cc:144] [main:144]初始化图像数据格式...
I0507 08:33:55.152530 8644 ppyoloe_crn_l.cc:147] [main:147]初始化图像数据...
I0507 08:33:55.154525 8644 ppyoloe_crn_l.cc:150] [main:150]开始检测...
I0507 08:33:55.154525 8644 ppyoloe_crn_l.cc:115] [run:115][run]获得句柄...
W0507 08:33:55.154525 8644 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.2, Runtime API Version: 11.8
W0507 08:33:55.157517 8644 gpu_resources.cc:164] device: 0, cuDNN Version: 8.6.
I0507 08:33:55.158514 8644 ppyoloe_crn_l.cc:118] [run:118][run]FLAGS_warmup...


C++ Traceback (most recent call last):

Not support stack backtrace yet.


Error Message Summary:

InvalidArgumentError: The axis is expected to be in range of [1, -1), but got 1
[Hint: Expected axis_value >= -rank && axis_value < rank == true, but received axis_value >= -rank && axis_value < rank:0 != true:1.] (at ..\paddle\phi\infermeta\unary.cc:3814)

@kangguangli
Copy link

你好,现在看来应该是内部问题,有一个输入没有被初始化,我们会尽快提PR修复,合入后会及时在PR里同步。

@kangguangli
Copy link

@dict1234 #520 应该修复了,可以拉下最新代码试试。

@lizexu123
Copy link
Contributor

已经修复了

@dict1234
Copy link
Author

dict1234 commented May 8, 2024

你好, 我昨天也调试了一下. 挨个观察, 发现是run_mode的设置问题. 默认是paddle_gpu, 改成trt_xxx就可以了.

@dict1234
Copy link
Author

dict1234 commented May 8, 2024

另外请教一个问题. 使用TensorRT的情况下, 加载模型需要时间太长. 有没有办法在几秒或者几十秒内完成?

@lizexu123
Copy link
Contributor

lizexu123 commented May 8, 2024 via email

@kangguangli
Copy link

你好, 我昨天也调试了一下. 挨个观察, 发现是run_mode的设置问题. 默认是paddle_gpu, 改成trt_xxx就可以了.

理论上所有run_mode应该都能跑通,现在我们已经修复了原生GPU下的问题,可以尝试下原生GPU。使用 TRT加载时间太长可能跟TRT的图优化过程有关,可以分享下你现在具体加载的用时。

@dict1234
Copy link
Author

dict1234 commented May 8, 2024

W0508 15:06:45.928721 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
W0508 15:06:45.929718 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
W0508 15:06:45.929718 8180 op_compat_sensible_pass.cc:232] Check the Attr(axis) of Op(elementwise_add) in pass(conv_elementwise_add_fuse_pass) failed!
W0508 15:06:45.929718 8180 conv_elementwise_add_fuse_pass.cc:94] Pass in op compat failed.
I0508 15:06:45.944679 8180 fuse_pass_base.cc:59] --- detected 78 subgraphs
e[32m--- Running IR pass [remove_padding_recover_padding_pass]e[0m
e[32m--- Running IR pass [delete_remove_padding_recover_padding_pass]e[0m
e[32m--- Running IR pass [dense_fc_to_sparse_pass]e[0m
e[32m--- Running IR pass [dense_multihead_matmul_to_sparse_pass]e[0m
e[32m--- Running IR pass [tensorrt_subgraph_pass]e[0m
I0508 15:06:45.996539 8180 tensorrt_subgraph_pass.cc:302] --- detect a sub-graph with 387 nodes
I0508 15:06:46.060369 8180 tensorrt_subgraph_pass.cc:846] Prepare TRT engine (Optimize model structure, Select OP kernel etc). This process may cost a lot of time.
W0508 15:06:48.557268 8180 helper.h:127] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
W0508 15:06:48.557268 8180 helper.h:127] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
W0508 15:06:48.560261 8180 place.cc:161] The paddle::PlaceType::kCPU/kGPU is deprecated since version 2.3, and will be removed in version 2.4! Please use Tensor::is_cpu()/is_gpu() method to determine the type of place.
W0508 15:06:48.563252 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output.
W0508 15:06:48.570233 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output.
W0508 15:06:48.581204 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output.
W0508 15:06:48.591177 8180 helper.h:127] Tensor DataType is determined at build time for tensors not marked as input or output.
I0508 15:06:48.701881 8180 engine.cc:215] Run Paddle-TRT FP16 mode
W0508 15:09:45.615533 8180 helper.h:127] TensorRT encountered issues when converting weights between types and that could affect accuracy.
W0508 15:09:45.615533 8180 helper.h:127] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
W0508 15:09:45.616530 8180 helper.h:127] Check verbose logs for the list of affected weights.
W0508 15:09:45.616530 8180 helper.h:127] - 138 weights are affected by this issue: Detected subnormal FP16 values.
W0508 15:09:45.616530 8180 helper.h:127] - 63 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
e[32m--- Running IR pass [conv_bn_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add_act_fuse_pass]e[0m
e[32m--- Running IR pass [conv_elementwise_add2_act_fuse_pass]e[0m
e[32m--- Running IR pass [transpose_flatten_concat_fuse_pass]e[0m
e[32m--- Running IR pass [auto_mixed_precision_pass]e[0m
e[1me[35m--- Running analysis [save_optimized_model_pass]e[0m
e[1me[35m--- Running analysis [ir_params_sync_among_devices_pass]e[0m
I0508 15:09:45.832952 8180 ir_params_sync_among_devices_pass.cc:53] Sync params from CPU to GPU
e[1me[35m--- Running analysis [adjust_cudnn_workspace_size_pass]e[0m
e[1me[35m--- Running analysis [inference_op_replace_pass]e[0m
e[1me[35m--- Running analysis [memory_optimize_pass]e[0m
I0508 15:09:45.852897 8180 memory_optimize_pass.cc:118] The persistable params in main graph are : 199.14MB
I0508 15:09:45.854892 8180 memory_optimize_pass.cc:246] Cluster name : multiclass_nms3_0.tmp_1 size: 4
e[1me[35m--- Running analysis [ir_graph_to_program_pass]e[0m

中间2行:
I0508 15:06:48.701881 8180 engine.cc:215] Run Paddle-TRT FP16 mode
W0508 15:09:45.615533 8180 helper.h:127] TensorRT encountered issues when converting weights between types and that could affect accuracy.

15:06:48 到 15:09:45 时间用了将近3分钟.

另外请问 原生GPU , 是指你们已经更新了SDK包吗? 在哪里下载? 可以给出链接吗?

@kangguangli
Copy link

kangguangli commented May 9, 2024

原生GPU指你一开始的配置,即不使用TRT。更新的话主要更新的是本仓库,你拉下这个仓库的最新commit就行,或者手动仿照 #520 更新下 c++/gpu/ppyoloe_crn_l/ppyoloe_crn_l.cc即可。

关于TRT的时间问题,我会反馈给相关同事,短期内可能没法修复。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants