Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: EfficientAd is slower than other models in anomalib #2150

Open
1 task done
haimat opened this issue Jun 24, 2024 · 16 comments
Open
1 task done

[Bug]: EfficientAd is slower than other models in anomalib #2150

haimat opened this issue Jun 24, 2024 · 16 comments

Comments

@haimat
Copy link

haimat commented Jun 24, 2024

Describe the bug

I had the impression that the EfficientAd model would be among the fastest in anomalib in terms of prediction times. To verify that I have trained three models, Padim, Fastflow, and EfficientAd, all with the same training data and an image dimension of 512x512 pixels. Then I have written a small script that loads these models, warms up the GPU, and then runs prediction on 100 images. I measure only the model forward time, no image loading or any pre- or post-processing.

With the models exported to ONNX I get these results (avg. model forwards times on 100 images):

  • EfficientAd: 0.0116 sec.
  • Fastflow: 0.0053 sec.
  • Padim: 0.0036 sec.

So in other words: The EfficientAd model is the slowest from these three, and Padim the fastest - I thought it would be the other way round. Am I missing something, or is this a bug in anomalib?

Dataset

Other (please specify in the text field below)

Model

Other (please specify in the field below)

Steps to reproduce the behavior

I trained three models on the same dataset, then predict 100 images with each of them and measure the avg. model forward / inferencing time, without pre- or post-processing.

OS information

OS information:

  • OS: Ubuntu 22.04
  • Python version: Python 3.10.12
  • Anomalib version: 1.1.0
  • PyTorch version: 2.2
  • CUDA/cuDNN version: 12.2
  • GPU models and configuration: 4x Nvidia A6000
  • Any other relevant information: I am using a custom dataset

Expected behavior

I would expect the EfficientAd net to be considerable faster than the other models.

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

-

Logs

-

Code of Conduct

  • I agree to follow this project's Code of Conduct
@alexriedel1
Copy link
Contributor

alexriedel1 commented Jun 25, 2024

Hi,
can you show how you measure the timing?
In plain pytorch on 256x256 images I have the following speed measurements on a GTX 1660 Ti:
Padim: 6.4ms
EfficientAD S: 64.1ms
Fastflow: 33ms

When measuring this implementation of EfficientAD which claims to reach the paper result timing stats, I get the same speed of 64ms per image on my GPU. This makes me think that the speed of EfficientAD in anomalib isn't slower as it should be.

The authors of EfficientAD state that
For each method, we remove unnecessary parts for the timing, such as the computation of losses during inference, and use float16 precision for all networks. Switching from float32 to float16 for the inference of EfficientAD does not change the anomaly detection results for the 32 anomaly detection scenarios evaluated in this paper. In latency-critical applications, padding in the PDN architecture of EfficientAD can be disabled. This speeds up the forward pass of the PDN architecture by 80 µs without impairing the detection of anomalies. We time EfficientAD without padding and therefore report the anomaly detection results for this setting in the experimental results of this paper

So you should be sure to set padding=False and use half-precision. Especially half-precision matters for some kinds of GPUs

@alexriedel1
Copy link
Contributor

alexriedel1 commented Jun 26, 2024

I was curious and made some more experiments. half precision really matters for example for a T4 GPU.
anomalib EfficientAD refers to the anomalib implementation, nelson refers to this implementation


256 x 256 image size
Anomalib EfficientAD S full precision 24ms
Anomalib EfficientAD S half precision 8.9ms
nelson EfficientAD S full precision 21.5ms
nelson EfficientAD S half precision 7.4ms

Anomalib Fastflow half precision Resnet18 25.23ms
Anomalib Fastflow full precision Resnet18 23.37ms

512 x 512 image size
Anomalib EfficientAD S full precision 161ms
Anomalib EfficientAD S half precision 30ms
nelson EfficientAD S full precision 153ms
nelson EfficientAD S half precision 27ms

Anomalib Fastflow half precision Resnet18 26.1ms
Anomalib Fastflow full precision Resnet18 24.9ms

Anomalib Fastflow full precision Resnet50 108ms
Anomalib Fastflow half precision Resnet50 52ms

What I assume from these results (and isn't big news): Half precision matters especially for convolution intensive models. Image size matters. The choice of GPU matters. The EfficientAD authors might not have made a fair comparison between the models because I have the feeling they didn't use half precision for all the others they compare their inference speed with.

@haimat
Copy link
Author

haimat commented Jun 26, 2024

@alexriedel1 Thanks for your response.
I will try to reproduce your experiments and get back here to you soon!

@haimat
Copy link
Author

haimat commented Jul 4, 2024

@alexriedel1 When I look at your testing results, it is clear that regardless of the image size even with half precision the EfficientAd model is slower than the full precision Fastflow model. That is quite a surprise ...

I have exported my models to ONNX and then converted to TensorRT on Nvidia.
For the latter I have enabled half precision.

How have you turned on or off the half precision mode?
Also, how can you define the Resnet type for the Fastlow model?

@blaz-r
Copy link
Contributor

blaz-r commented Jul 18, 2024

I think that half precision makes quite a big difference due to tensor cores that operate with FP16. I'm not sure if having the data in fp32 even guarantees that it's not turned to fp16 (either by pytorch or cuda) for tensor cores behind the scenes. However, it really seems like the speed greatly depends on the image size, and with compute heavy models the GPU plays quite a big role as well (H100 for example has significantly faster tensor cores that work with FP16).

The EfficientAD authors might not have made a fair comparison between the models because I have the feeling they didn't use half precision for all the others they compare their inference speed with.

I'm not sure how exactly they did that, but I think every model they used can be set to FP16, BUT some really don't benefit much from that (probably due to tensor cores mentioned above that mostly do MMA operations).

To answer the other two questions:
I think you can put the model to fp16 by simply using model.to(torch.float16). Most of models just need that, some I think also need manually placing some of variables inside the model to float16.

For FastFlow model, you can specify the ResNet model either by config file or by passing the backbone name to constructor:

def __init__(
self,
backbone: str = "resnet18",

@watertianyi
Copy link

我很好奇,并做了一些实验。例如,对于 T4 GPU,半精度确实很重要。anomalib EfficientAD 指的是 anomalib 实现,nelson 指的是这个实现


256 x 256 image size
Anomalib EfficientAD S full precision 24ms
Anomalib EfficientAD S half precision 8.9ms
nelson EfficientAD S full precision 21.5ms
nelson EfficientAD S half precision 7.4ms

Anomalib Fastflow half precision Resnet18 25.23ms
Anomalib Fastflow full precision Resnet18 23.37ms

512 x 512 image size
Anomalib EfficientAD S full precision 161ms
Anomalib EfficientAD S half precision 30ms
nelson EfficientAD S full precision 153ms
nelson EfficientAD S half precision 27ms

Anomalib Fastflow half precision Resnet18 26.1ms
Anomalib Fastflow full precision Resnet18 24.9ms

Anomalib Fastflow full precision Resnet50 108ms
Anomalib Fastflow half precision Resnet50 52ms

我从这些结果中得到的结论是(这不是什么大新闻):半精度很重要,尤其是对于稀疏密集型模型。图像大小很重要。GPU的选择很重要。EfficientAD的作者可能没有对这些模型进行公平的比较,因为我觉得他们没有对所有其他为了比较推理速度的模型使用半准确性。
Where is the half-precision set? I set the input half-precision return to_dtype(to_image(image), torch.float16, scale=True) and the half-precision self.model = self.model.to(torch.float16) of TorchInferencer inference. The result is an error RuntimeError: Input type (float) and bias type (c10::Half) should be the same

@samet-akcay
Copy link
Contributor

samet-akcay commented Nov 27, 2024

You could use the precision arg in Engine or Lightning's Trainer
https://lightning.ai/docs/pytorch/stable/common/trainer.html#precision

@watertianyi
Copy link

@samet-akcay
Do you mean that half precision must be set during training, but not during inference?

@alexriedel1
Copy link
Contributor

@samet-akcay Do you mean that half precision must be set during training, but not during inference?

Preferably both during Training and inferencing

@watertianyi
Copy link

@alexriedel1 @blaz-r @samet-akcay

wang-xinyu/tensorrtx#1605

There is a loss of precision when converting efficientAD to tensorrt. If you have time, can you help me find out what is different from python? I really can’t find it.

@blaz-r
Copy link
Contributor

blaz-r commented Dec 18, 2024

Unfortunately I don't have experience with tensorrt and can't help in this regard.

@alexriedel1
Copy link
Contributor

@alexriedel1 @blaz-r @samet-akcay

wang-xinyu/tensorrtx#1605

There is a loss of precision when converting efficientAD to tensorrt. If you have time, can you help me find out what is different from python? I really can’t find it.

without any further information it's hard to help. Did you go completely after this example https://github.com/wang-xinyu/tensorrtx/tree/master/efficient_ad ?

How large is your accuracy drop?
How do you measure accuracy?

@watertianyi
Copy link

@alexriedel1 @blaz-r @samet-akcay
王新宇/tensorrtx#1605
efficientAD 转为 tensorrt 时有精度损失,有时间的话能帮我看看和 python 有什么区别吗,实在找不出来。

没有任何进一步的信息很难提供帮助。您是否完全遵循了这个示例https://github.com/wang-xinyu/tensorrtx/tree/master/efficient_ad?

你的准确率下降了多少? 你如何测量准确率?

The model was trained on its own data set. The python test has an accuracy of 80%, while the C++ test with the above code only has an accuracy of 62.5%.
The accuracy is the prediction of OK in the OK test data set + the prediction of NG in the NG test data set divided by the test data set of OK and NG

@alexriedel1
Copy link
Contributor

@alexriedel1 @blaz-r @samet-akcay
王新宇/tensorrtx#1605
efficientAD 转为 tensorrt 时有精度损失,有时间的话能帮我看看和 python 有什么区别吗,实在找不出来。

没有任何进一步的信息很难提供帮助。您是否完全遵循了这个示例https://github.com/wang-xinyu/tensorrtx/tree/master/efficient_ad?
你的准确率下降了多少? 你如何测量准确率?

The model was trained on its own data set. The python test has an accuracy of 80%, while the C++ test with the above code only has an accuracy of 62.5%. The accuracy is the prediction of OK in the OK test data set + the prediction of NG in the NG test data set divided by the test data set of OK and NG

Have you checked if all parameters (model weights, normalization values, threshold values) are equal between your tensorrt and pytorch model?

@watertianyi
Copy link

@alexriedel1 @blaz-r @samet-akcay
王新宇/tensorrtx#1605
effectiveAD转为tensorrt时有精度损失,有时间的话能帮我看看和python有什么区别吗,确实找不出来。

没有任何进一步的信息很难提供帮助。您是否完全遵循了这个示例https://github.com/wang-xinyu/tensorrtx/tree/master/efficient_ad?
您的准确率下降了多少?您如何测量准确率?

该模型是在自己的数据集上训练的。python 测试的准确率为 80%,而使用上述代码的 C++ 测试的准确率仅为 62.5%。准确率是 OK 测试数据集中 OK 的预测 + NG 测试数据集中 NG 的预测除以 OK 和 NG 的测试数据集

您是否检查过 tensorrt 和 pytorch 模型之间的所有参数(模型权重、规范化值、阈值)是否相等?

After comparing the codes, inference does not include Dropout, a layer of network, because this layer of network only works during training and does not affect reasoning. There is no way to check the model weights. Now I just input an image, and tensorrt compares the output of each layer of python. , the biggest change was found in map_combined = 0.5 * map_st + 0.5 * map_stae, but no big changes were found in map_st and map_stae.

python anomaly_map:
cpu_output_data[batch * kOutputSize]:65536
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 16]:-0.0641369
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 26]:-0.051166
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 36]:-0.0460452
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 37]:-0.0457649
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 38]:-0.0461044
cpu_output_data[0 * (image_wh * image_wh) + 16 * image_wh + 39]:-0.0470637
cpu_output_data[0 * (image_wh * image_wh) + 85 * image_wh + 222]:0.00670829
cpu_output_data[0 * (image_wh * image_wh) + 86 * image_wh + 223]:0.0123574
cpu_output_data[0 * (image_wh * image_wh) + 134 * image_wh + 162]:0.18391
cpu_output_data[0 * (image_wh * image_wh) + 135 * image_wh + 94]:0.0204612

C++ anomaly_map:
anomaly_map: torch.Size([1, 1, 256, 256])
map_combined[0, 0, 16, 16]: tensor(-0.0506, device='cuda:0', grad_fn=)
map_combined[0, 0, 16, 26]: tensor(-0.0447, device='cuda:0', grad_fn=)
map_combined[0, 0, 16, 36]: tensor(-0.0418, device='cuda:0', grad_fn=)
map_combined[0, 0, 16, 37]: tensor(-0.0416, device='cuda:0', grad_fn=)
map_combined[0, 0, 16, 38]: tensor(-0.0417, device='cuda:0', grad_fn=)
map_combined[0, 0, 16, 39]: tensor(-0.0421, device='cuda:0', grad_fn=)
map_combined[0, 0, 85, 222]: tensor(0.2825, device='cuda:0', grad_fn=)
map_combined[0, 0, 86, 223]: tensor(0.2824, device='cuda:0', grad_fn=)
map_combined[0, 0, 134, 162]): tensor(0.0903, device='cuda:0', grad_fn=)
map_combined[0, 0, 135, 94]: tensor(-0.0003, device='cuda:0', grad_fn=)

normalizedMap_stae:
cpu_output_data[batch * kOutputSize]:65536
cpu_output_data[0]:-0.0355803
cpu_output_data[1]:-0.0359128
cpu_output_data[2]:-0.0350038
cpu_output_data[3]:-0.0348574
cpu_output_data[4]:-0.0347626
cpu_output_data[5]:-0.0347191
cpu_output_data[85]:0.55863
cpu_output_data[86]:0.552995
cpu_output_data[134]:-0.00249266
cpu_output_data[135]:-0.0208227

normalize map_stae: torch.Size([1, 1, 256, 256])
normalize map_stae16: tensor(-0.0356, device='cuda:0', grad_fn=)
normalize map_stae16: tensor(-0.0359, device='cuda:0', grad_fn=)
normalize map_stae16: tensor(-0.0350, device='cuda:0', grad_fn=)
normalize map_stae16: tensor(-0.0349, device='cuda:0', grad_fn=)
normalize map_stae16: tensor(-0.0348, device='cuda:0', grad_fn=)
normalize map_stae16: tensor(-0.0348, device='cuda:0', grad_fn=)
normalize map_stae85: tensor(0.5592, device='cuda:0', grad_fn=)
normalize map_stae86: tensor(0.5535, device='cuda:0', grad_fn=)
normalize map_stae134: tensor(-0.0026, device='cuda:0', grad_fn= )
normalize map_stae135: tensor(-0.0209, device='cuda:0', grad_fn=)

normalizedMap_st:
cpu_output_data[batch * kOutputSize]:65536
cpu_output_data[0]:-0.064137
cpu_output_data[1]:-0.051166
cpu_output_data[2]:-0.0460452
cpu_output_data[3]:-0.045765
cpu_output_data[4]:-0.0461045
cpu_output_data[5]:-0.0470638
cpu_output_data[85]:0.00670823
cpu_output_data[86]:0.0123574
cpu_output_data[134]:0.18391
cpu_output_data[135]:0.0204611

normalize map_st: tensor(-0.0656, device='cuda:0', grad_fn=)
normalize map_st: tensor(-0.0535, device='cuda:0', grad_fn=)
normalize map_st: tensor(-0.0486, device='cuda:0', grad_fn=)
normalize map_st16: tensor(-0.0483, device='cuda:0', grad_fn=)
normalize map_st16: tensor(-0.0485, device='cuda:0', grad_fn=)
normalize map_st16: tensor(-0.0494, device='cuda:0', grad_fn=)
normalize map_st85: tensor(0.0057, device='cuda:0', grad_fn=)
normalize map_st86: tensor(0.0113, device='cuda:0', grad_fn=)
normalize map_st134: tensor(0.1832, device='cuda:0', grad_fn=)
normalize map_st135: tensor(0.0202, device='cuda:0', grad_fn=)

map_combined = 0.5 * map_st + 0.5 * map_stae
C++ code:
` float* scaleVal = nullptr;
scaleVal = reinterpret_cast<float*>(malloc(sizeof(float) * 1));
for (int i = 0; i < 1; i++) {
scaleVal[i] = 0.5f;
}
Weights scaleWeight{DataType::kFLOAT, scaleVal, 1};
IScaleLayer* mergeMapLayer1 = network->addScale(input1, ScaleMode::kUNIFORM, Weights{}, scaleWeight, Weights{});
assert(mergeMapLayer1);

IScaleLayer* mergeMapLayer2 = network->addScale(input2, ScaleMode::kUNIFORM, Weights{}, scaleWeight, Weights{});
assert(mergeMapLayer2);

IElementWiseLayer* mergedMapLayer = network->addElementWise(
        *mergeMapLayer1->getOutput(0), *mergeMapLayer2->getOutput(0), ElementWiseOperation::kSUM);
assert(mergedMapLayer);`

@alexriedel1
Copy link
Contributor

The values in your map_st are not equal and the difference would explain your accuracy drop I think.
I don't know how to solve this however

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants