This repository provides an unofficial PyTorch implementation of Score-CAM [1]. Score-CAM is a CAM (class activation mapping) [2] based visual explanation method like Grad-CAM [3] and Grad-CAM++ [4], but Score-CAM does not depend on gradients and can provide stable visual explanations. The code of this repository also contains additional functions, like CSKIP. See the following sections for more details about the additional functions.
The features of this implementation are:
- Versatile: The code of this repository is applicable to many types of
neural networks, not only for the models provided by
torchvision
module but also for custom CNN models. - Portable: This repository is easily transplanted to user projects. At this moment what the users need to do is just copy a single Python file to the user's projects.
- Less dependent: The core module of this repository has fewer dependencies
for easier transplantation to user's projects. The current implementation
depends only on
numpy
andtorch
module.
The core module of this repository, scorecam
, requires only NumPy and PyTorch.
pip3 install numpy torch
Additionally, the example code examples.py
requires OpenCV, Matplotlib
and Torchvision.
pip3 install opencv-python matplotlib torchvision
import numpy as np
import cv2 as cv
import torchvision
# Import ScoreCAM class.
from scorecam import ScoreCAM
# Load NN model.
model = torchvision.models.resnet18(weights="IMAGENET1K_V1")
# Load input image.
image = cv.imread("resources/sample_image_01.jpg", cv.IMREAD_COLOR)
image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
image = cv.resize(image, (224, 224), interpolation=cv.INTER_CUBIC)
# Normalize the image.
IMAGENET1K_MEAN = np.array([0.485, 0.456, 0.406])
IMAGENET1K_STD = np.array([0.229, 0.224, 0.225])
x = (image / 255.0 - IMAGENET1K_MEAN) / IMAGENET1K_STD
# Create Score-CAM instance.
scorecam = ScoreCAM(model, actmap="layer4")
# Compute visual explanation.
# The argument 'coi' means 'class of interest' and the number 242
# is the index of the label 'boxer' (breed of dog) in ImageNet.
L = scorecam.compute(x, coi=242)
print(L)
If your CNN model is not a classification network, the class of interest
does not make sense and you need a custom function for scoring in Score-CAM.
In such case, you can specify a Python function to the argument coi
.
For example, imagine that your CNN is YOLO and outputs a tensor with shape
(B, 5 + C, H, W) where B is a batch size, C is a number of class, and
(H, W) is an output resolution. If you want to analyze the detection result
of class C = c
at H = h
and W = w
, the custom function can be written
like the following:
# Define a custom scoring function.
coi_fn = lambda output: output[:, c, h, w]
L = scorecam.compute(x, coi=coi_fn)
Note that the following code
L = scorecam.compute(x, coi=target_index)
is equevarent with
# Define a scoring function.
coi_fn = lambda output: output[:, target_index]
L = scorecam.compute(x, coi=coi_fn)
where target_index
is an integer.
Score-CAM is a CAM-based method to compute visual explanations for CNN. The other CAM-based methods depend on the gradient of the output of CNN, but Score-CAM does not. Score-CAM scores each channel of the activation map by the prediction result of a masked image which is defined as the Hadamard product of the input image and a channel of activation map. The following figure is a sketch of Score-CAM procedure.
One of the weak points of Score-CAM is inference speed. Normally Score-CAM is much slower than other CAM-based methods like GradCAM or GradCAM++. It is considered a tradeoff of inference stability, but we can accelerate the computational time of Score-CAM by a very simple trick. A cause of the long computation time is that Score-CAM requires many forward inferences to compute the output visual explanation. For example, if the number of channels of the activation map is 512, then forward inference of 512 images is required.
However, we can easily imagine that a only very limited number of channels of the activation maps contribute to the output visual explanation. This means that we can reduce the computational time of Score-CAM by omitting "unnecessary" activation maps from the calculation of the visual explanation. Although there may be many ideas on how to measure the "necessity" of each channel of the activation map, in this repository, we use the maximum value of each channel as a measure of necessity.
So the acceleration procedure is summarized like the following:
- Get an activation map from the input image,
- Sort the channel of the activation map by the maximum value of each channel,
- Keep only top
K
channels and drop other channels from the activation map, - Compute Score-CAM visual explanation using the reduced activation map,
where K
is a hyperparameter. We call this method as CSKIP (Channel SKIPping).
You can easily apply CSKIP by adding cskip=True
and sckip_out=K
to the
argument of the function ScoreCAM.compute
, like the following:
L = scorecam.compute(x, coi=242, cskip=True, cskip_out=16)
The hyperparameter K controls the ratio of acceleration. If the channel number of the activation map is 512 and the number of remaining channels K is 16, then you can expect 32 (= 512 / 16) times faster inference.
The following is the experiment results on the author's environment with the following settings:
- pre-trained model: ResNet18 trained on ImageNet V1
- The layer to extract the activation map:
layer4
- input image:
resources/sample_image_01.jpg
- Class of interest: 242 (boxer)
- Number of channels to be kept on CSKIP: 16
As you can see, the visualization result of Score-CAM with CSKIP is almost the same as the visualization without CSKIP, however, the computational time is much faster.
Device | Vanilla | SCKIP | Acceleration ratio |
---|---|---|---|
CPU | 7.621 [sec/image] | 0.246 [sec/image] | x 30.98 |
CUDA (GPU) | 0.851 [sec/image] | 0.028 [sec/image] | x 30.32 |
Note that the experiment environment is:
- CPU: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
- RAM: 32GB
- GPU: NVIDIA GeForce GTX 1660 Ti
As you already guessed, excessive channel reduction of CSKIP results in the degradation of the visualization heatmap. The following left figure shows the channel reduction ratio (horizontal axis) and SSIM (structural similarity) which is commonly used as a metric to measure the difference between two images (vertical axis). The dataset used in this result is 500 images randomly chosen from the validation data of ILSVRC 2012.
The following figure on the right shows the relationship between the reduction ratio and computational time measured on the CPU. As you can see, this relationship can be said to be almost linear.
Generally, images are said to be similar if SSIM is 0.95 or higher. Therefore, it seems to be a good idea to set the reduction ratio as sharp as 0.5, which can cut the calculation time in half.
Note that the experiment environment is:
- CPU: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
- RAM: 32GB
- GPU: NVIDIA GeForce GTX 1660 Ti
[1] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, "Score-CAM: Score-weighted visual explanations for convolutional neural networks", CVPR, 2020. URL
[2] M. Oquab, L. Bottou, I. Laptev, J. Sivic, "Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks", CVPR, 2015. URL
[3] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, "Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization", ICCV, 2017. URL
[4] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. Balasubramanian, "Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks", WACV, 2018. URL