This is how one can interpret YOLO v5 segmentation output!!! #12834
Replies: 5 comments 3 replies
-
great!Thank you! |
Beta Was this translation helpful? Give feedback.
-
@defend1234 Thanks.. you are welcome! |
Beta Was this translation helpful? Give feedback.
-
it's very helpful to me. I'm fusing for this when I use yolov5. Thank you! |
Beta Was this translation helpful? Give feedback.
-
It's very helpful ThankYou |
Beta Was this translation helpful? Give feedback.
-
@aravindchakravarti Thank you for the detailed explanation! It's really helpful! But I still can't figure out how to output masks with a higher resolution (like 640x640), any advice? Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi All,
There are lot of tutorials online (also discussions in
YOLOv5
issues section) about detection output and postprocessing ofYOLOv5
detection model.I was looking something similar for segmentation model from long time. Looks like, there is a very little information online or at least I didn't find much online. Until yesterday, when I understood that
YOLOv5
segmentation model is inspired fromYOLACT
.So, I am giving some inputs here, so that you know how
YOLOv5
processes its segmentation output.Before we go to
YOLOv5
segmentation model, lets see howYOLACT
works using simple diagram from its paper.Basically,
YOLOACT
produces two outputs.Prototype Generation
These are prototype masks which gets generated by Fully Connected Convolutional Neural Network. Number of masks generated is determined by last layer which has
k
channels. Number of masks generated =k
Mask Coefficients
The masks generated above are just prototypes. We need to combine these channels. To do this,
YOLOACT
also generated mask coefficients.YOLACT
multiplies prototypes with mask coefficients to generate final masks as shown below. They are later cropped and thresholded.Now, lets come to
YOLOv5
segmentation. I am usingzidange.jpg
for below illustration.Lets look at below code in
segment\predict.py
Here model is producing two outputs. One is
pred
and another isproto
.If you see the output dimension of
pred
=torch.Size([1, 25200, 117])
and output dimension of
proto
=torch.Size([1, 32, 160, 160])
Now, for
pred
.1
is the batch size,25200
is the overall predictions including all anchors(same as detection network) and117
is equal to85+32
. Meaning, there80
classes,5
localization informationx, y, w, h, conf
and last32
are themask coefficients
And for
proto
.1
is the batch size,32
is number of prototype masks with each mask being160x160
pixels. So, how does these 32 prototype look like? Lets see some examples.So, 32 such prototype masks will be generated by model. We can see, none of the images clearly identifying any object (2 persons and 1 tie) in the image. That is where we need help of mask coefficients.
We know that, model has produced
pred
with size([1, 25200, 117])
which contains mask coefficients. Now thispred
is passed to NMS.If we see the dimension of
pred
then forzidange.jpg
it will betorch.Size([3, 38])
Why
3
? Because we have 3 objects in input image (2 persons and 1 tie).Why
38
? Because YOLO segmentation also outputs bounding boxes. Hence this38
is actually6+32
. First6
will bex, y, w, h, conf, class
+32
co-efficients.To check if you print
pred[0][0]
then you will get,Now, leave first 6 values (because they corresponds to bounding box information). Now take rest 32 values and multiply with 32 prototype channels. You will get the final MASK!!!! One example is below which is indicating tie !!
Hope this is helpful!!
Beta Was this translation helpful? Give feedback.
All reactions