This is how one can interpret YOLO v5 segmentation output!!! #12834

aravindchakravarti · 2024-03-21T10:43:23Z

aravindchakravarti
Mar 21, 2024

Hi All,
There are lot of tutorials online (also discussions in YOLOv5 issues section) about detection output and postprocessing of YOLOv5 detection model.

I was looking something similar for segmentation model from long time. Looks like, there is a very little information online or at least I didn't find much online. Until yesterday, when I understood that YOLOv5 segmentation model is inspired from YOLACT.

So, I am giving some inputs here, so that you know how YOLOv5 processes its segmentation output.

Before we go to YOLOv5 segmentation model, lets see how YOLACT works using simple diagram from its paper.

Basically, YOLOACT produces two outputs.

Prototype Generation
These are prototype masks which gets generated by Fully Connected Convolutional Neural Network. Number of masks generated is determined by last layer which has k channels. Number of masks generated = k
Mask Coefficients
The masks generated above are just prototypes. We need to combine these channels. To do this, YOLOACT also generated mask coefficients.

YOLACT multiplies prototypes with mask coefficients to generate final masks as shown below. They are later cropped and thresholded.

Now, lets come to YOLOv5 segmentation. I am using zidange.jpg for below illustration.

Lets look at below code in segment\predict.py

# Inference
with dt[1]:
     visualize = increment_path(save_dir / Path(path).stem, mkdir=True) if visualize else False
     pred, proto = model(im, augment=augment, visualize=visualize)[:2]

Here model is producing two outputs. One is pred and another is proto.

If you see the output dimension of pred = torch.Size([1, 25200, 117])
and output dimension of proto = torch.Size([1, 32, 160, 160])

Now, for pred. 1 is the batch size, 25200 is the overall predictions including all anchors(same as detection network) and 117 is equal to 85+32. Meaning, there 80 classes, 5 localization information x, y, w, h, conf and last 32 are the mask coefficients

And for proto. 1 is the batch size, 32 is number of prototype masks with each mask being 160x160 pixels. So, how does these 32 prototype look like? Lets see some examples.

So, 32 such prototype masks will be generated by model. We can see, none of the images clearly identifying any object (2 persons and 1 tie) in the image. That is where we need help of mask coefficients.

We know that, model has produced pred with size ([1, 25200, 117]) which contains mask coefficients. Now this pred is passed to NMS.

pred = non_max_suppression(pred, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det, nm=32)

If we see the dimension of pred then for zidange.jpg it will be torch.Size([3, 38])
Why 3? Because we have 3 objects in input image (2 persons and 1 tie).
Why 38? Because YOLO segmentation also outputs bounding boxes. Hence this 38 is actually 6+32. First 6 will be x, y, w, h, conf, class + 32 co-efficients.

To check if you print pred[0][0] then you will get,

tensor([ 1.82055e+02,  3.81594e+02,  2.48618e+02,  6.38722e+02,  5.47279e-01,  2.70000e+01,  8.05743e-02, -1.95804e-01,  4.02435e-01, -2.21064e-01,  2.24337e-01, -5.27114e-02, -5.52488e-01, -9.28004e-01, -6.14983e-01, -1.40511e-01, -1.34073e-01,  1.70663e-01, -4.58587e-01, -2.04990e-01,  5.66050e-01, -8.52897e-02,
         6.01281e-01, -1.39116e-01, -2.84991e-01, -9.39970e-01,  1.72199e-01, -2.77271e-01,  3.80420e-01, -6.57674e-01, -3.69973e-01, -1.07442e+00, -2.15896e-01,  7.56425e-02, -1.06007e+00,  3.85358e-01,  2.00247e-01, -9.35435e-01], device='cuda:0')

Now, leave first 6 values (because they corresponds to bounding box information). Now take rest 32 values and multiply with 32 prototype channels. You will get the final MASK!!!! One example is below which is indicating tie !!

Hope this is helpful!!

defend1234 · 2024-04-24T06:58:44Z

defend1234
Apr 24, 2024

great!Thank you!

0 replies

aravindchakravarti · 2024-04-24T07:02:13Z

aravindchakravarti
Apr 24, 2024
Author

@defend1234 Thanks.. you are welcome!

1 reply

defend1234 Apr 24, 2024

I am an AI engineer from China.I see your github homepage,feeling you are full of energy!Thank you again!

hcscysh · 2024-04-30T07:42:45Z

hcscysh
Apr 30, 2024

it's very helpful to me. I'm fusing for this when I use yolov5. Thank you!

1 reply

aravindchakravarti Jun 17, 2024
Author

Thanks!!

trinath3 · 2024-06-17T10:15:22Z

trinath3
Jun 17, 2024

It's very helpful ThankYou

1 reply

aravindchakravarti Jun 17, 2024
Author

Hey thanks!

caspar-frog · 2024-06-21T11:41:17Z

caspar-frog
Jun 21, 2024

@aravindchakravarti Thank you for the detailed explanation! It's really helpful!

But I still can't figure out how to output masks with a higher resolution (like 640x640), any advice? Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This is how one can interpret YOLO v5 segmentation output!!! #12834

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

This is how one can interpret YOLO v5 segmentation output!!! #12834

aravindchakravarti Mar 21, 2024

Replies: 5 comments · 3 replies

defend1234 Apr 24, 2024

aravindchakravarti Apr 24, 2024 Author

defend1234 Apr 24, 2024

hcscysh Apr 30, 2024

aravindchakravarti Jun 17, 2024 Author

trinath3 Jun 17, 2024

aravindchakravarti Jun 17, 2024 Author

caspar-frog Jun 21, 2024

aravindchakravarti
Mar 21, 2024

Replies: 5 comments 3 replies

defend1234
Apr 24, 2024

aravindchakravarti
Apr 24, 2024
Author

hcscysh
Apr 30, 2024

aravindchakravarti Jun 17, 2024
Author

trinath3
Jun 17, 2024

aravindchakravarti Jun 17, 2024
Author

caspar-frog
Jun 21, 2024