Image Segmentation

Image Segmentation

Best Articles to quickly understand the literature progress on semantic segmentation models

Pre-requisite Topics:

Deconvolution Checkerboard Artifacts
Dilated Convolution - Also called atrous convolution and is used in DilatedNet and DeepLab papers
CRF-Conditional Random Fields - Used as a post processing refine filter for final segmentation masks to smoothen out masks using pixel values and pixel location as input using gaussian filters.

SEMANTIC SEGMENTATION

FCN [2014] - Mean IOU 62.2 Pascal VOC

Use fully convolutional network to predict a mask Image.
Fuse information from different shallower layers as in deeper layers spatial information is lost.
FCN8s, FCN16s, FCN32s are three different variants of FCN based on fusion strategy

PSPNet [Dec 2016] - Mean IOU 82.6 Pascal VOC

Focuses on the use of global context
Uses ResNet with dilated convolutions for feature extraction (1/8)
Pyramid pooling module:

Sub region average pooling (1x1, 2x2, 3x3 and 6x6)
Then applies 1x1 convolution for dimensionality reduction for channels. 2048->512
Bilinear upsampling and concatenation for context aggregation

Auxilliary loss for training is used as is used in Inception networks for deep networks training
Future Scope- Decoder is simply 8x8 upsampling

MobileNetV2 [Jan 2018] Mean IOU 79.2 Pascal VOC
DeepLab

DeepLabv1 [2015 ICLR] Blog

Atrous convolutions to increase large field of view - DeepLab-largeFOV
Uses VGG as feature extractor and bilinear interpolation for upsampling
CRF is used as post processing step

DeepLabv2 [2018 TPAMI] Mean IOU 79.7 Pascal VOC Blog

Introduces Atrous spatial pyramid pooling (ASPP) over deeplab v1
Uses ResNet as the backbone for featire extraction
Multi scale input is used and CRF as post processing step

DeepLabv3 Mean IOU 86.9 Pascal VOC Blog

Beats PSPNet in performance
Doesn't use CRF as post processing step to make end to end pipeline
Go deeper using multi-grid atrous convolutions
ASPP + Image Pooling
Upsampling logits instead of downsampling GT as in DeepLabv2
Uses Resnet-101 as backbone
While inference, multiscale and flipped inputs are used to improve performance
DeepLab-JFT uses pretrained model on ImageNet + JFT-3M dataset

DeepLabv3+ [2018 ECCV] Mean IOU 89.0 Pascal VOC Blog

Atrous Separable Convolution is introduced
Encoder Decoder structure with much better design for upsampling
Modified Aligned Xception network is used as feature extractor (more deeper and max pool layers are replaced by separable atrous convolutions)
Multi-scale (MS) + flip(FL) + decoder(D) + Separable Conv (SC) + Coco + JFT achieves SOTA
AutoDeepLab [2019 CVPR]
Panoptic DeepLab [2020 CVPR]

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction [2018 NIPS] Mean IOU 87.9 Pascal VOC

Architecture search is performed to express wide range of architectures

Search space is reduced by keeping backbone fixed
Proxy task (running on low resolution images for choosing best param)
Smaller backbone (MobileNetv2) is used for trimming search space and later finetuned on bigger network (Modified Xception)
Search space consist of B! x 10^B configurations (For b=5, 10^11 configurations)

Multi Scale context aggregation is kept in mind while searching for best architecture. branch bi takes as input all previous branch ouputs and fuse it

EfficientNet-L2+NAS+FPN [2020 Arxiv] Mean IOU 90.5 Pascal VOC

INSTANCE SEGMENTATION

Mask RCNN
Deep Mask

Special Thanks to Sik-Ho Tsang for all the wonderful explanations

Comments