Skip to main content

Image Segmentation



 Best Articles to quickly understand the literature progress on semantic segmentation models 

Pre-requisite Topics:

SEMANTIC SEGMENTATION

  • FCN [2014] - Mean IOU 62.2 Pascal VOC 
    • Use fully convolutional network to predict a mask Image. 
    • Fuse information from different shallower layers as in deeper layers spatial information is lost. 
    • FCN8s, FCN16s, FCN32s are three different variants of FCN based on fusion strategy 
  • PSPNet [Dec 2016] - Mean IOU 82.6 Pascal VOC
    • Focuses on the use of global context
    • Uses ResNet with dilated convolutions for feature extraction (1/8)
    • Pyramid pooling module:
      • Sub region average pooling (1x1, 2x2, 3x3 and 6x6)
      • Then applies 1x1 convolution for dimensionality reduction for channels. 2048->512 
      • Bilinear upsampling and concatenation for context aggregation
    • Auxilliary loss for training is used as is used in Inception networks for deep networks training
    • Future Scope- Decoder is simply 8x8 upsampling 
  • MobileNetV2 [Jan 2018] Mean IOU 79.2 Pascal VOC
  • DeepLab
    • DeepLabv1 [2015 ICLR] Blog
      • Atrous convolutions to increase large field of view - DeepLab-largeFOV
      • Uses VGG as feature extractor and bilinear interpolation for upsampling 
      • CRF is used as post processing step 
    • DeepLabv2 [2018 TPAMI] Mean IOU 79.7 Pascal VOC Blog
      • Introduces Atrous spatial pyramid pooling (ASPP) over deeplab v1 
      • Uses ResNet as the backbone for featire extraction 
      • Multi scale input is used and CRF as post processing step  
    • DeepLabv3  Mean IOU 86.9 Pascal VOC Blog
      • Beats PSPNet in performance 
      • Doesn't use CRF as post processing step to make end to end pipeline
      • Go deeper using multi-grid atrous convolutions
      • ASPP + Image Pooling
      • Upsampling logits instead of downsampling GT as in DeepLabv2
      • Uses Resnet-101 as backbone
      • While inference, multiscale and flipped inputs are used to improve performance
      • DeepLab-JFT uses pretrained model on ImageNet + JFT-3M dataset 
    • DeepLabv3+ [2018 ECCV]  Mean IOU 89.0 Pascal VOC Blog
      • Atrous Separable Convolution is introduced
      • Encoder Decoder structure with much better design for upsampling
      • Modified Aligned Xception network is used as feature extractor (more deeper and max pool layers are replaced by separable atrous convolutions)
      • Multi-scale (MS) + flip(FL) + decoder(D) + Separable Conv (SC) + Coco + JFT achieves SOTA  
      • AutoDeepLab [2019 CVPR]
      • Panoptic DeepLab [2020 CVPR]
  • Searching for Efficient Multi-Scale Architectures for Dense Image Prediction [2018 NIPS]  Mean IOU 87.9 Pascal VOC
    • Architecture search is performed to express wide range of architectures 
      • Search space is reduced by keeping backbone fixed 
      • Proxy task (running on low resolution images for choosing best param)
      • Smaller backbone (MobileNetv2) is used for trimming search space and later finetuned on bigger network (Modified Xception)
      • Search space consist of B! x 10^B configurations (For b=5, 10^11 configurations)
    • Multi Scale context aggregation is kept in mind while searching for best architecture. branch bi takes as input all previous branch ouputs and fuse it
  • EfficientNet-L2+NAS+FPN [2020 Arxiv]  Mean IOU 90.5 Pascal VOC

INSTANCE SEGMENTATION

  • Mask RCNN 
  • Deep Mask

Special Thanks to Sik-Ho Tsang for all the wonderful explanations

Comments