Best Articles to quickly understand the literature progress on semantic segmentation models
Pre-requisite Topics:
- Deconvolution Checkerboard Artifacts
- Dilated Convolution - Also called atrous convolution and is used in DilatedNet and DeepLab papers
- CRF-Conditional Random Fields - Used as a post processing refine filter for final segmentation masks to smoothen out masks using pixel values and pixel location as input using gaussian filters.
SEMANTIC SEGMENTATION
- FCN [2014] - Mean IOU 62.2 Pascal VOC
- Use fully convolutional network to predict a mask Image.
- Fuse information from different shallower layers as in deeper layers spatial information is lost.
- FCN8s, FCN16s, FCN32s are three different variants of FCN based on fusion strategy
- PSPNet [Dec 2016] - Mean IOU 82.6 Pascal VOC
- Focuses on the use of global context
- Uses ResNet with dilated convolutions for feature extraction (1/8)
- Pyramid pooling module:
- Sub region average pooling (1x1, 2x2, 3x3 and 6x6)
- Then applies 1x1 convolution for dimensionality reduction for channels. 2048->512
- Bilinear upsampling and concatenation for context aggregation
- Auxilliary loss for training is used as is used in Inception networks for deep networks training
- Future Scope- Decoder is simply 8x8 upsampling
- MobileNetV2 [Jan 2018] Mean IOU 79.2 Pascal VOC
- DeepLab
- DeepLabv1 [2015 ICLR] Blog
- Atrous convolutions to increase large field of view - DeepLab-largeFOV
- Uses VGG as feature extractor and bilinear interpolation for upsampling
- CRF is used as post processing step
- DeepLabv2 [2018 TPAMI] Mean IOU 79.7 Pascal VOC Blog
- Introduces Atrous spatial pyramid pooling (ASPP) over deeplab v1
- Uses ResNet as the backbone for featire extraction
- Multi scale input is used and CRF as post processing step
- DeepLabv3 Mean IOU 86.9 Pascal VOC Blog
- Beats PSPNet in performance
- Doesn't use CRF as post processing step to make end to end pipeline
- Go deeper using multi-grid atrous convolutions
- ASPP + Image Pooling
- Upsampling logits instead of downsampling GT as in DeepLabv2
- Uses Resnet-101 as backbone
- While inference, multiscale and flipped inputs are used to improve performance
- DeepLab-JFT uses pretrained model on ImageNet + JFT-3M dataset
- DeepLabv3+ [2018 ECCV] Mean IOU 89.0 Pascal VOC Blog
- Atrous Separable Convolution is introduced
- Encoder Decoder structure with much better design for upsampling
- Modified Aligned Xception network is used as feature extractor (more deeper and max pool layers are replaced by separable atrous convolutions)
- Multi-scale (MS) + flip(FL) + decoder(D) + Separable Conv (SC) + Coco + JFT achieves SOTA
- AutoDeepLab [2019 CVPR]
- Panoptic DeepLab [2020 CVPR]
- Searching for Efficient Multi-Scale Architectures for Dense Image Prediction [2018 NIPS] Mean IOU 87.9 Pascal VOC
- Architecture search is performed to express wide range of architectures
- Search space is reduced by keeping backbone fixed
- Proxy task (running on low resolution images for choosing best param)
- Smaller backbone (MobileNetv2) is used for trimming search space and later finetuned on bigger network (Modified Xception)
- Search space consist of B! x 10^B configurations (For b=5, 10^11 configurations)
- Multi Scale context aggregation is kept in mind while searching for best architecture. branch bi takes as input all previous branch ouputs and fuse it
- EfficientNet-L2+NAS+FPN [2020 Arxiv] Mean IOU 90.5 Pascal VOC
INSTANCE SEGMENTATION
- Mask RCNN
- Deep Mask
Special Thanks to Sik-Ho Tsang for all the wonderful explanations
Comments
Post a Comment