Scene Parsing
Scene Parsing
Problem
segment and parse an image into different image regions associated with semantic categories
Evaluation
mean of the pixel-wise accuracy
the ratio of pixels which are correctly predicted.
class-wise IoU
the Intersection of Union of pixels averaged over all the semantic categories.
Dataset
Stanford Background
S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1–8, Sept 2009.
SIFT Flow
C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(12):2368–2382, Dec 2011.
PASCAL-Context
Mottaghi, Roozbeh, et al. "The role of context for object detection and semantic segmentation in the wild." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
ADE20K
Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442
Dataset | Stanford Background | SIFT Flow | ADE20K |
---|---|---|---|
No. of images | 715 | 2688 | 25562 |
No. of train set | 572 | 2488 | 20210 |
No. of val set | 0 | 0 | 2000 |
No. of test set | 143 | 200 | 3352 |
No. of classes | 8 | 33 | 150 |
Samples of ADE20K
http://sceneparsing.csail.mit.edu/browse.php/?dirname=training/
Result
Stanford Background
Method | Pixel Acc. | Class Acc. | averaged computing time per image |
---|---|---|---|
Single-scale ConvNet | 66 | 56.5 | 0.35 (GPU) |
Augmented CNNs | 71.97 | 66.16 | - |
Superparsing | 77.5 | - | 10 to 300 |
Deep 2D LSTM (window 5x5) | 77.73 | 68.26 | 1.3 (CPU) |
Deep 2D LSTM (window 3x3) | 78.56 | 68.79 | 3.7 (CPU) |
Multi-scale ConvNet | 78.8 | 72.4 | 0.6 (CPU) |
RCNN2 (3 instances) | 80.2 | 69.9 | 10.7 (GPU) |
N-ReNet | 80.4 | 71.8 | 0.07 (GPU) |
Multi-CNN + rCPN Fast | 80.9 | 78.8 | 0.37 (GPU) |
multiscale net + CRF on gPb | 81.4 | 76.0 | 60.5 (CPU) |
Zoom-out | 82.1 | 77.3 | - |
HGDN | 82.41 | 72.98 | 0.02 (GPU) |
RCNN_NIPS2015 | 83.1 | 74.8 | 0.03 (GPU) |
SIFT Flow
Method | Pixel Acc. | Class Acc. | mean IU | f.w. IU | averaged computing time per image |
---|---|---|---|---|---|
Augmented CNNs | 49.39 | 44.54 | - | - | - |
Deep 2D LSTM (window 5x5) | 68.74 | 22.59 | - | - | 1.2 (CPU) |
Deep 2D LSTM (window 3x3) | 70.11 | 20.90 | - | - | 3.1 (CPU) |
RCNN2 (3 instances) | 77.7 | 29.8 | - | - | - |
multiscale net + cover1 | 72.3 | 50.8 | - | - | - |
multiscale net + cover2 | 78.5 | 29.6 | - | - | - |
RCNN (balanced) | 79.3 | 57.1 | - | - | 0.03 (GPU) |
HGDN | 79.68 | 51.26 | - | - | 0.03 (GPU) |
RCNN-large | 84.3 | 41.0 | - | - | 0.04 (GPU) |
FCN-16s | 85.2 | 51.7 | - | - | 0.175 (GPUs) |
VGG-conv5-DAG-RNN(8) | 85.3 | 55.7 | - | - | - |
FCN-8s | 85.9 | 53.9 | 41.2 | 77.2 | - |
patch CRF+CNN | 88.1 | 53.4 | - | - | - |
PASCAL-Context
Method | Pixel Acc. | Class Acc. | mean IU | f.w. IU |
---|---|---|---|---|
CFM | - | - | 18.1 | - |
CFM | - | - | 34.4 | - |
FCN-32s | 65.5 | 49.1 | 36.7 | 50.9 |
FCN-16s | 66.9 | 51.3 | 38.4 | 52.3 |
FCN-8s | 67.5 | 52.3 | 39.1 | 53.0 |
patch CRF+CNN | 71.5 | 53.9 | - | - |
Reference
Method | Year | Conference | Reference Paper |
---|---|---|---|
Superparsing | 2010 | ECCV | Superparsing: Scalable nonparametric image parsing with superpixels |
Single-scale ConvNet | 2013 | PAMI | Learning hierarchical features for scene labeling |
multiscale net | 2013 | PAMI | Learning hierarchical features for scene labeling |
Augmented CNNs | 2014 | BMVC | Contextually constrained deep networks for scene labeling |
RCNN2 (3 instances) | 2014 | ICML | Recurrent convolutional neural networks for scene labeling |
Multi-CNN + rCPN Fast | 2014 | NIPS | Recursive context propagation network for semantic scene labeling |
RCNN (balanced) | 2015 | NIPS | Convolutional Neural Networks with Intra-layer |
RCNN-large | 2015 | NIPS | Convolutional Neural Networks with Intra-layer |
Deep 2D LSTM | 2015 | CVPR | Scene Labeling with LSTM Recurrent Neural Networks |
Zoom-out | 2015 | CVPR | Feedforward semantic segmentation with zoom-out features |
FCN-16s | 2015 | CVPR | Fully convolutional networks for semantic segmentation |
N-ReNet | 2016 | Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation | |
HGDN | 2016 | CVPR | Hierarchically Gated Deep Networks for Semantic Segmentation |
VGG-conv5-DAG-RNN(8) | 2016 | CVPR | DAG-Recurrent Neural Networks For Scene Labeling |
patch CRF+CNN | 2016 | CVPR | Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation |