1. Introduction
  2. Our Approach
    1. Position-sensitive Score Map Parameterization
    2. Joint Mask Prediction and Classification
    3. An End-to-End Solution
  3. Related Work
  4. Experiments
    1. Ablation Study on PASCAL VOC
    2. Experiments on COCO
  5. Conclusion

(CVPR 2017) Fully Convolutional Instance-aware Semantic Segmentation
Paper: https://arxiv.org/abs/1611.07709
Code: https://github.com/daijifeng001/TA-FCN

We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It performs instance mask prediction and classification jointly.

Introduction

instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.

In a prevalent family of instance-aware semantic segmentation approaches, it is achieved by adopting different types of sub-networks in three stages. Such methods have several drawbacks:

  1. the ROI pooling step losses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., \(14 \times 14\) in [8]) for fc layers.

  2. the fc layers over-parametrize the task, without using regularization of local weight sharing.

  3. the per-ROI network computation in the last step is not shared among ROIs.

InstFCN drawbacks:

  1. It is blind to semantic categories and requires a downstream network for category classification.

  2. The mask prediction and classification sub-tasks are separated and the solution is not end-to-end.

  3. It operates on square, fixed-size sliding windows (\(224 \times 224\) pixels) and adopts a time-consuming image pyramid scanning to find instances at different scales.

we propose the first end-to-end fully convolutional approach for instance-aware semantic segmentation. The
underlying convolutional representation and the score maps are fully shared for the mask prediction and classification sub-tasks, via a novel joint formulation with no extra parameters.

Figure 1. Illustration of our idea. (a) Conventional fully convolutional network (FCN) [29] for semantic segmentation. A single score map is used for each category, which is unaware of individual object instances. (b) InstanceFCN [5] for instance segment proposal, where \(3 \times 3\) position-sensitive score maps are used to encode relative position information. A downstream network is used for segment proposal classification. (c) Our fully convolutional instance-aware semantic segmentation method (FCIS), where position-sensitive inside/outside score maps are used to perform mask prediction and category classification jointly.

Our Approach

Position-sensitive Score Map Parameterization

In FCNs [29], a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some object category”. It is translation invariant and unaware of individual object instances.

To introduce translation-variant property, a fully convolutional solution is firstly proposed in [5] for instance mask proposal. Each score represents the likelihood of “the pixel belongs to some object instance at a relative position”.

Joint Mask Prediction and Classification

For the instance-aware semantic segmentation task, not only [5], but also many other state-of-the-art approaches, such as SDS [15], Hypercolumn [16], CFM [7], MNC [8], and MultiPathNet [42], share a similar structure: two subnetworks are used for mask prediction and classification sub-tasks, separately and sequentially.

We enhance the “position-sensitive score map” idea to perform the mask prediction and classification sub-tasks jointly and simultaneously.

Figure 2. Instance segmentation and classification results (of “person” category) of different ROIs. The score maps are shared by different ROIs and both sub-tasks. The red dot indicates one pixel having different semantics in different ROIs.

For each object category, two sets of \(k^2\) score maps are produced from the preceding convolutional layers. Each pixel has two scores in each cell. They represent the likelihoods of “the pixel belongs to some object instance at a relative position and is inside(or outside) the object boundary”.

For mask prediction, a softmax operation produces the per-pixel foreground probability (\(\in [0, 1]\)).

For mask classification, a max operation produces the per-pixel likelihood of “belonging to the object category”.

For a positive ROI, for each (inside, outside) score pair, one should be high and the other should be low, depending whether the corresponding pixel is inside or outside the object boundary.

For a negative ROI, all scores should be low.

An End-to-End Solution

Figure 3. Overall architecture of FCIS. A region proposal network (RPN) [34] shares the convolutional feature maps with FCIS. The proposed region-of-interests (ROIs) are applied on the score maps for joint instance mask prediction and classification. The learnable weight layers are fully convolutional and computed on the whole image. The per-ROI computation cost is negligible.

we adopt the ResNet model [18]. The last fully-connected layer for 1000-way classification is discarded. The resulting feature maps have 2048 channels. On top of it, a \(1 \times 1\) convolutional layer is added to reduce the dimension to 1024.

To reduce the feature stride and maintain the field of view, the “hole algorithm” [3, 29] (Algorithme `a trous [30]) is applied.

We use region proposal network (RPN) [34] to generate ROIs.

From the conv5 feature maps, \(2k^2 \times (C + 1)\) score maps are produced (C object categories, one background category, two sets of \(k^2\) score maps per category, \(k = 7\) by default in experiments) using a \(1 \times 1\) convolutional layer.

bounding box (bbox) regression [13, 12] is used to refine the initial input ROIs. A sibling \(1 \times 1\) convolutional layer with \(4k^2\) channels is added on the conv5 feature maps to estimate the bounding box shift in location and size.

Semantic Image Segmentation

Object Segment Proposal

Instance-aware Semantic Segmentation

FCNs for Object Detection

Experiments

Ablation Study on PASCAL VOC

method \(mAP^r\)@0.5 (%) \(mAP^r\)@0.7 (%)
naive MNC 59.1 36.0
InstFCN + R-FCN 62.7 41.5
FCIS (translation invariant) 52.5 38.5
FCIS (separate score maps) 63.9 49.7
FCIS 65.7 52.1

Table 1. Ablation study of (almost) fully convolutional methods on PASCAL VOC 2012 validation set.

Experiments on COCO

method \(mAP^r\)@[0.5:0.95] (%) \(mAP^r\)@0.5 (%)
FAIRCNN (2015) 25.0 45.6
MNC+++ (2015) 28.4 51.6
G-RMI (2016) 33.8 56.9
FCIS baseline 29.2 49.5
+multi-scale testing 32.0 51.9
+horizontal flip 32.7 52.7
+multi-scale training 33.6 54.5
+ensemble 37.6 59.9

Conclusion

We present the first fully convolutional method for instance-aware semantic segmentation. It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task. The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation.