FCIS
(CVPR 2017) Fully Convolutional Instance-aware Semantic Segmentation
Paper: https://arxiv.org/abs/1611.07709
Code: https://github.com/daijifeng001/TA-FCN
We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It performs instance mask prediction and classification jointly.
Introduction
instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions.
In a prevalent family of instance-aware semantic segmentation approaches, it is achieved by adopting different types of sub-networks in three stages. Such methods have several drawbacks:
the ROI pooling step losses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., \(14 \times 14\) in [8]) for fc layers.
the fc layers over-parametrize the task, without using regularization of local weight sharing.
the per-ROI network computation in the last step is not shared among ROIs.
InstFCN drawbacks:
It is blind to semantic categories and requires a downstream network for category classification.
The mask prediction and classification sub-tasks are separated and the solution is not end-to-end.
It operates on square, fixed-size sliding windows (\(224 \times 224\) pixels) and adopts a time-consuming image pyramid scanning to find instances at different scales.
we propose the first end-to-end fully convolutional approach for instance-aware semantic segmentation. The
underlying convolutional representation and the score maps are fully shared for the mask prediction and classification sub-tasks, via a novel joint formulation with no extra parameters.
Figure 1. Illustration of our idea. (a) Conventional fully convolutional network (FCN) [29] for semantic segmentation. A single score map is used for each category, which is unaware of individual object instances. (b) InstanceFCN [5] for instance segment proposal, where \(3 \times 3\) position-sensitive score maps are used to encode relative position information. A downstream network is used for segment proposal classification. (c) Our fully convolutional instance-aware semantic segmentation method (FCIS), where position-sensitive inside/outside score maps are used to perform mask prediction and category classification jointly.
Our Approach
Position-sensitive Score Map Parameterization
In FCNs [29], a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some object category”. It is translation invariant and unaware of individual object instances.
To introduce translation-variant property, a fully convolutional solution is firstly proposed in [5] for instance mask proposal. Each score represents the likelihood of “the pixel belongs to some object instance at a relative position”.
Joint Mask Prediction and Classification
For the instance-aware semantic segmentation task, not only [5], but also many other state-of-the-art approaches, such as SDS [15], Hypercolumn [16], CFM [7], MNC [8], and MultiPathNet [42], share a similar structure: two subnetworks are used for mask prediction and classification sub-tasks, separately and sequentially.
We enhance the “position-sensitive score map” idea to perform the mask prediction and classification sub-tasks jointly and simultaneously.
Figure 2. Instance segmentation and classification results (of “person” category) of different ROIs. The score maps are shared by different ROIs and both sub-tasks. The red dot indicates one pixel having different semantics in different ROIs.
For each object category, two sets of \(k^2\) score maps are produced from the preceding convolutional layers. Each pixel has two scores in each cell. They represent the likelihoods of “the pixel belongs to some object instance at a relative position and is inside(or outside) the object boundary”.
For mask prediction, a softmax operation produces the per-pixel foreground probability (\(\in [0, 1]\)).
For mask classification, a max operation produces the per-pixel likelihood of “belonging to the object category”.
For a positive ROI, for each (inside, outside) score pair, one should be high and the other should be low, depending whether the corresponding pixel is inside or outside the object boundary.
For a negative ROI, all scores should be low.
An End-to-End Solution
Figure 3. Overall architecture of FCIS. A region proposal network (RPN) [34] shares the convolutional feature maps with FCIS. The proposed region-of-interests (ROIs) are applied on the score maps for joint instance mask prediction and classification. The learnable weight layers are fully convolutional and computed on the whole image. The per-ROI computation cost is negligible.
we adopt the ResNet model [18]. The last fully-connected layer for 1000-way classification is discarded. The resulting feature maps have 2048 channels. On top of it, a \(1 \times 1\) convolutional layer is added to reduce the dimension to 1024.
To reduce the feature stride and maintain the field of view, the “hole algorithm” [3, 29] (Algorithme `a trous [30]) is applied.
We use region proposal network (RPN) [34] to generate ROIs.
From the conv5 feature maps, \(2k^2 \times (C + 1)\) score maps are produced (C object categories, one background category, two sets of \(k^2\) score maps per category, \(k = 7\) by default in experiments) using a \(1 \times 1\) convolutional layer.
bounding box (bbox) regression [13, 12] is used to refine the initial input ROIs. A sibling \(1 \times 1\) convolutional layer with \(4k^2\) channels is added on the conv5 feature maps to estimate the bounding box shift in location and size.
Related Work
Semantic Image Segmentation
Object Segment Proposal
Instance-aware Semantic Segmentation
FCNs for Object Detection
Experiments
Ablation Study on PASCAL VOC
method | \(mAP^r\)@0.5 (%) | \(mAP^r\)@0.7 (%) |
---|---|---|
naive MNC | 59.1 | 36.0 |
InstFCN + R-FCN | 62.7 | 41.5 |
FCIS (translation invariant) | 52.5 | 38.5 |
FCIS (separate score maps) | 63.9 | 49.7 |
FCIS | 65.7 | 52.1 |
Table 1. Ablation study of (almost) fully convolutional methods on PASCAL VOC 2012 validation set.
Experiments on COCO
method | \(mAP^r\)@[0.5:0.95] (%) | \(mAP^r\)@0.5 (%) |
---|---|---|
FAIRCNN (2015) | 25.0 | 45.6 |
MNC+++ (2015) | 28.4 | 51.6 |
G-RMI (2016) | 33.8 | 56.9 |
FCIS baseline | 29.2 | 49.5 |
+multi-scale testing | 32.0 | 51.9 |
+horizontal flip | 32.7 | 52.7 |
+multi-scale training | 33.6 | 54.5 |
+ensemble | 37.6 | 59.9 |
Conclusion
We present the first fully convolutional method for instance-aware semantic segmentation. It extends the existing FCN-based approaches and significantly pushes forward the state-of-the-art in both accuracy and efficiency for the task. The high performance benefits from the highly integrated and efficient network architecture, especially a novel joint formulation.