1. Introduction
  2. Related Work
  3. Instance-sensitive FCNs for Segment Proposal
    1. From FCN to InstanceFCN
    2. Local Coherence
    3. Algorithm and Implementation
  4. Experiments
    1. Experiments on PASCAL VOC 2012
    2. Experiments on MS COCO
  5. Conclusion

(ECCV 2016) Instance-sensitive Fully Convolutional Networks
Paper: https://arxiv.org/abs/1603.08678

our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.

our method does not have any high-dimensional layer related to the mask resolution, but instead exploits image local coherence for estimating instances.

Introduction

we develop an end-to-end fully convolutional network that is capable of segmenting candidate instances.

our method computes a set of instance-sensitive score maps, where each pixel is a classifier of relative positions to an object instance.

With this set of score maps, we are able to generate an object instance segment in each sliding window by assembling the output from the score maps.

Unlike DeepMask, our method has no layer whose size is related to the mask size \(m^2\), and each pixel in our method is a low-dimensional classifier. This is made possible by exploiting local coherence [9] of natural images for generating per-window pixel-wise predictions.

Instance-sensitive FCNs for Segment Proposal

From FCN to InstanceFCN

the output is indeed reusable for most pixels, except for those where one object is conjunct the other

Instance-sensitive score maps

we propose an FCN where each output pixel is a classifier of relative positions of instances.

In our practice, we define the relative positions using a \(k \times k\) (e.g., \(k = 3\)) regular grid on a square sliding window. We call them instance-sensitive score maps.

Instance assembling module

We slide a window of resolution \(m \times m\) on the set of instance-sensitive score maps. In this sliding window, each \(\frac{m}{k} \times \frac{m}{k}\) sub-window directly copies values from the same sub-window in the corresponding score map. The k2 sub-windows are then put together (according to their relative positions) to assemble a new window of resolution \(m \times m\).

Local Coherence

for a pixel in a natural image, its prediction is most likely the same when evaluated in two neighboring windows.

This allows us to conserve a large number of parameters when the mask resolution \(m^2\) is high.

This not only reduces the computational cost of the mask prediction layers, but more importantly, reduces the number of parameters required for mask regression, leading to less risk of overfitting on small datasets

Algorithm and Implementation

Network architecture

use the VGG-16 network [22] pre-trained on ImageNet [23] as the feature extractor. We follow the practice in [24] to reduce the network stride and increase feature map resolution: the max pooling layer pool4 (between conv4 3 and conv5 1) is modified to have a stride of 1 instead of 2, and accordingly the filters in conv5 1 to conv5 3 are adjusted by the “hole algorithm” [24].

On top of the feature map, there are two fully convolutional branches, one for estimating segment instances and the other for scoring the instances.

Figure 4. Details of the InstanceFCN architecture. On the top is a fully convolutional branch for generating \(k^2\) instance-sensitive score maps, followed by the assembling module that outputs instances. On the bottom is a fully convolutional branch for predicting the objectness score of each window. The highly scored output instances are on the right. In this figure, the objectness map and the “all instances” map have been sub-sampled for the purpose of illustration.

Training

We adopt the image-centric strategy in [21,19].

The forward pass computes the set of instance-sensitive score maps and the objectness score map.

After that, a set of 256 sliding windows are randomly sampled [21,19], and the instances are only assembled from these 256 windows for computing the loss function.

The loss function is defined as:

\[\sum_i (\mathcal{L}(p_i, p_i^*) + \sum_j \mathcal{L}(S_{i,j}, S_{i,j}^*))\]

Here \(i\) is the index of a sampled window, \(p_i\) is the predicted objectness score of the instance in this window, and \(p_i^*\) is \(1\) if this window is a positive sample and \(0\) if a negative sample. \(S_i\) is the assembled segment instance in this window, \(S_i^*\) is the ground truth segment instance, and \(j\) is the pixel index in the window. \(\mathcal{L}\) is the logistic regression loss.

We follow the scale jittering in [26] for training: a training image is resized such that its shorter side is randomly sampled from \(600 \times 1.5^{ \{−4,−3,−2,−1,0,1\} }\) pixels.

Inference

A forward pass of the network is run on the input image, generating the instance-sensitive score maps and the objectness score map.

The assembling module then applies densely sliding windows on these maps to produce a segment instance at each position.

To handle multiple scales, we resize the shorter side of images to \(600 \times 1.5^{ \{−4,−3,−2,−1,0,1\} }\) pixels, and compute all instances at each scale.

For each output segment, we truncate the values to form a binary mask.

Then we adopt non-maximum suppression (NMS) to generate the final set of segment proposals.

Experiments

Experiments on PASCAL VOC 2012

Ablations on the number of relative positions \(k^2\)

Our method is not sensitive to \(k^2\), and can perform well even when \(k = 3\).

Ablation comparisons with the DeepMask scheme

Comparisons on Instance Semantic Segmentation

Experiments on MS COCO

Conclusion

We have presented InstanceFCN, a fully convolutional scheme for proposing segment instances.

A simple assembling module is then able to generate segment instances from these score maps.

Our network architecture handles instance segmentation without using any high-dimensional layers that depend on the mask resolution.