Co-CNN
(ICCV 2015) Human Parsing with Contextualized Convolutional Neural Network
(T-PAMI 2016) Human Parsing with Contextualized Convolutional Neural Network
Paper: http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Liang_Human_Parsing_With_ICCV_2015_paper.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7423822&queryText=human%20parsing%20with%20contextualized&newsearch=true
Project: http://hcp.sysu.edu.cn/deep-human-parsing/
提出Contextualized Convolutional Neural Network (CoCNN),在CNN中加入cross-layer context, global image-level context, within-super-pixel context和cross-super-pixel neighborhood context。
cross-layer context: 加入local-to-global-to-local结构(把前面层的feature map加到后面层中,把输入的图片卷积后加入最后一个feature map中),结合了不同卷积层的全局语义信息和局部细节。
global image-level label prediction: 把分割中所有标签整合成一个binary vector进行多标签分类。
within-super-pixel smoothing and cross-superpixel neighbourhood voting: 平滑局部标签
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (CoCNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.
the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers.
the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN.
the within-super-pixel smoothing and cross-superpixel neighbourhood voting are formulated as natural subcomponents of the Co-CNN to achieve the local label consistency in both training and testing process.
Introduction
none of previous methods has achieved excellent dense prediction over raw image pixels in a fully end-to-end way.
diverse contextual information and mutual relationships among the key components of human parsing (i.e. semantic labels, spatial layouts and shape priors) should be well addressed during predicting the pixel-wise labels.
the predicted label maps are desired to be detail-preserved and of high-resolution, in order to recognize or highlight very small labels (e.g. sunglass or belt).
In this paper, we present a novel Contextualized Convolutional Neural Network (Co-CNN) that successfully addresses the above mentioned issues.
Figure 1. Our Co-CNN integrates the cross-layer context, global image-level context and local super-pixel contexts into a unified network. It consists of cross-layer combination, global image-level label prediction, within-super-pixel smoothing and cross-super-pixel neighborhood voting.
given an input 150 × 100 image, we extract the feature maps for four resolutions (i.e., 150 × 100, 75 × 50, 37 × 25 and 18 × 12). Then we gradually up-sample the feature maps and combine the corresponding early, fine layers (blue dash line) and deep, coarse layers (blue circle with plus) under the same resolutions to capture the cross-layer context.
an auxiliary objective (shown as “Squared loss on image-level labels”) is appended after the down-sampling stream to predict global image-level labels. These predicted probabilities are then aggregated into the subsequent layers after the up-sampling (green line) and used to re-weight pixel-wise prediction (green circle with plus).
the within-super-pixel smoothing and cross-super-pixel neighborhood voting are performed based on the predicted confidence maps (orange planes) and the generated super-pixel over-segmentation map to produce the final parsing result.
Only down-sampling, up-sampling, and prediction layers are shown; intermediate convolution layers are omitted. For better viewing of all figures in this paper, please see original zoomed-in color pdf file.
Related Work
Human Parsing
Semantic Segmentation with CNN
The Proposed Co-CNN Architecture
Local-to-global-to-local Hierarchy
Our basic local-to-global-to-local structure captures the cross-layer context. It simultaneously considers the local fine details and global structure information.
the early convolutional layers with high spatial resolutions (e.g., \(150 \times 100\)) often capture more local details while the ones with low spatial resolutions (e.g., \(18 \times 12\)) can capture more structure information with high-level semantics.
The feature maps up-sampled from the low resolutions and those from the high resolutions are then aggregated with the element-wise summation, shown as the blue circle with plus in Figure 1.
To capture more detailed local boundaries, the input image is further filtered with the \(5 \times 5\) convolutional filters and then aggregated into the later feature maps.
Global Image-level Context
An auxiliary objective for multi-label prediction is used after the intermediate layers with spatial resolution of \(18 \times 12\), as shown in the pentagon in Figure 1.
Following the fully-connected layer, the C-way softmax which produces a probability distribution over the C class labels is appended.
Squared loss is used during the global image label prediction.
Suppose for each image \(I\) in the training set, \(y = [y_1, y_2, \cdots, y_C]\) is the ground-truth multi-label vector. \(y_c = 1, (c = 1, \cdots, C)\) if the image is annotated with class \(c\), and otherwise \(y_c = 0\).
concatenating the predicted label probabilities with the intermediate convolutional layers (image label concatenation in Figure 1) and element-wise summation with label confidence maps (element-wise summation in Figure 1).
Local Super-pixel Context
It is advantageous that super-pixel guidance is used at the later stage, which avoids making premature decisions and thus learning unsatisfactory convolution filters.
Within-super-pixel Smoothing
For each input image \(I\), we first compute the over-segmentation of I using the entropy rate based segmentation algorithm [17] and obtain 500 super-pixels per image. Given the \(C\) confidence maps \(\{x_c\}^C_1\) in the prediction layer, the within-super-pixel smoothing is performed on each map \(x_c\). Let us denote the super-pixel covering the pixel at location \((i, j)\) by \(s_{ij}\), the smoothed confidence maps \(\tilde{x}_c\) can be computed by
\[\tilde{x}_{i,j,c} = \frac{1}{\lVert s_{ij} \rVert} \sum_{(i', j') \in s_{ij}} x_{i', j', c}\]
Cross-super-pixel Neighborhood Voting
we can take the neighboring larger regions into account for better inference, and exploit more statistical structures and correlations between different super-pixels.
For each super-pixel s, we first compute a concatenation of bag-of-words from RGB, Lab and HOG descriptor for each super-pixel, and the feature of each super-pixel can be denoted as \(b_s\). The cross neighborhood voted response \(\bar{x}_s\) of the super-pixel s is calculated by
\[\bar{x}_s = (1 - \alpha) \tilde{x}_s + \alpha \sum_{s' \in D_s} \frac{\exp (-\lVert b_s - b_{s'} \rVert ^2)}{\sum_{\hat{s} \in D_s} \exp (-\lVert b_s - b_{s'} \rVert ^2)} \tilde{x}_{s'}\]
Our within-super-pixel smoothing and cross-super-pixel neighborhood voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.
Parameter details of Co-CNN
Experiments
Experimental Settings
Dataset
the large ATR dataset [15] and the small Fashionista dataset [31]
Implementation Details
We augment the training images with the horizontal reflections, which improves about 4% in terms of F-1 scores.
Given a test image, we use the human detection algorithm [10] to detect the human body.
To evaluate the performance, we re-scale the output pixel-wise prediction back to the size of the original ground-truth labeling.
We train the networks for roughly 90 epochs, which takes 4 to 5 days.
Our Co-CNN can rapidly process one 150 × 100 image within about 0.0015 second. After incorporating the super-pixel extraction [17], we test one image within about 0.15 second.
Results and Comparisons
Discussion on Our Network
Local-to-Global-to-Local Hierarchy
Global Image-level Context
Local Super-pixel Contexts
Conclusions and Future Work
In this work, we proposed a novel Co-CNN architecture for human parsing task, which integrates the cross-layer context, global image label context and local super-pixel contexts into a unified network.