Video Semantic Segmentation with
Distortion-Aware Feature Correction

Jiafan Zhuang1      Zilei Wang1      Bingke Wang1     

Abstract


Video semantic segmentation targets to generate accurate semantic map for each frame in a video. For such a task, conducting per-frame image segmentation is generally unacceptable in practice due to high computational cost. To address the issue, many works use the flow-based feature propagation to reuse the features of previous frames, which actually exploits the content continuity of consecutive frames. However, the optical flow estimation itself inevitably suffers inaccuracy and consequently would cause the propagated features distorted. In this paper, we propose a distortion-aware feature correction method with aims of improving video segmentation performance at a low price. The core idea is to mainly correct the features on distorted regions using the current frame, while the propagated features are reserved for other regions. As a result, a lightweight network can be enough to achieve promising segmentation results. In particular, we propose to predict the distorted regions in the image space by utilizing the consistency of distortion patterns for images and features, such that the high-cost feature extraction for the current frames can be avoided. We conduct the extensive experiments on Cityscapes and CamVid, and the results show that our proposed method significantly outperforms previous methods and achieves the state-of-the-art performance on both segmentation accuracy and speed.

Framework



The framework of our proposed approach is illustrated, where semantic segmentation is performed on the features of each frame individually. To be specific, each of video frames is treated as the key or non-key frame. For the key frames, we directly conduct image semantic segmentation to get the results using an off-the-shelf network, and the intermediate features are used to propagate to the subsequent non-key frames. In particular, we propagate the features frame-by-frame in this work. That is, the feature of current frame is first obtained by propagating that of the previous frame, in which the predicted optical flow is used as the guidance and the bilinear interpolate is adopt as the warping operator. At the same time, we propagate the image of video frame as for feature propagation with the same optical flow, resulting in the propagated frame. For the non-key frames, we first feed the propagated frame and current frame into our proposed distortion map network (DMNet) to predict a distortion map, which actually represents the distortion pattern of propagated feature. Then we use the current frame to rectify the propagated feature under the guidance of the predicted distortion map, which is completed in our proposed feature correction module (FCM). Finally, we conduct semantic segmentation on the corrected feature to get the segmentation result of current frame.

Experimental Results

We provide results on Cityscapes val subset and CamVid test subset. DFF and Accel50 are reimplemented using DeepLabv3+ and the same FlowNet as in our method. In particular, visualization comparisons are demonstrated, where T + i denotes the i-th frame from the key frame T. Besides, we calculate the average computation cost by fixing the propagation distance as 5 for all methods. A lighter color in colorbar represents higher computation cost. Best viewed in color and zoom in.

Cityscapes


CamVid


Materials

  •                   

Citation

@article{zhuang2020video,
  title={Video Semantic Segmentation with Distortion-Aware Feature Correction},
  author={Zhuang, Jiafan and Wang, Zilei and Wang, Bingke},
  journal={arXiv preprint arXiv:2006.10380},
  year={2020}
}