Knowledge Adaptation for Efficient Semantic Segmentation

Motivation

  • The delemma between segmentation accuracy and efficiency.
  • Existing methods adopting knowledge distillation are for image-level classification task without considering the spatial context structures and the feature maps from the teacher and student networks usually have inconsistent context and mismatched features, making the methods improper for semantic segmentation.
  • Due to the limited receptive field, small models have difficulty in capturing long-term dependencies and can be statistically brittle.

Contributions

  • Propose a new knowledge distillation method for semantic segmentation by translating the teacher knowledge into compact representations through an auto-encoder.
  • Design an affinity distillation module that captures the long-term dependencies of the feature maps.
  • Conduct experiments to validate the effectiveness of the proposed method.

Methods

  • Overall network framework
    The teacher network is frozen and outputs high resolution feature maps. The student network outpus small size of feature maps. Knowledge distillation is only enforced on the feature maps of the final convolution layers.
  • Knowledge translation and adaptation (pixel-wise distillation)
    Translating the knowledge of the feature maps generated by the teacher network is realized through an auto-encoder. The auto-encoder is composed of three convolution layers (encoder) and three deconvolution layers (decoder) (3x3 kernel, padding 1, BN, ReLU). The convolution strides of the first convolution layer and the first deconvolution layer are set to 2. The training of the auto-encoder is completed by using a reconstruction loss:

    L1 norm is used to produce sparse representations. alpha=10E-7.
    A feature adapter is used to avoid large feature differences and adjust the channel numbers, which consists of three convolution layers (3x3 kernel, padding 1, BN, ReLU).
    The knowledge distillation loss for this part is:

    where E is the auto-encoder, I refers to the indices of all student-teacher pairs in all positions, Cf is the adapter.
  • Affinity distillation module (pair-wise distillation)
    Computing interactions between any two positions, expecting that the pixels with different labels will generate a low response and a high response for the pixels with the same labels.

    L2 loss is used to match the affinity matrix between teacher and student models:
  • Training procedure

    beta = 50, gamma = 1

Results

Teacher networks: Segmentation models based on ResNet-50 and Xception-41 with atrous convolution and ASPP.
Student networks: Segmentation models based on MobileNetV2 and its variants.

  • Pascal VOC 2012





  • Cityscapes
  • Pascal Context

Conclusions

  • The proposed method improved the student network segmentation performance by 2%.
  • The proposed method proposed a knowledge distillation approach between features with different spatial resolutions (auto-encoder + reconstruction loss).
  • The two distillaiton methods could be categorized as pixel-wise distillation and pair-wise distillation.
  • The knowledge distillation was only applied on the final feature maps, which might be a limitation for the improvement of segmentation accuracy.