Learning Cross-Modal Deep Representations for Robust Pedestrian Detection

Published in 2017 CVPR


  • Exploiting multispectral data can make the distinguishment of hard positive and negative samples easier in RGB images.
  • Most surveillance systems still employ traditional RGB sensors.
  • Few labeled multi-modal datasets are availabel for pedestrian detection.


  • A novel approach for learning and transferring cross-modal feature representations for pedestrian detection.
  • First work on the problem of pedestrian detection under adverse illumination conditions with CNNs.


  • Overall framework consists of RRN (Region Reconstruction Netowrk) and MSDN (Multi-Scale Detection Network).
    • RRN: Reconstruct thermal data from RGB images.
      • Inputs: RGB image and the corresponding bounding box proposals generated by a pretrained generic pedestrian detector (ACF, contain both true and false positives).
      • Outputs: Reconstructed proposals (L2 loss).
        Initial proposal generation can easy the reconstruction of hard positive and negative samples.
    • MSDN: Pedestrian detection.
      • Inputs: Only RGB images and bounding box proposals from ACF.
      • Outputs: Detection results (location and classification results of the proposals).
    • Optimization procedure.
      • Train RNN on multispectral data.
      • Train MSDN with RGB data (train Sub-Net A first, and then finetune the whole network).

Experiments and Results

  • KAIST dataset (both RGB and thermal data)
    • Effectiveness of the proposed cross-modality transfer CNN (CMT-CNN).
      CMT-CNN-SA: only Sub-Net A was used. CMT-CNN-SA-SB (Random): initialize Sub-Net B with radom parameters. CMT-CNN-SA-SB (ImageNet): Initialize Sub-Net B with ImageNet pretraining. CMT-CNN: proposed.
      Evaluated by log average miss-rate (MR).
    • Effectiveness of the reconstruction subnet.
      Can successfully reconstruct the thermal data.
      Can clearly distinguish the true pedestrian proposals from the false positive samples.
    • Comparison to state-of-the-art methods.
  • Caltech dataset ( only RGB images)
    • As no thermal data available for the Calthech dataset, the proposed method initialize Sub-Net B with the RRN trained on KAIST dataset.
    • Results
      The performance gain by the knowledge transfer was smaller than that on KAIST, which has worse illumination.
      CMT-CNN-SA-SC (RGB-KAIST): Pretrain Sub-Net B on ImageNet, then further train it on KAIST RGB data.


  • Reconstruction was employed as the supervision to get useful features from the thermal images instead of direct segmentation labels.
  • Multiscale detection was enabled through ROI pooling.