Eryck Zhou

A super simple BLOG for Artifical Intelligence.

YOLOv2

28 July 2020

image

Photo by unsplash-logoGrace Brauteseth

Origin: YOLO9000: Better, Faster, Stronger

Imporvements comparing to YOLOv1

1. Add BN

2. Hight Resolution classifier

a.  Train on ImageNet (224 x 224)
b.  Resize & Finetune on ImageNet (448 x 448)
c.  Finetune on dataset
d.  Get 13 x 13 grid finally (7 x 7 grid before)

3. Use Anchors

4. Fine-Grained Features

a.  Lower features are concatenated directly to heigher features
b.  A new layer is added for that purpose: reorg

5. Multi-Scale Training

  • Remove FC layers: Can accpet any size of inputs, enhance model robustness.
  • Size across 320, 352, …, 608. Change 10 per epochs
         [border % 32 = 0, decided by down sampling

Anchor in YOLOv2

Box Generation # Avg IOU
Cluster SSE 5 58.7
Cluster IOU 5 58.7
Anchor Boxes 9 58.7
Cluster IOU 9 58.7

1. Anchor size and number

a.  Faster RCNN: 9 by hands
b.  YOLOv2: 5 by K-Means [dist: 1 − IOU(bbox, cluster)]

2. Anchors, Truth BBoxes & Predicted bboxes

Anchors: 0.57273, 0.677385, …, 9.77052, 9.16828

               [10 numbers: ($a_{w_{i}}, a_{h_{i}}$) * 5)]
               anchors[0] = $a_{w_{i}}$ = $\frac{a_{w_{i}}}W$ * 13

Truth Anchor:
  1. oringinal bbox: $\left[x_{o}, y_{o}, w_{o}, h_{o}\right] \in[0, \mathrm{W} \mid \mathrm{H}]$
  2. normalize original bbox: $[x_{r}, y_{r}, w_{r}, h_{r}]\in[0, 1]$
    • $[x_{r}, y_{r}, w_{r}, h_{r}] = [x_{o}/W, y_{o}/H, w_{o}/W, h_{o}/H]$
  3. Transfer to 13 x 13 grid and box: $[x_{s}, y_{s}, w_{s}, h_{s}]\in[0, 13]$
    • $\left[x_{i}, y_{i}, w_{i}, h_{i}\right]=\left[x_{r}, y_{r}, w_{r}, h_{r}\right] * (13 \mid 13)$
    • save this for calculating
    • transfer to 0~1 corresponding to each grid cell
  4. final box: $[x_{f}, y_{f}, w_{f}, h_{f}]\in[0, 1]$
    • $x_{f} = x_{i} - i$     // i,j = 13 x 13 grid
    • $y_{f} = y_{i} - j$
    • $W_{f} = log(W/anchors[0])$
    • $H_{f} = log(H/anchors[1])$
Predicted Anchor:

The Model Darknet-19

Output of YOLOv2: [0: 25]

one grid cell: $S^2$ * B * [x, y, w, h, $C_{0}, ..., C_{N}$]

  • detection layer
    • 3 x 331024 Conv
    • add a passthrough layer
    • 1*1 avg pooling

Loss Function

  • No longer use the square root
  • Confidence: 1 convert to IoU

The path from YOLO to YOLOv2

  YOLO               YOLOv2
batch norm?  
hi-res classifier?    
convolutional?      
anchor boxes?              
new network?        
dimension priors?          
location prediction?          
passthrough?            
multi-scale?              
hi-res detector?                
VOC2007 mAP 63.4 65.8 69.5 69.2 69.6 74.4 75.4 76.8 78.6