YOLOv2
28 July 2020

Photo by Grace Brauteseth
Origin: YOLO9000: Better, Faster, Stronger
Imporvements comparing to YOLOv1
1. Add BN
2. Hight Resolution classifier
a. Train on ImageNet (224 x 224)
b. Resize & Finetune on ImageNet (448 x 448)
c. Finetune on dataset
d. Get 13 x 13 grid finally (7 x 7 grid before)
3. Use Anchors
4. Fine-Grained Features
a. Lower features are concatenated directly to heigher features
b. A new layer is added for that purpose: reorg
5. Multi-Scale Training
- Remove FC layers:
Can accpet any size of inputs, enhance model robustness.
- Size across 320, 352, …, 608. Change 10 per epochs
[border % 32 = 0, decided by down sampling
Anchor in YOLOv2
| Box Generation |
# |
Avg IOU |
| Cluster SSE |
5 |
58.7 |
| Cluster IOU |
5 |
58.7 |
| Anchor Boxes |
9 |
58.7 |
| Cluster IOU |
9 |
58.7 |
1. Anchor size and number
a. Faster RCNN: 9 by hands
b. YOLOv2: 5 by K-Means [dist: 1 − IOU(bbox, cluster)]
2. Anchors, Truth BBoxes & Predicted bboxes
Anchors: 0.57273, 0.677385, …, 9.77052, 9.16828
[10 numbers: ($a_{w_{i}}, a_{h_{i}}$) * 5)]
anchors[0] = $a_{w_{i}}$ = $\frac{a_{w_{i}}}W$ * 13
Truth Anchor:
- oringinal bbox: $\left[x_{o}, y_{o}, w_{o}, h_{o}\right] \in[0, \mathrm{W} \mid \mathrm{H}]$
- normalize original bbox: $[x_{r}, y_{r}, w_{r}, h_{r}]\in[0, 1]$
- $[x_{r}, y_{r}, w_{r}, h_{r}] = [x_{o}/W, y_{o}/H, w_{o}/W, h_{o}/H]$
- Transfer to 13 x 13 grid and box: $[x_{s}, y_{s}, w_{s}, h_{s}]\in[0, 13]$
- $\left[x_{i}, y_{i}, w_{i}, h_{i}\right]=\left[x_{r}, y_{r}, w_{r}, h_{r}\right] * (13 \mid 13)$
- save this for calculating
- transfer to 0~1 corresponding to each grid cell
- final box: $[x_{f}, y_{f}, w_{f}, h_{f}]\in[0, 1]$
- $x_{f} = x_{i} - i$ // i,j = 13 x 13 grid
- $y_{f} = y_{i} - j$
- $W_{f} = log(W/anchors[0])$
- $H_{f} = log(H/anchors[1])$
Predicted Anchor:
The Model Darknet-19
Output of YOLOv2: [0: 25]
one grid cell: $S^2$ * B * [x, y, w, h, $C_{0}, ..., C_{N}$]
- detection layer
- 3 x 331024 Conv
- add a passthrough layer
- 1*1 avg pooling
Loss Function
- No longer use the square root
- Confidence: 1 convert to IoU
The path from YOLO to YOLOv2
| |
YOLO |
|
|
|
|
|
|
|
YOLOv2 |
| batch norm? |
|
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
| hi-res classifier? |
|
|
√ |
√ |
√ |
√ |
√ |
√ |
√ |
| convolutional? |
|
|
|
√ |
√ |
√ |
√ |
√ |
√ |
| anchor boxes? |
|
|
|
√ |
√ |
|
|
|
|
| new network? |
|
|
|
|
√ |
√ |
√ |
√ |
√ |
| dimension priors? |
|
|
|
|
|
√ |
√ |
√ |
√ |
| location prediction? |
|
|
|
|
|
√ |
√ |
√ |
√ |
| passthrough? |
|
|
|
|
|
|
√ |
√ |
√ |
| multi-scale? |
|
|
|
|
|
|
|
√ |
√ |
| hi-res detector? |
|
|
|
|
|
|
|
|
√ |
| VOC2007 mAP |
63.4 |
65.8 |
69.5 |
69.2 |
69.6 |
74.4 |
75.4 |
76.8 |
78.6 |