YOLOv1

26 July 2020

Photo by Jess Barnett

Origin: You Only Look Once: Unified, Real-Time Object Detection

Abstract

As a regression problem to spatially separated bounding boxes and associated class probabilities
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
more localization errors but is less likely to predict false positives on background

The YOLO Detection System

Processing images with YOLO is simple and straightforward

resizes the input image to 448 × 448
runs a single convolutional network on the image
thresholds the resulting detections bythe model’s confidence

The Model

Procedure

It divides the image into an S × S grid [448 × 448 -> 7 x 7]
If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.
Bounding Box: x, y, w, h(center)
Confidence: $Pr(object) \cdot IoU^{pred}_{truth}$
Final output tensor: S × S × (B ∗ 5 + C)

The Loss Function

$\begin{array}{c} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} 1_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}_{i}\right)^{2}+\left(y_{i}-\hat{y}_{i}\right)^{2}\right] \\ +\lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \\ +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left(C_{i}-\hat{C}_{i}\right)^{2} \\ +\lambda_{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {noobj }}\left(C_{i}-\hat{C}_{i}\right)^{2} \\ \quad+\sum_{i=0}^{S^{2}} \mathbb{1}_{i}^{\text {obj }} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{array}$

i: 0~($S^2$-1) [iterate each grid (0~48)]
j: 0~(B-1) [iterate each bbox (0~1)]
$1_{i j}^{\mathrm{obj}} \& 1_{i j}^{\mathrm{noobj}}:\left[\begin{array}{lllllll} 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}\right]\left[\begin{array}{lllllll} 1 & 1 & 1 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 1 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 & 1 & 1 & 1 \end{array}\right]$

For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1

Coordinate Loss

$\begin{array}{l} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}_{i}\right)^{2}+\left(y_{i}-\hat{y}_{i}\right)^{2}\right] \\ \quad+\lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {obj }}\left[(\sqrt{w_{i}}-\sqrt{\hat{w}_{i}})^{2}+(\sqrt{h_{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \end{array}$

x, y: predicated bbox center
w, h: predicated bbox width & height
$\hat{x}, \hat{y}$: labeled bbox center
$\hat{w}, \hat{h}$: labeled bbox width & height
$\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
$\lambda_{\text {coord }}$: 5. because there’s only 8 dimensions. Too less comparing to other losses weighted loss essentially.

Confidence Loss

$\begin{array}{c} +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\mathrm{obj}}\left(C_{i}-\hat{C}_{i}\right)^{2} \\ +\lambda_{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}_{i j}^{\text {noobj }}\left(C_{i}-\hat{C}_{i}\right)^{2} \end{array}$

$\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
$C_{i}$: preidcted confidence score [IoU] generated from network

Note:

$\hat{C}_{i}$ is 0 or 1 integer
$\lambda_{\text {noobj }}$=0.5, because there’s so many non-object bboxes
Train: confidence = $Pr(object) \cdot IoU^{pred}_{truth}$
Test: individual box confidence predicton:
confidence = $Pr(cls_{i}obj)Pr(obj) \cdot IoU^{pred}_{truth}$

Classification loss

$+\sum_{i=0}^{S^{2}} 1_{i}^{\mathrm{obj}} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2}$

Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.

▶ Don't forget to do NMS after generating bboxes.

The YOLOv1 Pros & Cons

Pros:

one stage, really fast

Cons:

Bad for crowed objects[1 cell 1 obj]
Bad for small objects
Bad for objects with new width-height ratio
No BN