As a regression problem to spatially separated bounding boxes and
associated class probabilities
A single neural network predicts bounding boxes and class probabilities directly from
full images in one evaluation
more localization errors but is less likely to predict
false positives on background
The YOLO Detection System
Processing images with YOLO is simple and straightforward
resizes the input image to 448 × 448
runs a single convolutional network on the image
thresholds the resulting detections bythe model’s confidence
The Model
Procedure
It divides the image into an S × S grid [448 × 448 -> 7 x 7] If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. Bounding Box: x, y, w, h(center) Confidence: $Pr(object) \cdot IoU^{pred}_{truth}$
Final output tensor: S × S × (B ∗ 5 + C)
The Loss Function
i: 0~($S^2$-1) [iterate each grid (0~48)] j: 0~(B-1) [iterate each bbox (0~1)]
For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1
Coordinate Loss
x, y: predicated bbox center
w, h: predicated bbox width & height
$\hat{x}, \hat{y}$: labeled bbox center
$\hat{w}, \hat{h}$: labeled bbox width & height
$\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
$\lambda_{\text {coord }}$: 5. because there’s only 8 dimensions. Too less comparing to other losses weighted loss essentially.
Confidence Loss
$\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
$C_{i}$: preidcted confidence score [IoU] generated from network
Note:
$\hat{C}_{i}$ is 0 or 1 integer
$\lambda_{\text {noobj }}$=0.5, because there’s so many non-object bboxes