Eryck Zhou

A super simple BLOG for Artifical Intelligence.

YOLOv1

26 July 2020

image

Photo by unsplash-logoJess Barnett

Origin: You Only Look Once: Unified, Real-Time Object Detection

Abstract

  • As a regression problem to spatially separated bounding boxes and associated class probabilities
  • A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
  • more localization errors but is less likely to predict false positives on background

The YOLO Detection System

Figure 1 Processing images with YOLO is simple and straightforward

  1. resizes the input image to 448 × 448
  2. runs a single convolutional network on the image
  3. thresholds the resulting detections bythe model’s confidence

The Model

Procedure

Figure 2

  1. It divides the image into an S × S grid [448 × 448 -> 7 x 7]
        If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
  2. Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.
        Bounding Box: x, y, w, h(center)
        Confidence: $Pr(object) \cdot IoU^{pred}_{truth}$
  3. Final output tensor: S × S × (B ∗ 5 + C)

The Loss Function

        i: 0~($S^2$-1) [iterate each grid (0~48)]
        j: 0~(B-1) [iterate each bbox (0~1)]

        For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1

Coordinate Loss

  • x, y: predicated bbox center
  • w, h: predicated bbox width & height
  • $\hat{x}, \hat{y}$: labeled bbox center
  • $\hat{w}, \hat{h}$: labeled bbox width & height
  • $\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
  • $\lambda_{\text {coord }}$: 5. because there’s only 8 dimensions. Too less comparing to other losses weighted loss essentially.

Confidence Loss

  • $\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
  • $C_{i}$: preidcted confidence score [IoU] generated from network

Note:

  • $\hat{C}_{i}$ is 0 or 1 integer
  • $\lambda_{\text {noobj }}$=0.5, because there’s so many non-object bboxes
  • Train: confidence = $Pr(object) \cdot IoU^{pred}_{truth}$
  • Test: individual box confidence predicton:
         confidence = $Pr(cls_{i}obj)Pr(obj) \cdot IoU^{pred}_{truth}$

Classification loss

Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.

Don't forget to do NMS after generating bboxes.


The YOLOv1 Pros & Cons

Pros:

  • one stage, really fast

Cons:

  • Bad for crowed objects[1 cell 1 obj]
  • Bad for small objects
  • Bad for objects with new width-height ratio
  • No BN