[논문 코드] YOLO v1 (2016 CVPR) PyTorch 구현 (타 GitHub)

0. 코드 소개 전 알려드릴 내용

시작에 앞서 본 논문의 Summary 내용을 알고싶다면 아래 링크를 참조부탁드립니다.

https://aigong.tistory.com/437

[논문 Summary] YOLO v1 (2016 CVPR) "You Only Look Once: Unified, Real-Time Object Detection"

[논문 Summary] YOLO v1 (2016 CVPR) "You Only Look Once: Unified, Real-Time Object Detection" 목차 논문 정보 Citations : 2022.05.30 기준 24598회 저자 Joseph Redmon - University of Washington Santosh..

aigong.tistory.com

Official code는 아래 링크에서 확인할 수 있습니다.

https://pjreddie.com/darknet/yolo/

코드 소개

YOLO v1 모델

논문에 따르면 모델은 위와 같이 448x448 이미지를 입력으로 받아 최종적으로 7x7x30의 output이 나타나도록 모델이 구현되어 있습니다. 7x7이 된 이유는 S라는 grid cell을 S*S로 나눴기 때문이고 30은 2*5(Bounding box)+20(Class Probability)이기 때문입니다.

2.2 Training에서 마지막 layer는 linear activation function을 나머지 모든 layer에서는 leaky ReLU를 사용한다고 적혀있습니다.(slope=0.1) 또한 2.2 마지막 부분에 dropout을 첫 connected layer 후에 0.5의 확률로 사용한다고 적혀있습니다.

다만, Batchnorm이 나오기 이전의 시점에 나온 논문이기 때문에 본 모델은 Batchnorm을 사용하여 정규화에 대한 장점을 살리도록 하였습니다.

이와 관련한 코드구현을 직접적으로 진행하면 다음과 같습니다.

본 모델에 대한 총 parameter는 271.717 M 정도로 굉장히 큽니다. 6GB GPU RAM으로도 batch size 1개를 돌리지 못할 정도의 크기입니다. 때문에 fully connected layer의 output channel을 496으로 사용하여 6GB GPU RAM 기준 batch size 12로 코드를 돌릴 수 있었습니다. (선택사항)

architecture_config = [
    (7, 64, 2, 3),
    "M",
    (3, 192, 1, 1),
    "M",
    (1, 128, 1, 0),
    (3, 256, 1, 1),
    (1, 256, 1, 0),
    (3, 512, 1, 1),
    "M",
    [(1, 256, 1, 0), (3, 512, 1, 1), 4],
    (1, 512, 1, 0),
    (3, 1024, 1, 1),
    "M",
    [(1, 512, 1, 0), (3, 1024, 1, 1), 2],
    (3, 1024, 1, 1),
    (3, 1024, 2, 1),
    (3, 1024, 1, 1),
    (3, 1024, 1, 1),
]


class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super(CNNBlock, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.batchnorm = nn.BatchNorm2d(out_channels)
        self.leakyrelu = nn.LeakyReLU(0.1)

    def forward(self, x):
        return self.leakyrelu(self.batchnorm(self.conv(x)))


class Yolov1(nn.Module):
    def __init__(self, in_channels=3, **kwargs):
        super(Yolov1, self).__init__()
        self.architecture = architecture_config
        self.in_channels = in_channels
        self.darknet = self._create_conv_layers(self.architecture)
        self.fcs = self._create_fcs(**kwargs)

    def forward(self, x):
        x = self.darknet(x)
        return self.fcs(torch.flatten(x, start_dim=1))

    def _create_conv_layers(self, architecture):
        layers = []
        in_channels = self.in_channels

        for x in architecture:
            if type(x) == tuple:
                layers += [
                    CNNBlock(
                        in_channels, x[1], kernel_size=x[0], stride=x[2], padding=x[3],
                    )
                ]
                in_channels = x[1]

            elif type(x) == str:
                layers += [nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))]

            elif type(x) == list:
                conv1 = x[0]
                conv2 = x[1]
                num_repeats = x[2]

                for _ in range(num_repeats):
                    layers += [
                        CNNBlock(
                            in_channels,
                            conv1[1],
                            kernel_size=conv1[0],
                            stride=conv1[2],
                            padding=conv1[3],
                        )
                    ]
                    layers += [
                        CNNBlock(
                            conv1[1],
                            conv2[1],
                            kernel_size=conv2[0],
                            stride=conv2[2],
                            padding=conv2[3],
                        )
                    ]
                    in_channels = conv2[1]

        return nn.Sequential(*layers)

    def _create_fcs(self, split_size, num_boxes, num_classes):
        S, B, C = split_size, num_boxes, num_classes

        # In original paper this should be
        # nn.Linear(1024*S*S, 4096),
        # nn.LeakyReLU(0.1),
        # nn.Linear(4096, S*S*(B*5+C))

        return nn.Sequential(
            nn.Flatten(),
            nn.Linear(1024 * S * S, 496),
            nn.Dropout(0.0),
            nn.LeakyReLU(0.1),
            nn.Linear(496, S * S * (C + B * 5)),
        )

YOLO v1 데이터 세트

YOLO v1은 PASCAL VOC 2007과 2012를 상황에 맞게 사용했습니다.

PASCAL VOC 2007은 아래에서 다운받을 수 있습니다.

총 데이터 양은 4.5 GB 정도이며 압축을 풀면 조금 더 늘어납니다.

https://www.kaggle.com/datasets/734b7bcb7ef13a045cbdd007a3c19874c2586ed0b02b4afc86126e89d00af8d2

공식 : http://host.robots.ox.ac.uk/pascal/VOC/

압축을 풀면 다음과 같이 나타납니다.

이 중 label에서 첫 번째 파일을 열어보면 다음과 같이 나타나는 것을 확인할 수 있습니다.

순차적으로 class #, x center, y center coordinate, w, h 순으로 나타나는 것을 확인할 수 있습니다.

훈련 데이터에 대한 이미지와 label txt의 pair가 개수에 따라 8examples.csv, 100examples.csv, train.csv 3개 중 하나를 선택하여 활용하면 됩니다. 본인의 컴퓨터 리소스에 따라 선정(선택)

import torch
import os
import pandas as pd
from PIL import Image


class VOCDataset(torch.utils.data.Dataset):
    def __init__(
        self, csv_file, img_dir, label_dir, S=7, B=2, C=20, transform=None,
    ):
        self.annotations = pd.read_csv(csv_file)
        self.img_dir = img_dir
        self.label_dir = label_dir
        self.transform = transform
        self.S = S
        self.B = B
        self.C = C

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        label_path = os.path.join(self.label_dir, self.annotations.iloc[index, 1])
        boxes = []
        with open(label_path) as f:
            for label in f.readlines():
                class_label, x, y, width, height = [
                    float(x) if float(x) != int(float(x)) else int(x)
                    for x in label.replace("\n", "").split()
                ]

                boxes.append([class_label, x, y, width, height])

        img_path = os.path.join(self.img_dir, self.annotations.iloc[index, 0])
        image = Image.open(img_path)
        boxes = torch.tensor(boxes)

        if self.transform:
            # image = self.transform(image)
            image, boxes = self.transform(image, boxes)

        # Convert To Cells
        label_matrix = torch.zeros((self.S, self.S, self.C + 5 * self.B))
        for box in boxes:
            class_label, x, y, width, height = box.tolist()
            class_label = int(class_label)

            # i,j represents the cell row and cell column
            i, j = int(self.S * y), int(self.S * x)
            x_cell, y_cell = self.S * x - j, self.S * y - i

            """
            Calculating the width and height of cell of bounding box,
            relative to the cell is done by the following, with
            width as the example:
            
            width_pixels = (width*self.image_width)
            cell_pixels = (self.image_width)
            
            Then to find the width relative to the cell is simply:
            width_pixels/cell_pixels, simplification leads to the
            formulas below.
            """
            width_cell, height_cell = (
                width * self.S,
                height * self.S,
            )

            # If no object already found for specific cell i,j
            # Note: This means we restrict to ONE object
            # per cell!
            if label_matrix[i, j, 20] == 0:
                # Set that there exists an object
                label_matrix[i, j, 20] = 1

                # Box coordinates
                box_coordinates = torch.tensor(
                    [x_cell, y_cell, width_cell, height_cell]
                )

                label_matrix[i, j, 21:25] = box_coordinates

                # Set one hot encoding for class_label
                label_matrix[i, j, class_label] = 1

        return image, label_matrix

보이는 바와 같이 선정한 train csv 파일에 대하여 path와 더불어 image, label path를 기입하여 초기 hyperparamter를 설정합니다. label index에 맞는 txt 파일을 읽어 [class_label, x, y, width, height]에 대한 내용을 boxes list에 추가합니다.

마찬가지로 image index에 맞는 이미지 역시 PIL package로 불러옵니다.

아마 추가적으로 PIL을 tensor로 바꾸는 transform을 진행할 것이며 필요에 따라 resize, rotate, flip을 진행합니다.

이후 boxes list에 저장한 list들을 하나씩 불러오며 좌표계에 따른 상대적 크기를 재설정하고 label_matrix 위치에 맞게 저장합니다.

이때 i와 j는 S*S 중 어떤 grid cell에 들어있는지를 판별하는 정수입니다. txt 파일에 있는 x와 y는 0~1의 범주로 있기 때문에 이를 활용하여 grid cell의 위치를 알기위해서는 행에 대해서는 S*y, 열에 대해서는 S*x를 통해 알아냅니다.

아래 표는 32.txt 파일의 내용을 가져온 것입니다.

	x	y	w	h	*i(=7y) 행**	*j(=7x) 열**	w_bb	h_bb
aeroplane	0.479	0.464413	0.542	0.373665	3	3	3.794	2.615658
aeroplane	0.33	0.375445	0.128	0.124555	2	2	0.896	0.871886
people	0.408	0.727758	0.036	0.174377	5	2	0.252	1.220641
people	0.07	0.759786	0.036	0.174377	5	0	0.252	1.220641

이를 토대로 각 셀에 대하여

0~19까지는 class probability이기 때문에 해당 클래스에 대한 값을 1로 설정.

obejct가 존재하는 경우 20번째에 onfidence score=1로 설정.

21~25에 bounding box coordinate(x,y,w,h)에 대한 값 설정.

YOLO v1 Loss

모델을 통해 도출된 결과를 가지고 Ground Truth Bounding box와의 coordinate 차이와 class 차이, probability 차이를 진행합니다.

import torch
import torch.nn as nn
from utils import intersection_over_union


class YoloLoss(nn.Module):
    """
    Calculate the loss for yolo (v1) model
    """

    def __init__(self, S=7, B=2, C=20):
        super(YoloLoss, self).__init__()
        self.mse = nn.MSELoss(reduction="sum")

        """
        S is split size of image (in paper 7),
        B is number of boxes (in paper 2),
        C is number of classes (in paper and VOC dataset is 20),
        """
        self.S = S
        self.B = B
        self.C = C

        # These are from Yolo paper, signifying how much we should
        # pay loss for no object (noobj) and the box coordinates (coord)
        self.lambda_noobj = 0.5
        self.lambda_coord = 5

    def forward(self, predictions, target):
        # predictions are shaped (BATCH_SIZE, S*S(C+B*5) when inputted
        predictions = predictions.reshape(-1, self.S, self.S, self.C + self.B * 5)

        # Calculate IoU for the two predicted bounding boxes with target bbox
        iou_b1 = intersection_over_union(predictions[..., 21:25], target[..., 21:25])
        iou_b2 = intersection_over_union(predictions[..., 26:30], target[..., 21:25])
        ious = torch.cat([iou_b1.unsqueeze(0), iou_b2.unsqueeze(0)], dim=0)

        # Take the box with highest IoU out of the two prediction
        # Note that bestbox will be indices of 0, 1 for which bbox was best
        iou_maxes, bestbox = torch.max(ious, dim=0)
        exists_box = target[..., 20].unsqueeze(3)  # in paper this is Iobj_i

        # ======================== #
        #   FOR BOX COORDINATES    #
        # ======================== #

        # Set boxes with no object in them to 0. We only take out one of the two 
        # predictions, which is the one with highest Iou calculated previously.
        box_predictions = exists_box * (
            (
                bestbox * predictions[..., 26:30]
                + (1 - bestbox) * predictions[..., 21:25]
            )
        )

        box_targets = exists_box * target[..., 21:25]

        # Take sqrt of width, height of boxes to ensure that
        box_predictions[..., 2:4] = torch.sign(box_predictions[..., 2:4]) * torch.sqrt(
            torch.abs(box_predictions[..., 2:4] + 1e-6)
        )
        box_targets[..., 2:4] = torch.sqrt(box_targets[..., 2:4])

        box_loss = self.mse(
            torch.flatten(box_predictions, end_dim=-2),
            torch.flatten(box_targets, end_dim=-2),
        )

        # ==================== #
        #   FOR OBJECT LOSS    #
        # ==================== #

        # pred_box is the confidence score for the bbox with highest IoU
        pred_box = (
            bestbox * predictions[..., 25:26] + (1 - bestbox) * predictions[..., 20:21]
        )

        object_loss = self.mse(
            torch.flatten(exists_box * pred_box),
            torch.flatten(exists_box * target[..., 20:21]),
        )

        # ======================= #
        #   FOR NO OBJECT LOSS    #
        # ======================= #

        #max_no_obj = torch.max(predictions[..., 20:21], predictions[..., 25:26])
        #no_object_loss = self.mse(
        #    torch.flatten((1 - exists_box) * max_no_obj, start_dim=1),
        #    torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
        #)

        no_object_loss = self.mse(
            torch.flatten((1 - exists_box) * predictions[..., 20:21], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1),
        )

        no_object_loss += self.mse(
            torch.flatten((1 - exists_box) * predictions[..., 25:26], start_dim=1),
            torch.flatten((1 - exists_box) * target[..., 20:21], start_dim=1)
        )

        # ================== #
        #   FOR CLASS LOSS   #
        # ================== #

        class_loss = self.mse(
            torch.flatten(exists_box * predictions[..., :20], end_dim=-2,),
            torch.flatten(exists_box * target[..., :20], end_dim=-2,),
        )

        loss = (
            self.lambda_coord * box_loss  # first two rows in paper
            + object_loss  # third row in paper
            + self.lambda_noobj * no_object_loss  # forth row
            + class_loss  # fifth row
        )

        return loss

논문에서는 sum-squared error를 사용한다고 적혀있으니 MSELoss에 reduction을 sum을 설정합니다.

$\lambda_{coord}=5,\lambda_{noobj}=0.5$로 설정합니다.

시작에 앞서 B를 2로 설정했기 때문에 각 셀에 대한 2개의 bounding box 중 Ground Truth와 가장 가까운 것 하나를 선택해야 합니다. 때문에 b1에 대한 iou와 b2에 대한 iou를 계산하고 이 중 가장 높은 iou를 가진 bounding box를 설정할 수 있도록 계산합니다.

그리고 i번째 cell에 object가 존재하는지에 대해 exists_box를 통해 계산합니다. 이는 20번째 위치 즉, confidence score를 통해 알 수 있습니다.

위 코드에서는 iou_maxes는 iou가 최대인 tensor가 나타나있고 bestbox에는 그에 따른 index가 정해져 있습니다. 0번째 1번째 둘 중 하나.

이를 기반으로 본 논문에서 제시하는 loss를 계산합니다.

솔직히 start_dim이나 end_dim까지는 필요없을 듯 보입니다. 어차피 계산하면 1 value tensor

YOLO v1 기타 추가 함수 구현

NMS, IOU, mAP, 모든 셀에 대한 bounding box들 중 가장 가치있는 bounding box를 선정하는 함수 구현이 필요합니다.

특히 마지막 함수의 경우 각 cell에 대해 구한 bounding box 2개씩 총 98개(=7*7*2) bounding box에 대해 NMS 진행이 필요합니다. Ground Truth Bounding box와의 IOU threshold 설정을 몇으로 할지 confidence score threshold를 몇으로 할지를 정해야합니다. 이에 대한 내용은 util에 별도로 함수가 구현되어 있습니다.