[논문 Summary] Deformable DETR (2021 ICLR Oral) "Deformable DETR: Deformable Transformers for End-to-End Object Detection"

논문 정보

Citation : 2022.12.15 토요일 기준 1369회

저자

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai

- SenseTime Research, University of Science and Technology of China, Chinese University of Hong Kong

논문 링크

Official

https://openreview.net/forum?id=gZ9hCDWe6ke

Deformable DETR: Deformable Transformers for End-to-End Object...

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and...

openreview.net

Arxiv

https://arxiv.org/abs/2010.04159

Deformable DETR: Deformable Transformers for End-to-End Object Detection

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Tra

arxiv.org

논문 Summary

Abstract

0. 설명 시작 전 Overview

DETR에는 2가지 문제점이 존재한다.

1. 수렴하기 위해서는 긴 training epochs를 요구한다.

2. 작은 물체를 검출하는데 어려움을 겪는다.

이를 해결하기 위해 Multi-scale의 feature map 활용, Deformable Attention Module을 사용하여 이 문제를 해결한다.

자세한 내용은 아래 참조.

1. Introduction

현대 object detection은 anchor generator, rule-based training, NMS,등과 같은 후처리 hand-crafted 요소를 많이 활용한다. 이는 실질적인 end-to-end가 아니다. 이에 DETR은 이런 수작업 방법을 제거하고 온전한 end-to-end를 달성하는 간단한 모델을 제안한다.

DETR은 CNN과 encoder-decoder를 가지는 Transformer의 결합한 구조를 가지고 다재다능하고 강력한 결과를 보인다.

그러나 DETR은 2가지 문제점이 있다.

(1) 수렴하는데 많이 긴 훈련 epoch를 수행해야 한다.

Faster R-CNN 대비 10~20배 정도로 500 epoch가 필요하다.

(2) 작은 물체 검출에 낮은 성과를 보인다.

기존 object detector처럼 multi-scale feature를 활용하지 않는 Transformer의 성질 때문

Deformable convolution이 sparse spatial location에 주의를 기울이기 위한 강력하고 효율적인 방법으로 긴 훈련 시간을 통해 attention weight가 학습하고자 하는 문제점을 해소할 수 있는 방법이다.

본 논문을 통해 느린 수렴 속도와 높은 복잡도를 가지는 DETR의 문제를 완화하는 Deformable DETR을 제안.

이때 모든 feature map 픽셀 중 중요한 핵심 요소에 대한 사전 필터로 작은 sampling 위치 집합에 주의를 기울이는 deformable attention module을 제안. 이를 통해 multi-scale feature들을 자연스럽게 aggregate하도록 확장할 수 있다.

Figure 1과 같이 기존 DETR transformer의 attention module 대신 deforamble attention module을 사용.

또한, 간단하고 효율적인 iterative bounding box refinement 방법을 통해 검출 성능을 향상

two-stage Deformable DETR 시도

2. Related work - 중략

Efficient Attention Mechanism

Transformer는 self attention과 cross attention 방법을 모두 포함하는 방법으로 긴 시간과 메모리 복잡도가 가장 큰 문제로 자리잡았고 이를 3가지의 카테고리에 다라 해결하고자 하는 노력을 보인다.

1) pre-defined sparse attention patterns on keys

2) learn data-dependent sparse attention

3) explore the low-rank property in self-attention

Multi-scale Feature Representation for Object Detection

object detection에서의 주요 어려움 중 하나는 다양한 사이즈의 객체를 효과적으로 표현하는데 있다.

이에 대한 예시

3. Revisiting Transformers and DETR

Multi-Head Attention in Transformers

4. Method

4.1 Deformable Transformers for End-to-End Object Detection

Deformable Attention Module

Transformer의 주요 문제는 모든 가능한 spatial location을 살펴본다는 것이다. 이를 해결하기 위해 deformable attention module을 제안한다. Figure 2에서 제시하는 바와 같이 feature map의 spatial size에 관계없이 오직 reference point 주변의 key sampling point의 작은 집합에 주의를 기울인다. 각 query에 대한 작고 고정된 key를 할당함으로써 DETR이 가지는 convergence와 feature spatial resolution의 문제를 완화한다.

$x \in \mathbf{R} ^{C \times H \times W}$ : input feature map

$q$ : query element

$z_q$ : content feature를 포함하는 $q$

$p_q$ : 2d reference point를 포함하는 $q$

$m$ : attention head

$k$ : sampled keys

$K$ : total sampled key # ($K << HW$)

$\triangle p_{mqk}$ : m번째 attention head, k번째 sampling point의 sampling offset

$A_{mqk}$ : m번째 attention head, k번째 sampling point의 attention weight

Figure 2 상단에 보이는 바와 같이 Query Feature $z_q$를 linear projection하여 3MK channel 생성함. 이때 2MK는 sampling offset인 $\triangle p_{mqk}$로 encode하고 나머지 MK는 softmax를 거쳐 attention weight인 $A_{mqk}$에 줌.

Multi-sacle Deformable Attention Module

최신 object detection과 같이 multi-scale feature map으로 이익을 얻도록 deformable attention module을 확장하여 multi-sacle deformable attention module을 제안.

$\{ x^l \}^L_{l=1}$ : input multi-scale feature map

$\hat{p}_q \in [0,1]^2 $ : 각 query element에 대한 reference point의 정규화된 coordinate

$m$ : attention head

$l$ : input feature level

$k$ : sampled keys

$K$ : total sampled key # ($K << HW$)

$\triangle p_{mlqk}$ : m번째 attention head, k번째 sampling point의 sampling offset

$A_{mlqk}$ : m번째 attention head, k번째 sampling point의 attention weight

제안된 multi-sacle deformable attention module은 효율적인 Transformer의 변형으로 간주될 수 있다.

Deformable Transformer Encoder

input과 output enocder는 같은 resolution의 multi-scale feature map이다. (Figure 4)

모든 multi-scale feature mpa은 256 channel이다.

Appendix A.2 참조

Deformable Transformer Decoder

decoder에는 cross-attention, self-attnetion이 있다.

query element는 object queries

Cross attention에서, object query는 encoder의 output feature map을 key element로 받는 feature map으로부터 feature를 추출한다. 각 cross-attention module을 multi-scale deformable attention moudle로 교체하고 self-attention module는 변화하지 않는다.

Appendix A.3에서 detail 참조

multi-scale deformable attention module은 reference point 주변의 image feautre를 추출하기 때문에 detection head는 relative offset으로부터 bounding box를 예측한다. 이를 통해 최적화 문제의 어려움을 감소시키고 훈련 수렴에 빠르게 도달하게 한다.

DETR에 deformable attention module로 교체함으로써 더 효율적이고 빠름 수렴에 도달할 수 있는 검출기를 만들 수 있음.

4.2 Additional improvements and variants for Deformable DETR

Appendix A.4에서 detail 참조

Iterative Bounding Box Refinement

optical flow estimation에서 영감을 받아 detection 성능을 향상시키기 위해 간단하고 효율적인 iterative bounding box refinement 방법을 도입함. 각 decoder layer는 이전 decoder 예측 결과 layer를 기반의 bounding box를 수정한다.

Two-Stage Deformable DETR

Two-stage object detector들에서 영감을 받아 DETR에서 현재 이미지와 관련없던 object query에 변화를 준다. 생성된 region proposal은 더 나은 정제를 위해 object query로 decoder에 제공한다.

5. Experiments

Dataset : COCO 2017 dataset

Implementation Details : pre-trained ResNet-50이 backbone

Deformable attention에서 M=8, K=4

Focal loss를 위한 loss weight 2를 제외하면 나머지는 DETR과 동일

object query는 100에서 300으로 증가

50 epoch

learning rate는 40번째 epoch에 0.1배 감소

Adam optimizer

learning rate $2 \times 10^{-4}$

NVIDIA Tesla V100 GPU 1대로 평가

5.1 Comparison with DETR

DETR은 적은 training 시간과 epoch만으로도 더 나은 AP 달성.

더 빠른 수렴 속도와 높은 AP

5.2 Ablation study on Deformable Attention

single scale보다 multi scale input을 사용하는 것이 더 정확함.

sampling point의 증가가 더 높은 AP

5.3 Comparison with SOTA methods

Reference

공식 Github

https://github.com/fundamentalvision/Deformable-DETR

GitHub - fundamentalvision/Deformable-DETR: Deformable DETR: Deformable Transformers for End-to-End Object Detection.

Deformable DETR: Deformable Transformers for End-to-End Object Detection. - GitHub - fundamentalvision/Deformable-DETR: Deformable DETR: Deformable Transformers for End-to-End Object Detection.

github.com

도움이 되는 YouTube 1.

블로그 1

https://junha1125.github.io/blog/artificial-intelligence/2021-03-12-DeformableDETR/

【Transformer+OD】Deformable DETR w/ advice

논문 : Deformable DETR- Deformable Transformers for End-to-End Object Detection 분류 : Transformer + Object Detection 저자 : Xizhou Zhu , Weijie Su, Lewei Lu , Bin Li 느낀점 이해가 안되는 부분이 몇몇 있다. 논문 자체에서 구체

junha1125.github.io