AI 공부 도전기

PR-281 "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" Review (2021 ICLR)()

PR12 Review/PR12 - Season 3

2022. 3. 8. 22:41

PR-281 "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" Review (2021 ICLR)()

목차

1. Citations & Abstract 읽기

Citations : 2022.02 기준 회

저자

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby - Google Research, Brain Team

Abstract

2. 발표 정리

https://youtu.be/D72_Cn-XV1g

공식 논문 링크

https://openreview.net/forum?id=YicbFdNTTy

An Image is Worth 16x16 Words: Transformers for Image Recognition...

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied...

openreview.net

Arxiv

https://arxiv.org/abs/2010.11929

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to rep

arxiv.org

Presentation Slide

https://drive.google.com/file/d/1YMV35XBQwDVEpv31hFmTR8gEFHL3GIUr/view

PR-281_ An Image is Worth 16x16 Words_ Transformers for Image Recognition at Scale.pdf

drive.google.com

Contents

Transformers & Vision

NLP Task를 기반으로 시작된 Transformer

Computer Vision에서의 Transformer의 다양한 시도

( 각각의 section별 설명은 생략)

Transformer Encoder는 NLP에서 사용하는 모습을 그대로 따라가는 형식

NLP에서는 1D sequence token을 입력으로 사용하지만 Vision에서는 2D 이미지를 사용함.

이를 다루기 위해 각 이미지를 16x16 사이즈의 패치들을 활용

ex) 256x256 image -> 16x16 patch 256개

ViT Experiment

ViT 실험을 위한 데이터 세트

가장 기본이 되는 것은 ImageNet 이미지 데이터 세트

다양한 모델 변형을 설정

대다수의 데이터 세트에 있어 제안된 모델이 가장 좋은 성과를 보임

이전 SOTA인 BiT 대비 좋은 성능을 보임.

데이터 세트가 커질수록 ViT의 성능이 월등히 좋아짐

ViT Embedding filter와 position embedding의 visualize 결과

Further Study

Self-supervision & Detection & Segmentation

참조

GitHub

https://github.com/google-research/vision_transformer

GitHub - google-research/vision_transformer

Contribute to google-research/vision_transformer development by creating an account on GitHub.

github.com

저작자표시 비영리 동일조건

공유하기

kakaoTalk

kakaostory

naver

band

'PR12 Review/PR12 - Season 3' 의 관련글

공지사항

블로그정보

AI공

------ E-mail : aigongbu@gmail.com------

인기글

최근글

최근댓글

달력

글보관함

방문자 카운터

어제 :
오늘 :
누적 :

02-24 00:04

글쓰기
관리자
카테고리
목차
맨위로

티스토리툴바