PR-315 "Taming Transformers for High-Resolution Image Synthesis" Review (2021 CVPR)(Image Synthesis GAN)

1. Citations & Abstract 읽기

Citations : 2022.03.22 기준 154 회

저자

Patrick Esser, Robin Rombach, Bj¨orn Ommer - Heidelberg Collaboratory for Image Processing, IWR, Heidelberg University, Germany

Abstract

2. 발표 정리

공식 논문 링크

https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html

CVPR 2021 Open Access Repository

Taming Transformers for High-Resolution Image Synthesis Patrick Esser, Robin Rombach, Bjorn Ommer; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873-12883 Abstract Designed to learn long-range interac

openaccess.thecvf.com

Arxiv

https://arxiv.org/abs/2012.09841

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expres

arxiv.org

Presentation Slide

https://www.slideshare.net/HyeongminLee3/pr315-taming-transformers-for-highresolution-image-synthesis

PR-315: Taming Transformers for High-Resolution Image Synthesis

요즘 Transformer 구조를 language랑 vision 관계 없이 여기저기 적용해보려는 시도가 매우 다양하게 이루어지고 있는데요, 그래서 이번주 제 발표에서는 이를 High-resolution image synthesis에 활용한, CVPR 2021

www.slideshare.net

https://youtu.be/GcbT0IGt0xE

Transformer를 활용한 Image Synthesis는 한계점이 분명해 보임.

CNN은 Inductive Bias를 통해 상대적 학습량이 줄어듬

Transformer는 많은 데이터 사용으로 인해 복잡도가 높아짐.

Patch -> Reshape -> vectorization

CNN을 통한 vectorization

Language에서의 input은 discrete vector sequence

embedding을 위해 사용하는 lookup table 방법

Image input이 Encoder를 통해 vectorization이 진행됨.

입력 이미지보다 16배 작은 height, width를 가지는 $\hat{z}$를 구성

Lookup Table과 같은 Codebook $\mathcal{Z}$ : N개의 code sample

codebook과 L2 loss를 통해 차이가 작은 것들을 기반으로 quantization을 진행하여 $z_q$를 획득

vector quantization

quantization된 $z_q$를 Decoder G에 넣어 Image Generation을 진행함.

Reconstruction loss인 L2 Loss

sg : stop-gradient operation

$\lambda$ : Decoder의 마지막 layer L에 대한 gradient 값을 분모와 분자에 대해 구함.

Unconditional Generation vs Conditional Generation

log likelihood maximize = softmax logit loss

다음이 나올 예측을 위한 방안

인접한 patch들 간의 attention 계산을 통해 다음 값을 찾아냄

연산량이 늘어날 것 없이 이미지 합성이 가능

condition : class, segmentation map, edge information etc

고화질의 이미지 데이터 생성 또한 가능

참조

GitHub

https://github.com/CompVis/taming-transformers

GitHub - CompVis/taming-transformers: Taming Transformers for High-Resolution Image Synthesis

Taming Transformers for High-Resolution Image Synthesis - GitHub - CompVis/taming-transformers: Taming Transformers for High-Resolution Image Synthesis

github.com

https://compvis.github.io/taming-transformers/

Taming Transformers for High-Resolution Image Synthesis

Abstract Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes th

compvis.github.io

블로그

https://arankomatsuzaki.wordpress.com/2021/03/04/state-of-the-art-image-generative-models/?fbclid=IwAR3qOm985YRdCMzq-O504qFAJa1LuakpEPcE4N3ICwZQ0tUBblGJ5HLYNrY#vqgan

State-of-the-Art Image Generative Models

I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are spe…

arankomatsuzaki.wordpress.com