Patrick Esser, Robin Rombach, Bj¨orn Ommer - Heidelberg Collaboratory for Image Processing, IWR, Heidelberg University, Germany
CVPR 2021 Open Access Repository
Taming Transformers for High-Resolution Image Synthesis Patrick Esser, Robin Rombach, Bjorn Ommer; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12873-12883 Abstract Designed to learn long-range interac
openaccess.thecvf.com
Arxiv
https://arxiv.org/abs/2012.09841
Taming Transformers for High-Resolution Image Synthesis
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expres
arxiv.org
PR-315: Taming Transformers for High-Resolution Image Synthesis
요즘 Transformer 구조를 language랑 vision 관계 없이 여기저기 적용해보려는 시도가 매우 다양하게 이루어지고 있는데요, 그래서 이번주 제 발표에서는 이를 High-resolution image synthesis에 활용한, CVPR 2021
www.slideshare.net
Transformer를 활용한 Image Synthesis는 한계점이 분명해 보임.
CNN은 Inductive Bias를 통해 상대적 학습량이 줄어듬
Transformer는 많은 데이터 사용으로 인해 복잡도가 높아짐.
Patch -> Reshape -> vectorization
CNN을 통한 vectorization
Language에서의 input은 discrete vector sequence
embedding을 위해 사용하는 lookup table 방법
Image input이 Encoder를 통해 vectorization이 진행됨.
입력 이미지보다 16배 작은 height, width를 가지는 $\hat{z}$를 구성
Lookup Table과 같은 Codebook $\mathcal{Z}$ : N개의 code sample
codebook과 L2 loss를 통해 차이가 작은 것들을 기반으로 quantization을 진행하여 $z_q$를 획득
vector quantization
quantization된 $z_q$를 Decoder G에 넣어 Image Generation을 진행함.
Reconstruction loss인 L2 Loss
sg : stop-gradient operation
$\lambda$ : Decoder의 마지막 layer L에 대한 gradient 값을 분모와 분자에 대해 구함.
Unconditional Generation vs Conditional Generation
log likelihood maximize = softmax logit loss
다음이 나올 예측을 위한 방안
인접한 patch들 간의 attention 계산을 통해 다음 값을 찾아냄
연산량이 늘어날 것 없이 이미지 합성이 가능
condition : class, segmentation map, edge information etc
고화질의 이미지 데이터 생성 또한 가능
https://github.com/CompVis/taming-transformers
GitHub - CompVis/taming-transformers: Taming Transformers for High-Resolution Image Synthesis
Taming Transformers for High-Resolution Image Synthesis - GitHub - CompVis/taming-transformers: Taming Transformers for High-Resolution Image Synthesis
github.com
https://compvis.github.io/taming-transformers/
Taming Transformers for High-Resolution Image Synthesis
Abstract Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes th
compvis.github.io
State-of-the-Art Image Generative Models
I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are spe…
arankomatsuzaki.wordpress.com