[논문 Summary] AnimateDiff (2023.07 Arxiv) "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Modelswithout Specific Tuning"

[논문 Summary] AnimateDiff (2023.07 Arxiv) "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models
without Specific Tuning"

논문 정보

Citation : 2023.10.28 토요일 기준 4회

저자

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai

Shanghai AI Laboratory, The Chinese University of Hong Kong, Stanford University

논문 링크

Official

https://arxiv.org/abs/2307.04725

Arxiv

공식 Github

https://animatediff.github.io/

AnimateDiff

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning Yuwei Guo 1,2 Ceyuan Yang 1* Anyi Rao 3 Yaohui Wang 1 Yu Qiao 1 Dahua Lin 1,2 Bo Dai 1 * Corresponding Author. 1 Shanghai AI Laboratory 2 The Chinese University o

animatediff.github.io

https://github.com/guoyww/AnimateDiff

GitHub - guoyww/AnimateDiff: Official implementation of AnimateDiff.

Official implementation of AnimateDiff. Contribute to guoyww/AnimateDiff development by creating an account on GitHub.

github.com

https://github.com/continue-revolution/sd-webui-animatediff

GitHub - continue-revolution/sd-webui-animatediff: AnimateDiff for AUTOMATIC1111 Stable Diffusion WebUI

AnimateDiff for AUTOMATIC1111 Stable Diffusion WebUI - GitHub - continue-revolution/sd-webui-animatediff: AnimateDiff for AUTOMATIC1111 Stable Diffusion WebUI

github.com

논문 Summary

0. 설명 시작 전 Overview

기존 SD에 video animation을 활용하기 위한 새로운 모듈 제안 (like ControlNet)

generalized dataset으로 해당 모듈을 장착한 T2I모델을 학습하고나면 다양한 종류의 personalized model는 훈련 없이 inference 가능.

1. Introduction

Dreambooth, LoRA와 같은 personalized model을 활용한 consistency가 보이는 animation video를 선보이고자 한다.

각 personalized model별 추가적인 별도의 학습이 필요하지 않는다.

여러 실험을 통해 부드러운 생성 결과를 확인할 수 있다.

2. Related works

Text-to-image diffusion models

GLIDE, DALLE-2, CLIP, Imagen, LDM, eDiff-I

Personalize text-to-image model

Textual inversion, Dreambooth, LoRA, Custom Diffusion

Personalized T2I animation

Tune-a-Video, Text2Video-Zero, Align-your-Latents

3. Methods

3.1 Preliminaries

General text-to-image generator

Stable Diffusion. 이야기

autoencoder로 pretrained VQ-GAN, VQ-VAE 적용

Markov process 기반 Diffusion 이야기

DDPM기반 objective function

SD에서는 U-Net self-, cross-attention 사용

Text model으로 CLIP ViT-L/14 text encoder 사용

Personalized image generation

Dreambooth, LoRA

데이터가 작을 때 regularization으로 모델 tuning시 catastrophic forgetting이나 overfitting 발생.

이를 방지하기 위해 Dreambooth의 mech 설명

전체 모델의 cross attention weight 학습해야하는 Dreambooth와 달리 LoRA는 decompose 2 matrix의 rank 조절을 통해 조금 더 효율적이고 가벼운 parameter tuning 가능 설명

3.2 Personalized Animation

personalized image model animating을 위해서는 상응하는 video collection이 필요.

적거나 훈련 비용이 없으면서도 domain knowledge나 quality를 보존하는 animation genrator로의 변환이 목표

접근법 1: (순진한 접근법) temporal structure를 추가하고 사전에 large-scale video dataset으로 motion을 학습시키는 것.

그러나 충분한 personalized video에는 비용 소모. 적은 데이터는 knowledge loss 초래

접근법 2: 일반화된 motion modeling module 훈련 후 inference time에 personalized T2I plug in.

이를 통해 각 personalize마다 훈련시킬 필요 없음. (like ControlNet)

3.3 Motion Modeling Module

Network Inflation

SD는 image batch이기 때문에 motion modeling module 호환을 위해서는 모델을 ㅎ확장해야함.

5D video tensor : batch x channel x frames x height x width

Video Diffusion Model과 유사하게 채택

frame axis를 batch axis로 reshape함으로써 2D conv와 attention layer를 spatial only pseudo-3D layer로 변환

motion module frames을 batch에 넣음으로써 motion smoothness와 content consistency 성취.

Module Design

frame에서의 효율적인 정보 교환이 가능한 motion modeling module 설계가 목표

vanilla temporal transformer를 선택. (다른 모델들도 해봤는데 이것이 motion prior modeling에 적합. 나중에 더 찾아보겠다.)

temporal transformer는 temporal axis를 따라 여러 개의 self-attention block으로 구성됨.
순서는 위 그림 Figure 3 오른쪽에 보이는 것과 같음.

1) batch x channel x frames x height x width를 다음과 같이 reshape.
(batch x height x width) x frames x channel
2) feature map proejct
3) 여러 개의 self-attention block 통과

이를 통해 temporal 축을 기점으로 동일 위치의 feature들간 temporal dependency를 인지하게 함.
receptive field 확장을 위해 매 U-Net 구조의 모든 resolution마다 삽입.

4) sinusoial position encoding 역시 self-attention block에 추가. 이를 통해 현 frame에 대한 temporal 위치 인지.
5) temporal transformer output projection layer에 zero initialize

Training Objective

Sample video data $x_0^{1:N}$

latent code $z_0^{1:N}$

forward diffusion schedule :