[논문 Summary] Tune-A-Video (2023 ICCV) "Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation"

논문 정보

Citation : 2024.01.14 일요일 기준 192회

저자

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

1) Show Lab, National University of Singapore

2) ARC Lab

3) Tencent PCG

4) School of Computing, National University of Singapore

논문 링크

Official

https://openaccess.thecvf.com/content/ICCV2023/papers/Wu_Tune-A-Video_One-Shot_Tuning_of_Image_Diffusion_Models_for_Text-to-Video_Generation_ICCV_2023_paper.pdf

Arxiv

https://arxiv.org/abs/2212.11565

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new

arxiv.org

공식 Github

https://github.com/showlab/Tune-A-Video

GitHub - showlab/Tune-A-Video: [ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation - GitHub - showlab/Tune-A-Video: [ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models...

github.com

https://tuneavideo.github.io/

Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation 1Show Lab,2National University of Singapore 3ARC Lab,4Tencent PCG 4 School of Computing, National University of Singapore --> "A puppy is eating a cheeseburger on the tabl

tuneavideo.github.io

논문 Summary

Abstract

Text-to-Video(T2V) generator를 위한 연구가 진행되지만 계산 비용이 많이 든다.

단 하나의 text-video pair를 통한 새로운T2V Setting인 One-Shot Video Tuning을 소개

이는 pretrained T2I model을 활용한다.

Tune-A-Video를 제안

1) tailored spatio-temporal attention mechanism

2) one-shot tuning strategy

3) Inference stage, DDIM inversion 사용

실험을 통해 우월성 입증.

0. 설명 시작 전 Overview

T2I U-Net 속 Transformer의 Attention layer에 대한 fine-tuning을 통해 T2V를 효율적으로 표현

personalization, style, text, pose Control 등에 대한 효율적 적용 가능.

Keywords

One-Shot Video Tuning

sparse spatio-temporal attention mechnism

DDIM inversion

1. Introduction

최근 T2I generation model의 spatial-only 확장을 통한 spatio-temporal domain 작업이 대부분이다.

매우 좋은 결과를 확인할 수 있음에도 계산비용적으로 비싸고 시간이 오래 소요된다.

T2I model이 인간과 같이 open-domain concept에 대한 지식을 가졌을 때 이런 질문이 떠오른다.

"can they infer other novel videos from a single video example, like humans?"

이에 One-Shot Video Tuning 제안.

오직 단일 text-video pair를 T2V generator에 학습시켜 필수적인 motion 정보에 대한 이해를 바탕으로 edited prompt로 새로운 영상 합성이 가능하게 한다.

이것의 핵심은 consistent object의 continuous motion 보존에 있다.

1) Motion

T2I generation

2) Consistent objects

spatio-temporal attention으로 확장하여 연속적인 frame 생성해도 consistent 객체와 장소를 나타내도 motion이 continuous하지 않다.(Figure 2 두 번째 row)

이는 T2I models에서의 self-attention layer는 pixel position보다 오직 spatial similarities에 의해서만 구동된다는 것을 의미

이에 간단하지만 효과적인 Tune-A-Video 제안

T2I의 spatio-temporal dimension으로 단순 확장

그러나 모든 attention 훈련은 제곱배의 계산량 증가와 사전 존재하는 지식을 위험에 빠뜨릴 수도 있고 새로운 concept video 생성에 방해를 받을지도 모른다.

이를 해결하기 위해 sparse spatio-temporal attention mechnism

1) 첫 번째와 사전 frame 활용

2) attention block에서 projection matrics update

이것으로 consistent object 유지는 가능하지만 continuous motion에는 부족

structure guidance를 찾기위해 DDIM inversion 사용

이를 통해 temporally coherent video에서 smooth movement 가능.

We introduce a new setting of One-Shot Video Tuning for T2V generation, which eliminates the burden of training with large-scale video datasets.
We present Tune-A-Video, which is the first framework for T2V generation using pretrained T2I models.
We propose efficient attention tuning and structural inversion that significant improve temporal consistency.
We demonstrate remarkable results of our method through extensive experiments.

2. Related work

Text-to-Image diffusion models

GLIDE, DALL-E 2, Imagen, VQ-diffusion, LDM

Text-to-Video generative models

GODIVA, NUWA, CogVideo, CogView2, VDM, Imagen Video, Make-A-Video, MagicVideo

Text-driven video editing

Text2Live, Dreamix, Gen-I

Generation from a single video

Single-Video GANs, HPVAE-GAN, Patch nearest-neightbor methods, SinFusion

3. Methods

$\mathcal{V}$ : m개의 frame을 가지는 Video = $ \{ v_i | i \in [1,m] \} $

$\mathcal{P}$ : video V를 표현하는 source prompt

$\mathcal{P^*}$ : edited text prompt

$\mathcal{V^*}$ : $mathcal{P^*}$으로 수정되어 새롭게 생성된 video

3.1 Preliminaries

DDPM

Markov Chain 기반 negative log-likelihood의 variational lower bound maximizing

참조 : DDPM https://aigong.tistory.com/589

[논문 Summary] DDPM (2020 NIPS) "Denoising diffusion probabilistic models"

[논문 Summary] DDPM (2020 NIPS) "Denoising diffusion probabilistic models" 목차 논문 정보 Citation : 2022.11.05 토요일 기준 660회 저자 Jonathan Ho, Ajay Jain, Pieter Abbeel UC Berkeley 논문 링크 Official https://proceedings.neurips.cc/p

aigong.tistory.com

LDM

Stable Diffusion 설명

autoencoder, DDPM 기반 text-to-image genration model 간략 설명

3.2 Network Inflation

T2I U-Net은 2D conv residual block과 transformer block들로 구성

transformer block은 spatial self-attention layer, cross-attention layer, feed-forward network(FFN)으로 구성

Spatial self-attention은 feature map에서 pixel location에서 유사한 관계성을 활용

Cross attention은 condition input과 pixel간 일치성을 고려

video frame $v_i$에 대해 latent representation $z_{v_i}$가 주어졌을 때, Spatial self-attention은 다음과 같이 표현.

저자들은 spatio-temporal domain으로 확장.

2D conv → pseudo 3D convolution layer (3x3 kernel -> 1x3x3 kernel)

transformer block에 temporal self-attention layer 추가.

spatio-temporal attention → spatio-temporal attention (ST-Attn)으로 확장.

단, 이대로는 너무 계산량이 많아지기에 causal attention mechanism의 sparse version 사용 제안

frame $z_{v_i}$와 이전 두 frame $z_{v_{i-1}}$ & $z_{v_1}$ 간 attention matrix 계산.

이를 통한 계산 복잡도 $O((mN)^2)$ → $O(2m(N)^2)$

projection matrix $W^Q, W^K, W^V$는 모두 space와 time에 대해 공유

3.3 Fine-Tuning and Inference

Model fine-tuning

spatio-temporal attention (ST-Attn)에서 Q만 update, K는 fix

Cross-attention에서도 Q만 update

효율적 update

Structure guidance via DDIM inversion

Attention layer Fine-tuning으로 spatial consistency 유지 가능하지만 pixel shift에 대한 제어가 어렵다.

이를 위해 source video로부터 structure guidance를 통합할 수 있도록 DDIM Inversion 사용

4. Applications of Tune-A-Video

Object editing

객체 replacing, adding, removing

Background change

change background

그러나 물체의 움직임 consistency 보존

Style transfer

비디오 스타일 변화

Personalized and controllable generation

Dreambooth, ControlNet, T2I-Adapter 사용 가능.

5. Experiments

5.1 implementation Details

LDM (Stable Diffusion)

32 uniform frames

512x512 resolution

500 steps

learning rate $3 \times 10^{-5}$

batch size 1

Inference, DDIM sampler with classificer-free guidance

NVIDIA A100

finetuning - 10 min

sampling - 1 min

5.2 Baseline Comparison

Datasets

DAVIS dataset - 42 videos

Baselines

1) CogVideo

2) Plug-and-Play

3) Text2LIVE

Qualitative results

CogVideo는 text에 대한 general concept 영상 생성(다양한 결과).

But video input으로 넣지 못함.

Plug-and-Play는 개별 frame edit 가능.

But temporal context 무시됨에 따라 frame consistency 부족

Text2LIVE temporal smooth video 생성 가능

But 정확한 edit prompt 표현이 어려움.

자신들의 방법은 구조적 정보 보존에 따라 temporally-coherent video 생성 가능.

Quantitative results

Frame consistency - 모든 프레임에 대한 CLIP image embeddings에 대해 average cosine similarity

Textual faithfulness - 모든 프레임과 edit prompt에 대한 average CLIP score

User Study도 진행.

5.3 Ablation Study

Spatio-temporal attention (ST-Att) mechanism - content discrepancies

DDIM inversion - consistent content

finetuning

3가지에 대해 진행.

6. Limitations and future work

다수의 물체나 occlusion이 있는 경우 어려움

Reference

Official Colab

https://colab.research.google.com/github/showlab/Tune-A-Video/blob/main/notebooks/Tune-A-Video.ipynb

Tune-A-Video.ipynb

Run, share, and edit Python notebooks

colab.research.google.com

저작자표시 비영리 동일조건

AI 공부 도전기