Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman
https://grail.cs.washington.edu/projects/dreampose/
https://arxiv.org/abs/2304.06025
DreamPose : 이미지를 통해 diffusion 기반의 animate fashion video를 생성하는 방법 제안.
image, human body pose sequence가 주어지면 사람과 fabric motion을 합성하여 비디오로 보여줌.
Stable Diffusion과 같은 pretrained text2image model을 pose-and image guided video synthesis model로 변환.
이때 UBC Fashion dataset을 활용한 새로운 finetuning 전략이 사용.
평가하고 증명.
Dreampose
입력 : 1장의 이미지 + 여러 sequece pose
Contribution :
1) pretrained Stable Diffusion의 구조적 변화
2) 2 stage fine-tuning method 제안
이를 통한 Video generation model을 구성할 수 있게 했다.
Fashion photography는 유명하지만 정보가 제한적이고 의류에 대한 뉘앙스에 대한 부분을 알기 어렵다. 이에 반해 Fashion video는 많은 양의 정보를 제안하지만 상대적으로 적다.
저자들은 pose sequence를 사용한 사실적이고 animate된 비디오를 만드는 DreamPose를 제안한다. 이 방법은 Stable Diffusion 기반의 Diffusion video synthesis model이다.
단일 이미지 + pose sequence를 통해 비디오 생성 가능.
물론, 이는 매우 어려운 작업이다.
이는 현존하는 모델이 text condition되어지기 때문. 만약 motion과 같은 signal을 conditioning으로 한다면 조금 더 섬세한 제어가 가능.
저자들의 방법인 image-and-pose conditioning scheme은 외형의 fidelity를 높여주고 frame-to-frame consistency이 가능하게 한다.
이를 달성하기 위해
To summarize, our contributions
(1) DreamPose: an image-and-pose conditioned diffusion method for still fashion image animation
(2) a simple, yet effective, pose conditioning approach that greatly improves temporal consistency across frames
(3) a split CLIP-VAE encoder that increases the output fidelity to the conditioning image
(4) a finetuning strategy that effectively balances image fidelity and generalization to new poses.
impressive results in text-conditioned image synthesis, video synthesis and 3D generation tasks
However, training these models from scratch is computationally expensive and data intensive.
Latent Diffusion Models
our work leverages a pretrained Stable Diffusion model with subject-specific finetuning.
image animation refers to the task of generating a video from one or more input images
multiple separate networks. ex.
end-to-end single-network approaches. ex.
pose-guided fashion image synthesis methods - GAN 기반 방법
그러나 large pose changes, synthesizing occluded regions, and preserving garment style에 어려움.
최근 접근법 attention-based mechanisms
diffusion-based fashion image and video synthesis.
ex) DiffFashion, PIDM
Many text-to-video diffusion models rely on adapting text-to-image diffusion models for video synthesis
그러나 struggle to match the realism
처음부터 훈련시키기도 하지만
문제점 : requiring expensive computational resources, huge training datasets, and extensive training time
Tune-A-Video finetunes a text-to-image pretrained diffusion model for text-and-image conditioned video generation
Tune-A-Video 어려움 : exhibit textural flickering and structural inconsistencies.
While effective at controlling high-level details, text conditioning fails to provide rich, detailed information about the exact identity or pose of a person and garment.
이를 해결하기 위한 다양한 시도 image conditioning
ex) DreamBooth, PIDM, DreamPose
저자들의 방법 : CLIP and VAE를 섞은 것으로부터 image embedding을 추출하여 UNet의 cross-attention layer에 통합시키는 image conditioning approach를 사용. 이를 통해 smooth, temporally consistent motion을 달성.
Diffusion model의 장점과 방법 설명
LDM에 대한 설명(process에 대한 개괄)
Classifier-free guidance에 대한 설명
Our method aims to produce photorealistic animated videos from a single image and a pose sequence
fine-tune a pretrained Stable Diffusion model on a collection of fashion videos.
At inference time, we generate each frame independently
The DreamPose model is a pose- and image-conditioned image generation model that modifies and finetunes the original text-to-image Stable Diffusion model for the purpose of image animation.
objectives
(1) faithfulness (2) visual quality (3) temporal stability
DreamPose requires an image conditioning mechanism that captures the global structure, person identity, and fine-grained details of the garment, as well as a method to effectively condition the output image on target pose while also enabling temporal consistency between independently sampled output frames
4.2.1 Split CLIP-VAE Encoder
our network aims specifically to produce images which are not spatially aligned with the input image.
we implement image conditioning by replacing the CLIP text encoder with a custom conditioning adapter that combines the encoded information from pretrained CLIP image and VAE encoders.
given that Stable Diffusion and CLIP, it seems natural to simply replace the CLIP conditioning with the embedding derived from the conditioning image.
CLIP image embeddings alone are insufficient for capturing fine-grained details in the conditioning image.
additionally input the encoded latent embeddings from Stable Diffusion’s VAE.
add an adapter module A that combines the CLIP and VAE embeddings to produce one embedding
This adapter blends both the signals together and transforms the output into the typical shape
the weights corresponding to the VAE embeddings are set to zero, such that the network begins training with only the CLIP embeddings
4.2.2 Modified UNET
concatenate the noisy latents $\tilde{z_i}$with a target pose representation $c_p$
we set $c_p$ to consist of five consecutive pose frames: $c_p = {p_{i-2}; p_{i-1}; p_i; p_{i+1} p_{i+2}}$
that individual poses are prone to frame-to-frame jitter, but training the network with a set of consecutive poses increases the overall motion smoothness and temporal consistency.
modify the UNet input layer to take in 10 extra input channels, initialized to zero,
초기
initialized from a pretrained text-to-image Stable Diffusion checkpoint, except for the CLIP image encoder which is loaded from a separate pretrained checkpoint
DreamPose is finetuned in two stages
The first phase fine-tunes the UNet and adapter module on the full training dataset in order to synthesize frames consistent with an input image and pose.
The second phase refines the base model by fine-tuning the UNet and adapter module, then the VAE decoder, … to create a subject-specific custom model used for inference.
sample-specific finetuning is essential to preserving the identity of the input image’s person and garment, as well as maintaining a consistent appearance across frames.
However, simply training on a single frame and pose pair quickly leads to artifacts
To prevent this, we augment the image-and-pose pair at each step, such as by adding random cropping.
finetuning the VAE decoder is crucial for recovering sharper, more photorealistic details in the synthesized output frames
$c_I$ : image conditioning
$c_p$ : pose conditioning
dual classifier-free guidance prevents overfitting
two guidance weights : $s_I, s_p$
large $s_I$ : high appearance fidelity
large $s_p$ : ensure alignment to the input pose
prevent overfitting
512x512 해상도 2개의 NVIDIA A100으로 훈련
Training
1) finetune our base model UNet on the full training dataset for a total of 5 epochs at a learning rate of 5e-6.
batch size of 16
dropout scheme where null values replace the pose input 5% of the time, the input image 5% of the time, and both input pose and input image 5% of the time during training.
2) finetune the UNet on a specific sample frame for another 500 steps with a learning rate of 1e-5 and no dropout.
3) finetune the VAE decoder only for 1500 steps with a learning rate of 5e-5.
Inference
use a PNDM sampler for 100 denoising steps
UBC Fashion dataset
Dwnet: Dense warp-based network for poseguided human video generation, 2019 BMVC
https://github.com/zpolina/dwnet
339 training and 100 test videos.
a frame rate of 30 frames/second, 12 seconds long
training, randomly sample pairs of frames from the training videos
quantitatively & qualitatively
Motion Representations for Articulated Animation (MRAA) [39]
Thin-Plate Spline Motion Model (TPSMM) [53].
UBC Fashion Dataset에 동일한 script와 epoch로 처음부터 학습시킨 것을 비교
We run PIDM and our method with 100 denoising steps.
6.1.1 Quantitative Analysis
test all models on the UBC Fashion test set, consisting of 100 unique fashion videos, at 256px resolution
extract 50 frames for testing
the full DreamPose model quantitatively outperforms both methods in all four quantitative
metrics
6.1.2 Qualitative Analysis
With MRAA and TPSMM, note that the person identity, fabric folds, and fine patterns are lost in new poses, whereas DreamPose accurately retains those details. Plus, during large pose changes, MRAA may produce disjointed limbs.
PIDM synthesizes realistic faces
both the identity and the dress appearance vary frame-to-frame
(1) $Ours_{CLIP}$: We use a pretrained CLIP image encoder, instead of our dual CLIPVAE encoder
(2) $Ours_{No-VAE-FT}$ : We do subject-specific finetuning of the UNet only, not the VAE decoder
(3) $Ours_{1-pose}$ : We concatenate only one target pose, instead of 5 consecutive poses, to the noise
(4) $Ours_{full}$ : Our full model, including subject-specific VAE finetuning, CLIPVAE encoder, and 5-pose input.
Quantitative Analysis
100 predicted video frames selected from each of the 100 test videos of the UBC Fashion dataset
Qualitative Analysis
additional input images of a subject increase the quality and viewpoint consistency
한계 : limbs disappearing, hallucinated dress features, directional misalignment
개선 방법 가능성 : some of these failures could be alleviated with improved pose estimation, a larger dataset, or a segmentation mask.
Future Work(temporal consistensy) : Achieving better temporal consistency on such patterns, ideally without subject-specific finetuning, is left to future work
Future Work(Time ): Fine-tuning the model on a specific subject takes approximately 10 minutes for the UNet and 20 minutes for the VAE decoder, in addition to an 18 second per-frame rendering time.
https://github.com/johannakarras/DreamPose
https://grail.cs.washington.edu/projects/dreampose/