[논문 Summary] DiffEdit (2022.10 arxiv) "DiffEdit: Diffusion-based semantic image editing with mask guidance"

논문 정보

Citation : 2023.04.24 월요일 기준 32회

저자

Guillaume Couairon, Jakob Verbeek, Holger Schwenk (Meta AI), Matthieu Cord(Sorbonne Universit´e, Valeo.ai)

논문 링크

Official

Not Yet

Arxiv

https://arxiv.org/abs/2210.11427

DiffEdit: Diffusion-based semantic image editing with mask guidance

Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion model

arxiv.org

논문 Summary

Abstract

DiffEdit

text query 기반 이미지를 edit하는 semantic image editing task에 대한 text-conditioned diffusion model 방법을 제안

현재 Editing 방법들은 mask 제공된 것을 기반으로 inpainting이 진행되지만 본 방법은 다른 text prompt에 조건한 diffusion model의 대조 예측을 통해 자동적인 mask highlight 영역을 생성한다.

이를 통해 더 좋은 시너지 효과가 나타나며 높은 정량적 지표를 확인함.

1. Introduction

Semantic Image Editing은 textual transformation query에 따라 입력 이미지를 수정하는 것을 의미한다.

예를 들어 주어진 이미지에서 fruits를 pears로 바꾸고자 한다면 해당 모델을 통해 fruits를 pear로 바꾸며 그 외 text인 bowl과 배경은 최대한 입력 이미지와 비슷하게 유지하는 것을 목표로 한다.

Text-conditional image generation은 거대한 변혁이 진행 중이다.

이 모델들의 크기를 키우는 것이 성공의 비결인데 많은 리소스를 필요로 하는 많은 양의 데이터로 학습하는 것이 최신 모델의 훈련 법이다.

LLM의 downstream task로 prompt engineering이 채택되는 바와 마찬가지로, 거대한 생성 모델의 생성적 힘은 semantic image editing을 해결하는데 활용할 수 있다. 물론 이때 특별한 architecture를 훈련하거나 비용이 많이드는 isntance -based optimization은 회피하면서 말이다.

Image Editing에서 Diffusion model은 활발하게 사용되고 있다. 특히, CLIP guidance 혹은 user-given mask를 통한 inpainting과 같은 다양한 기법들을 guide로 활용되지만 semantic image editing을 위한 방법으로는 2가지 중요한 요소가 부족하다.

inpainting 시 반드시 사용되야하는 정보를 버린다.(ex) 동물의 경우 pose, color etc)
mask가 요구된다.

본 모델을 통해 language-guided editing을 통해 더 직관적이고 적은 노력이 요구되게 하는 방법을 제공할 것이라 믿는다.

mask 없이 Conditioning diffusion model을 진행할 수도 있다. 제안 모델은 지역적 수정을 목표로 하는 반면 mask 없는 Conditioning diffusion model 방법은 전체 이미지를 수정한다. 더욱이 noise를 더하는 작업이 중요한 정보를 잊어버리게 한다.

text query가 주어졌을 때 수정해야만 하는 입력 이미지의 영역이 어딘지를 자동적으로 발견하는 DiffEdit을 제안한다. 단순 noise를 추가하는 것보다 잠재 공간에서의 입력 이미지로의 encode하는 reverse denoising model가 더 나은 배경과의 수정된 영역의 통합을 이뤄낼 수 있고 더 자연스러운 수정이 가능하다.

2. Related Work

Semantic image editing

image editing은 photo colorization, retouching부터 style transfer, 이미지 내 object 넣기, image-to-image translation, inpainting, scene graph manipulation, 새로운 내용에 subject 넣기 등 다양한 task를 아우른다.

우리는 이 중에서도 natural language가 주어졌을 때 이미지를 수정하는 semantic image editing에 집중한다.

Image editing with diffusion models

diffusion model 자체 특성 때문에 mask가 주어졌을 때 inpainint을 위한 방법으로 쉽게 채택된다.

DDIM, CLIP-guided diffusion, blended diffusion, Diffusion CLIP, SDEdit 등의 방법에서 활용 가능한 history가 있다.

3. DiffEdit Framework

3.1 Background: Diffusion Models, DDIM and Encoding

DDPM

forword process : T timestep 동안 noise를 계속 주입하여 Gaussian noise 생성

reverse process : neural network를 통해 Denoising objective를 최소화하는 방법으로 훈련 진행

DDIM

deterministic procedure이 있는 새 이미지들을 생성하기 위한 방정식을 제공. eq 2

이에 따른 neural ODE(상미분 방정식) 관련 식으로 전개하면 eq 3과 같다.

inference동안 더 적은 smapling step 사용

3.2 Semantic Image Editing with DiffEdit

semantic image edit은 이미지의 일부분만을 수정하고 나머지는 변하지 않기를 원한다. 그러나 단순 text query로 영역을 구별하는 것은 쉽지 않고 원치 않는 부분까지 수정하는 경우가 생기기도 한다.

이를 회피하기 위해 DiffEdit은 수정이 필요한 부분에 대한 mask를 추론하는 text-conditioned diffusion model을 활용하는 방법을 제안한다.

DDIM encoding으로 추론된 mask를 사용하고 관심 밖 영역의 수정은 최소화 하는 process로 guide한다.

여기에는 3 단계가 존재한다.

Step 1: Computing editing mask

text-conditioned diffusion model은 다른 text conditioning이 주어졌을 때 다른 noise 추정치가 생성된다.

Figure 2 query “zebra” -> reference text “horse”의 차이가 생김 반면 배경은 그대로

noise estimate 간의 차이를 기반으로 mask 추론

Gaussian noise with strength 50%

극단적 noise 예측값 삭제, n input noise 집합의 공간적 차이 평균 진행

0~1 범주로 rescale, Binarize with threshold 0.5로 설정. 이에 따라 mask되는 영영은 overshoot

Step 2: Encoding

DDIM encoding function $E_r$에 timestep r, implicit latent space로 입력 이미지 $x_0$ encode

이 단계에서 text input은 없음

Step 3: Decoding with mask guidance

latent $x_r$ 획득 후 수정하고자 하는 text query Q를 조건으로 diffusion model을 decode.

mask M을 diffusion process에 guide로 활용

mask 외 영역에 대해서는 DDIM encoding에서 추론했던 latent xt로 교체함.

encoding ratio r은 edit의 강도를 나타냄. 큰 r은 강한 edit의 허용 정도를 나타냄.

이에 대한 효과는 Appendix A.5 참조

3.3 Theoretical Analysis

왜 SDEdit과 같이 랜덤 노이즈를 추가하는 것보다 이 방식이 더 나은 edit 결과를 가져오는 가에 대한 이론적 insight를 주기 위한 session

DiffEdit에서 text query Q를 조건으로 하는 DDIM decoding을 사용하지만 여전히 원본 이미지에 가까이 가려는 강한 bias가 존재.

왜냐하면 noise estimator network는 비슷한 estimate를 생산하려기 때문.

즉, edited 이미지는 입력 이미지와 관련하여 작은 distance를 가져야만 함.

eq (4) : SDEdit / eq (5) : DiffEdit

DiffEdit이 더 tight함.

증명은 Appendix B.

4. Experiments

4.1 Experimental Setup

Dataset

실험 : ImageNet, Imagen, COCO dataset

Diffusion models

실험에서 LDM 사용. 512x512 해상도

DDIM sampling으로 50 step

단일 Quadro GP 100 GPU를 사용하여 10초 내외로 이미지 수정

classifier-free guidance 사용

Comparison to other methods

SDEdit, FlexIT, ILVR과 비교

Evaluation

2개의 모순된 objective (1) text query와 일치 (2) input image와 가까워야 한다라는 것들을 만족해야 함.

두 objective 사이의 trade-off curve를 기반으로 비교 진행

4.2 Experiments on ImageNet

LPIPS : input image와의 perceptual distance 계산

CSFID : class-conditional FID 변형한 prompt와 관련한 realism과 일관성을 측정

낮을수록 좋은 관계

edit이 강할수록 CSFID score가 낮음

Ablation Study

(좌) DDIM encoding의 유무에 따라 성능 차이를 보여주며 해당 encoding이 있을 때 더 나은 모습을 보여준다.

(우) 낮은 threshold (0.25)의 경우 넓은 mask를 초래하여 나쁜 tradeoff가 생성

너무 높은 threshold(0.75)의 경우 너무 제한적인 mask를 초래함.

Encode-Decode는 masking 없는 DIFFEDIT을 의미

SDEdit는 masking도 없고 encoding도 없음

Masking을 사용하지 않으면 원하지 않는 background edit이 이뤄짐 (2열)

DDIM encoding을 사용하지 않으면 입력의 appearance 정보(e.g. pose)가 사라짐 (마지막 2열)

4.3 Experiments on Images Generated by Imagen

single class label이 어렵기 때문에 CSFID 대신 FID와 CLIP-Score 사용

FID : image realism 측정

CLIP-Score : query와 결과 이미지의 alignment 측정

DiffEdit가 다른 방법들보다 더 정확한 edit 제공

Figure 8은 reference text가 있냐 없냐에 따른 mask 차이를 표현한 것

reference text는 query와 reference text 양쪽에 묘사되는 이미지의 부분을 무시하도록 할 수 있게 한다.

반면 다른 부분에 대해서는 다른 noise estimate를 가진다.

4.4 Experiments on COCO

5. Concolusion

DiffEdit : diffusion model 기반의 semantic image editing을 위한 새로운 알고리즘 도입

textual query가 주어졌을 때 diffusion model을 활용하여 수정하고자 하는 관련성 있는 영역에 대한 mask 생성하고 DDIM encoding을 진행. 이를 통해 정량적, 정성적으로 높은 평가를 나타냄. 훌륭한 수정 결과를 이끔

7. Ethics Statement

“must not distribute harmful, offensive, dehumanizing content or otherwise harmful representations of people or their environments, cultures, religions, etc. produced with the model weights”.

윤리적 강령

Appendix

A.4 Visualisation of the impact of encoding ratio

4.2 ablation study와 같이 보면 좋은 Figure

궁극적으로 Encoding & Autogenerated Mask를 같이 사용하는 최종 DiffEdit이 좋다.

A.5 Additional Visualization and Qualitative Results

A. Failure Case

새로운 객체를 넣거나 예상과 다른 영역에 대한 mask를 진행하거나 원치않는 위치 혹은 객체가 수정되는 경우들이 존재한다.

Reference

공식 Github

Yet

Sub Github

https://github.com/huggingface/diffusers

GitHub - huggingface/diffusers: 🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch - GitHub - huggingface/diffusers: 🤗 Diffusers: State-of-the-art diffusion models for image and audio generat...

github.com