[논문 Summary] BlobGEN (ICML 2024) "Compositional Text-to-Image Generation with Dense Blob Representations"

논문 정보

Citation : 2024.06.01 토요일 기준 회

저자

Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

NVIDIA

논문 & Github 링크

Official

Arxiv

https://arxiv.org/abs/2405.08246

공식 Github

Code는 없음

https://blobgen-2d.github.io/

Compositional Text-to-Image Generation with Dense Blob Representations

blobgen-2d.github.io

논문 Summary

0. 설명 시작 전 Overview

Text-to-image model에서 제어가능한 생성은 아직도 어려운 상황이다.

이에 본 논문은 blob layout( dense blob representation )을 통한 섬세한 디테일을 제어할 수 있는 blob-grounded text-to-image diffusion model인 BlobGEN을 소개한다.

	Caption: "Two astronauts in a grassy field with trees in the background." Blob 1: "The tree is green and leafy, with a lush and healthy appearance." Blob 2: "The grass is tall and lush, with a mix of green and brown colors." Blob 3: "The astronaut wears a white space suit and a large glass helmet, floating in the air with his body stretched." Blob 4: "The astronaut appears to be walking in a wooded area, surrounded by trees. The astronaut wears a white space suit and a large glass helmet."
	Caption: "A dog who is a doctor is talking to a cat who is the dog doctor's patient." Blob 1: "The wall is in light gray color with a landscape painting on it." Blob 2: "The floor is light brown and appears to be made of hardwood. It has a natural and warm appearance and is large and spacious." Blob 3: "The cat is in gray color with a fat and fluffy body. It is sitting on a bench, listening the instructions of the dog doctor." Blob 4: "The bench is in black color with a leather texture." Blob 5: "The desk is in light blue color. It appears to be a classic work desk." Blob 6: "The dog has a brown head and dresses a white doctor uniform. It is sitting on a chair, talking to the cat." Blob 7: "The chair is a tan-colored leather chair, featuring a stitched design."

본 논문 주요 contribution

- blob 기반 layer 설정을 위한 blob parameter와 blob description을 구성하는 설정 방법 소개

- masked cross-attention module 도입

- LLM을 활용한 in-context learning approach 보강 도입

- Editing에 있어 다양한 실험 결과에서 우수성을 보임.

Abstract

1. Introduction

text-to-image generation의 발전과는 달리 복잡한 프롬프트를 따라가는 데 어려움을 겪고 있으며, 이는 context를 잘못 이해하거나 키워드를 무시하는 경향이 있다. 따라서 세밀한 제어 가능성은 여전히 해결되지 않은 문제로 남아 있다.

최근 연구에서는 text prompt 자체만으로는 모호한 부분이 있어 visual layout을 통한 방법을 시도하고 있다.

대표적 visual layout : bounding box, semantic maps, depths, other modalities

- semantic maps, depths은 세밀한 정보를 주지만 user가 다루기 쉽지 않다.

- boudning box는 user 친화적이나 상대적으로 큰 정보를 주는 문제가 있다.

본 논문에서는 "dense blob representation"이라 불리는 새로운 type 제안한다. 그 이유는 blob은 세밀한 정보를 제공해줄 뿐 아니라 사용자 친화적이기에 해석과 사용이 용이하기 때문이다.

이를 발전시켜 BlobGEN이란 blob-grounded text-to-image diffusion model 제안.

특히, blob 표현과 시각적 특징 간의 융합을 분리(disentangle)하기 위해 새로운 masked cross-attention module을 도입.

본 논문에서는 blob representation과 visual feature 분리가 각 역할을 분리함으로써 이미지의 세밀한 부분을 더 잘 제어할 수 있다고 고려하기에 이를 분리했다고 표현한 듯 보입니다. 이 둘이 융합된 경우 blob 각 영역에 대한 부분을 잘못해석함에 따라 다른 생성 결과가 만들어질 수 있다고 이야기합니다.

또한, 다른 visual layout 논문들과 마찬가지로 LLM을 활용한 새로운 in-context learning approach를 설계함으로써 dense blob representation을 생성한다. LLM의 시각적 이해와 구성적 추론 능력을 활용함으로써 생성 작업에 도움받을 수 있음을 보임.

zero-shot generation, layout-guided controllability, local editing, object repositioning에 대해 우수한 성능을 보임.

Contribution Summary:

scene을 세밀한 시각적 primitive로서 dense blob representation으로 분해
BlobGEN with 새로운 masked cross-attention module을 도입
새로운 in-context learning approach를 설계를 위한 LLM을 통한 강화
실험적 우수성 입증

2. Method

2.1. Image Decomposition into Blob Representations

"dense blob representation"을 위해서는 2가지 요소가 필요하다.

1) blob parameter : formulates a tilted ellipse to specify the object’s position, size and orientation

2) blob description : a rich text sentence that describes the object’s appearance, style, and visual attributes

blob parameter는 blob의 size, location, orientation 5가지 필요 $[c_x, c_y, a, b, \theta]$

$c_x, c_y$ : elipse의 중앙값

$a,b$ : semi-major, semi-minor 축의 radius

$\theta \in (-\pi , \pi ]$ : orientation 방향 각도

- orientation는 객체의 방향이나 자세를 묘사가 가능할 수 있기에 boudning box 대비 더 세밀한 묘사가 가능하다.

blob description는 객체의 시각적 외관을 묘사할 수 있는 text 문장이다.

본 논문에서는 pretrained image captioner에서 추출한 region-level synthetic caption을 사용.

묘사 내용

category name
appearance (e.g., color, texture, and material, etc.)를 포함하는 객체 the detailed visual features
spatial relationship of sub-parts within the object region (e.g., “a wooden chair with brown legs and soft seat”).

Blob representations은 irregular, large objects and background까지 잡아낼 수 있음.

2.2. Blob-grounded Text-to-Image Generation

blob grounding을 통합하는 새로운 cross-attention layer 도입. 이때, 나머지는 freezing, 새로운 layer만 학습.

Blob Embedding

blob parameter : $\tau = [c_x, c_y, a, b, \theta], \tilde{\tau} =[c_x, c_y, a, b, sin \theta, cos \theta] $

blob description : $s = [s_1, \cdots, s_L]$ (L은 text sentence length)

blob parameter embedding : $e_{\tau} = \text{Fourier}([c_x, c_y, a, b, \sin \theta, \cos \theta])$

Fourier feature encoding 통과

blob description embedding : $e_s = f(s) = [e_{s_1}, e_{s_2}, \ldots, e_{s_L}]$

CLIP text encoder $f$ 통과

final blob embedding (concat) : $e_{\tilde{s}_i} = [e_{s_i}; e_{\tau}]$

차원 계산을 해보니 이미지 내 blob 하나에 대한 blob embedding을 위와 같이 표현하는 것.
즉, blob 하나에 대한 pairing으로 paramter와 description이 같이 있는 형태

Masked Cross-Attention

1) standard cross attention 적용했을 때

N개의 blob embedding이 바로 존재할 때 standard cross attention 적용하면, text leakage나 entanglement가 생김.

이유 : 모든 blob embedding이 모든 feature pixel에 참여하는 방식으로 작동하는 것이 각 blob local region에 상응하는 정보와 상호작용을 활용하는 blob embedding에 적합하지 않아 모델이 헷갈릴 것으로 추정

2) Masked cross attention 적용

각 blob embedding은 해당 영역에 대해서만 참여하는 방식으로 작동.

blob ellipse mask를 활용하여 해당 공간이면 1 아니면 0으로 설정

$q : hw \times d_g$

$k : (N \times L) \times d_g$

$v : (N \times L) \times d_g$

$qk^T : d_g \times (N \times L)

$a = qk^T = hw \times (N \times L)$

$\sigma(a)v = hw \times d_g $

결론 : the blob grounding process can be more modular and independent across different object regions, and the model can be more disentangled in generation.

Other Design Choices

Gated Cross-Attention Module (learnable scalar)
Gated Self-Attention Module (learnable scalar)
Synthetic Global Captions (rather than original caption)
Denoising Score Matching Loss

Li et al. (2023)의 "GLIGEN: Open-set Grounded Text-to-Image Generation" 논문

게이트 방식(gated way)은 학습 가능한 스칼라 파라미터를 사용하여 신경망 모듈에서 정보의 흐름을 조절하는 기법. 이를 통해 모델의 학습 안정성을 높이고, 학습 초기의 과도한 변화에 대한 민감도를 낮추며, 점진적으로 더 많은 정보를 통과시킬 수 있도록 함.

2.3. LLMs for Blob Generation

Blob Parameter Generation

CSS format으로 blob parameter 표현 - LLM이 공간 의미를 더 잘 이해

선언 style : "object {major-radius:?px; minor-radius: ?px; cx: ?px; cy:?px; angle: ?}".

본 논문에서는 GPT3.5-chat, GPT4 사용

Top-k 중 마지막 하나 선택.

어떻게 k개 중 마지막 하나를 선택했는지는 나와있지 않음. 그러나 metric/세부 평가/blending 등으로 확인했을수도?

Blob Description Generation

마찬가지이나 CSS format으로 진행하지는 않음. 다만, category name을 seperator로 blob 간 구분

"object {text sentence}"

역시 top-k에서 마지막 prompt 설정

여기서는 LLaMA-13B 사용

3. Related Work

Text-to-Image Generation:

Compositional Image Generation:

Du et al. (2020) (Compositional visual generation with energy based models, NeurIPS)
Nie et al. (2021) (Controllable and compositional generation with latent-space energy-based models, NeurIPS)
Liu et al. (2022) (Compositional visual generation with composable diffusion models, ECCV)
Epstein et al. (2022) (Blobgan: Spatially disentangled scene representations, ECCV) - BlobGAN
Ruiz et al. (2023) (Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, CVPR) - DreamBooth
Kumari et al. (2023) (Multi-concept customization of text-to-image diffusion, CVPR)
Xiao et al. (2023) (Fastcomposer: Tuning-free multi-subject image generation with localized attention, arXiv) - FastComposer
Feng et al. (2022) (Training-free structured diffusion guidance for compositional text-to-image synthesis, ICLR)
Chefer et al. (2023) (Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models, TOG) - Attend-and-Excite
Epstein et al. (2023) (Diffusion self-guidance for controllable image generation, arXiv)
Chen et al. (2023b) (Training-free layout control with cross-attention guidance, arXiv)
Phung et al. (2023) (Grounded text-to-image synthesis with attention refocusing, arXiv)
Yang et al. (2023) (Reco: Region-controlled text-to-image generation, CVPR) - Reco
Li et al. (2023) (GLIGEN: Open-set grounded text-to-image generation, CVPR) - GLIGEN
Zheng et al. (2023) (Layoutdiffusion: Controllable diffusion model for layout-to-image generation, CVPR) - LayoutDiffusion
Huang et al. (2023b) (Composer: Creative and controllable image synthesis with composable conditions, arXiv) - Composer
Feng et al. (2023b) (Ranni: Taming text-to-image diffusion for accurate instruction following, arXiv) - Ranni

LLM-augmented Image Generation

Wu et al. (2023a) (Visual chatgpt: Talking, drawing and editing with visual foundation models, arXiv) - Visual ChatGPT
Koh et al. (2023) (Generating images with multimodal language models, arXiv)
Sun et al. (2023) (Generative pretraining in multimodality, arXiv)
Feng et al. (2023a) (Layoutgpt: Compositional visual planning and generation with large language models, arXiv) - LayoutGPT
Lian et al. (2023) (Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models, arXiv) - LMD

4. Experiments

4.1. Blob-grounded Text-to-Image Generation

Data Preparation

Dataset : Common Crawl web index (filtered with the CLIP score) - 랜덤 1M image-text pairs

Resizing : 512×512

blob representations 추출을 위한 ODISE (Xu et al., 2023) 적용 (get instance segmentation maps)

- aiming to maximize the Intersection Over Union (IOU) between the blob ellipse and segmentation mask.

- 이미지 내 모든 object의 local regions crop

- 각 blob별 caption을 위해 LLaVA-1.5 (Liu et al., 2023a) 사용 (위 segmentation map도 같이 넣어서 활용하는 듯)

이미지마다 평균 12개 blob 보유

Training Details

SD 1.4

512x512

400K steps

batch size 512

9 days on 64 NVIDIA A100 GPUs

AdamW

학습 시 blob 표현 강화를 위해 랜덤하게 50% 확률로 global caption drop(?... )

Evaluation Metrics

FID - 30k 생성 & 실제 이미지 기반 비교

mIOU - langSAM 기반 segmentation map과 ellipse mask 영역 비교

rCLIP_i - region-level CLIP Image similarity

rCLIP_t - region-level CLIP score (Image & text(caption))

4.1.2. ZERO-SHOT GENERATION ON MS-COCO

better mIOU and rCLIP_i / deteriorates the rCLIP_t score

- 이유 : discrepancy arises from a misalignment between the consistency decoder and the CLIP text encoder

Editing

4.1.3. ABLATION STUDIES

4.2. LLMs for Blob Generation

Data Preparation

NSR-1K

Evaluation Metrics

Precision, Recall, Accuracy

4.2.2. NUMERICAL AND SPATIAL REASONING

총평

나름 layout을 활용한 방법 중 blob이라는 방식이 다른 방식 대비 효과적임을 분명하게 나타난 듯하다. 그러나 다만, 생각보다 생성된 품질의 결과가 좋지 못한 것을 확인할 수 있어 해당 부분에 대한 더 향상된 혹은 더 좋은 모델들과의 결합을 통한 결과가 없는 것이 아쉬운 점이다.

확실히 LLM의 성능이 높아지고 multi-modal을 넘어서는 다양한 domain input이 가능한 모델들의 등장에 따라 해당 방법론을 도입하려는 시도가 최근 많이 있고 이 부분은 분명 관심을 가지고 지켜봐야할 것 같다.