[논문 Summary] LMD (TMLR 2024) "LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models"

논문 정보

Citation : 2024.06.09 일요일 기준 64회

저자

Long Lian, Boyi Li, Adam Yala, Trevor Darrell

- UC Berkeley

논문 & Github 링크

Official

https://llm-grounded-diffusion.github.io/

Arxiv

https://arxiv.org/abs/2305.13655

공식 Github

https://github.com/TonyLianLong/LLM-groundedDiffusion

논문 Summary

Abstract

0. 설명 시작 전 Overview

본 모델은 추가 훈련 없이 LLM 기반의 layout을 생성한 후 해당 내용을 활용한 이미지 생성하는 절차에 대해서 이야기하고 있다.

해당 절차에는 2 단계가 필요

1단계

LLM을 활용한 prompt, negative prompt, bounding box coordinate를 생성

2단계

생성한 layout 기반 bounding box별 cross attention map의 객체 segmentation 진행.

이후 생성한 reverse process box별 mask feature image를 활용한 이미지 생성

효과

이를 통해 Negation, Numeracy, Attribute binding, spatial relationship 4가지 측면에서 기존 모델들 대비 우월성을 보임.

1. Introduction

문제

text-to-image generation은 급격히 향상된 품질의 이미지를 생성하지만 prompt를 정확하게 따르는 이미지 생성에는 여전 히 어려움을 겪고 있다. 특히, object의 특정 숫자( Number )나 존재하지 않는 물체(Negation)에 대한 이해, 공간적 이해 (Spatial Reasoning)와 물체의 속성(Attributes) 연계에 대해서 잘 생성하지 못한다.

단순한 해결법과 그 한계

이런 문제에 대한 해결책으로 복잡한 caption으로 구성된 종합적인 데이터 세트를 구축 후 학습하면 되지만, 정교한 데이터 구축 및 학습에 상당한 시간과 자원이 소모된다.

이에 본 논문에서는 training-free method 방식을 제안한다. LLM-Grounded Diffusion (LMD)

효과

이를 통해 기존 어려움을 겪는 4가지 문제 (Negation, Numeracy, Attribute binding, Spatial Relationships)에 대해서 기존 base model인 SDXL 대비 상대적으로 이미지를 잘 생성할 수 있는 모습을 보인다.

(자세한 방법론은 3. methods에서 설명)

Contribution Summary

1. We propose a training-free two-stage generation pipeline that introduces LLMs to improve the prompt
understanding ability of text-to-image diffusion models.

2. We introduce layout-grounded Stable Diffusion, a novel controller that steers an off-the-shelf diffusion
model to generate images grounded on instance-level box layouts from the LLM.

3. LMD enables instruction-based scene specification and allows broader language support in the prompts.

4. We propose a benchmark to assess the prompt understanding ability of a text-to-image model and
demonstrate the superior performance of LMD over recent baselines.

2. Related work

Text-to-image diffusion models

(생략)

LLMs for visual grounding

논문 제목: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (BLIP-2)
- 저자: Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi
- 소속: Salesforce Research
- 출시 연도: 2023.01
- 제출 학회: arXiv

논문 제목: Flamingo: A Visual Language Model for Few-Shot Learning (Flamingo)
- 저자: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al.
- 소속: DeepMind
- 출시 연도: 2022.12
- 제출 학회: NeurIPS 2022

논문 제목: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (VisualChatGPT)
- 저자: Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan
- 소속: Microsoft Research Asia
- 출시 연도: 2023.03
- 제출 학회: arXiv

논문 제목: Generating Images with Multimodal Language Models (GILL)
- 저자: Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov
- 소속: Carnegie Mellon University
- 출시 연도: 2023.05
- 제출 학회: arXiv
논문 제목: LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
- 저자: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang
- 소속: UC Santa Barbara, University of Washington, Google Research
- 출시 연도: 2023.05
- 제출 학회: arXiv

논문 제목: Training-free layout control with cross-attention guidance
- 저자: Minghao Chen, Iro Laina, Andrea Vedaldi
- 소속: University of Oxford
- 출시 연도: 2023.04
- 제출 학회: arXiv

Spatially-conditioned image generation methods

논문 제목: Semantic Image Synthesis with Spatially-Adaptive Normalization (SPADE)
- 저자: Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu
- 소속: NVIDIA
- 출시 연도: 2019.06
- 제출 학회: CVPR 2019

논문 제목: BlobGAN: Spatially Disentangled Scene Representations (BlobGAN)
- 저자: Dave Epstein, Taesung Park, Richard Zhang, Eli Shechtman, Alexei A. Efros
- 소속: UC Berkeley, NVIDIA, Adobe Research
- 출시 연도: 2022.10
- 제출 학회: ECCV 2022

논문 제목: Image Generation from Layout (Layout2Im)
- 저자: Bo Zhao, Lili Meng, Weidong Yin, Leonid Sigal
- 소속: University of British Columbia
- 출시 연도: 2019.06
- 제출 학회: CVPR 2019

논문 제목: Scene Graph Generation by Iterative Message Passing
- 저자: Danfei Xu, Yuke Zhu, Christopher B. Choy, Li Fei-Fei
- 소속: Stanford University
- 출시 연도: 2017.06
- 제출 학회: CVPR 2017

논문 제목: Image Generation from Scene Graphs
- 저자: Justin Johnson, Agrim Gupta, Li Fei-Fei
- 소속: Stanford University
- 출시 연도: 2018.06
- 제출 학회: CVPR 2018

논문 제목: Learning Canonical Representations for Scene Graph to Image Generation
- 저자: Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
- 소속: Tel Aviv University, UC Berkeley
- 출시 연도: 2020.08
- 제출 학회: ECCV 2020

논문 제목: Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)
- 저자: Lvmin Zhang, Maneesh Agrawala
- 소속: Stanford University
- 출시 연도: 2023.02
- 제출 학회: arXiv

논문 제목: SpaText: Spatio-Textual Representation for Controllable Image Generation
- 저자: Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, Xi Yin
- 소속: Google Research, University of Washington
- 출시 연도: 2023.06
- 제출 학회: CVPR 2023

논문 제목: LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation
- 저자: Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, Mu Li
- 소속: Shanghai AI Lab
- 출시 연도: 2023.02
- 제출 학회: arXiv

논문 제목: LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
- 저자: Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
- 소속: Zhejiang University, Tencent AI Lab
- 출시 연도: 2023.06
- 제출 학회: CVPR 2023

논문 제목: GLIGEN: Open-Set Grounded Text-to-Image Generation
- 저자: Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee
- 소속: Microsoft Research, UC Davis
- 출시 연도: 2023.01
- 제출 학회: arXiv

논문 제목: ReCO: Region-Controlled Text-to-Image Generation
- 저자: Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng
- 소속: Microsoft Research, University of Washington
- 출시 연도: 2023.06
- 제출 학회: CVPR 2023

논문 제목: Multidiffusion: Fusing Diffusion Paths for Controlled Image Generation
- 저자: Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel
- 소속: Weizmann Institute of Science
- 출시 연도: 2023.02
- 제출 학회: arXiv

논문 제목: Training-free Layout Control with Cross-Attention Guidance
- 저자: Minghao Chen, Iro Laina, Andrea Vedaldi
- 소속: University of Oxford
- 출시 연도: 2023.04
- 제출 학회: arXiv

논문 제목: BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
- 저자: Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou
- 소속: National University of Singapore, Tencent Jarvis Lab
- 출시 연도: 2023.07
- 제출 학회: arXiv

논문 제목: InstructPix2Pix: Learning to Follow Image Editing Instructions
- 저자: Tim Brooks, Aleksander Holynski, Alexei A. Efros
- 소속: UC Berkeley
- 출시 연도: 2023.02
- 제출 학회: arXiv

논문 제목: Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- 저자: Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan
- 소속: Microsoft Research Asia
- 출시 연도: 2023.03
- 제출 학회: arXiv

논문 제목: Visual Programming: Compositional Visual Reasoning Without Training
- 저자: Tanmay Gupta, Aniruddha Kembhavi
- 소속: Allen Institute for AI
- 출시 연도: 2023.06
- 제출 학회: CVPR 2023

3. LLM-grounded Diffusion

Two stages:

Text-grounded layout generation (Section 3.1)

Layout-grounded image generation (Section 3.2)

3.1 LLM-based Layout Generation

Input : text prompt $y$

Layout Representation

LMD layout representation을 위한 2가지 요소

1) 각 객체별 bounding box caption과 coordinate(x, y, width, height)

2) 간결한 background caption과 neagtive prompt

Instructions

1) Task specification : LLM에게 어떤 작업을 할지를 알려주는 단계

2) Supporting details : 보조 상세 설명 내역을 작성하는 단계

보편적 단계에서 사용하기 위해 default 지시사항

In-context Learning

예시 제공을 통한 모호성 제거 단계

LLM Completion

예시 이후 제안한 prompt에 따른 caption과 object에 대한 구체적 지시사항 이행을 LLM에게 요구하는 단계

이 단계에서의 구체적 예시를 Appendix K에 작성해놓음.

3.2 Layout-grounded Stable Diffusion

Stage 2 : layout 기반 이미지 생성을 위한 controller 단계

이전 논문들의 경우 semantic guidnace를 영역별 denoising이나 attention 조작을 통해 진행했으나 잠재 공간에서의 객체에 대한 구별이 어려운 문제로 인한 한계가 존재

이에 개별 bounding box에 대한 mask latent를 prior로 활용해서 종합적인 이미지 생성이 가능하도록 하는 방법 제안.

Per-box masked latents.

본 단계 목표

LLM을 통해 생성한 object별 layout에 대하여 cross-attention map을 활용한 attention map을 각 iteration마다 추출하는 것

box 내 object에 대한 feature 추출 시 box 안쪽은 강조하고 box 바깥은 약화시키는 방식으로 energy function 구성
(Eq 2에 있는 A는 cross-attention map)

Eq 3처럼 각 denoising step 이전 latent update를 통해 energy function 최소화 진행.

예시)

i번째 box의 caption이 "a gray cat"일 경우 box 안쪽 denoising text prompt는 "[background prompt] with a gray cat"으로 진행

Refined Mask

SAM 사용(Optional) or threshold를 통한 mask 추출 → 추출한 cross attention map과 elemet-wise multiplication

Masked Latents as Priors for Instance-level Control

box별 객체에 대하여 추출한 step별 latent를 활용하여 attention에 guidance 역할을 수행하도록 진행

전체 알고리즘 (Appendix B)

Integration with Training-based Methods

LMD+ = Controller adaptor + GLIGEN

3.3 Additional Capabilities of LMD

Instruction-based scene specification

layout만을 update함으로써 consistent 유지, spatial reasoning에 대한 요청 수용 가능

연속적 request에도 처리 가능.

Supporting more languages

다양한 언어에 대한 수행 가능.

4. Evaluation

4.1 Qualitative Comparison

Setup

Base model : SDXL

다른 모델 비교를 위해서는 LMD에 SD 1.5 사용

Comparing with other LLM-based image generators

4.2 Quantitative Evaluation

4.2.1 Proposed Benchmark

Four Tasks: Negation, Generative Numeracy, Attribute Binding, Spatial Reasoning

각 모델별 100 prompts & query. 총 400 prompt 사용

Detection-based Evaluation

OWL-ViT, open-vocabulary object detector

Results.

4.3 Ablation Study

Layout-to-Image Stage

Comparing with other layout-to-image methods.

Switching the base diffusion model without hyperparameter tuning. (Table 3)

Using SAM vs a simple attention threshold to obtain the per-box mask (Table 4)

Text-to-layout stage.

Ablating in-context examples (Table 5)

- GPT-4는 예시가 없어도 잘 생성하지만, 예가 없이는 원하는 format이 아닐 수 있으니 최소 하나 이상은 권고

model types and the sizes of the LLMs (Appendix D, E)

4.4 T2I-CompBench (Table 6)

4.5 Evaluator-based Assessment

MOS 형식

10개의 랜덤 text prompt 선택

text prompt별 이미지 생성

평가자에게 아래 2개의 질문에 대한 답변 수신

1. Question 1: Which image aligns better with the text prompt?
2. Question 2: Which image has a more natural and coherent foreground-background composition?

5. Discussions

애매한 레이아웃 생성:
- 문제점: LLM이 생성한 레이아웃이 확산 모델에 애매모호하게 전달될 수 있습니다. 예를 들어, 상단에서 본 이미지를 위한 레이아웃이 측면에서 본 이미지로 해석되는 경우입니다.
- 해결책: LLM을 보다 명확하게 조정하거나 미세 조정하여 레이아웃의 가정(예: 시점)을 명확히 하면 이 문제를 완화할 수 있습니다. 이를 통해 LLM이 생성하는 레이아웃이 확산 모델과 일관되도록 할 수 있습니다.
기본 모델의 편향 상속:
- 문제점: 우리의 방법은 기본 확산 모델의 편향을 상속받아 특정 객체에 대해 불균형적인 처리를 할 수 있습니다.
- 해결책: 추가적인 데이터 증강 및 편향 완화 기술을 사용하여 모델의 편향을 최소화할 수 있습니다. 다양한 데이터를 활용하여 모델의 일반화 능력을 향상시키는 것도 한 방법입니다.
맥락 예제에 대한 의존성:
- 문제점: LLM은 맥락 예제에 언급된 객체에 대해 더 나은 레이아웃을 생성할 수 있습니다.
- 해결책: 다양한 맥락 예제를 제공하여 LLM이 더 많은 종류의 객체에 대해 일관성 있는 레이아웃을 생성할 수 있도록 할 수 있습니다. 또한, LLM을 지속적으로 학습시키거나 미세 조정하여 새로운 객체에 대한 레이아웃 생성 능력을 향상시킬 수 있습니다.
배포 용이성:
- 문제점: 추론 시 LLM을 활용하지 않고도 프롬프트 이해 능력을 향상시키기 위해 단일 단계 텍스트-이미지 확산 모델로 증류될 필요가 있습니다.
- 해결책: LLM을 활용하지 않는 단일 단계 모델로 증류하여 배포의 용이성을 향상시킬 수 있습니다. 이를 통해 모델의 경량화와 실시간 응용 가능성을 높일 수 있습니다.