[논문 Summary] InstructPix2Pix (2023 CVPR) "InstructPix2Pix: Learning to Follow Image Editing Instructions"

논문 정보

Citation : 2023.05.01 월요일 기준 34회

저자

Tim Brooks, Aleksander Holynski, Alexei A. Efros - University of California, Berkeley

논문 링크

Official

Arxiv

https://arxiv.org/abs/2211.09800

InstructPix2Pix: Learning to Follow Image Editing Instructions

We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the

arxiv.org

논문 Summary

Abstract

https://instruct-pix2pix.timothybrooks.com/instruct-pix2pix.mp4

0. 설명 시작 전 Overview

language model(GPT-3)과 text-to-image model(Stable Diffusion) 두 pretrained model을 결합하여 훈련 데이터를 구성하고 conditional diffusion model인 InstructPix2Pix를 통해 생성된 데이터로 훈련시킨다. 실제 이미지들과 user-written 지시들을 inference time동안 일반화시켜 사용한다.

추가적인 별도의 fine-tuning이나 inversion을 요구하지 않음. 빠름.

1. Introduction

Image editing을 위해 human-written instruction을 따르는 generative model을 훈련시키기 위한 방법을 제안

paired training data를 위해 language model(GPT-3)과 text-to-image model(Stable Diffusion) 두 pretrained model을 결합하여 생성.

생성된 paired data를 사용하여 conditional diffusion model에서 훈련함.

모델은 forward pass에서의 image edit이기 때문에 추가적인 예시 이미지가 필요하지 않고 example마다의 fine-tuning 역시 필요하지 않다.

본 모델은 임의의 real image와 natural human-written instruction을 통한 zero-shot genratlization을 달성하기에 다양한 edit이 가능하다.

2. Prior work

Composing large pretrained models

GPT3, Stable Diffusion 이 모델들을 활용해 paired multi-modal training data를 생성하는데 사용함.

Diffusion-based generative models

다양한 generative modalities 수행.

Generative models for image editing

본 모델은 단일 이미지와 instruction을 가지고 어떻게 이미지를 수정하는지 수행.

특별한 user가 그린 mask, 추가적인 이미지 또는 example마다의 inversion/fine-tuning 없이 forward pass에 대한 edit을 수행.

Learning to follow instructions

Training data generation with generative models

저자들은 두 generative model(language, text-to-image)를 사용하여 훈련 데이터를 생성하기 위해 사용

3. Methods

(1) paired training dataset 생성

(2) 생성된 dataset으로 image editing diffusion model 훈련

생성된 이미지와 editing instruction으로 훈련시켰음에도 불구하고, 본 모델은 임의의 human-written instruction을 사용하여 실제 이미지들을 editing하고 일반화할 수 있다.

-> 사람이 직접 instruction과 이미지를 제공하면 Instruction에 맞는 이미지 editing된 이미지를 생성해준다는 의미

3.1 Generating a Multi-modal Training Dataset

3.1.1 Genrating Instructions and Paired Captions

Figure 2 (a)에서 보이듯 Input Caption이 제공되었을 때 Instruction과 Edited Caption을 만들기 위해서 GPT-3를 finetuning하여 훈련시킨다.

여기서 우리가 필요한 것은 Input Caption, Edit Instruction, Output Caption 총 3개이다. (일종의 Supervised 형식)

이에 따라 700개의 input caption과 사람이 직접 적은 instruction과 output caption을 훈련데이터로 사용한다.

GPT-3 Davinci model을 1 epoch에 대하여 fine-tuning 진행

-> 이렇게 하면 Input Caption에 대하여 창의적이고 sensible instruction과 caption이 생성된다.

Input Caption은 큰 규모의 다양한 내용이 있고 다양한 medium이 존재하는 LAION dataset-Aesthetics로부터 caption한다.

LAION에 noise가 있는 단점은 dataset filtering과 classfier-free guidance로 완화시킴.

이를 통해 454,445개의 instruction과 caption corpus example을 구성

3.1.2 Genrating paired images from Paired Captions

Figure 2 (b)와 마찬가지로 Input Caption과 Edited Caption이 주어졌을 때 이미지를 Edit하기 위해 Stable diffusion을 사용한다. 그러나 image consistency의 유지를 위해 Prompt-to-Prompt를 사용한다.