[논문 Summary] MAE(Masked AutoEncoder) (2022 CVPR) "Masked Autoencoders Are Scalable Vision Learners"

논문 정보

Citation : 2023.10.02 월요일 기준 2950회

저자

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

Meta (FAIR)

논문 링크

Official

https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf

Arxiv

https://arxiv.org/abs/2111.06377

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we

arxiv.org

논문 Summary

Abstract

0. 설명 시작 전 Overview

1. Introduction

label없는 데이터의 문제 해결을 위해 NLP에서는 SSL 활용.

예를 들어 GPT와 같은 autoregressive language modeling 방법과 BERT와 같은 masked autoencoding 방법.

masked autoencoder는 CV에서도 많이 활용되지만 NLP 대비 성능이 뒤처짐.

그렇다면 masked autoencoder에서 vision와 language 간의 차이는 어디에서 기인하는 것일까.

3가지 관점

1) architecture 차이

Vision에서는 CNN 사용. 이는 간극이 발생. 그러나 최근 ViT로 일부 해결

2) Information density 차이

language는 highly semantic하고 information-desne signal이기에 문장의 중간중간 mask된 일부 단어에 대한 예측이 쉽다.

그러나 vision은 spatial redundancy가 많은 signal

이 때문에 전체적인 이해를 요구할 수 있고 중복성을 감소시키기 위해 random patch의 많은 부분을 masking하여 활용

3) Autoencoder의 decoder

vision에서는 decoder를 통해 pixel reconstruct하기 때문에 잠재 공간에서의 seantic level에서 중요한 역할을 하지만, language는 decoder를 통해 semantic 정보를 담은 단어를 예측한다. 이는 vision에 비해 상대적으로 사소하다.

저자들은 visual representation learning을 위해 간단하고, 효과적이며, scalable한 masked autoencoder(MAE)를 제안한다.

reduce overall pre-training time by x3 or more and likewise reduce memory consumption, enabling us to easily scale our MAE to large models.

achieves better results

2. Related work

Masked Language modeling

BERT, GPT

Autoencoder

Denoising autoencoder (DAE)

Masked image encoding

Vision : DAE, Context Encoder, ViT, BEiT

NLP : Transformer, iGPT

Self-supervised learning

pretext task, Contrastive Learning

3. Approach

간단한 autoencoder 구조

그러나 고전적인 방법과 달리 asymmetric design을 취함.

the encoder to operate only on the partial, observed signal (without mask tokens)

a lightweight decoder that reconstructs the full signal from the latent representation and mask tokens.

Masking

ViT

겹치지 않은 patch들을 uniform dist에 따라 교체 없이 random patch sample 진행 : random sampling

높은 masking 비율을 통해 이웃 patch들로 답을 추론할 수 있는 것을 방지하고 redundancy 제거.

uniform dist를 통해 center bias 방지.

MAE encoder

ViT encoder 사용

unmasked patch만 적용.

linear projection에 의해 patch embed, positional embedding 진행. Transformer block 통과

그러나 25%의 patch만 활용.

MAE decoder

encoded visible path와 mask token 모두 활용.

여기에 위치 정보를 위해 positional embedding을 모든 token에 더함.

MAE decoder는 image reconstruction을 위해 오직 pre-training동안 사용.

상대적으로 가벼운 decoder는 적은 리소스와 시간 소모

Reconstruction target

decoder 마지막 layer는 linear projection

pixel space에서의 원본 이미지와의 MSE loss fucntion

Simple implementation

간단 적용

First we generate a token for every input patch (by linear projection with an added positional embedding).

Next we randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio.

After encoding, we append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets.

The decoder is applied to this full list (with positional embeddings added).

4. ImageNet Experiments

a single 224x224 crop

Baseline: ViT-Large (ViT-L/16)

fine-tuning is only for 50 epochs (vs. 200 from scratch)

4.1 Main Properties

Masking ratio

The ratio of 75% is good for both linear probing and fine-tuning.

Decoder design

Table 1a : A sufficiently deep decoder is important for linear probing.

Table 1b : 8 blocks and a width of 512-d

Mask token

Table 1c : If the encoder uses mask tokens, it performs worse

by skipping the mask token in the encoder, we greatly reduce training computation.

Reconstruction target

Table 1d: Using pixels with normalization improves accuracy.

Data augmentation

Table 1e :

Our MAE works well using cropping-only augmentation, either fixed-size or random-size (both having random horizontal flipping). Adding color jittering degrades the results

Mask sampling strategy

Training schedule

800-epoch pre-training

4.2 Comparison with Previous Results

Comparisons with self-supervised methods

Comparisons with supervised pre-training

4.3. Partial Fine-tuning

5. Transfer Learning Experiments

Reference

공식 Github

https://github.com/facebookresearch/mae

GitHub - facebookresearch/mae: PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377 - GitHub - facebookresearch/mae: PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

github.com