[논문 Summary] CharNet (2019 ICCV) "Convolutional Character Networks"

논문 정보

Citation : 2023.04.23 일요일 기준 125회

저자

Linjie Xing, Zhi Tian, Weilin Huang, Matthew R. Scott

논문 링크

Official

https://openaccess.thecvf.com/content_ICCV_2019/html/Xing_Convolutional_Character_Networks_ICCV_2019_paper.html

Arxiv

https://arxiv.org/abs/1910.07954

논문 Summary

Abstract

0. 설명 시작 전 Overview

1. Introduction

이미지에서 text를 읽는 것은 2 단계를 거친다.

text detection & text recognition

text detection의 경우 각 text instance에 대하여 bounding box를 예측하는 것을 목표로 진행하며 object detection 기술이 사용된다.

text recognition은 crop된 이미지 패치로부터 character 일련의 label을 인지하는 것을 목표로 하고 주로 CNN에서 추출한 feature로부터 recurrent model을 활용한다.

그러나 2 step pipeline은 다양한 제약조건이 생긴다.

1) sub-optimization 문제로 인해 text 본연의 잠재적 탐색을 어렵게 한다.

2) task recognition 성능은 text detection의 성능에 크게 의존한다.

최근 text detection과 recognition을 동시에 적용하려는 통합 framework를 개발하려는 노력을 진행했으나 이 역시 한계점을 가지고 있다.

- patch를 활용한 ROI cropping이나 pooling을 RNN 모델에 활용하는 것은 어려운 일.

따라서 RNN기반의 sequential model보다 CNN model을 통한 character recognition이 효율적이다.

Contribution

1) text detection과 recognition을 동시에 진행할 수 있는 one-stage CharNet을 제안. character을 기본 unit으로 활용함으로써 two-stage에서의 제약을 극복함.

2) iterative character detection method를 사용하여 추가적인 char-level bounding box 제공 없이도 CharNet을 훈련할 수 있게 함.

3) CharNet은 다른 2 stage 접근법들 대비 높은 정량적 지표를 보여줌.

2. Related work

Text detection

Text recognition

End-to-end (E2E) text recognition

3. Convolutional Character Networks

3.1 Overview

Character branch를 통해 character detection과 recognition을 진행

text detection branch를 통해 이미지 내 각 text instance의 bounding box 예측 진행

훈련 단계에서 instance-level, char-level의 bounding box가 모두 필요

Backbone

ResNet-50과 Hourglass network를 backbone으로 활용

ResNet-50에서는 4배의 down-sampling 비율을 가져감. 이를 통해 극도로 작은 text instance도 구별할 수 있게함.

Hourglass-88, Hourglass-57을 사용

3.2 Character Branch

RNN 기반의 text recognition에서 word-level optimization보다 char-level attention mechanism이 더 높은 성능을 가져다 주었고 이것이 지금 제안하는 작업의 영감이 되었다.

이를 단초삼아 저자들은 새로운 character branch를 도입한다.

character branch는 character를 basic unit으로 detection과 recognition에 사용한다.

여러 개의 conv layer의 stack

여기서의 input feature map은 input image의 1/4이며 3가지의 sub branch로 구성된다.

1) text instance segmentation 2) character detection 3) character recognition

1) text instance segmentation 2) character detection는 3개의 conv layer, filter : 3x3, 3x3, 1x1

3) character recognition 4개의 conv layer, 3x3 filter가 추가됨.

1) text instance segmentation

- 2개의 output (각 공간 영역별 text or non-text 확률)

2) character detection

- 각 character bounding box별 5개의 parameter(위, 아래, 좌, 우, 방향(orientation))

3) character recognition

- 68개의 character class (26 영단어, 10 숫자, 32 특별 symbol)

모든 sub-branch의 결과는 동일 resolution이며 char-level bbox는 0.95 confident value.

생성된 bbox label은 글자별 softmax의 최댓값

3.3 Tect Detection Branch

다수의 text instance들이 가까이에 붙어있고 곡선 형태의 다수의 방향성(orientation)을 가지는 경우 직접적으로 grouping character는 복잡하고 경험에 기반하게 된다.

여기서 최소한의 수정에 기반한 text detector를 활용해 굽어진 text-line이나 다수의 방향 단어들에 대한 감지를 진행하고자 한다.

Multi-Orientation Text

EAST detector의 간단한 약간의 수정을 진행한다.

text instance segmentation과 IoU loss를 사용하는 instance-level bounding box regression에 대한 2 sub-branch로 구성.

output : 2-channel(text or non-text 확률) + 5 channel(4 scalar bbox + orientation angle)

Curved Text

Textfield를 약간 수정하여 사용

3.4 Iterative Character Detection

본 모델은 char-level과 word-level bounding box와 character label이 필요하다.

그러나 char-level bbox는 획득하기 어렵고 비싸다. 이용할 수도 없음.

이에 synthetic data를 활용하여 character를 구분할 수 있도록하는 iterative character detection을 사용하기로함.

weakly-supervised 방식

그러나 table 1에서도 확인할 수 있듯 synthetic 이미지에 바로 훈련 시키고 real-world에 inference하는 것은 어려운 문제이다.

기본적으로 text recognizer보다 text detector가 상대적으로 강한 일반화 가능성을 보여줌.

이에 따라 text detector를 활용하여 synthetic data로부터 char-level annotation을 학습하여 real-world 이미지 데이터에 점진적으로 일반화가 가능하도록 함.

여기서 가정

text instance에서의 character bounding box의 숫자가 제공되는 isntance-level transcript에서의 character lebel이 정확하게 일치한다.

character가 일치하는지를 보는 것은 성능에 좋은 결과를 가져다 주지 않았기 때문에 숫자만 보는 것.

여러번 반복을 통해 점진적으로 일반화 성능이 올라가고 이에 따라 다수의 correct한 char-level bounding box가 생성하게 함.

4. Experiments, Results and Comparison

Benchmark dataset : ICDAR 2015, Total-Text, ICDAR MLT 2017

4.1 Implementation Details

4 iterative step을 진행.

synthetic data로 Synth800k를 5 epoch

GPU당 4 image, mini-batch 32 이미지

learning rate = 0.0002

learning rate decay는 식이 존재. (논문 참고)

simple augmentation 진행

4.2 On Iterative Character Detection

4.3 Results on Text Detection

4.4 Results on End-to-End Text Recognition

Reference

공식 Github

https://github.com/msight-tech/research-charnet

저작자표시 비영리 동일조건

아이공의 AI 공부 도전기