PR-120 "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design" Review (2018 ECCV)

1. Citations & Abstract 읽기

Citations : 2022.01.07 기준 1836회

저자

Ningning Ma - Megvii Inc, Tsinghua University

Xiangyu Zhang - Megvii Inc

Hai-Tao Zheng - Tsinghua University

Jian Sun - Megvii Inc

Abstract

현재 신경망 아키텍쳐 설계는 대부분 계산 복잡성의 간접 지표에 의해 지도된다. 그러나 속도와 같은 직접적인 지표는 메모리 접근 비용 및 플랫폼 특성과 같은 다른 요인들에 달려있다. 그러므로 본 연구는 FLOP들만을 고려하는 것을 넘어 대상 플랫폼에서 직접 지표를 평가하는 것을 제안한다. 일련의 통제된 실험들을 토대로 이 연구는 효율적인 네트워크 설계를 위한 실질적인 지침을 유도한다. 이에 따라 ShuffleNet V2라 불리는 새로운 아키텍처가 제시된다. 종합적인 ablation 실험은 우리들의 모델이 스피드와 정확도의 tradeoff 관점에서 최첨단임을 증명한다.

Direct metric을 직접 평가하는 것을 통해 효과적인 네트워크 설계를 제안하는 모델로 판단됨.

2. 발표 정리

https://youtu.be/lrU6uXiJ_9Y

공식 논문 링크

https://openaccess.thecvf.com/content_ECCV_2018/html/Ningning_Light-weight_CNN_Architecture_ECCV_2018_paper.html

ECCV 2018 Open Access Repository

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, Jian Sun; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 116-131 Current network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs. How

openaccess.thecvf.com

https://arxiv.org/abs/1807.11164

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Currently, the neural network architecture design is mostly guided by the \emph{indirect} metric of computation complexity, i.e., FLOPs. However, the \emph{direct} metric, e.g., speed, also depends on the other factors such as memory access cost and platfo

arxiv.org

Presentation Slide

https://www.slideshare.net/JinwonLee9/pr120-shufflenet-v2-practical-guidelines-for-efficient-cnn-architecture-design

PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Archite…

Tensorflow-KR 논문읽기모임 Season2 120번째 발표 영상입니다 ShuffleNet V2를 review 해보았습니다 참고영상 : ShuffleNet v1 - https://youtu.be/pNuBdj53Hbc 발표영상 : https://youtu.be/lrU6uXiJ…

www.slideshare.net

Depthwise Separable Convolution을 사용하는 MobileNet에서 1x1 convolution이 대다수의 연산량을 차지함.

Main Ideas of ShuffleNet

1) Use Depthwise separable convolution

2) Grouped convolution on 1x1 convolution layers - pointwise group convolution

3) Channel shuffle operation after pointwise group convolution

Group을 둘로 나눠서 계산을 진행함. 이후의 결과를 concatenate (AlexNet)

(a)에서 보이는 바와 같이 동일 채널에 대해서만 학습을 진행하는 것은 효과가 좋지 못함

(b) 중간중간 채널을 shuffle을 진행

(b), (c) ShuffleNet

연산량 감소

ShuffleNet V2

Motivation

1) Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs.

2) However, the direct metric, e.g., speed, also depends on the other factors such as memory access cost and platform characterics.

3) Indirect metric is usually not equivalent to the direct metric that we really care about, such as speed or latency.

4) For example, MobileNet V2 is much faster than NASNET-A but they have comparable FLOPs.

연산량과 같은 indirect metric은 speed와 같은 direct metric과 비례하지 않는다.

이 때문에 실제로는 문제가 있다.

(d) 비슷한 연산량인데 속도차이가 큰 것을 확인할 수 있음 (앞서 이야기한 것)

Two Reasons of the Discrepancy

1) First, several important factors that have considerable affection on speed are not taken into account by FLOPs. One such factor is memory access cost (MAC).
자원에 빠르게 해야할 일을 전달하지 않으면 놀고있음.

2) Second, operations with the same FLOPs could have different running time, depending on the platform.

Hardware Platform for Experiments

1) GPU - A single NVIDIA GeForce GTX 1080 Ti is used. The convolution library is CUDNN 7.0

2) ARM. A Qualcomm Snapdragon 810. We use a highly-optimized Neon-based implementation. A single thread is used for evaluation.

Analysis of the Runtime Performance

FLOPs metric only account for the convolution part.

속도가 빠를 수 있는 Guideline 제공

Guide 1. Equal channel width minimizes memory access cost (MAC).

MAC has a lower bound given by FLOPs. It reaches the lower bound when the numbers of input and output channels are equal.

Guide 2. Excessive group convolution increases MAC.

Group이 클수록 MAC이 커짐

Therefore, we suggest that the group number should be carefully chosen based on the target platform and task. It is unwise to use a large group number simply because this may enable using more channels, because the benefit of accuracy increase can easily be outweighed by the rapidly increasing computational cost.

Guide 3. Network fragmentation reduces degree of parallelism.

fragementation : Inception module branch가 갈라지는 것과 같은 것

Though such fragmented structure has been shown beneficial for accuracy, it could decrease efficiency because it is unfriendly for devices with strong parallel computing powers like GPU. It also introduces extra overheads such as kernel launching and synchronization.

1초당 Batch의 개수는 fragment가 적을 때 늘어남.

Guide 4. Element-wise operations are non-negligible.

Here, the element-wise operators include ReLU, AddTensor, AddBias, etc. They have small FLOPs but relatively heavy MAC.

Element-wise operator들은 무거운 MAC를 만듬.

We observe around 20% speedup is obtained on both GPU and ARM, after ReLU and shortcut are removed.

제거했더니 더 빨랐다.

ReLU, short-cut 모두 없을 때 많은 batch를 초당 학습시킬 수 있었음. -> 빨라짐

Conclusion and Discussions

Based on the above guidelines and empirical studies, we conclude that an efficient network architecture should
1) use ”balanced“ convolutions (equal channel width)
2) be aware of the cost of using group convolution
3) reduce the degree of fragmentation
4) reduce element-wise operations.