๐Ÿ“‘Paper Review

[paper reivew] ZFNet : Visualizing and Understanding Convolutional Networks

date
Dec 23, 2023
slug
ZFNet
author
status
Public
tags
paper
DeepLearning
summary
type
Post
thumbnail
์บก์ฒ˜.PNG
category
๐Ÿ“‘Paper Review
updatedAt
Sep 6, 2024 03:30 PM
  • CNN์—์„œ ์—ฌ๋Ÿฌ ์ธต์„ ์Œ“์„ ๊ฒฝ์šฐ ์–ด๋–ค ์›๋ฆฌ๋กœ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š”์ง€, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์ˆ˜์ •ํ•ด์•ผํ•˜๋Š”์ง€, ๋งŒ๋“ค์–ด๋‚ธ ๊ตฌ์กฐ๊ฐ€ ์ตœ์ ์˜ ๊ตฌ์กฐ์ธ์ง€ ํŒ๋‹จํ•˜๋Š”๊ฒƒ์€ ์–ด๋ ค์›€
    • โ†’ Alexnet์˜ ๊ฒฝ์šฐ 2๊ฐœ์˜ GPU๋กœ ์ผ์ฃผ์ผ ์ด์ƒ ํ•™์Šต์„ ํ–ˆ๋Š”๋ฐ, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ• ๋•Œ๋งˆ๋‹ค ์ผ์ฃผ์ผ์”ฉ ๊ธฐ๋‹ค๋ฆด ์ˆ˜ ์—†์Œ
      โ†’ CNN์„ ๋ณด๋‹ค ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜๋‹จ์ด ํ•„์š”ํ–ˆ๊ณ , ์ด๋ฅผ Visualizing ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด์„œ ํ•ด๊ฒฐํ•˜๋ คํ•จ
ย 
  • CNN์˜ ๋™์ž‘์„ ์ž˜ ์ดํ•ดํ•˜๋ ค๋ฉด ๋ ˆ์ด์–ด ์ค‘๊ฐ„์ค‘๊ฐ„์—์„œ feature์˜ activity๊ฐ€ ์–ด๋–ค์ง€ ์•Œ์•„์•ผํ•จ.
    • โ†’ ํ•˜์ง€๋งŒ ์ค‘๊ฐ„ layer์˜ ๋™์ž‘์„ ๋ณด๊ธฐ๋Š” ์–ด๋ ค์›€
      โ†’ ๊ทธ๋Ÿฌ๋ฉด ์ค‘๊ฐ„ layer์—์„œ ์ƒ์„ฑ๋˜๋Š” ๊ฐ’๋“ค์„ ๋‹ค์‹œ input image size๋กœ mapping ์‹œํ‚ค๋ฉด ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?
      โ†’ ์ค‘๊ฐ„ layer์—์„œ CNN ๊ณผ์ •์„ ์—ญ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋ฉด ์ค‘๊ฐ„ ๊ณ„์ธต์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ˆˆ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ.
ย 

๊ธฐ์กด์—ฐ๊ตฌ(AlexNet)์™€ ๋‹ค๋ฅธ ์ 

  1. 1๊ฐœ์˜ GPU๋งŒ์œผ๋กœ ํ•™์Šต : 2๊ฐœ์˜ GPU๋ฅผ ์ผ๋ถ€ ์ธต์—์„œ๋Š” inter gpu ์—ฐ์‚ฐ์„, ๋‚˜๋จธ์ง€ ์ผ๋ถ€์—์„œ๋Š” intra gpu ์—ฐ์‚ฐ์„ ์‹œ์ผฐ๋˜ Alexnet๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ํ•œ๊ฐœ์˜ gpu๋งŒ๋“œ๋กœ ํ•™์Šต์„ ์ทจํ•จ
  1. ์ข€ ๋” ์ž‘์€ size์˜ kernel filter์™€ stride ์‚ฌ์šฉ
ย 

Introduction

  • CNN๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์˜ ๋“ฑ์žฅ
      1. ๋งŽ์€ ๋ฐ์ดํ„ฐ
      1. Powerful GPU
      1. Dropout ๊ณผ ๊ฐ™์€ Regularization ๊ธฐ๋ฒ•์˜ ๋“ฑ์žฅ
      ์œผ๋กœ ์ธํ•ด Alexnet๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ task์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, ์–ด๋–ป๊ฒŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€ ๋ช…ํ™•ํžˆ ์•Œ์ง€ ๋ชปํ•จ
  • ๊ทธ๋ž˜์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Deconvolutional Network(Deconvnet)์„ ์ œ์•ˆํ•จ, ์ด๋ฅผ ํ†ตํ•ด alexnet์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ ๋ชจ๋ธ์ด ZFNet
  • ๊ธฐ์กด ๋ฐฉ๋ฒ•์—์„œ๋Š” ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ Feature map ์ •๋„๋งŒ ์‹œ๊ฐํ™” ํ•˜๋Š”๋ฐ ๊ทธ์ณค๊ธฐ ๋•Œ๋ฌธ์— ์ธต์ด ๊นŠ์–ด์งˆ ์ˆ˜๋ก ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์—†์—ˆ์Œ
    • โ†’ ์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ Deconvnet์„ ํ†ตํ•ด ๊นŠ์€ layer์ธต์—์„œ ๋งŒ๋“ค์–ด์ง„ Feature map ๋˜ํ•œ ์‹œ๊ฐํ™”๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ์ œ์•ˆํ•จ
      โ†’ feature map์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” patter์ด ๋ฌด์—‡์ธ์ด ์•Œ๋ ค์ฃผ๋Š” filter๋ฅผ ์‹œ๊ฐํ™”ํ•จ
ย 

Approach

  • ์ผ๋ฐ˜์ ์ธ fully supervised convnet(AlexNet) ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ
    • โ†’ RGB 2์ฐจ์› ์ด๋ฏธ์ง€๋ฅผ ๋ฐ›์•„์„œ ์–ด๋А class์— ์†ํ•˜๋Š”์ง€ ํ™•๋ฅ ๊ฐ’์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ชจ๋ธ
      1. ์ด์ „ ๋ ˆ์ด์–ด์˜ output์„ ๋ฐ›์•„ convolution
      1. ReLU ํ†ต๊ณผ
      1. max pooling & Local normalization
    • cross entropy ์‚ฌ์šฉ, SGD optimizer๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  backpropagtion์œผ๋กœ parameter update
    • โ†’ ๋ฐ˜๋ณต๋˜๋Š” 1~3 ๊ณผ์ •์˜ activities๋ฅผ ๋‹ค์‹œ๊ธˆ pixel space๋กœ ํ‘œํ˜„ํ•˜๊ณ ์ž ํ•˜์˜€๊ณ , ์ด๋ฅผ ์œ„ํ•ด deconvolutional network๋ฅผ ์‚ฌ์šฉํ•จ.
      โ†’ deconvnet์€ filtering, pooling ๊ณผ์ •์„ ๋™์ผํ•˜๊ฒŒ ์ˆ˜ํ–‰, ํ•˜์ง€๋งŒ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง€๋Š”๊ฒƒ์€ ์•„๋‹˜
      โ†’ alexnet์˜ ๊ณผ์ •์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๊ฐ conv layer๋ฅผ ํ†ต๊ณผํ•˜๋ฉฐ ์ƒ๊ธด activation์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ activation์„ 0์œผ๋กœ ๋งŒ๋“ค๊ณ , ๋งŒ๋“ค์–ด์ง„ featuremap์„ deconvnet์˜ input์œผ๋กœ ๋งŒ๋“ค์–ด๋ƒ„
      โ†’ convnet์˜ ๊ณผ์ •์„ ์—ญ์ˆœ์œผ๋กœ ๋ฐ˜๋ณตํ•˜์—ฌ unpool, rectify, filter ์ˆœ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉฐ reconstruction ์‹ค์‹œ
notion image
  • Unpooling
    • map pooling ๋œ ๊ฐ’์„ ๊ฑฐ๊พธ๋กœ ์–ป์–ด๋‚ด๋Š”๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅ(max ๊ฐ’ ์ œ์™ธํ•˜๊ณ ๋Š” ์‚ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ)
    • ๊ทธ๋ž˜์„œ 2*2 ์˜ ๊ฒฝ์šฐ max๊ฐ’์˜ ์œ„์น˜(1,2)์™€ ๊ฐ™์ด ์ €์žฅํ•˜์—ฌ unpooling์‹œ ํ™œ์šฉํ•จ
      • โ†’ Switches๋ผ๋Š” ํ˜•ํƒœ์˜ variable๋กœ ์ด๋ฅผ ์ €์žฅ
        โ†’ ๋‹จ์ ์œผ๋กœ๋Š” max(๊ฐ•ํ•œ์ž๊ทน)์„ ์ œ์™ธํ•œ ์•ฝํ•œ ์ž๊ทน์˜ ์˜ํ–ฅ๋ ฅ์€ ์•Œ ์ˆ˜ ์—†์Œ
ย 
  • Rectification
    • ์•ž์„œ unpooling๋œ ๊ฐ’์„ ๋‹ค์‹œ ํ•œ๋ฒˆ Relu ์‹œ์ผœ์คŒ
      • โ†’ ์–ด์ฐจํ”ผ max ๊ฐ’์„ ๊บผ๋‚ด์˜ค๋Š”๋ฐ ์™œ relu๋ฅผ ์”Œ์œ„์ฃผ๋Š”์ง€๋Š” ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œโ€ฆ
ย 
  • Filtering
    • convnet์— ์‚ฌ์šฉ๋œ filter๋ฅผ transpose ์‹œ์ผœ ์‚ฌ์šฉํ•จ
    • ์›๋ž˜์˜ convolutional์ด stride๋ฅผ ๊ฐ–๊ณ  input image๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์‹์„ ๋ณด๊ฒŒ ๋˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Œ
      • notion image
    • deconvolutional์€ ์œ„์˜ ์ด๋ฏธ์ง€๋ฅผ ์—ญ์ˆœ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— sparse matrix c๋ฅผ transpose ์‹œ์ผœ output๊ณผ ๊ณ„์‚ฐ์‹œํ‚ค๊ณ , ์—ญ์œผ๋กœ input์„ ์–ป์–ด๋‚ด๋Š” ๊ณผ์ •
      • notion image
ย 
  • ์ฃผ์˜ํ•  ์ , Deconv ๊ณผ์ •์€ ํ•™์Šต์ด ์•„๋‹ˆ๋ผ ๋‹จ์ˆœ ๊ณ„์‚ฐ์œผ๋กœ strongํ•œ activation๋งŒ ๋ณต์›ํ•ด๋‚ด๋Š” ๊ณผ์ •์ผ ๋ฟ์ž„
    • โ†’ ํ•˜๋‚˜์˜ conv layer์„ ๊ฑฐ์น˜๊ณ  ์ƒ์„ฑ๋œ featur map์ด kerner์—์„œ ์–ด๋–ค๋ถ€๋ถ„์ด ์ž๊ทน๋˜์—ˆ๋Š”์ง€๋ฅผ ๋Œ€๋žต์ ์œผ๋กœ ํ™•์ธํ•  ๋ฟ์ž„, 100% ๋ณต์›๋˜๋Š”๊ฒƒ์€ ์ ˆ๋Œ€๋กœ X
ย 

Training details

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ณผ์ •์€ alexnet์„ ๋”ฐ๋ผํ–ˆ์ง€๋งŒ, layer 3,4,5 ์—์„œ 2 gpu๋ฅผ ์ด์šฉํ•œ ์—ฐ์‚ฐ ๋ถ€๋ถ„์€ ๋‹ฌ๋ผ์ง(ZFnet์€ 1 gpu๋กœ ๊ณ„์‚ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ)
ย 
  • ImageNet 2012 training set์œผ๋กœ ํ•™์Šต๋จ
    • โ†’ 130๋งŒ๊ฐœ์˜ ์ด๋ฏธ์ง€, 1000๊ฐœ์˜ classes
ย 
  • 256 size๋กœ resize, ๋น„์œจ์ด ๋‹ค๋ฅธ ์‚ฌ์ง„์€ 256 * 256 ์ด ๋˜๋„๋ก center crop
    • โ†’ ์ดํ›„ 224*224 size๊ฐ€ ๋˜๋„๋ก center + corner crop & horizontal flip
ย 
  • 128 ๊ฐœ์˜ minibatch ๋ฅผ ์ด์šฉํ•œ SGD optimizer ์‚ฌ์šฉ
ย 
  • ํ•™์Šต๋ฅ ์€ 0.01, momentum 0.9, 0.5 dropout ratio, 0.5 FCN ์‚ฌ์šฉ
    • โ†’ ์ดˆ๊นƒ๊ฐ’์€ 0.01, bias 0
ย 
  • GTX 580 GPU 1๊ฐœ๋ฅผ ๊ฐ€์ง€๊ณ  12์ผ๊ฐ„ ํ•™์Šต, 70 epoches ์—์„œ ์ค‘์ง€
ย 

Convnet visualization

notion image
ย 
  • layer 1,2 ๋Š” ์ฃผ๋กœ Edge์™€ color ๋“ฑ ๋‹จ์ˆœํ•œ ์ •๋ณด layer 3๋Š” ๋ณด๋‹ค ๋ณต์žกํ•œ texture layer 4๋Š” class๋ณ„๋กœ ๊ตฌ์ฒด์ ์ธ ํŠน์ง•์— ํ•ด๋‹นํ•˜๋Š” feature๋ฅผ(์‚ฌ๋ฌผ์ด๋‚˜ ๊ฐœ์ฒด์˜ ์ผ๋ถ€๋ถ„) layer 5๋Š” ์ „์ฒด์ ์œผ๋กœ pose variation ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ(์œ„์น˜๋‚˜ ์ž์„ธ ๋ณ€ํ™” ๋“ฑ์ด ํฌํ•จ๋œ ์‚ฌ์ง„)
  • ํ•™์Šต ๊ฒฐ๊ณผ edge์™€ ๊ฐ™์€ low-level feature๋Š” ํ•™์Šต์ด ์‹œ์ž‘๋˜๊ณ  ์–ผ๋งˆ๋˜์ง€ ์•Š์•„ ์‹œ๊ฐ์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, layer 4๋‚˜ 5์™€ ๊ฐ™์€ high-level feature๋Š” ๊ฑฐ์˜ 4~50 epoch์ด ์ง€๋‚œ ๋’ค์—์„œ์•ผ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Œ
    • โ†’ ๋ชจ๋ธ์ด ๋ชจ๋“  layer๊ฐ€ ์ œ๋Œ€๋กœ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ ค๋ฉด ์ถฉ๋ถ„ํ•œ ํ•™์Šต์ด ์ด๋ค„์ ธ์•ผํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
ย 
notion image
  • ๊ฒฐ๊ณผ์—์„œ CNN์˜ ํŠน์ง•์ค‘ ํ•˜๋‚˜์ธ invariance ํ•œ ํŠน์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
    • โ†’ ์‚ฌ์ง„์— vertical translation, scale ํ™•๋Œ€, rotation์„ ๊ฐ€ํ•จ
      โ†’ 1์—ด์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€, 2์—ด์€ ๋ณ€ํ™˜ ์ •๋„, 3์—ด์€ 7๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ๋ฐ˜์‘, 4์—ด์€ output์˜ true label ํ™•๋ฅ 
      โ†’ ์ƒ๋Œ€์ ์œผ๋กœ rotation ๋ณ€ํ˜•์— ๋Œ€ํ•ด ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๋Š”๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
ย 

4.1 architecture selection

  • alexnet์˜ 1, 2๋ฒˆ layer๋ฅผ ์‹œ๊ฐํ™”ํ–ˆ์„ ๋•Œ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•จ
    • โ†’ 1๋ฒˆ layer๋Š” extremly high & low ํ•œ frequency information์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด mid frequency๋ฅผ ์ž˜ ๊ฐ–์ง€ ๋ชปํ•จ(dead feature ๋ฐœ์ƒ)
      โ†’ ์ด์–ด์„œ 2๋ฒˆ ๋ ˆ์ด์–ด๋Š” ๋„ˆ๋ฌด ํฐ stride๋ฅผ ์ง€๋…€์„œ aliasingํ•œ artifacts๊ฐ€ ์ƒ๊น€
      โ‡’ ์œ„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด kernel_size๋ฅผ 11์—์„œ 7๋กœ ์ค„์ด๊ณ , stride๋ฅผ 4์—์„œ 2๋กœ ์ค„์ž„
      โ‡’ ์ด๋ฅผ ํ†ตํ•ด ์‹ค์ œ๋กœ classification ์„ฑ๋Šฅ์ด ์ƒ์Šนํ•˜๊ธฐ๋„ ํ•จ
notion image
ย 

4.2 Occlusion Sensitivity

notion image
  • ๋ชจ๋ธ์˜ classification ๊ณผ์ •์„ ๋ณด๋ฉด์„œ ๋“œ๋Š” ์ƒ๊ฐ์€ ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ ์ด๋ฏธ์ง€ ์† ๋ฌผ์ฒด์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ธ์ง€, ์ฃผ๋ณ€์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๊ณ  context ์ •๋ณด๋ฅผ ํ†ตํ•ด ํŒŒ์•…ํ•˜๋Š”๊ฐ€? ํ•˜๋Š” ๊ฒƒ
    • โ†’ ๊ทธ๋ž˜์„œ ๊ฐœ์ฒด์˜ ์ผ๋ถ€๋ถ„๋งŒ์„ ๊ฐ€๋ ค๋ณด๊ณ  ์ •๋‹ต์„ ๋งž์ถฐ๋ณด๊ฒŒ ํ–ˆ์„ ๋•Œ, ์ •๋‹ต ํ™•๋ฅ ์ด ๊ฐ์†Œํ•˜๋Š”๊ฒƒ์„ ํ™•์ธ
      โ†’ ๋ชจ๋ธ์€ ๋ฌผ์ฒด์˜ ์œ„์น˜์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ๋งž๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆผ
      โ†’ ๊ทธ๋ฆฌ๊ณ  ZFNet์˜ ํ•„ํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ  Alexnet์˜ ๊ตฌ์กฐ๋ฅผ ์ผ๋ถ€ ์ˆ˜์ •ํ•œ๊ฒƒ์ด ์˜๋ฏธ์žˆ๋‹ค๋Š”๊ฒƒ์„ ํ™•์ธํ•จ
ย