๐Ÿ“‘Paper Review

[paper review] CLIP : Learning Transferable Visual Models From Natural Language Supervision

date
Sep 23, 2023
slug
clip
author
status
Public
tags
paper
DeepLearning
summary
type
Post
thumbnail
์บก์ฒ˜.PNG
category
๐Ÿ“‘Paper Review
updatedAt
Sep 6, 2024 03:20 PM
ย 

Introduction

์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ text-to-text Transformer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ๋Œ€์šฉ๋Ÿ‰์˜ ํฌ๋กค๋ง ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•™์Šต์— ํ™œ์šฉํ•˜์—ฌ ๋น„์•ฝ์ ์ธ ๋ฐœ์ „์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.
๋ฐ˜๋ฉด, ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ๋Š” ์•„์ง imagenet๊ณผ ๊ฐ™์€ crowd-labeled dataset์„ ํ™œ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ๋“ค์ด ์ฃผ๋ฅผ ์ด๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ๋„ ๋Œ€์šฉ๋Ÿ‰ ํฌ๋กค๋ง ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก image,text multi-modal ๊ตฌ์กฐ์ธ CLIP ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๋ณธ ๋…ผ๋ฌธ์˜ contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ๋Œ€์šฉ๋Ÿ‰์˜ image-text crawling dataset์„ ์‚ฌ์šฉํ•˜์—ฌ, multi-modal model ํ•™์Šต, Zero-shot์— ๊ฐ•ํ•จ.
  • crowd labeling(gold-label) ์—†์ด, raw text๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์‚ฌ์šฉ.
  • Contrastive learning์„ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ ํ•™์Šต.

Proposed approach

ย 
  • contrastive pre-training
<๊ทธ๋ฆผ 1>
<๊ทธ๋ฆผ 1>
ํ•™์Šต ๊ณผ์ •์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.
์ด๋ฏธ์ง€์™€ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ raw text pair๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ๊ฐ์˜ ์ธ์ฝ”๋”๋ฅผ Contrastive learning์„ ํ†ตํ•ด ํ•™์Šต ์‹œํ‚ต๋‹ˆ๋‹ค. ์ธ์ฝ”๋”๋Š” ์•„๋ž˜์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
text encoder : c-bow or Transformer
Image encoder : ResNet or Vision Transformer
<๊ทธ๋ฆผ 2>
<๊ทธ๋ฆผ 2>
์•„์‹œ๋‹ค์‹œํ”ผ, Contrastive learning์€ positive set์— ๋Œ€ํ•ด์„œ๋Š” ์„œ๋กœ cosine ์œ ์‚ฌ๋„๋ฅผ ๋†’์ด๊ณ ,
negative set์— ๋Œ€ํ•ด์„œ๋Š” ์„œ๋กœ cosine ์œ ์‚ฌ๋„๋ฅผ ๋‚ฎ์ถ”๋Š” ํ˜•ํƒœ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
์ด ๋•Œ, symmetric loss๋ฅผ ์œ„ํ•ด axis 1, axis 0์—์„œ ๋ชจ๋‘ loss๊ฐ’์„ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค.
<๊ทธ๋ฆผ 2>์˜ pseudocode๋ฅผ ํ†ตํ•ด์„œ๋„ ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๋งŒ์•ฝ N๊ฐœ์˜ (image,text) pair๊ฐ€ batch๋กœ ๋“ค์–ด์˜จ๋‹ค๋ฉด, NxN์˜ ๊ฐ€๋Šฅํ•œ prediction pair ์ค‘, ๊ฐœ์˜ positive pair์™€ ๊ฐœ์˜ negative pair๊ฐ€ ๋ฐœ์ƒํ• ๊ฒƒ ์ž…๋‹ˆ๋‹ค. <๊ทธ๋ฆผ 1>์„ ์ฐธ๊ณ ํ•˜์„ธ์š”~!
ย 
Contrastive learning์˜ ํšจ์œจ์„ฑ์€ ์•„๋ž˜์˜ # of trained images(x) , Zero-shot acc(y) ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด์„œ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
notion image

Experiments

Zero-shot task์—์„œ ๋งค์šฐ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.
notion image
๊ธฐ์กด supervised baseline์„ ์ผ๋ถ€ benchmark์—์„œ ๋›ฐ์–ด๋„˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
notion image
ย