[paper review] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

date
Aug 25, 2023
slug
BERT
author
status
Public
tags
paper
DeepLearning
summary
type
Post
thumbnail
์บก์ฒ˜.PNG
category
updatedAt
Sep 6, 2024 02:42 PM

Introduction

ย 
์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์–ธ์–ดํ‘œํ˜„์„ down stream task์— ์ ์šฉ์‹œํ‚ค ์œ„ํ•ด์„œ๋Š” ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.
  1. feature-based : task-specificํ•œ ๊ตฌ์กฐ (์˜ˆ) ELMo
  1. fine-tuning: task-specificํ•œ parameter๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ณ  down-stream task์— ๋ชจ๋“  pre-trained parameter๋ฅผ fine-tuningํ•˜๋Š” ๊ธฐ๋ฒ• (์˜ˆ) GPT
๋‘๊ฐ€์ง€ ๋ฐฉ์‹ ๋ชจ๋‘ unidirectional(๋‹จ๋ฐฉํ–ฅ) ์–ธ์–ด๋ชจ๋ธ์—์„œ ์‚ฌ์ „ ํ›ˆ๋ จ์‹œ์— ๋˜‘๊ฐ™์€ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง„๋‹ค.
ํ˜„์žฌ๊นŒ์ง€ ๋ฐœํ‘œ๋œ pre-trained ์–ธ์–ด๋ชจ๋ธ์€ ์ฃผ๋กœ unidirectionalํ•˜์˜€๋‹ค. (openAI GPT)
๊ทธ๋Ÿฌ๋‚˜ ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ์€ ๋ฌธ์žฅ๋‹จ์œ„ task์— ๋Œ€ํ•ด ์™„์ „ํžˆ optimalํ•˜์ง€๋Š” ์•Š์œผ๋ฉฐ, ์–‘๋ฐฉํ–ฅ์˜ ๋งฅ๋ฝ์„ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ token๋‹จ์œ„ task(QA)์—์„œ๋„ ์˜จ์ „ํ•˜์ง€ ์•Š๋‹ค.
๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” BERT(ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์ด์šฉํ•œ ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋” representation)์„ ์†Œ๊ฐœํ•œ๋‹ค. BERT์—์„œ๋Š” MLM(๋งˆ์Šคํฌ ์–ธ์–ด๋ชจ๋ธ)์„ ์‚ฌ์šฉํ•ด ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋‹จ๋ฐฉํ–ฅ์„ฑ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•œ๋‹ค.
MLM์—์„œ๋Š” ๋žœ๋คํ•˜๊ฒŒ ์ธํ’‹์˜ ํ† ํฐ๋“ค์„ maskํ•˜๊ณ  ๋ชฉ์ ํ•จ์ˆ˜๋Š” masked๋œ ๋‹จ์–ด์˜ ์›๋ž˜ ๋‹จ์–ด์˜ id๋ฅผ ๋งฅ๋ฝ์—๋งŒ ์˜์กดํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋„๋ก ์„ค๊ณ„๋œ๋‹ค.
left-to-right ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•œ ์‚ฌ์ „ํ›ˆ๋ จ๊ณผ๋Š” ๋‹ฌ๋ฆฌ MLM obejctive๋Š” ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ ์–‘๋ฐฉํ–ฅ์˜ ๋งฅ๋ฝ์„ ์œตํ•ฉ๋˜๋„๋ก ํ•œ๋‹ค.
๋˜ํ•œ ์ถ”๊ฐ€์ ์œผ๋กœ MLM์—์„œ๋Š” NSP(next sentence prediction, ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ๋‘๊ฐœ์˜ ๋ฌธ์žฅ pair์— ๋Œ€ํ•œ representation์„ ์‚ฌ์ „ํ›ˆ๋ จํ•œ๋‹ค.
ย 

Main contribution

  1. MLM์„ ์ด์šฉํ•ด deep bidirectional representation์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ๋‹ค.
  1. ๋ณต์žกํ•˜๊ณ  ๋ฌด๊ฑฐ์šด ์—”์ง€๋‹ˆ์–ด๋ง์ด ํ•„์š”ํ•œ task specificํ•œ ๋ฐฉ๋ฒ•๋ก ๋ณด๋‹ค๋„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒˆ๋‹ค.
  1. 11๊ฐœ์˜ nlp task์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

Related works

  • unsupervised feature based approaches
  • unsupervised fine-tuning approaches
  • transfer learning from supervised data

BERT

notion image
bert ๋ชจ๋ธ์—๋Š” ์œ„์™€ ๊ฐ™์ด pre-training, fine-tuning๋ผ๋Š” ๋‘๊ฐœ์˜ step์ด ์žˆ๋‹ค.
๋‘ ์Šคํ…์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ตฌ์กฐ๋Š” ๋งˆ์ง€๋ง‰ output layer์„ ์ œ์™ธํ•˜๊ณ ๋Š” ๋‹ค ๋™์ผํ•˜๋‹ค.
unlabled data๋กœ ์‚ฌ์ „ํ•™์Šต์ด ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์œผ๋กœ ์žก๊ณ  labeled ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด fine tuning์„ ํ•˜๊ฒŒ ๋˜๋ฉฐ task๊ฐ€ ๋‹ฌ๋ผ๋„ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์€ ๋‹ค ๋™์ผํ•˜๊ฒŒ ์‚ฌ์ „ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋กœ ์‹œ์ž‘๋œ๋‹ค.
CLS ํ† ํฐ์€ special token์œผ๋กœ, ๋ชจ๋“  ์ธํ’‹์•ž์— ์ถ”๊ฐ€๋˜๋ฉฐ SEP ํ† ํฐ์€ ๋ฌธ์žฅ(๋˜๋Š” ๋ฌธ์„œ)์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํ† ํฐ์ด๋‹ค.
bert๋Š” down stream task์˜ ์ข…๋ฅ˜์™€ ์ƒ๊ด€์—†์ด ๋ชจ๋‘ ๊ตฌ์กฐ๊ฐ€ ๋™์ผํ•˜๋‹ค๋Š” ํŠน์ดํ•œ ์ ์ด ์žˆ๋‹ค.
bert๋Š” transformer์˜ ์ธ์ฝ”๋”๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์Œ“์•„ ๋งŒ๋“  ๊ตฌ์กฐ์ด๋‹ค.
(transforemer ์ธ์ฝ”๋”์˜ sublayer๋Š” ํฌ๊ฒŒ ๋‘๊ฐœ๋กœ, multihead self attention๊ณผ position wise FC layer๊ฐ€ ์žˆ๋‹ค)

input/output ํ˜•ํƒœ

๋‹ค์–‘ํ•œ down stream task์— ๋ฌธ์ œ์—†์ด ์‚ฌ์šฉ๋˜๊ธฐ ์œ„ํ•ด ๋‘๋ฌธ์žฅ์ด๋“  ํ•œ๋ฌธ์žฅ์ด๋“  ํ•˜๋‚˜์˜ token sequence๋กœ ๋‚˜ํƒ€๋‚ด์•ผ ํ–ˆ๋‹ค.
(๋ณธ๋…ผ๋ฌธ์—์„œ sequence๋ž€ bert์˜ ์ธํ’‹์ด ๋˜๋Š” ํ•˜๋‚˜์˜ ๋ฉ์–ด๋ฆฌ๋กœ, ์ด๋Š” ํ•œ๋ฌธ์žฅ์ผ์ˆ˜๋„ ๋‘๋ฌธ์žฅ์ผ์ˆ˜๋„ ์žˆ๋‹ค. ๋˜ํ•œ ๋ฌธ์žฅ์€ ์–ธ์–ด์ ์œผ๋กœ ํ†ต์šฉ๋˜๋Š” ์˜๋ฏธ์˜ ๋ฌธ์žฅ์ด๋ผ๊ธฐ ๋ณด๋‹ค๋Š” ์—ฐ์†์ ์ธ ํ…์ŠคํŠธ์˜ ๋ญ‰ํ……์ด๋ผ๊ณ  ์ƒ๊ฐํ•ด์•ผํ•œ๋‹ค.
bert ๊ณ„์—ด ๋…ผ๋ฌธ์—์„œ sequence์™€ sentence๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์ด๋ฉฐ ๊ฐ ์˜๋ฏธ๋ฅผ ์•„๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.)
tokenization์€ BPE์˜ ์ผ์ข…์ธ wordpiece tokenizer๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค. (30000๊ฐœ์˜ ๋‹จ์–ด์ง‘ํ•ฉ์„ ๊ฐ€์ง„)
๋ชจ๋“  ์ธํ’‹์˜ ๋งจ์•ž์—๋Š” CLS๋ผ๋Š” ์ŠคํŽ˜์…œ ํ† ํฐ์„ ๋‘์—ˆ๋Š”๋ฐ ์ด์— ๋Œ€ํ•œ hidden state์€ sequence์— ๋Œ€ํ•œ representation์œผ๋กœ ๋ถ„๋ฅ˜task์— ํ™œ์šฉ๋œ๋‹ค.
ํ•˜๋‚˜์˜ sequence๋‚ด์˜ ๋‘๋ฌธ์žฅ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‘๊ฐ€์ง€์ธ๋ฐ SEPํ† ํฐ์„ ๋‘ ๋ฌธ์žฅ ์‚ฌ์ด์— ๋‘๋Š” ๋ฐฉ๋ฒ•๊ณผ
์ž„๋ฒ ๋”ฉ์ธต์„ ์•„์˜ˆ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด์„œ ๊ฐ ํ† ํฐ์ด ๋ฌธ์žฅ1์— ์†ํ•˜๋Š”์ง€ ๋ฌธ์žฅ2์— ์†ํ•˜๋Š”์ง€๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
ย 
notion image
์œ„ ๊ทธ๋ฆผ์€ bert์˜ ์ธํ’‹ representation์ด๋‹ค.
ํฌ์ง€์…˜ ์ž„๋ฒ ๋”ฉ, ์–ด๋–ค sentence์— ์†ํ•˜๋Š”์ง€๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ์„ธ๊ทธ๋จผํŠธ ์ž„๋ฒ ๋”ฉ, ๊ทธ๋ฆฌ๊ณ  ๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ์„ ํ•ฉ์ณ ๋งŒ๋“ ๋‹ค.

pre-training

MLM๋ฐฉ๋ฒ•๊ณผ NSP๋ฐฉ๋ฒ• ๋‘๊ฐ€์ง€๋กœ ์‚ฌ์ „ํ›ˆ๋ จํ•œ๋‹ค.

Masked LM (MLM)8

Next Sentence Prediction (NSP)

ย 

Experiment

ย 
GLUE
notion image
ย 
ย