๐Ÿ“‘Paper Review

[paper reivew] Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

date
Nov 18, 2022
slug
Informer
author
status
Public
tags
DeepLearning
paper
summary
type
Post
thumbnail
์บก์ฒ˜.PNG
category
๐Ÿ“‘Paper Review
updatedAt
Sep 6, 2024 03:16 PM
ย 

Introdiction

๋ณธ ๋…ผ๋ฌธ์€ transformer ๊ธฐ๋ฐ˜์˜ long sequence time series forecasting(LSTF๋ผ๊ณ  ์ •์˜) ๋ฌธ์ œ์— ๋Œ€ํ•œ ๊ฐœ์„  ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•จ.
notion image
์œ„ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์˜ˆ์ธกํ•ด์•ผํ•˜๋Š” sequence๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก MSE loss๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋ฉฐ ์ดˆ๋‹น inferenceํ•  ์ˆ˜ ์žˆ๋Š” ์˜ˆ์ธก๊ฐ’์˜ ๊ฐœ์ˆ˜๋กœ ๊ธ‰๊ฒฉํžˆ ์ค„์–ด๋“ ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธ ๊ฐ€๋Šฅ
transformer๋Š” long-range alignment ability ์— ๋Œ€ํ•ด์„œ๋Š” ๋น„๊ต์  ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ efficient operations on long sequence inputs and outputs ์ธก๋ฉด์—์„œ ํšจ์œจ์ ์ด์ง€ ๋ชปํ•จ
transformer์˜ self attention ์—ฐ์‚ฐ์€ quadratic ์˜ ๋ณต์žก๋„ ๊ฐ€์ง€๋ฉฐ ๊ฐœ์˜ layer๋ฅผ ๊ฐ€์ง€๋Š” transformer ํŠน์„ฑ์ƒ ์ด ๋ณต์žก๋„๋Š” ๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ
๋˜ํ•œ inference์‹œ step-by-step decoding์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์†๋„๊ฐ€ ์ €ํ•˜๋จ
๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ
  1. ์˜ ์‹œ๊ฐ„๋ณต์žก๋„, ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ProbSparse self-attention mechanism ์ œ์•ˆ
  1. self-attention distilling operation ์ œ์•ˆ, feature ์ถ”์ถœ์— ํ•„์š”ํ•œ stacking layer์˜ ๊ณต๊ฐ„ ๋ณต์žก๋„ ๋กœ ๊ฐ์†Œ
  1. ํ•˜๋‚˜์˜ forward step๋งŒ์œผ๋กœ long sequence output์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” generative style decoder ์ œ์•ˆ
notion image

Methodology

Efficient Self-attention Mechanism

notion image
๊ธฐ์กด transformer์˜ attention mechanism์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค์‹œ ํ‘œํ˜„๊ฐ€๋Šฅ
query์™€ key์˜ ๋‚ด์ ์„ ๊ทผ์‚ฌํ•˜๋Š” ํ•จ์ˆ˜ ์‚ฌ์šฉ
ํŠน์ • query ์— ๋Œ€ํ•œ softmax ํ™•๋ฅ ์€ ๋ชจ๋“  j๊ฐœ์˜ key์— attention score๋ฅผ ํ•ฉํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ ๋”ฐ๋ผ์„œ ์ด๋Ÿฌํ•œ ํ™•๋ฅ ์„ ๋”ฐ๋ฅด๋Š” ์˜ ํ‰๊ท ์ด๋ผ๊ณ  ๋‹ค์‹œ ํ‘œํ˜„ ๊ฐ€๋Šฅ
์ด๊ฒƒ์€ quadraticํ•œ ๋‚ด์  ์—ฐ์‚ฐ ๋ณต์žก๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
๋ณธ ๋…ผ๋ฌธ์€ performance์— ์œ ์˜๋ฏธํ•œ ์˜ํ–ฅ ๋ผ์น˜์ง€ ์•Š๋Š” ํ™•๋ฅ ๊ฐ’๋“ค ์ œ์™ธํ•˜๋Š” selective ์ „๋žต ์‚ฌ์šฉ
ํŠน์ • query๊ฐ€ ๋‹ค๋ฅธ key๋“ค๊ณผ ์ƒํ˜ธ์ž‘์šฉ์ด ํ™œ๋ฐœํ•˜์ง€ ์•Š๋‹ค๋ฉด uniformํ•œ ํ˜•ํƒœ์˜ ํ™•๋ฅ  ๋ถ„ํฌ ๊ฐ€์งˆ ๊ฒƒ
๋”ฐ๋ผ์„œ ์˜ ๋ถ„ํฌ๊ฐ€ uniform ๋ถ„ํฌ ์™€ ์œ ์‚ฌํ•˜๋ฉด ์ƒ๋Œ€์ €์œผ๋กœ ๋ถˆํ•„์š”ํ•œ query๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Œ
์ƒ์ˆ˜ํ•ญ ์ œ๊ฑฐํ•œ ํ›„ ๋ฒˆ์งธ query์— ๋Œ€ํ•œ sparsity ์ธก์ • ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์Œ์„ ์ œ์‹œ
notion image
ํ•ด๋‹น ๊ฐ’์ด ๋†’์„ ์ˆ˜๋ก ์œ ์˜๋ฏธํ•œ query๋ฅผ ์˜๋ฏธ
๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ• ํ†ตํ•ด ์œ ์˜๋ฏธํ•œ query๋งŒ์„ ์„ ํƒํ•˜์—ฌ attention ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด PropSparse self attention

Encoder: Allowing for Processing Longer Sequential Inputs under the Memory Usage Limitation

notion image

Self-attention Distilling

probsparse attention ํ›„ convolution๊ณผ max-pooling ํ†ตํ•ด distilling
notion image
์ด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ layer์˜ input์ด ์ด์ „ layer input demension์˜ ์ ˆ๋ฐ˜์˜ ๊ธธ์ด๋ฅผ ๊ฐ€์ง€๋„๋ก distilling
์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ๋กœ ๊ฐ์†Œ์‹œํ‚ด
ย 

Experiment

notion image
univariate long sequence์— ๋Œ€ํ•œ ์‹คํ—˜๊ฒฐ๊ณผ
๋Œ€์ฒด์ ์œผ๋กœ informer๊ฐ€ ๊ฝค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
notion image
๋‹ค๋ณ€๋Ÿ‰์— ๋Œ€ํ•ด์„œ๋„ informer๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ๋ณด์—ฌ์ฃผ์ง€๋งŒ ๋‹จ๋ณ€๋Ÿ‰์—์„œ๋งŒํผ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๋Š” ์•„๋‹˜
ย