๐Ÿ“‘Paper Review

[paper review] TextRank: Bringing Order into Text

date
Oct 10, 2023
slug
textrank
author
status
Public
tags
paper
DeepLearning
summary
type
Post
thumbnail
์บก์ฒ˜.PNG
category
๐Ÿ“‘Paper Review
updatedAt
Sep 7, 2024 03:08 AM

Introduction

TextRankย ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์•ž์„œ ์ •๋ฆฌํ•œ PageRank๋ฅผ Text ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ variation์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Keyword Extraction๊ณผ Sentence Extraction์„ ์œ„ํ•œ ๋‘๊ฐœ์˜ TextRank ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Proposed approach

PageRank๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœย unweighted, directed graph ๋ฅผ ๊ตฌ์„ฑํ•˜์—ฌ vertex์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
๋ฐ˜๋ฉด์—ย TextRank๋Š”ย weighted graph์™€ย undirected graphย ๋˜ํ•œย ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ๋…์„ ํ™•์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.
์ด๋Š” ์›น ํŽ˜์ด์ง€์— ๋น„ํ•ด ์ƒํ˜ธ ๋ณต์žกํ•œ ์—ฐ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” text data์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•จ ์ž…๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ TextRank์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ์—๋Š” directed/undirected, weighted/unweighted์˜ ํ˜•ํƒœ ์ค‘ ์ž์œ ๋กญ๊ฒŒ ์„ ํƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ย 
๋”ฐ๋ผ์„œ
weighted graph์—์„œ์˜ vertexย ์— ๋Œ€ํ•œ ์ค‘์š”๋„๋Š” ๋‹ค์Œ์˜ ๊ฐ€์ค‘ํ•ฉ ๊ณผ์ •์„ ํ†ตํ•ด ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ย 
๊ฐ๊ฐ vertex ์‚ฌ์ด edge์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋Š” Keyword Extraction๊ณผ Sentence Extraction ๋ฐฉ๋ฒ•์— ๋”ฐ๋ผ ๊ฐ๊ฐ ๋‹ค๋ฅด๊ฒŒ ๊ณ„์‚ฐ ๋ฉ๋‹ˆ๋‹ค.
ย 
TextRank๋ฅผ text data์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ๋ž˜ํ”„์— ์ถ”๊ฐ€ํ•œ text์˜ ํ˜•ํƒœ์— ์ƒ๊ด€์—†์ด ๋‹ค์Œ์˜ ๋„ค๊ฐ€์ง€ ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ฅด๋ฉด ๋ฉ๋‹ˆ๋‹ค.
  1. task์— ๋งž๋Š” ๊ฐ€์žฅ ์ ๋‹นํ•œ text units์„ ์ •์˜ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„์— vertex๋กœ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  1. text unit ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ •์˜ํ•˜๊ณ , vertex ์‚ฌ์ด์— edge๋กœ์„œ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ, edge๋Š” ์•ž์„œ ์„ค๋ช…๋“œ๋ฆฐ๋Œ€๋กœ directed/undirected, weighted/unweighted์˜ ํ˜•ํƒœ ์ค‘ ์ž์œ ๋กญ๊ฒŒ ์„ ํƒ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  1. ๊ทธ๋ž˜ํ”„๊ฐ€ ์ˆ˜๋ ดํ• ๋•Œ๊นŒ์ง€ ๋žญํ‚น ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
  1. ์ตœ์ข…์ ์œผ๋กœ vertex์˜ ์ค‘์š”๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ •๋ ฌํ•˜์—ฌ ranking/selection์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
ย 
ย 

TextRank for Keyword Extraction

TextRank๋ฅผ Keyword Extraction์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, ์ฃผ์–ด์ง„ Text data์— ๋Œ€ํ•œ ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ๋‹จ์–ด ํ˜น์€ ๊ตฌ๋ฌธ์˜ ์ง‘ํ•ฉ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ ํ•˜๋‚˜ ์ด์ƒ์˜ ์–ดํœ˜(lexical units)๋กœ ๊ตฌ์„ฑ๋œ ์‹œํ€€์Šค(1~n gram)๋ฅผ Vertex๋กœ ์‚ฌ์šฉํ•˜๊ณ , Vertex ์‚ฌ์ด์— ์œ ์˜๋ฏธํ•œ edge๋ฅผ ์ •์˜ํ•˜์—ฌ Keyword Extraction์„ ์œ„ํ•œ ์ค‘์š”๋„๋ฅผ ํŒ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์–ดํœ˜ ๋‹จ์œ„(lexical units)์˜ co-occurrence๋ฅผ ํ†ตํ•ด TextRank ๊ทธ๋ž˜ํ”„์˜ edge๋ฅผ ์ •์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
co-occurrence๋Š” Window size N ์ด๋‚ด์˜ ๋™์‹œ์— ์ถœํ˜„์„ ๊ณ ๋ คํ•˜์—ฌ Vertex๋ฅผ ์—ฐ๊ฒฐ์‹œ์ผœ ์ค๋‹ˆ๋‹ค.
ํ•„ํ„ฐ๋ง์„ ํ†ตํ•˜์—ฌ ์›ํ•˜๋Š” ์กฐ๊ฑด์— ๋งž๋Š” ๋‹จ์–ด๋งŒ์„ Vertex์— ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋˜ํ•œ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, ์˜ค์ง ๋ช…์‚ฌ์™€ ๋™์‚ฌ๋งŒ์„ vertex๋กœ ์ถ”๊ฐ€ ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฆ‰ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.
์ดํ›„ ๊ณผ์ •์—์„œ ์ตœ์ ์˜ ํ•„ํ„ฐ๋ง ์กฐํ•ฉ์„ ์ฐพ๊ธฐ ์œ„ํ•œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ, ๋ช…์‚ฌ์™€ ํ˜•์šฉ์‚ฌ๋งŒ์„ ๊ทธ๋ž˜ํ”„์— ์ถ”๊ฐ€ํ•˜์˜€์„ ๋•Œ ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
ย 
์ •๋ฆฌํ•ด ๋ณด๋ฉด, TextRank์˜ keyword extraction ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋น„ ์ง€๋„ํ˜•ํƒœ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  1. Text๋ฅผ ํ† ํฐํ™” ํ•˜๊ณ , ํ•„ํ„ฐ๋ง์„ ์œ„ํ•ด POS ํƒœ๊น…์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
notion image
  1. ํ•„ํ„ฐ๋ง์„ ์ง„ํ–‰ํ•œ vertex๋ฅผ ๊ทธ๋ž˜ํ”„์— ์ถ”๊ฐ€ํ•˜๊ณ , Co-occurrence๋ฅผ ๊ณ ๋ คํ•˜์—ฌ edge ๋˜ํ•œ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  1. ์ดˆ๊ธฐ vertex ์ค‘์š”๋„๋ฅผ 1๋กœ ์„ค์ •ํ•˜๊ณ  ์ˆ˜๋ ดํ• ๋•Œ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
notion image
  1. ์ตœ์ข…์ ์œผ๋กœ ์–ป์€ ์ค‘์š”๋„๋ฅผ ์ •๋ ฌํ•˜์—ฌ Top-N๊ฐœ์˜ vertex๋ฅผ ๋ฌธ์žฅ์˜ keyword๋กœ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
notion image
๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ๋ž˜ํ”„๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ n-gram์ด ์•„๋‹Œ ๊ฐœ๋ณ„ ๋‹จ์–ด๋“ค๋งŒ์„ vertex๋กœ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๊ฐœ๋ณ„ ํ‚ค์›Œ๋“œ๋“ค์„ ํ•ฉ์น˜๋Š” post-processing ์ž‘์—…์„ ํ†ตํ•ด multi-word keyword๋กœ์˜ ๋ณ€ํ™˜ ๋˜ํ•œ ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
ย 

TextRank for Sentence Extraction

TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Sentence๋ฅผ vertex๋กœํ•˜๋Š” ๊ทธ๋ž˜ํ”„์— ๋Œ€ํ•ด์„œ๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ๋Š” ์ผ์ผ์ด ๋™์‹œ์ถœํ˜„์„ ๊ณ ๋ คํ•  ์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘ ๋ฌธ์žฅ ์™€ ์‚ฌ์ด์—์„œ ๋™์‹œ์— ์ถœํ˜„ํ•˜๋Š” ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ edge๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ Sentence Extraction์„ ์œ„ํ•œ TextRank๋Š” ๋ฌธ์žฅ์— ์ค‘์š”๋„๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ณผ์ •๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
์ •๋ฆฌํ•ด ๋ณด๋ฉด, Sentence Extraction ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  1. ๊ฐ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ธ๋ฑ์Šค๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
notion image
  1. ๊ทธ๋ž˜ํ”„์— ๋ฌธ์žฅ์„ vertex๋กœ ์ถ”๊ฐ€ํ•˜๊ณ , ์ฃผ์–ด์ง„ ์ˆ˜์‹์„ ์ด์šฉํ•˜์—ฌ vertex ์‚ฌ์ด์˜ edge ๋˜ํ•œ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
  1. Vertex์˜ ์ค‘์š”๋„๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์ˆ˜๋ ดํ• ๋•Œ๊นŒ์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
notion image
  1. ์ตœ์ข…์ ์œผ๋กœ ์ •๋ ฌ์„ ํ†ตํ•ด ์ค‘์š”๋„๊ฐ€ ๋†’์€ ๋ฌธ์žฅ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ์— ๋Œ€ํ•œ ์š”์•ฝ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
notion image

Experiments

notion image
TextRank์˜ Keyword Extraction์— ๋Œ€ํ•œ ๊ฒ€์ฆ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•˜์—ฌ ๋‹ค์Œ์˜ ์‚ฌ์‹ค๋“ค์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  1. ๋‹น์‹œ TextRank๊ฐ€ ๊ฐ€์žฅ ๋›ฐ์–ด๋‚œ F-1 score๋ฅผ ๋ณด์˜€์Œ.
  1. window size๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์€ ์—ญํšจ๊ณผ๋ฅผ ๋ถˆ๋Ÿฌ์˜ด
  1. ๊ฐ™์€ ์กฐ๊ฑด์—์„œ ๋น„๊ตํ•ด ๋ดค์„ ๋•Œ, directed ๊ทธ๋ž˜ํ”„๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ์ด Undiriected ๊ทธ๋ž˜ํ”„๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜์Œ.
  1. ๋ช…์‚ฌ, ํ˜•์šฉ์‚ฌ๋ฅผ vertex๋กœ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ, ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜์Œ.
ย 
ย 

Conclusion

TextRank๋Š” ๋ฌธ์žฅ ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„์™€ ์ค‘์š”๋„ ์ˆœ์œ„๋ฅผ ๋™์‹œ์— ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ง€๋„ํ•™์Šต์„ ์œ„ํ•ด ์š”์•ฝ๋ณธ์„ ์–ป๋Š” ๊ฒƒ์€ ๋งค์šฐ ํฐ ๋…ธ๋™๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ TextRank๋Š” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์ด๋ผ๋Š” ์žฅ์ ๋˜ํ•œ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
ย