๐Ÿ“‘Paper Review

[paper reivew] A dirichlet multinomial mixture model-based approach for short text clustering

date
Mar 2, 2023
slug
dmmm
author
status
Public
tags
paper
DeepLearning
summary
type
Post
thumbnail
category
๐Ÿ“‘Paper Review
updatedAt
Sep 6, 2024 01:52 PM

Introduction

short text clustering์€ sparsity ploblem์„ ๊ฐ€์ง„๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๋‹จ์–ด๋“ค์€ ๊ฐ short text์—์„œ ํ•œ๋ฒˆ๋งŒ ๋ฐœ์ƒํ•œ๋‹ค. TF-IDF ๊ฐ™์€ ๋ฐฉ๋ฒ•๋ก ์€ ์ด๋Ÿฌํ•œ short text์— ํšจ๊ณผ์ ์ด์ง€ ๋ชปํ•จ. ๋˜ํ•œ Vector space model์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ sparseํ•˜๊ณ  high dimensinalํ•œ vector๋ฅผ ๋‹ค๋ค„์•ผํ•œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊น€.
ย 
collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model ์ œ์•ˆ.
๋˜ํ•œ, ๊ฐ ๋ฌธ์„œ๋“ค์€ ํ•™์ƒ, ํ•™์ƒ๋“ค์ด ๋ณธ ์˜ํ™”๋Š” ๋‹จ์–ด์— ๋น„์œ ํ•˜์—ฌ GSDMM ๊ณผ์ •์„ ์„ค๋ช…ํ•˜๋Š” Movie Group Process ์ œ์•ˆ
short text clustering ๋ฌธ์ œ๋ฅผ ๋น„์Šทํ•œ ๊ด€์‹ฌ์‚ฌ๋ฅผ ๊ณต์œ ํ•˜๋Š” ๊ทธ๋ฃน์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ํ•™์ƒ๋“ค์„ clustering ํ•˜๋Š” ๋ฌธ์ œ์— ๋น„์œ ํ•  ์ˆ˜ ์žˆ๋‹ค.
ย 
๋ณธ ๋…ผ๋ฌธ์˜ contribution์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค
1) short text clustering์— Dirichlet Multinomial Mixture (DMM)์„ ์ ์šฉํ•˜๋Š” ์ฒซ๋ฒˆ์งธ ์‹œ๋„์ด๋‹ค. sparse ad high dimensinal problem์„ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ.
2) DMM์„ ์œ„ํ•œ collapsed Gibbs Sampling algorithm ์ œ์•ˆ. cluster ๊ฒฐ๊ณผ์˜ completness์™€ homoginity ์‚ฌ์ด์— ์ข‹์€ ๋ฐธ๋Ÿฐ์Šค๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ cluster ๊ฐœ์ˆ˜๋ฅผ ์ž๋™์œผ๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋น ๋ฅผ๊ฒŒ ์ˆ˜๋ ดํ•จ.
3) GSDMM์˜ ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•œ Movie Group Process (MGP) ์ œ์•ˆ.
ย 

APPROACH

Movie Group Process

์˜ํ™” ํ† ๋ก  ์ˆ˜์—…์— ๋น„์œ ํ•˜์—ฌ GSDMM ๊ณผ์ • ์„ค๋ช…. ๋น„์Šทํ•œ ์˜ํ™”๋ฅผ ๋ณธ ํ•™์ƒ๋“ค์„ ๊ฐ™์€ ๊ทธ๋ฃน์— ๋‚˜๋ˆ„๊ณ ์žํ•จ. ๊ฐ ํ•™์ƒ์€ ๊ทธ ํ•™์ƒ์ด ์ž‘์„ฑํ•œ ์‹œ์ฒญ ์˜ํ™” ๋ชฉ๋ก์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ตœ์ข… ๋ชฉ์ ์€ ๊ฐ™์€ ๊ทธ๋ฃน์˜ ํ•™์ƒ๋“ค์€ ๋น„์Šทํ•œ ์˜ํ™” ๋ฆฌ์ŠคํŠธ๋ฅผ, ๋‹ค๋ฅธ ๊ทธ๋ฃน์˜ ํ•™์ƒ๋“ค์€ ๋‹ค๋ฅธ ์˜ํ™” ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ฐ€์ง€๋„๋ก clustering ํ•˜๋Š” ๊ฒƒ.
์ด๋•Œ input์€ D๋ช…์˜ ํ•™์ƒ๋“ค(document)์ด๊ณ  ๊ฐ ํ•™์ƒ์€ ์˜ํ™” ๋ฆฌ์ŠคํŠธ(document์˜ ๋‹จ์–ด)๋กœ representation๋จ. ๋ชจ๋“  ํ•™์ƒ์ด ์ž‘์„ฑํ•œ ์ด ์˜ํ™”(words)์˜ ๊ฐœ์ˆ˜๋ฅผ V๋ผ๊ณ  ํ•œ๋‹ค๋ฉด short text์˜ sparse characteristic์— ์˜ํ•ด ๊ฐ short text์˜ ๋‹จ์–ด ๊ฐœ์ˆ˜(L)๋Š” ์ž‘์ง€๋งŒ(often less than ) V๋Š” ๋งค์šฐ ํฌ๋‹ค.(often larger than )
(K-means์™€ ๊ฐ™์€ clustering ๋ฐฉ๋ฒ•์€ ๊ฐ documnet๋ฅผ V ์ฐจ์›์˜ vector๋กœ ํ‘œํ˜„ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ document์˜ ๋‹จ์–ด ๊ฐœ์ˆ˜๊ฐ€ ์ ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  V ๊ธธ์ด์— vector๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋†’์€ ์‹œ๊ฐ„ ๋ฐ ๊ณต๊ฐ„ ๋ณต์žก๋„๋ฅผ ๊ฐ€์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ high-dimensional problem ๋ฐœ์ƒ.)
ํฐ ์‹๋‹น์—์„œ ์šฐ์„  ํ•™์ƒ๋“ค์„ randomํ•˜๊ฒŒ ๊ฐœ์˜ table์— ์•‰ํžŒ ํ›„ ๋‹ค์‹œ ํ•™์ƒ๋“ค์—๊ฒŒ ๋‹ค์Œ ๋ฃฐ์— ๋”ฐ๋ผ ์ฐจ๋ก€๋กœ table์„ ์„ ํƒํ•˜๋„๋ก ํ•œ๋‹ค.
1) ๋” ๋งŽ์€ ํ•™์ƒ์ด ์žˆ๋Š” table์„ ์„ ํƒํ•ด์•ผํ•˜๊ณ 
2) ๋น„์Šทํ•œ ์˜ํ™” ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๋Š” table์„ ์„ ํƒํ•ด์•ผํ•œ๋‹ค.
์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉด ์–ด๋–ค table์€ ์ธ์›์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์–ด๋–ค table์€ ์ธ์›์ด ์ ์  ๊ฐ์†Œํ•  ๊ฒƒ์ด๋ฉฐ ๊ฐ table์˜ ํ•™์ƒ๋“ค์€ ๋น„์Šทํ•œ ์˜ํ™” ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ณต์œ ํ•  ๊ฒƒ์ด๋‹ค.
rule1์€ clustering ๊ฒฐ๊ณผ์˜ ๋†’์€ comletness ์ด๋Œ์–ด๋ƒ„. completness๋Š” ground true group์˜ ๋ชจ๋“  ๋ฉค๋ฒ„๋“ค์ด ๊ฐ™์€ cluster์— ํ• ๋‹น๋˜์–ด์•ผํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ. rule1์€ ์ธ๊ธฐ์žˆ๋Š” table์€ ๋” ์ธ๊ธฐ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ground true group์˜ ํ•™์ƒ๋“ค์ด ์‹ค์ œ๋กœ ๊ฐ™์€ cluster์— ๋“ค์–ด๊ฐ€๋„๋ก ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ธ๋‹ค.
rule2๋Š” clustering ๊ฒฐ๊ณผ์˜ ๋†’์€ homogeneity ์ด๋Œ์–ด๋ƒ„. homogeneity๋Š” ๊ฐ cluster๊ฐ€ ํ•˜๋‚˜์˜ ground true group์˜ member๋“ค์ด ํฌํ•จ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ.
์ด๋Ÿฌํ•œ MGP๋Š” GSDMM์˜ collapsed gibbs sampling ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ฐ™๋‹ค.

Dirichlet Multinomial Mixture

notion image
DMM์€ ๋ฌธ์„œ์— ๋Œ€ํ•œ ํ™•๋ฅ ์  ์ƒ์„ฑ ๋ชจ๋ธ.
๋ฌธ์„œ d๋ฅผ ์ƒ์„ฑํ•  ๋•Œ DMM์€ ์ฒซ๋ฒˆ์งธ๋กœ mixture weight์— ๋”ธ mixture component(cluster) k๋ฅผ ์„ ํƒํ•œ๋‹ค. ๋ฌธ์„œ ๋Š” ๋ถ„ํฌ ๋กœ๋ถ€ํ„ฐ ์„ ํƒ๋œ mixture component์— ์˜ํ•ด ์ƒ์„ฑ๋จ.
๋”ฐ๋ผ์„œ ๋ฌธ์„œ ์˜ likelihood๋Š”
notion image
์ด์ œ ์™€ ๋ฅผ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•  ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ๋ฌธ์ œ๊ฐ€ ๋จ. DMM์€ ๋ฌธ์„œ ์•ˆ์˜ ๋‹จ์–ด๋“ค์ด ๋…๋ฆฝ์ ์œผ๋กœ ์ƒ์„ฑ๋˜๊ณ  ๋‹จ์–ด์˜ ํ™•๋ฅ ์€ ๋ฌธ์„œ ๋‚ด์˜ ๋‹จ์–ด์˜ ์œ„์น˜์™€ ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ cluster ์— ์˜ํ•ด ๋ฌธ์„œ ๊ฐ€ ์ƒ์„ฑ๋  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
notion image
๊ฐ mixture component๋Š” multinomial distribution์ด๋ผ๊ณ  ๊ฐ€์ •.
๊ฐ€
๊ฐ mixture component์˜ prior๊ฐ€ Dirichlet distribution์ด๋ผ ๊ฐ€์ •
notion image
๋˜ํ•œ mixture component์˜ weight๊ฐ€ multinomial distribution
notion image
์— ์˜ํ•ด ๋ฝ‘ํžŒ๋‹ค๊ณ  ๊ฐ€์ •.
์ด multinomial distribution์˜ prior๋Š” ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ
notion image

Gibbs Sampling for DMM

input์ธ ๋ฌธ์„œ์™€ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ , ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ inference ์ด ๋•Œ ๋Š” ์˜ ์ถฉ๋ถ„ํ†ต๊ณ„๋Ÿ‰์ด๋ฏ€๋กœ
notion image
๋ฅผ inferenceํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.
notion image
notion image
์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ฒซ๋ฒˆ์งธ for๋ฌธ์€ randomization๊ณผ์ •. ๋ฌธ์„œ ๋งˆ๋‹ค ๋ถ„ํฌ์— ๋”ฐ๋ผ randomํ•˜๊ฒŒ ํด๋Ÿฌ์Šคํ„ฐ ํ• ๋‹น ํ›„ ํ•ด๋‹น ๊ฐ’๋“ค ์—…๋ฐ์ดํŠธ.
๋‘๋ฒˆ์งธ for๋ฌธ์€ gibbs sampling ๊ณผ์ •.
notion image
๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ™•๋ฅ ์— ๋”ฐ๋ผ sampling๋œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ• ๋‹น ํ›„ ํ•ด๋‹น ๊ฐ’ ์—…๋ฐ์ดํŠธํ•˜๋ฉฐ ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ๋˜๋Š” ํŠน์ • iteration๊นŒ์ง€ ๋ฐ˜๋ณตํ•œ๋‹ค.

EXPERIMENTAL STUDY

Comparison of clustering models

GSDMM๊ณผ ๋‹ค๋ฅธ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐฉ๋ฒ•๋ก ๋“ค ์„ฑ๋Šฅ ๋น„๊ต.
notion image
GSDMM์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธ.
ย 
ย