토픽 모델링 #2 주요단어

텍스트마이닝

토픽 모델링 #2 주요단어

yul_S2 2022. 12. 20. 11:16

주요단어

(1) 토픽별 단어 확률, beta 추출하기

베타(beta, β): 단어가 각 토픽에 등장할 확률

베타를 보면 각 토픽에 등장할 가능성이 높은 주요 단어를 알 수 있다

#1 beta 추출하기

tidytext::tidy()

term_topic <- tidy(lda_model, matrix = "beta")
term_topic

# # A tibble: 49,256 x 3
# topic term            beta
# <int> <chr>          <dbl>
#   1     1 한국       0.0000363
# 2     2 한국       0.000388
# 3     3 한국       0.0000363
# 4     4 한국       0.0000346
# 5     5 한국       0.0563
# 6     6 한국       0.0000340
# 7     7 한국       0.0000355
# 8     8 한국       0.0111
# 9     1 방탄소년단 0.0000363
# 10     2 방탄소년단 0.0000353
# # ... with 49,246 more rows
# # i Use `print(n = ...)` to see more rows

#2 beta 보기

#2-1 토픽별 단어수

term_topic %>%
  count(topic)

# # A tibble: 8 x 2
# topic     n
# <int> <int>
# 1     1  6157
# 2     2  6157
# 3     3  6157
# 4     4  6157
# 5     5  6157
# 6     6  6157
# 7     7  6157
# 8     8  6157

모델을 6157개 단어로 만들었으므로 토픽별 6157

#2-2 토픽 1의 beta 합계

확률 값이므로 한 토픽의 beta를 모두 다하면 1

term_topic %>%
  filter(topic == 1) %>%
  summarise(sum_beta = sum(beta))

# # A tibble: 1 x 1
# sum_beta
# <dbl>
#   1        1

#3 특정단어의 토픽별 확률

특정 단어를 추출하면 단어가 어떤 토픽에 등장할 확률이 높은지 알 수 있다

term_topic %>%
  filter(term == "작품")

# # A tibble: 8 x 3
# topic term       beta
# <int> <chr>     <dbl>
#   1     1 작품  0.0000363
# 2     2 작품  0.00991
# 3     3 작품  0.0000363
# 4     4 작품  0.00211
# 5     5 작품  0.0185
# 6     6 작품  0.000714
# 7     7 작품  0.000744
# 8     8 작품  0.0000356

#4 특정 토픽에서 beta가 높은 단어 살펴보기

term_topic %>%
  filter(topic == 5) %>%
  arrange(-beta)

# # A tibble: 6,157 x 3
# topic term        beta
# <int> <chr>      <dbl>
#   1     5 한국     0.0563
# 2     5 세계     0.0477
# 3     5 봉감독님 0.0278
# 4     5 한국영화 0.0278
# 5     5 최고     0.0261
# 6     5 감사     0.0220
# 7     5 작품     0.0185
# 8     5 대박     0.0155
# 9     5 문화     0.0137
# 10     5 훌륭     0.00792
# # ... with 6,147 more rows

#5 모든 토픽의 주요 단어 살펴보기

topicmodels::terms()

terms(lda_model, 20) %>%
  data.frame()

# Topic.1  Topic.2    Topic.3    Topic.4    Topic.5  Topic.6    Topic.7    Topic.8
# 1          역사     대박       조국 블랙리스트       한국     수상       사람       좌파
# 2        감독상     진심       자랑     박근혜       세계     우리       배우       호감
# 3          스카     국민     문재인     송강호   봉감독님     생각       정치     빨갱이
# 4          미국     감동       가족       정권   한국영화     오늘       나라       외국
# 5          인정 우리나라       경사 자유한국당       최고   시상식       소름       한국

terms (x, ....)

(2) 토픽별 주요 단어 시각화

#1. 토픽별 beta가 가장 높은 단어 추출

top_term_topic <- term_topic %>%
  group_by(topic) %>%
  slice_max(beta, n = 10)

top_term_topic

# # A tibble: 83 x 3
# # Groups:   topic [8]
# topic term        beta
# <int> <chr>      <dbl>
#   1     1 역사     0.0356
# 2     1 감독상   0.0331
# 3     1 스카     0.0287
# 4     1 미국     0.0157
# 5     1 인정     0.0157
# 6     1 각본상   0.0120
# 7     1 우리나라 0.0117
# 8     1 감격     0.0109
# 9     1 영화제   0.0102
# 10     1 정도     0.00839
# # ... with 73 more rows

토픽별 단어 확률에 동점이 있으면 추출한 단어가 10개 보다 많을 수 있다.

동점 제외하려면 slice_max(with_ties = F)

#2 막대그래프 만들기

install.packages("scales")
library(scales)
library(ggplot2)

ggplot(top_term_topic,
       aes(x = reorder_within(term, beta, topic),
           y = beta,
           fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(n.breaks = 4,
                     labels = number_format(accuracy = .01)) +
  labs(x = NULL) +
  theme(text = element_text(family = "nanumgothic"))

scale_y_continuous(n.breaks = 4) : 축 눈금을 4개 내외로 정하기

labels = number_format(accuracy = .01) : 눈금 소수점 첫째 자리에서 반올림. scales 로드 필요.