비교분석 #1

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

59doit

비교분석 #1 본문

텍스트마이닝

비교분석 #1

yul_S2 2022. 12. 18. 09:42

비교분석

여러 텍스트를 비교해 차이를 알아보는 분석 방법

단어 빈도 분석을 응용해 자주 사용된 단어의 차이를 살펴봄

(1) 텍스트 합치기

텍스트를 비교하기 위해 여러 개의 텍스트를 하나의 데이터셋으로 합치는 작업

#1 데이터 불러오기

library(dplyr)

# 문재인 대통령 연설문 불러오기

raw_moon <- readLines("C:/speech_moon.txt", encoding = "UTF-8")
moon <- raw_moon %>%
  as_tibble() %>%
  mutate(president = "moon")

# 박근혜 대통령 연설문 불러오기

raw_park <- readLines("C:/speech_park.txt", encoding = "UTF-8")
park <- raw_park %>%
  as_tibble() %>%
  mutate(president = "park")

#2 데이터 합치기

두 데이터를 행(세로) 방향으로 결합
출력 결과 보기 편하게 select()로 변수 순서 바꾸기
윗부분은 문재인 대통령, 아랫부분은 박근혜 전 대통령 연설문

bind_speeches <- bind_rows(moon, park) %>%
select(president, value)

head(bind_speeches)
## # A tibble: 6 x 2
## president value
## <chr> <chr>
## 1 moon "정권교체 하겠습니다!…
## 2 moon " 정치교체 하겠습니…
## 3 moon " 시대교체 하겠습니…
## 4 moon " "
## 5 moon " ‘불비불명(不飛不…
## 6 moon ""

tail(bind_speeches)
## # A tibble: 6 x 2
## president value
## <chr> <chr>
## 1 park "국민들이 꿈으로만 가…
## 2 park ""
## 3 park "감사합니다."
## 4 park ""
## 5 park "2012년 7월 10…
## 6 park "새누리당 예비후보 박…

(2) 집단별 단어 빈도 구하기

# 1 전처리 및 토큰화

한글 이외의 문자, 연속된 공백 제거
bind_speeches는 tibble 구조이므로 mutate() 활용

library(stringr)
speeches <- bind_speeches %>%
mutate(value = str_replace_all(value, "[^가-힣]", " "),
value = str_squish(value))

speeches

## # A tibble: 213 x 2
## president value
##
## 1 moon "정권교체 하겠습니다"
## 2 moon "정치교체 하겠습니다"
## 3 moon "시대교체 하겠습니다"
## 4 moon ""
## 5 moon "불비불명 이라는 고사가 있습니다 남쪽 언덕 나뭇가지에 앉아 년 …
## 6 moon ""
## 7 moon "그 동안 정치와 거리를 둬 왔습니다 그러나 암울한 시대가 저를 …
## 8 moon ""
## 9 moon ""
## 10 moon "우리나라 대통령 이 되겠습니다"
## # … with 203 more rows

library(tidytext)
library(KoNLP)
speeches <- speeches %>%
  unnest_tokens(input = value,
                output = word,
                token = extractNoun)

speeches

## # A tibble: 2,997 x 2
## president word
## <chr> <chr>
## 1 moon "정권교체"
## 2 moon "하겠습니"
## 3 moon "정치"
## 4 moon "교체"
## 5 moon "하겠습니"
## 6 moon "시대"
## 7 moon "교체"
## 8 moon "하겠습니"
## 9 moon ""
## 10 moon "불비불명"
## # … with 2,987 more rows

(3) 하위 집단별 단어 빈도 구하기 - count()

frequency <- speeches %>%
count(president, word) %>% # 연설문 및 단어별 빈도
filter(str_count(word) > 1) # 두 글자 이상 추출

head(frequency)

count()는 입력한 변수의 알파벳, 가나다순으로 행을 정렬함

(4) 자주 사용된 단어 추출하기

dplyr::slice_max() : 값이 큰 상위 n개의 행을 추출해 내림차순 정렬

slice_min() : 값이 작은 하위 n개 추출

top10 <- frequency %>%
group_by(president) %>% # president별로 분리
slice_max(n, n = 10) # 상위 10개 추출

top10

#2 단어 빈도 동점 처리

top10

# # A tibble: 23 x 3
## # Groups: president [2]
## president word n
## <chr> <chr> <int>
## 1 moon 국민 21
## 2 moon 일자리 21
## 3 moon 나라 19
## 4 moon 우리 17
## 5 moon 경제 15
## 6 moon 사회 14
## 7 moon 성장 13
## 8 moon 대통령 12
## 9 moon 정치 12
## 10 moon 하게 12
## # … with 13 more rows

두 연설문에서 단어 10개씩 추출했는데 20행이 아니라 23행
단어 빈도 동점인 행이 전부 추출되었기 때문
top10

top10 %>% filter(president == "park")

## # A tibble: 12 x 3
## # Groups: president [1]
## president word n
## <chr> <chr> <int>
## 1 park 국민 72
## 2 park 행복 23
## 3 park 여러분 20
## 4 park 정부 17
## 5 park 경제 15
## 6 park 신뢰 11
## 7 park 국가 10
## 8 park 우리 10
## 9 park 교육 9
## 10 park 사람 9
## 11 park 사회 9
## 12 park 일자리 9

"교육" , "사람" , "사회" , "일자리" 빈도 동점, 모두 추출되면서 행 늘어남

(5) 빈도 동점 단어 제외하고 추출하기

slice_max(with_ties = F) : 원본 데이터의 정렬 순서에 따라 행 추출

top10 <- frequency %>%
group_by(president) %>%
slice_max(n, n = 10, with_ties = F)

top10

## # A tibble: 20 x 3
## # Groups: president [2]
## president word n
##
## 1 moon 국민 21
## 2 moon 일자리 21
## 3 moon 나라 19
## 4 moon 우리 17
## 5 moon 경제 15
## 6 moon 사회 14
## 7 moon 성장 13
## 8 moon 대통령 12
## 9 moon 정치 12
## 10 moon 하게 12
## 11 park 국민 72
## 12 park 행복 23
## 13 park 여러분 20
## 14 park 정부 17
## 15 park 경제 15
## 16 park 신뢰 11
## 17 park 국가 10
## 18 park 우리 10
## 19 park 교육 9
## 20 park 사람 9

(6) 막대 그래프 만들기

#1 변수 항목별로 그래프 만들기 - facet_wrap()

library(ggplot2)
ggplot(top10, aes(x = reorder(word, n),
                  y = n,
                  fill = president)) +
  geom_col() +
  coord_flip() +
  facet_wrap(  ~  president)

~ 뒤에 그래프를 나누는 기준 변수 입력

#2 그래프별 y축 설정하기

축을 구성하는 단어가 한 범주에만 있으면 축은 있지만 막대는 없는 항목 생김

scales : 그래프의 축 통일 또는 각각 생성 결정
# "fixed" : 축 통일(기본값)
# "free_y" : 범주별로 y축 만듦

ggplot(top10, aes(x = reorder(word, n),
                  y = n,
                  fill = president)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ president,    # president별 그래프 생성
             scales = "free_y")         # y축 통일하지 않음

#3 특정 단어 제외하고 막대 그래프 만들기

전반적인 단어 빈도가 잘 드러나도록 제거

top10 <- frequency %>%
  filter(word != "국민") %>%
  group_by(president) %>%
  slice_max(n, n = 10, with_ties = F)

ggplot(top10, aes(x = reorder(word, n),
                  y = n,
                  fill = president)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ president, scales = "free_y")

박근혜 전 대통령 "국민" 빈도 너무 높아 다른 단어들 차이 드러나지 않음

#4 축 정렬하기

x축을 지정할 때 reorder()를 사용했는데도 막대가 빈도 기준으로 완벽하게 정렬되지 않음
전체 빈도 기준으로 각 범주의 x축 순서를 정했기 때문

ggplot(top10, aes(x = reorder_within(word, n, president),
                  y = n,
                  fill = president)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ president, scales = "free_y")

# tidytext::reorder_within() : 변수의 항목별로 축 순서 따로 구하기
# x : 축
# by :정렬 기준
# within : 그래프를 나누는 기준

#5 변수 항목제거하기

tidytext::scale_x_reordered() : 각 단어 뒤의 범주 항목 제거

ggplot(top10, aes(x = reorder_within(word, n, president),
                  y = n,
                  fill = president)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ president, scales = "free_y") +
  scale_x_reordered() +
  labs(x = NULL) + # x축 삭제
  theme(text = element_text(family = "nanumgothic"))      # 폰트

'텍스트마이닝' 카테고리의 다른 글

비교분석 #3 TF-IDF 로그 오즈비 (0)	2022.12.19
비교분석 #2 오즈비 (0)	2022.12.18
형태소 분석기를 이용한 빈도분석 ex (0)	2022.12.17
형태소 분석 (0)	2022.12.17
감정분석 #4 감정사전 수정하기 (0)	2022.12.15

'텍스트마이닝' Related Articles

Comments

59doit

비교분석 #1 본문

비교분석 #1

비교분석

'텍스트마이닝' 카테고리의 다른 글

티스토리툴바