형태소 분석

텍스트마이닝

형태소 분석

yul_S2 2022. 12. 17. 10:43

형태소 분석

띄어쓰기 기준 토큰화의 문제

의미를 지니지 않는 서술어가 가장 많이 추출됨 ex) '합니다', '있습니다'

형태소 분석(Morphological Analysis)

문장에서 형태소를 추출해 명사, 동사, 형용사 등 품사로 분류하는 작업

특히 명사를 보고 문장 내용 파악

형태소(Morpheme)

의미를 가진 가장 작은 말의 단위

더 나누면 뜻이 없는 문자가 됨

KoNLP 한글 형태소 분석 패키지 설치하기

#1 자바와 rJava 패키지 설치

install.packages("multilinguer")
library(multilinguer)
install_jdk()

#2 KoNLP 의존성 패키지 설치하기

install.packages(c("stringr", "hash", "tau", "Sejong", "RSQLite", "devtools"), type = "binary")

#3 KoNLP 패키지 설치하기

install.packages("remotes")
remotes::install_github("haven-jeon/KoNLP", upgrade = "never", INSTALL_opts = c("--no-multiarch"))
library(KoNLP)

#4 KoNLP 한글 형태소 분석 패키지 설치하기

useNIADic()

형태소 사전 설정하기 NIA 사전: 120만여 개 단어로 구성된 형태소 사전

형태소 분석기를 이용해 토큰화하기 - 명사 추출

# 1

library(KoNLP)
library(dplyr)
text <- tibble( value = c("대한민국은 민주공화국이다.",
"대한민국의 주권은 국민에게 있고, 모든 권력은 국민으로부터 나온다."))
text

# 2

extractNoun(text$value)

extractNoun() : 문장에서 추출한 명사를 list 구조로 출력

# 3 unnest_tokens()를 이용해 명사 추출하기

library(tidytext)
text %>% unnest_tokens(input = value,
output = word,
token = extractNoun)

tibble 구조로 명사 출력
token 파라미터에 입력한 extractNoun 앞뒤에 따옴표 X

#4 띄어쓰기 기준 추출

text %>% unnest_tokens(input = value, output = word, token = "words")

#5 명사 추출

text %>% unnest_tokens(input = value, output = word, token = extractNoun)

연설문에서 명사 추출하기

speech_moon.txt

0.02MB

# 1 파일불러오기 & 전처리

raw_moon <- readLines("speech_moon.txt", encoding = "UTF-8")

library(stringr)
library(textclean)
moon <- raw_moon %>%
str_replace_all("[^가-힣]", " ") %>%   # 한글만 남기기
str_squish() %>%     # 중복 공백 제거
as_tibble()       # tibble로 변환 moon

#2 명사기준 토큰화

word_noun <- moon %>%
unnest_tokens(input = value, output = word, token = extractNoun)
word_noun

명사 빈도 구하기

#1 단어 빈도 구하기

word_noun <- word_noun %>%
count(word, sort = T) %>% # 단어 빈도 구해 내림차순 정렬
filter(str_count(word) > 1) # 두 글자 이상만 남기기 word_noun

#2 띄어쓰기 기준 추출

moon %>% unnest_tokens(input = value, output = word, token = "words") %>%
count(word, sort = T) %>% filter(str_count(word) > 1)

#3 명사 추출

unnest_tokens(input = value, output = word, token = extractNoun) %>% count(word, sort = T) %>% filter(str_count(word) > 1)

#4 막대 그래프 만들기

top20 <- word_noun %>% head(20) # 상위 20개 단어 추출

library(ggplot2)
ggplot(top20, aes(x = reorder(word, n), y = n)) + geom_col() + coord_flip() +
geom_text(aes(label = n), hjust = -0.3) +
labs(x = NULL) +
theme(text = element_text(family = "nanumgothic"))

#5 워드 클라우드 만들기

library(showtext)
font_add_google(name = "Black Han Sans", family = "blackhansans") showtext_auto()

library(ggwordcloud)
ggplot(word_noun, aes(label = word, size = n, col = n)) +
geom_text_wordcloud(seed = 1234, family = "blackhansans") +
scale_radius(limits = c(3, NA), range = c(3, 15)) +
scale_color_gradient(low = "#66aaf2", high = "#004EA1") +
theme_minimal()

특정 단어 확인

#1 문장 기준으로 토큰화하기

sentences_moon <- raw_moon %>%
str_squish() %>%
as_tibble() %>%
unnest_tokens(input = value, output = sentence, token = "sentences")
sentences_moon

문장으로 토큰화할 때는 마침표가 문장의 기준점이 되므로 특수 문자 제거 X

#2 특정 단어가 사용된 문장 추출하기

sentences_moon %>% filter(str_detect(sentence, "국민"))

특정 단어가 들어 있는지 확인하기 - str_detect()

단어가 문장에 있으면 TRUE , 그렇지 않으면 FALSE 반환

#3 특정 단어가 사용된 문장 추출하기

sentences_moon %>% filter(str_detect(sentence, "일자리"))

cf ) tibble 구조는 텍스트가 길면 Console 창 크기에 맞춰 일부만 출력된다. 모든 내용 출력 하려면: %>% data.frame() 왼쪽 정렬 출력 하려면: %>% print.data.frame(right = F)