비교분석 연습문제

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

59doit

비교분석 연습문제 본문

비교분석 연습문제

yul_S2 2022. 12. 19. 14:03

speeches_presidents.csv

0.05MB

Q1. 역대 대통령의 대선 출마 선언문을 담은 speeches_presidents.csv를 이용해 문제를 해결해 보세요.

1 speeches_presidents.csv를 불러와 이명박 전 대통령과 노무현 전 대통령의 연설문을 추출하고 분석에 적합하게 전처리하세요.

1-1) 데이터 불러오기

raw_speeches <- read_csv("C:/speeches_presidents.csv")
raw_speeches

1-2) 데이터 전처리

speeches <- raw_speeches %>% mutate(value=str_replace_all(value,"[^가-힣]"," "),value = str_squish(value))

2 연설문에서 명사를 추출한 다음 연설문별 단어 빈도를 구하세요.

2-1) 명사 추출 토큰화

speeches <- speeches %>% unnest_tokens(input=value, output=word, token=extractNoun)

2-2) 단어 빈도

frequecy <- speeches %>% count(president, word) %>% filter(str_count(word) > 1)
frequecy

★3 로그 오즈비를 이용해 두 연설문에서 상대적으로 중요한 단어를 10개씩 추출하세요.

3-1) long form을 wide form으로 변환

fequency_wide <- frequency %>% pivot_wider(names_from = president, values_from = n, values_fill = list(n=0))
fequency_wide

names_from = : 변수명으로 만들 값

values_from = : 변수에 채워 넣을 값

values_fill = list(n=0) : 결측치 0으로 반환

3-2) 로그 오즈비 구하기

frequency_wide <- frequency_wide %>%
mutate(log_odds_ratio = log(((이명박 + 1) / (sum(이명박 + 1))) /
((노무현 + 1) / (sum(노무현 + 1)))))
frequency_wide

3-3) 상대적으로 중요한 단어 추출

top10 <- frequency_wide %>%
group_by(president = ifelse(log_odds_ratio > 0, "lee", "roh")) %>%
slice_max(abs(log_odds_ratio), n = 10, with_ties = F)
top10

# 로그 오즈비가 0보다 크면 "lee", 그 외에는 "roh"로 분류

4 두 연설문에서 상대적으로 중요한 단어를 나타낸 막대 그래프를 만드세요.

library(ggplot2)
ggplot(top10, aes(x = reorder(word, log_odds_ratio),
                  y = log_odds_ratio,
                  fill = president)) +
  geom_col() +
  coord_flip () +
  labs(x = NULL)

inaugural_address.csv

0.05MB

Q2. 역대 대통령의 취임사를 담은 inaugural_address.csv를 이용해 문제를 해결해 보세요.

1 inaugural_address.csv를 불러와 분석에 적합하게 전처리하고 연설문에서 명사를 추출하세요.

1-1) 데이터 불러오기

add_speeches <- read_csv("C:/inaugural_address.csv")
add_speeches

1-2) 전처리

speeches <- add_speeches %>% mutate(value = str_replace_all(value, "[^가-힣]"," "),
value = str_squish(value))

speeches

1-3) 명사 토큰화

speeches <- speeches %>% unnest_tokens(input = value,
                                       output = word,
                                       token = extractNoun)

speeches

# # A tibble: 4,099 x 2
# president word
# <chr>     <chr>
#   1 문재인    국민
# 2 문재인    말씀
# 3 문재인    존경
# 4 문재인    사랑
# 5 문재인    하
# 6 문재인    국민
# 7 문재인    여러분
# 8 문재인    감사
# 9 문재인    국민
# 10 문재인    여러분
# # ... with 4,089 more rows

2 TF-IDF를 이용해 각 연설문에서 상대적으로 중요한 단어를 10개씩 추출하세요.

2-1) 단어빈도구하기

frequecy <- speeches %>%
count(president,word) %>%
filter(str_count(word) > 1)

frequecy

2-2) TF-IDF 구하기

frequecy <- frequecy %>%
bind_tf_idf(term=word, document = president , n=n) %>%
arrange(-tf_idf)

frequecy

term = word : 단어

document = president : 텍스트 구분 변수

n = n : 단어 빈도

2-3) 상대적 중요 단어 추출

top10 <- frequecy %>%
group_by(president) %>%
slice_max(tf_idf, n=10 , with_ties = F)

top10

3. 각 연설문에서 상대적으로 중요한 단어를 나타낸 막대 그래프를 만드세요.

ggplot(top10, aes(x = reorder_within(word, tf_idf, president),
                  y = tf_idf, fill = president)) +
  geom_col(show.legend = F) +
  coord_flip() +
  facet_wrap(~president, scales = "free", ncol = 2) +
  scale_x_reordered() +
  labs(x=NULL)

facet_wrap(~president, scales = "free", ncol = 2)

ggplot() 그래프에서 플롯의 면 분할을 담당하는 함수

facet_wrap()에 전달하는 변수인 컬럼K는 범주형(factor, 이산형, discrete) 변수여야 한다.

ncol : 고정열개수 지정

scale은 각 서브플롯의 x축-y축 스케일을 통일할지(fixed), 별개로 할지(free) 정한다.

만약 모든 플롯을 동일한 조건 하에서 비교하고자 한다면, 디폴트 값인 scale='fixed'로 설정하는 것이 바람직하다.

Argument	사용예시	의미
vars()	vars(컬럼이름K)	면분할 기준이 될 컬럼 지정 * 컬럼K는 범주형(factor) 변수여야 함
nrow	nrow=숫자	출력되는 면분할 그래프의 서브플롯 행 개수 지정
ncol	ncol=숫자	출력될 면분할 그래프의 서브플롯 열 개수 지정
labeller	labeller='label_both'	서브플롯에 변수 이름(컬럼K,..) 표시 여부 지정
as.table	as.table=TRUE	TRUE / FALSE 서브플롯 정렬 순서 지정
scales	scales='fixed'	free / fixed / free_x / free_y (default : fixed) 각 서브플롯의 x축, y축 통일(fixed) 여부 지정
strip.position	strip.position='top'	top / bottom / left / right 각 서브플롯 라벨의 위치 지정

'Q.' 카테고리의 다른 글

데이터 시각화 연습문제 (0)	2022.12.22
토픽 모델링 연습문제 (2)	2022.12.21
텍스트 분석 연습문제 (0)	2022.12.16
텍스트 & 감정분석 예제 TEST(9) (0)	2022.12.16
머신러닝 예제 TEST(8) (0)	2022.12.05

'Q.' Related Articles

Comments

59doit

비교분석 연습문제 본문

비교분석 연습문제

'Q.' 카테고리의 다른 글

티스토리툴바