[ R ] 베이지안

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

59doit

[ R ] 베이지안 본문

통계기반 데이터분석

[ R ] 베이지안

yul_S2 2022. 12. 2. 11:25

베이지안

베이지안 확률 모델은 주관적인 추론을 바탕으로 만들어진 ‘사전확률’을 추가적인 관찰을 통한 ‘사후확률’로 업데이트하여 불확실성을 제거할 수 있다고 믿는 방법.

베이즈 정리는 posteriori확률을 찾는 과정이고 베이즈 추론을 MAP(Maximum a Posteriori) 문제라고 부르기도 한다

heart.csv

0.02MB

ex)

#1 패키지 설치

install.packages("e1071")
install.packages("caret")

#2 데이터 불러오기

data <- read.csv("C:/heart.csv", header = T)

head(data)
# X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca       Thal AHD
# 1 1  63   1      typical    145  233   1       2   150     0     2.3     3  0      fixed  No
# 2 2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2  3     normal Yes
# 3 3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2  2 reversable Yes
# 4 4  37   1   nonanginal    130  250   0       0   187     0     3.5     3  0     normal  No
# 5 5  41   0   nontypical    130  204   0       2   172     0     1.4     1  0     normal  No
# 6 6  56   1   nontypical    120  236   0       0   178     0     0.8     1  0     normal  No

str(data)
# 'data.frame': 303 obs. of  15 variables:
#   $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
# $ Age      : int  63 67 67 37 41 56 62 57 63 53 ...
# $ Sex      : int  1 1 1 1 0 1 0 0 1 1 ...
# $ ChestPain: chr  "typical" "asymptomatic" "asymptomatic" "nonanginal" ...
# $ RestBP   : int  145 160 120 130 130 120 140 120 130 140 ...
# $ Chol     : int  233 286 229 250 204 236 268 354 254 203 ...
# $ Fbs      : int  1 0 0 0 0 0 0 0 0 1 ...
# $ RestECG  : int  2 2 2 0 2 0 2 0 2 2 ...
# $ MaxHR    : int  150 108 129 187 172 178 160 163 147 155 ...
# $ ExAng    : int  0 1 1 0 0 0 0 1 0 1 ...
# $ Oldpeak  : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
# $ Slope    : int  3 2 2 3 1 1 3 1 2 3 ...
# $ Ca       : int  0 3 2 0 0 0 2 0 1 0 ...
# $ Thal     : chr  "fixed" "normal" "reversable" "normal" ...
# $ AHD      : chr  "No" "Yes" "Yes" "No" ...

#3 데이터 분류하기

library(caret)

set.seed(1234)
tr_data <- createDataPartition(y=data$AHD, p=0.7, list=FALSE)

# tr_data <- sample(1:nrow(data),nrow(data)*0.7) 와 결과 같음

tr_data <- createDataPartition(y=data$AHD, p=0.7, list=FALSE) ???

# AHD : chr "No" "Yes" "Yes" "No" .. 를 종속변수로 지정
# p=0.7 : createDataPartition 함수 사용 # 7:3의 비율로 훈련 데이터와 테스트 데이터

70% 는 훈련용데이터로 나머지 30%는 검증용(테스트)데이터로 데이터셋을 분할하기 위해 createDataPartition 함수를 사용하여 훈련데이터로 사용할 index 추출

# list = FALSE : tr <- data[tr_data,] 에서 데이터를 불러올때 벡터, matrix, array 형식이어야하므로 list = FALSE 로 지정

tr_data <- createDataPartition(y=data$AHD, p=0.7, list=FALSE) 의 tr_data의 클래스를 보면

< class(tr_data) # [1] "matrix" "array" > 나온다.

따라서 아래의 tr <- data[tr_data,] , te <- data[-tr_data,] " 를 불러올 수 있다.

만약

tr_data <- createDataPartition(y=data$AHD, p=0.7, list=TRUE) 로 실행할 경우에는 tr_data의 클래스가 < class(tr_data) # [1] "list" > 로 나와서 아래의 tr을 지정할 때 "tr <- data[unlist(tr_data),]" 로 해주면 불러올 수 있다.

#4 훈련데이터와 테스트데이터 생성 후 확인하기

tr <- data[tr_data,]    # 훈련데이터 생성
te <- data[-tr_data,]    # tr_data를 제외한 데스트 데이터 생성

tr
# X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca       Thal AHD
# 1   1  63   1      typical    145  233   1       2   150     0     2.3     3  0      fixed  No
# 3   3  67   1 asymptomatic    120  229   0       2   129     1     2.6     2  2 reversable Yes
# 5   5  41   0   nontypical    130  204   0       2   172     0     1.4     1  0     normal  No
# 6   6  56   1   nontypical    120  236   0       0   178     0     0.8     1  0     normal  No
# 7   7  62   0 asymptomatic    140  268   0       2   160     0     3.6     3  2     normal Yes
# 9   9  63   1 asymptomatic    130  254   0       2   147     0     1.4     2  1 reversable Yes

te
# X Age Sex    ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca       Thal AHD
# 2     2  67   1 asymptomatic    160  286   0       2   108     1     1.5     2  3     normal Yes
# 4     4  37   1   nonanginal    130  250   0       0   187     0     3.5     3  0     normal  No
# 8     8  57   0 asymptomatic    120  354   0       0   163     1     0.6     1  0     normal  No
# 14   14  44   1   nontypical    120  263   0       0   173     0     0.0     1  0 reversable  No
# 16   16  57   1   nonanginal    150  168   0       0   174     0     1.6     1  0     normal  No

#5 naiveBayes

Bayes <- naiveBayes(AHD~. ,data=tr)
Bayes
# Naive Bayes Classifier for Discrete Predictors
#
# Call:
# naiveBayes.default(x = X, y = Y, laplace = laplace)
# ..........

naiveBayes(훈련데이터, 훈련데이터라벨, laplace값 설정) => 모델 생성

e1071패키지의 naiveBayes() 함수의 입력인자
x	예측변수(수치형 벡터)
y	종속변수(벡터 자료형)
formula	모델 formula
data raw	데이터 형태 or 피벗테이블 형태로 학습할 데이터 입력
laplace	라플라스 평활화 이용여부 (default=0)
na.action	NA(결측치) 처리여부

#6 예측

predicted <- predict(Bayes, te, type="class")
table(predicted, te$AHD)
# predicted No Yes
# No 41 12
# Yes 8 29

predic (생성한 모델, 테스트데이터) => 예상값 출력

#7 예측값과 종속변수 혼동행렬 확인

str(predicted)
# Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 2 1 2 ...
str(te$AHD)
# chr [1:90] "Yes" "No" "No" "No" "No" "No" "No" "No" "No" "Yes" "No"

AHD <- as.factor(te$AHD) # AHD 요인화
confusionMatrix(predicted, AHD)

confusionMatrix

'통계기반 데이터분석' 카테고리의 다른 글

[ R ] 변수제거 (1)	2022.12.03
[ R ] 다차원척도법 (0)	2022.12.02
[ R ] install.packages 오류 (0)	2022.12.02
[ R ] 서포트벡터머신 (SVM) (0)	2022.12.01
[ R ] 인공신경망 #2 neuralnet 패키지이용 (0)	2022.12.01

'통계기반 데이터분석' Related Articles

Comments

59doit

[ R ] 베이지안 본문

[ R ] 베이지안

베이지안

'통계기반 데이터분석' 카테고리의 다른 글

티스토리툴바