[ R ] 앙상블 #2 - 랜덤포레스트 예제

통계기반 데이터분석

[ R ] 앙상블 #2 - 랜덤포레스트 예제

yul_S2 2022. 11. 30. 14:57

3. 랜덤포레스트(Random Forest)

randomforest()함수

formula: y ~ x형식으로 반응변수와 설명변수 식
data: 모델 생성에 사용될 데이터 셋
ntree: 복원 추출하여 생성할 트리 수 지정
mtry: 자식 노드를 분류할 변수 수 지정
na.action: 결측치(NA)를 제거할 함수 지정
importance: 분류모델 생성과정에서 중요 변수 정보 제공 여부

# 1 패키지 설치 및 데이터 셋 가져오기

install.packages("randomForest")
library(randomForest)
data(iris)

# 2 랜덤포레스트 모델 생성

model <- randomForest(Species ~ ., data = iris)
model
# Call:
#   randomForest(formula = Species ~ ., data = iris)
# Type of random forest: classification
# Number of trees: 500
# No. of variables tried at each split: 2
#
# OOB estimate of  error rate: 4.67%
# Confusion matrix:
# setosa versicolor virginica class.error
# setosa         50          0 0 0.00
# versicolor      0         47         3 0.06
# virginica       0 4 46 0.08

‘Number of trees: 500’ : 학습데이터로 500개의 포레스트(Forest)가 복원 추출방식으로 생성
‘No. of variables tried at each split: 2’ : 두 개의 변수 이용하여 트리의 자식 노드가 분류되었다는 의미

# 3 파라미터 조정 – 트리 개수 300개, 변수 개수 4개 지정

model2 <- randomForest(Species ~ ., data = iris,
                       ntree = 300, mtry = 4, na.action = na.omit)

model2

# Call:
#   randomForest(formula = Species ~ ., data = iris, ntree = 300,      mtry = 4, na.action = na.omit)
# Type of random forest: classification
# Number of trees: 300
# No. of variables tried at each split: 4
#
# OOB estimate of  error rate: 4.67%
# Confusion matrix:
# setosa versicolor virginica class.error
# setosa         50          0          0          0.00
# versicolor      0         47            3          0.06
# virginica 0          4            46          0.08

na.action : NA 처리 방법 지정. 여기서는 na.omit로 속성을 지정하여 NA 제거

# 4 중요 변수를 생성하여 랜덤포레스트 모델 생성

4-1) 중요 변수로 랜덤포레스트 모델 생성

model3 <- randomForest(Species ~ ., data = iris,
importance = T, na.action = na.omit)

4-2) 중요 변수 보기

importance(model3)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 7.130689 8.493780 8.714849                  11.800593                  10.380579
# Sepal.Width 5.201695 1.255848 4.709957                       5.964347               2.557567
# Petal.Length 22.294499  32.131832 29.261426                      34.683212                  43.259832
# Petal.Width 22.290376  33.125590 30.609805                      34.360385                  43.050580

importance()함수: 분류모델을 생성하는 과정에서 입력 변수 중 가장 중요한 변수가 어떤 변수인가를 알려주는 역할

MeanDecreaseAccuracy: 분류정확도를 개선하는데 기여한 변수를 수치로 제공

MeanDecreaseGini: 노드 불순도(불확실성)를 개선하는데 기여한 변수를 수치로 제공

4-3) 중요 변수 시각화

varImpPlot(model3)

엔트로피(Entropy): 불확실성 척도

x1 <- 0.5; x2 <- 0.5
e1 <- -x1 * log2(x1) - x2 * log2(x2)
e1
# 1

x1 <- 0.7; x2 <- 0.3
e2 <- -x1 * log2(x1) - x2 * log2(x2)
e2
# 0.8812909

엔트로피가 작으면 불확실성이 낮아진다.

불확실성이 낮아지면 그만큼 분류정확도가 향상된다고 볼 수 있다.

최적의 파라미터(ntree, mtry)찾기

# 1 속성값 생성

ntree <- c(400, 500, 600)
mtry <- c(2:4)
param <- data.frame(n = ntree, m = mtry)

param
# n m
# 1 400 2
# 2 500 3
# 3 600 4

# 2 이중 for()함수를 이용하여 모델 생성

for(i in param$n) {
  cat('ntree =', i, '\n')
  for(j in param$m) {
    cat('mtry =', j, '\n')
    model_iris <- randomForest(Species ~ ., data = iris,
                               ntree = i, mtry = j, na.action = na.omit)
    print(model_iris)
  }
}

9개의 모델이 생성된 결과에서 오차 비율(OOB(Out of Bag) estimate of error rate)을 비교하여 최적의 트리와 변수 결정