'Tools' 카테고리의 글 목록 (14 Page)

Tools (135)

[ 1 ] [ ··· ] [ 11 ] [ 12 ] [ 13 ] [ 14 ]

Tools/Keras

Keras - MNIST 데이터로 MLP(Multi Layer Perceptrons) Training

2017. 1. 20. 03:36

/**

Keras Mnist Learning Using Multi Layer Perceptron

작성자 : 3개월

날짜 : 2017.1.20

코드하이라이터 : http://markup.su/highlighter/

1. 데이터 읽기

import cPickle, gzip, numpy

# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()

mnist.pkl.gz는

https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/data/mnist.pkl.gz

에서 다운로드 후 코드와 같은 폴더에 붙여 넣으시면 됩니다.

cPickle.load(f)를 하면 train_set, valid_set, test_set이 알아서 나뉘어집니다.

# train_set 은 튜플로 두 개의 ndsarray를 원소로 가지고 있음
print train_set
print type(train_set)
print len(train_set)

# 튜플을 이렇게 둘로 나눌 수 있음
X_train, y_train = train_set
X_test, y_test = test_set

print type(X_train)
print len(X_train)
print len(y_train)
print X_train
print y_train

train_set을 위와 같이 X_train, y_train 으로 나눌 수 있습니다.

train_set은 2개의 원소(X, Y)를 갖고 있는 '튜플' 인데 튜플인 경우 저러한 문법이 가능합니다.

2. Multi Layer Perceptron 구현

CNN을 활용하기 전에 Multi Layer Perceptrons을 이용하여 데이터셋을 러닝해봅니다. universal approximation theorem에 의해 사실 이 MLP로도 이미지 classification을 포함한 이 세상에 존재하는 모든 문제를 해결할 수 있습니다. (다만 데이터가 부족할뿐)

# multi-layer perceptron
import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils

패키지를 임포트합니다.

seed = 7
numpy.random.seed(seed)

계속 실행하더라도 동일한 결과를 갖기 위해 numpy random seed를 설정해줍니다.

print type(X_train)
print type(y_train)

타입은 두 개 모두 ndarray입니다. X_train은 784*50000 2차원 배열, y_train은 길이가 50000인 1차원 배열입니다. 784는 MNIST 데이터가 28*28 =784 픽셀 흑백 손글씨 이미지이기 때문에 나온 숫자이고 이를 일렬로 쭉 늘어놓은 것입니다. 또한 50000은 데이터의 갯수입니다.

# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255

흑백 이미지 데이터는 픽셀 하나당 0-255까지의 숫자값을 가지므로 이를 255로 나누면 0-1 사이로 normalize됩니다.

# one hot encode outputs
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

# y_test 의 column을 클래스의 갯수 지정 : 10개
num_classes = y_test.shape[1]
num_pixels = X_train.shape[1]

위와 같이 y 값들을 one hot encoding 합니다.

또 아래 MLP 모델을 구축할 때 쓰는 변수들 num_classes, num_pixels를 정의합니다.

# define baseline model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(num_pixels, input_dim=num_pixels, init='normal', activation='relu'))
    model.add(Dense(num_classes, init='normal', activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

def good_model():
    # create model
    model = Sequential()
    model.add(Dense(400, input_dim=num_pixels, init='normal', activation='relu'))
    model.add(Dense(100, init='normal', activation='relu'))
    model.add(Dense(num_classes, init='normal', activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

위와 같이 두 개의 MLP를 만들었습니다.

baseline_model은 히든 레이어가 한 개인 네트워크이고, good_model은 히든 레이어가 두 개인 네트워크입니다.

keras의 경우 input layer는 생략하며 첫 번째 히든 레이어에 "input_dim=인풋 레이어의 노드 수"를 지정하는 식으로 네트워크를 구성합니다.
히든 레이어가 많고, 한 레이어당 노드 수가 적을 수록 overfitting이 적어져 generalization이 잘된다고 알려져있습니다.

데이터를 통해 위 두 개의 네트워크를 훈련하고 테스트해보겠습니다.

3. Training & Test

히든 레이어 한 개짜리 MLP

# build the model
model = baseline_model()
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))

첫 번째 모델인 baseline_model을 train하고 test하는 코드입니다.

validation set은 test set으로 대체하였습니다.

실행시 아래와 같이 10번 epoch을 돌면서, training accuracy, loss와 validation accuracy, loss를 출력합니다.

error는 validation accuracy를 기준으로하며 (1-0.9232)*100 = 7.68% 입니다.

히든 레이어 두 개짜리 MLP (더 Deep 한 모델)

# build the model
model2 = good_model()
# Fit the model
model2.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=10, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))

두 번째 모델인 good_model을 기준으로 training 과 test를 합니다.

위와 같이 Error가 7.54%로 첫 번째 네트워크보다 accuracy가 다소 높은 것을 알 수 있습니다.

4. 결론

MLP로 MNIST 데이터를 training하고 test 해보았습니다. MNIST와 같이 이미지 사이즈가 작고 흑백인 이미지에는 MLP도 꽤 괜찮은 accuracy를 내준다는 것을 알 수 있습니다. 하지만 이미지의 사이즈가 커지고 class의 갯수가 많아질 수록 CNN(Convolutional Neural Network)과 같은 이미지 처리에 최적화된 네트워크를 쓰는 것이 훨씬 효율적일 것입니다. 또한 training할 때 layer와 갯수와 layer당 node수를 적절히 선택하는 것이 중요하며 보통 이 사이즈는 try & error로 구하는 경우가 많습니다. 물론 네트워크를 구성하고 training 해본 경험이 많다면 어느정도의 layer와 node수를 선택해야하는지에 대해 어느정도 직감을 가질 수 있습니다.

'Tools > Keras' 카테고리의 다른 글

Keras와 Tensorflow 사용할 때 유용한 아나콘다 가상환경 (0)	2017.07.01
Keras - Backend 설정하기 (Theano, Tensorflow) (2)	2017.07.01
Keras - MNIST 데이터로 CNN(Convolutional Neural Network) Training (0)	2017.01.22
Keras (with Theano Backend) 설치 (0)	2016.11.09
Keras로 Deep Neural Network시작하기 (0)	2016.11.07

Tools/Python

Python - 로지스틱 회귀분석2

2016. 12. 29. 11:11

데이터 분석 대회 Kaggle에 나왔던 타이타닉 데이터로 로지스틱 회귀분석 연습을 해보았습니다. 데이터를 통해 성별, 나이, 객실 등급이 승객의 생존에 어떤 영향을 끼쳤는지 분석해 볼 수 있습니다.

알아보고 싶은 것은 "성별(Sex), 나이(Age), 객실등급(Pclass), 요금?(Fare) 가 생존에 어떻게 어느정도의 영향을 미쳤는가?" 입니다.

=============================================================================================

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
 
# 2015. 8. 5 
 
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
import test2
 
df = pd.read_csv("c:/train2.csv") # read file
 
print df.head()
 
cols_to_keep = ['Survived', 'Age', 'Fare'] # 분류할 수 없는 컬럼들
 
# 분류할 수 있는 컬럼들은 더미 컬럼를 만든다.
dummy_Pclass = pd.get_dummies(df['Pclass'], prefix='Pclass')
dummy_Sex = pd.get_dummies(df['Sex'], prefix='Sex')
 
# 더미를 데이터에 이어 붙인다.
data = df[cols_to_keep].join(dummy_Pclass.ix[:,'Pclass_2':]) # Pclass_2 부터 이어 붙임. 이래야 분석에 편리함
data = data.join(dummy_Sex.ix[:,'Sex_male':]) # Sex_male만 이어 붙임
 
data['intercept'] = 1.0
 
# 지금까지의 데이터 확인
print data.head()
 
 
# 출력창에 다음과 같이 뜸
#   Survived  Age     Fare  Pclass_2  Pclass_3  Sex_male  intercept
#0         0   22   7.2500         0         1         1          1
#1         1   38  71.2833         0         0         0          1
#2         1   26   7.9250         0         1         0          1
#3         1   35  53.1000         0         0         0          1
#4         0   35   8.0500         0         1         1          1
 
 
# logistic regression
train_cols = data.columns[1:] # train_cols는 설명 변수
logit = sm.Logit(data['Survived'], data[train_cols]) # Survived는 목적 변수
 
# fit the model
result = logit.fit() 
 
print result.summary() # 분석결과 출력
 
#==============================================================================
#                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
#------------------------------------------------------------------------------
#Age           -0.0330      0.007     -4.457      0.000        -0.048    -0.019
#Fare           0.0007      0.002      0.340      0.734        -0.003     0.005
#Pclass_2      -1.0809      0.286     -3.778      0.000        -1.642    -0.520
#Pclass_3      -2.2794      0.280     -8.142      0.000        -2.828    -1.731
#Sex_male      -2.6049      0.188    -13.881      0.000        -2.973    -2.237
#intercept      3.4772      0.418      8.318      0.000         2.658     4.297
#==============================================================================
 
# odds ratios only
print np.exp(result.params) # 오즈 비(Odds Ratio) 출력
 
#Age           0.967515
#Fare          1.000714
#Pclass_2      0.339281
#Pclass_3      0.102351
#Sex_male      0.073911
#intercept    32.367967
 
data["predict"] = result.predict(data[train_cols])
print data.head()
 
# 최종결과 (predict가 생존확률)
#   Survived  Age     Fare  Pclass_2  Pclass_3  Sex_male  intercept   predict
#0         0   22   7.2500         0         1         1          1  0.106363
#1         1   38  71.2833         0         0         0          1  0.906625
#2         1   26   7.9250         0         1         0          1  0.585365
#3         1   35  53.1000         0         0         0          1  0.913663
#4         0   35   8.0500         0         1         1          1  0.071945
cs

분석 결과 정리

coef(편회귀계수)의 부호만을 봤을 때, 나이가 많을 수록, 여자보다는 남자일 수록, 1등급보다는 2등급, 3등급일 수록, 요금이 적을 수록 생존확률이 낮아졌다. 또한 coef의 절대값으로 미루어보면 성별이 생존여부에 가장 큰 영향을 미치는 것을 알 수 있었다.

용어

Odds Ratio

Odds Ratio란 Odds의 비율이다. Odds란 성공/실패와 같이 상호 배타적이며 전체를 이루고 있는 것들의 비율을 의미한다. 예를 들어 남자 승객의 경우 577명중 109명이 생존했다. 이 경우 Odds = P(생존)/P(사망) = (109/577)/(468/577) = 0.19/0.81 = 0.23

여자 승객의 경우 314명중 233명이 생존했다. 이 경우 Odds = P(생존)/P(사망) = (233/314)/(81/314) = 2.87

따라서 Odds Ratio = 0.23/2.87 = 약 0.08

[출처] 파이썬 로지스틱 회귀분석2|작성자 3개월

'Tools > Python' 카테고리의 다른 글

Spyder IDE를 anaconda virtual environment에서 실행하는 법 (0)	2017.08.03
Python - 선형회귀분석 (& 교호작용을 고려한 선형회귀분석) (0)	2017.06.29
주피터 노트북 팁 1 - 단축키, 변수 출력, 도큐먼트 찾기 (0)	2017.04.03
Python - opencv 설치 (ImportError: No module named cv2) (1)	2017.01.31
Python - 로지스틱 회귀분석 (0)	2016.12.29

Tools/Python

Python - 로지스틱 회귀분석

2016. 12. 29. 11:10

http://blog.yhathq.com/posts/logistic-regression-and-python.html

ŷhat | Logistic Regression in Python

Logistic Regression is a statistical technique capable of predicting a binary outcome. It's a well-known strategy, widely used in disc...

blog.yhathq.com

알고 싶은 것 : GPA, GRE, 모교 우선순위(prestige) 가 대학원 입학 여부에 어떻게 영향을 미치는가?

---------------------------------------------------------------------------------------------------------------------------

위의 블로그의 내용을 요약한 글입니다. 정리한 글은 위의 블로그에 더욱 자세하게 설명되어 있습니다.

환경 : python 2.7, eclipse pydev

1. 데이터 읽기

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
 
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
 
print df.head()
 
df.columns = ["admit", "gre", "gpa", "prestige"] # df의 column 이름 바꾸기
print df.columns
 
Colored by Color Scripter
cs

2. 데이터 요약하기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
print df.describe() # 빈도수, 평균, 분산, 최솟값, 최댓값, 1/4분위수, 중위값, 1/4분위수를 나타냄
 
#             admit         gre         gpa   prestige
# count  400.000000  400.000000  400.000000  400.00000
# mean     0.317500  587.700000    3.389900    2.48500
# std      0.466087  115.516536    0.380567    0.94446
# min      0.000000  220.000000    2.260000    1.00000
# 25%      0.000000  520.000000    3.130000    2.00000
# 50%      0.000000  580.000000    3.395000    2.00000
# 75%      1.000000  660.000000    3.670000    3.00000
# max      1.000000  800.000000    4.000000    4.00000
 
print df.std() # 분산 출력
 
# admit      0.466087
# gre      115.516536
# gpa        0.380567
# prestige   0.944460
 
print pd.crosstab(df['admit'], df['prestige'], rownames=['admit'])
 
# prestige   1   2   3   4
# admit                   
# 0         28  97  93  55
# 1         33  54  28  12
 
df.hist()
pl.show() # pl.show()를 해야 화면에 띄워준다! 결과는 아래와 같다. 모든 컬럼에 대해 히스토그램을 그림
cs

3. 더미변수로 고치기

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
print dummy_ranks.head()
 
#    prestige_1  prestige_2  prestige_3  prestige_4
# 0           0           0           1           0
# 1           0           0           1           0
# 2           1           0           0           0
# 3           0           0           0           1
# 4           0           0           0           1
 
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])
print data.head()
#    admit  gre   gpa  prestige_2  prestige_3  prestige_4
# 0      0  380  3.61           0           1           0
# 1      1  660  3.67           0           1           0
# 2      1  800  4.00           0           0           0
# 3      1  640  3.19           0           0           1
# 4      0  520  2.93           0           0           1
 
data['intercept'] = 1.0
 
Colored by Color Scripter
cs

4. 회귀분석을 시행한다.

1
2
3
4
5
6 train_cols = data.columns[1:]
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()
print result.summary()
 
 
Colored by Color Scripter cs

              Logit Regression Results                           
          ==============================================================================
          Dep. Variable:                  admit   No. Observations:                  400
          Model:                          Logit   Df Residuals:                      394
          Method:                           MLE   Df Model:                            5
          Date:                Sun, 03 Mar 2013   Pseudo R-squ.:                 0.08292
          Time:                        12:34:59   Log-Likelihood:                -229.26
          converged:                       True   LL-Null:                       -249.99
                                                  LLR p-value:                 7.578e-08
          ==============================================================================
                           coef    std err          z      P>|z|      [95.0% Conf. Int.]
          ------------------------------------------------------------------------------
          gre            0.0023      0.001      2.070      0.038         0.000     0.004
          gpa            0.8040      0.332      2.423      0.015         0.154     1.454
          prestige_2    -0.6754      0.316     -2.134      0.033        -1.296    -0.055
          prestige_3    -1.3402      0.345     -3.881      0.000        -2.017    -0.663
          prestige_4    -1.5515      0.418     -3.713      0.000        -2.370    -0.733
          intercept     -3.9900      1.140     -3.500      0.000        -6.224    -1.756
          ==============================================================================

coef에 주목한다. gre:0.0023 gpa :0.840, prestige_2 : -0.6754 등등...

coef(편회귀계수)의 값이 양수이면 그 컬럼의 값이 커질수록 목적변수가 TRUE일 확률 즉, admit=1일 확률이 높아진다.

반대로 coef의 값이 음수이면 그 컬럼의 값이 커질수록 목적변수가 FALSE일 확률 즉, admin=0일 확률이 높아진다.

즉 GRE나 GPA가 커질수록 대학원에 입학할 확률은 커지고 prestige_2, prestige_3이 커질수록 대학원에 입학할 확률은 작아진다. 

이러한 경향은 pretige가 낮아질수록 심해진다.

5. odds ratio

1
2
3
4
5
6
7 print np.exp(result.params)
# gre 1.002267
# gpa 2.234545
# prestige_2 0.508931
# prestige_3 0.261792
# prestige_4 0.211938
# intercept 0.018500 cs

'Tools > Python' 카테고리의 다른 글

Spyder IDE를 anaconda virtual environment에서 실행하는 법 (0)	2017.08.03
Python - 선형회귀분석 (& 교호작용을 고려한 선형회귀분석) (0)	2017.06.29
주피터 노트북 팁 1 - 단축키, 변수 출력, 도큐먼트 찾기 (0)	2017.04.03
Python - opencv 설치 (ImportError: No module named cv2) (1)	2017.01.31
Python - 로지스틱 회귀분석2 (0)	2016.12.29

Tools/Keras

Keras (with Theano Backend) 설치

2016. 11. 9. 00:36

Keras With Theano Backend 설치

Keras를 Theano를 Backend로 해서 설치하는 방법입니다.

아래와 같이 gcc를 설치하고, 이를 파이썬에서 이용할 라이브러리들을 설치한후,

Theano와 Keras를 차례대로 설치하시면 됩니다.

Install TDM GCC x64.
Install Anaconda x64.
Open the Anaconda prompt
Run conda update conda
Run conda update --all
Run conda install mingw libpython
Install the latest version of Theano, pip install git+git://github.com/Theano/Theano.git
Run pip install git+git://github.com/fchollet/keras.git

이렇게 설치 과정이 복잡한 이유는 Keras뒤에 Theano가 있고 Theano는 또 C++로 구현되어 있기 때문입니다.

만약 gcc를 설치하지 않고 Keras를 사용하게 되면 Python으로 구현된 Theano를 Backend로 해서 돌아가게되는데

이 경우 속도는 gcc를 설치하지 않았을 때와 비교하여 현저하게 낮습니다.

따라서 Keras with Theano를 이용할 때는 꼭 gcc를 설치하는 것이 좋습니다.

출처

http://stackoverflow.com/questions/34097988/how-do-i-install-keras-and-theano-in-anaconda-python-2-7-on-windows

'Tools > Keras' 카테고리의 다른 글

Keras와 Tensorflow 사용할 때 유용한 아나콘다 가상환경 (0)	2017.07.01
Keras - Backend 설정하기 (Theano, Tensorflow) (2)	2017.07.01
Keras - MNIST 데이터로 CNN(Convolutional Neural Network) Training (0)	2017.01.22
Keras - MNIST 데이터로 MLP(Multi Layer Perceptrons) Training (0)	2017.01.20
Keras로 Deep Neural Network시작하기 (0)	2016.11.07

Tools/Keras

Keras로 Deep Neural Network시작하기

2016. 11. 7. 00:57

Keras는 Theano나 Tensorflow 위에서 동작하는 딥러닝 프레임워크입니다.

Tensorflow를 이용하는 것보다 더 쉽게 신경망 모델을 만들 수 있는 것 같습니다.

Keras를 처음 이용해 보기에 유용한 사이트를 찾아서 공유합니다.

Tutorial List

Fully Connected Network

- Binary Classification

http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

http://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/

- Multi-label Classification

http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

- Regression

http://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/

Drop Out

http://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

Cross Validation

http://machinelearningmastery.com/evaluate-performance-deep-learning-models-keras/

Applied Deep Learning in Python Mini-Course

http://machinelearningmastery.com/applied-deep-learning-in-python-mini-course/

MNIST - MLP + CNN

http://machinelearningmastery.com/handwritten-digit-recognition-using-convolutional-neural-networks-python-keras/

CIFAR-10 CNN

http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/

'Tools > Keras' 카테고리의 다른 글

Keras와 Tensorflow 사용할 때 유용한 아나콘다 가상환경 (0)	2017.07.01
Keras - Backend 설정하기 (Theano, Tensorflow) (2)	2017.07.01
Keras - MNIST 데이터로 CNN(Convolutional Neural Network) Training (0)	2017.01.22
Keras - MNIST 데이터로 MLP(Multi Layer Perceptrons) Training (0)	2017.01.20
Keras (with Theano Backend) 설치 (0)	2016.11.09

Deepplay interested in data analytics and ML modeling

admin write link

notice

블로그 운영 정보

my link

statistics

total :
today :
yesterday :

Tools (135)

Tools/Keras

Keras - MNIST 데이터로 MLP(Multi Layer Perceptrons) Training

'Tools > Keras' 카테고리의 다른 글

Tools/Python

Python - 로지스틱 회귀분석2

'Tools > Python' 카테고리의 다른 글

Tools/Python

Python - 로지스틱 회귀분석

'Tools > Python' 카테고리의 다른 글

Tools/Keras

Keras (with Theano Backend) 설치

'Tools > Keras' 카테고리의 다른 글

Tools/Keras

Keras로 Deep Neural Network시작하기

'Tools > Keras' 카테고리의 다른 글

notice

category

recent posts

recent comments

tag cloud

my link

statistics

티스토리툴바