R 중급 - 데이터 구조 심화 (Data structure)

Tools/R

R 중급 - 데이터 구조 심화 (Data structure)

2019. 4. 9. 09:36

R의 데이터 구조

프로그램에서 다양한 타입의 데이터를 변수에 저장하는 것이 필요하다.
운영체제는 사용자가 지정한 데이터 타입에 맞게 변수에 메모리를 할당한다.
R의 기본 데이터 타입 (base object) 은 Vector (atomic vector 와 list) 이고, 이를 기반으로 구현된 아래의 데이터 구조가 많이 사용된다.
- Atomic vector
- List
- Matrix
- Array
- Factors
- Data frame

모든 데이터 타입은 Vector 로부터 시작

Matrix 와 Array 는 atomic vector 를 기반으로 만들어진 base object 이다. 
Factor 와 Data frame 은 각각 atomic vector와 list 에 기반한 S3 클래스이다.

Two types of Vector

Vector 는 Atomic vector 와 list 로 나눌 수 있다.
Atomic vector 는 원자가 동질적이고, list 는 이질적일 수 있다.
Atomic vector 는 1차원 구조이지만 list 는 다차원일 수 있다 (nested structure).

Atomic vector

네 가지 atomic vector (logical, integer, double, character)
raw와 complex 도 있지만 잘 사용하지 않는다.

lgl_var <- c(TRUE, FALSE)
int_var <- c(1L, 6L, 10L)
dbl_var <- c(1, 2.5, 4.5)
chr_var <- c("these are", "some strings")
str(lgl_var)

##  logi [1:2] TRUE FALSE

str(int_var)

##  int [1:3] 1 6 10

str(dbl_var)

##  num [1:3] 1 2.5 4.5

str(chr_var)

##  chr [1:2] "these are" "some strings"

Coercion

만약 vector에 다른 타입의 데이터가 들어가면 강제로 형변환된다 (Coercion).
이 때, 우선순위는 logical, numeric, double, character 이다.

c(1, "a")

## [1] "1" "a"

c(TRUE, 1)

## [1] 1 1

c(1, 1.1)

## [1] 1.0 1.1

c(3.0, TRUE)

## [1] 3 1

c(c(1,2,3), 1)

## [1] 1 2 3 1

c(list(c(1,2,3)), "a")

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "a"

List

List 는 원자가 어떤 타입이든 될 수 있고, list 도 될 수 있다.

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

x <- list(list(list(list())))
str(x)

## List of 1
##  $ :List of 1
##   ..$ :List of 1
##   .. ..$ : list()

Excercises (Vector, list)

atomic vector 의 6가지 종류는 무엇이고 list 와의 차이점은 무엇인가?

integer, double, character, complex, raw, boolean.
homogenous vs heterogenous
1d vs nested structure

아래 코드의 결과는?

c(1, FALSE)

## [1] 1 0

c("a", 1)

## [1] "a" "1"

c(list(1), "a")

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"

c(TRUE, 1L)

## [1] 1 1

list 를 vector 로 변환하기 위해 사용하는 unlist 는 무엇을 하는가?

coercion, nested structure 제거, 1d structure 로 변환

a <- list(c(1,2,3,4,5), c(1,2,3))
a

## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] 1 2 3

unlist(a)

## [1] 1 2 3 4 5 1 2 3

Attributes

모든 object 는 attribute를 가질 수 있다.
attr 함수를 통해 attribute를 지정하거나 조회할 수 있다.
attribute가 중요한 이유는
- attribute 를 통해 데이터의 구조를 설정하기도 한다.
- 많은 함수에서 attribute 에 따른 기능을 구현한다.
- attribute는 R 에서 객체를 구현하는 한 가지 방법이다.
- 새로운 해결 전략을 만들 수도 있고, 문제를 해결하는데 시간을 단축할 수 있다.

Attribute 생성 방법

attr 함수

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")

## [1] "This is a vector"

structure 함수
- attirubte가 추가된 새로운 object를 반환한다.

structure(1:10, my_attribute = "This is a vector")

##  [1]  1  2  3  4  5  6  7  8  9 10
## attr(,"my_attribute")
## [1] "This is a vector"

attribute 조회
attributes 함수를 이용하면 list 형태로 반환

attributes(y)

## $my_attribute
## [1] "This is a vector"

세 가지 중요한 attribute

아래 attributes 는 R 기본 함수에서 사용법이 약속 되어있다.
- Names : 각 element 에 지정될 수 있는 chracter vector 이다.
- Dimensions : Matrix 와 Array에서 쓰인다.
- Class : S3 object system 에서 쓰인다.
데이터를 변형해도 이 attribute 는 사라지지 않는다.
다양한 함수에서 이 attribute 를 통해 기능을 구현하고 있다.

Names

name attribute를 만드는 법은 1) names 함수를 이용 하는 방법, 2) 변수를 생성할 때 지정하는 방법이 있다.
names 함수는 object의 name 을 만들거나 조회하는 함수이다.
names 함수를 이용하는 방법

v <- c(1, 2, 3)
names(v) <- c('a')
names(v)

## [1] "a" NA  NA

변수를 생성할 때 지정

y <- c(a = 1, 2, 3) 
names(y)

## [1] "a" ""  ""

attr 함수를 통해서도 name 확인 가능

attr(y, "names")

## [1] "a" ""  ""

Factors

Factor는 integer vector 를 기반으로한 S3 object 이다.
Factor는 ’미리 정한 값’만 인자로 넣을 수 있는 Vector 로 볼 수 있다.
Factor는 범주형 변수를 저장할 때만 쓰인다.
Factor의 중요한 두 개의 attributes,
- class : 이것은 factor 가 factor 임을 알려주고, integer vector 와 다른 쓰임을 갖게 한다.
- levels : level은 ‘미리 정한 값’ 을 정의한다.

library(pryr)
x <- factor(c("a", "b", "b", "a"))
x

## [1] a b b a
## Levels: a b

typeof(x) # factor 의 base type 은 integer vector 이다.

## [1] "integer"

otype(x)  # factor 는 S3 object 이다.

## [1] "S3"

class(x)

## [1] "factor"

str(x)

##  Factor w/ 2 levels "a","b": 1 2 2 1

# getAnywhere(str.default) # factor의 str은 default 로 들어가며, if 문에서 is.factor 로 체크가 된다.

class(x)

## [1] "factor"

levels(x) # levels 함수는 해당 객체의 levels attribute 를 반환하는 함수이다.

## [1] "a" "b"

factor는 interger vector 의 확장 버전이다.

# Factor 를 c 로 결합하면 강제 형변환 (coercion) 되면서, integer vector 로 변한다.
c(factor(c("a", "b", "c")), factor("b"))

## [1] 1 2 3 1

factor는 범주형 변수에서처럼 값이 한정적일 때 levels를 미리 정의할 수 있다.

sex_char <- c("m", "m", "m") 
sex_factor <- factor(sex_char, levels = c("m", "f")) 
table(sex_char)

## sex_char
## m 
## 3

Excercise (Attributes, Names, Factor)

“test” attribute 를 갖는 vector 생성하기

a <- 1:5 
attr(a, "test") <- "my attribute" 
a

## [1] 1 2 3 4 5
## attr(,"test")
## [1] "my attribute"

a <- structure(1:5, test = "my attribute") 
a

## [1] 1 2 3 4 5
## attr(,"test")
## [1] "my attribute"

“comment” attribute 만들고 출력해보기

b <- structure(1:5, comment = "my attribute")
b # print.default 함수에서 comment attribute는 출력이 안되도록 구현

## [1] 1 2 3 4 5

# ?attributes 
# Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.

Matrices and arrays

dim attribute 를 atomic vector 에 추가하면 배열 (array) 이 된다.
Matrix 는 array 의 한 종류로 2차원 array 이다.
array 는 atomic vector에 dimension 이 할당된 것이다.

# Matrix 

# 2x3 matrix 생성
a <- matrix(1:6, ncol = 3, nrow = 2)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

# 2x3x2 array 생성
b <- array(1:12, c(2, 3, 2)) 
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

# dim attribute 를 추가함으로써 array 를 생성할 수도 있다. 
d <- 1:6
dim(d) <- c(3, 2)
d

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

e <- 1:6
attr(e, "dim") <- c(3,2)
e

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

matrix, array 의 특성

library(pryr)

# matrix, array 의 base object 는 numeric vector 이다. 
typeof(a)

## [1] "integer"

typeof(b)

## [1] "integer"

typeof(d)

## [1] "integer"

# matrix, array는 base object 이다. 
otype(a)

## [1] "base"

otype(b)

## [1] "base"

otype(d)

## [1] "base"

# matrix, array 는 atomic vector에 dim attribute 가 추가된 것이다.  
attributes(a)

## $dim
## [1] 2 3

attributes(b)

## $dim
## [1] 2 3 2

attributes(d)

## $dim
## [1] 3 2

Charcter array 도 생성 가능

a <- array(c("a", "b", "c", "d", "e", "f"), c(2,3))
a

##      [,1] [,2] [,3]
## [1,] "a"  "c"  "e" 
## [2,] "b"  "d"  "f"

rownames와 colnames는 matrix 의 name attribute 를 설정하기 위해 사용됨

rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")
a

##   a   b   c  
## A "a" "c" "e"
## B "b" "d" "f"

attributes(a)

## $dim
## [1] 2 3
## 
## $dimnames
## $dimnames[[1]]
## [1] "A" "B"
## 
## $dimnames[[2]]
## [1] "a" "b" "c"

length 는 전체 원소의 갯수를 출력함
array 는 integer vector 기반이므로 원소의 갯수 출력

length(a)

## [1] 6

dimnames 를 통해 array 에서 각 dimension 에 대한 변수의 이름을 설정할 수 있음

dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b

## , , A
## 
##     a b c
## one 1 3 5
## two 2 4 6
## 
## , , B
## 
##     a  b  c
## one 7  9 11
## two 8 10 12

Dataframe

Dataframe은 R 에서 데이터 분석 시 가장 일반적으로 사용되는 데이터구조
Dataframe 은 list 를 기반으로 만들어진 S3 클래스이다.
- 2-dimensional structure
- 일반적으로 각 element 는 equal-length vector
Dataframe은 matrix 와 list 의 성질을 동시에 갖음

data frame 의 생성

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
df

##   x y
## 1 1 a
## 2 2 b
## 3 3 c

Data frame 의 성질

library(pryr)
otype(df) # data frame은 list 를 기반으로 만들어진 s3 클래스이다.

## [1] "S3"

typeof(df) # data frame의 base object는 list 이다.

## [1] "list"

class(df) # data frame의 s3 클래스 이름은 data.frame 이다.

## [1] "data.frame"

dim(df) # data frame 은 matrix 의 성질을 갖는다.

## [1] 3 2

data frame 에서는 character vector를 자동으로 factor 로 변형한다.

# 이를 해제하기 위해서는 stringAsFactors = FALSE 로 설정한다.
df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Data frame 의 병합

cbind(df, data.frame(z = 3:1))

##   x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c 1

rbind(df, data.frame(x = 10, y = "z"))

##    x y
## 1  1 a
## 2  2 b
## 3  3 c
## 4 10 z

Special columns

사용하는 경우는 많이 없지만 개념적으로 가능함

df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df

##   x          y
## 1 1       1, 2
## 2 2    1, 2, 3
## 3 3 1, 2, 3, 4

data frame 생성 시 컬럼으로 넣으려면 I 키워드를 통해 가능하다.

dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
dfl

##   x          y
## 1 1       1, 2
## 2 2    1, 2, 3
## 3 3 1, 2, 3, 4

dfl$y[[1]]

## [1] 1 2

matrix 도 data frame 의 컬럼으로 추가할 수 있다.

dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
dfm

##   x y.1 y.2 y.3
## 1 1   1   4   7
## 2 2   2   5   8
## 3 3   3   6   9

dfm[2, "y"]

##      [,1] [,2] [,3]
## [1,]    2    5    8

Excercise (Data frame)

Data frame은 어떤 attribute 를 갖는가?

df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
attributes(df)

## $names
## [1] "x" "y"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3

저작자표시

'Tools > R' 카테고리의 다른 글

R 중급 - 함수의 기초 (Functions) (0)	2019.04.09
R 중급 - 서브세팅 (Subsetting) (0)	2019.04.09
해들리 위컴은 어떻게 수많은 R 패키지를 개발할 수 있었을까? (0)	2019.04.07
R 중급 - apply 계열 함수 정리 (apply, lapply, sapply, tapply, mapply) (7)	2019.04.01
일관성 있는 R 코드 작성하기: 해들리 위컴의 R 코딩 스타일 가이드 (0)	2019.03.25

Deepplay interested in data analytics and ML modeling

admin write link

notice

블로그 운영 정보

my link

statistics

total :
today :
yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Tools/R

R 중급 - 데이터 구조 심화 (Data structure)

모든 데이터 타입은 Vector 로부터 시작

Two types of Vector

Atomic vector

Coercion

List

Excercises (Vector, list)

Attributes

Attribute 생성 방법

세 가지 중요한 attribute

Names

Factors

Excercise (Attributes, Names, Factor)

Matrices and arrays

Dataframe

data frame 의 생성

Data frame 의 성질

Data frame 의 병합

Special columns

Excercise (Data frame)

'Tools > R' 카테고리의 다른 글

notice

category

recent posts

recent comments

tag cloud

my link

statistics

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역