Matching 데이터와 Conditional Logistic Regression

Data science/Statistics

Matching 데이터와 Conditional Logistic Regression

2018. 8. 13. 17:40

Matching 데이터의 예

Matching이란 case의 control에서 혼란변수의 분포를 맞추어주는 방법으로 데이터셋을 구성할 때 자주 쓰이는 방식입니다. 이는 confounding을 방지하는 방법으로 알려져 있습니다. 이번엔 Matching 데이터를 분석할 때 매우 자주 쓰이는 Conditional Logistic Regression에 대해 아주 간단히 알아보겠습니다. R에서 먼저 사용할 데이터셋을 임포트합니다. 종속변수는 d이며, 이진변수입니다. 그래서 logistic regression으로 연관성 분석을 할 수 있는데요.

library(Epi)

data(bdendo)

	set	d	gall	hyp	ob	est	dur	non	duration	age	cest	agegrp	age3
1	1.00	1.00	No	No	Yes	Yes	4	Yes	96.00	74.00	3	70-74	65-74
2	1.00	0.00	No	No		No	0	No	0.00	75.00	0	70-74	65-74
3	1.00	0.00	No	No		No	0	No	0.00	74.00	0	70-74	65-74
4	1.00	0.00	No	No		No	0	No	0.00	74.00	0	70-74	65-74
5	1.00	0.00	No	No	Yes	Yes	3	Yes	48.00	75.00	1	70-74	65-74
6	2.00	1.00	No	No	No	Yes	4	Yes	96.00	67.00	3	65-69	65-74
7	2.00	0.00	No	No	No	Yes	1	No	5.00	67.00	3	65-69	65-74
8	2.00	0.00	No	Yes	Yes	No	0	Yes	0.00	67.00	0	65-69	65-74
9	2.00	0.00	No	No	No	Yes	3	No	53.00	67.00	2	65-69	65-74
10	2.00	0.00	No	No	No	Yes	2	Yes	45.00	68.00	2	65-69	65-74
11	3.00	1.00	No	Yes	Yes	Yes	1	Yes	9.00	76.00	1	75-79	75+
12	3.00	0.00	No	Yes	Yes	Yes	4	Yes	96.00	76.00	2	75-79	75+
13	3.00	0.00	No	Yes	No	Yes	1	Yes	3.00	76.00	1	75-79	75+
14	3.00	0.00	No	Yes	Yes	Yes	2	Yes	15.00	76.00	2	75-79	75+
15	3.00	0.00	No	No	No	Yes	2	Yes	36.00	77.00	1	75-79	75+

데이터셋의 대한 설명은 help(bdendo)를 입력하면 나오고, 아래를 참고 바랍니다.

Format

This data frame contains the following columns:

set: Case-control set: a numeric vector

d: Case or control: a numeric vector (1=case, 0=control)

gall: Gall bladder disease: a factor with levels No Yes.

hyp: Hypertension: a factor with levels No Yes.

ob: Obesity: a factor with levels No Yes.

est: A factor with levels No Yes.

dur: Duration of conjugated oestrogen therapy: an ordered factor with levels 0 < 1 < 2 < 3 < 4.

non: Use of non oestrogen drugs: a factor with levels No Yes.

duration: Months of oestrogen therapy: a numeric vector.

age: A numeric vector.

cest: Conjugated oestrogen dose: an ordered factor with levels 0 < 1 < 2 < 3.

agegrp: A factor with levels 55-59 60-64 65-69 70-74 75-79 80-84

age3: a factor with levels <64 65-74 75+

하지만 Matching 데이터에 일반적인 Unconditional Logistic regression을 쓰면 bias가 생깁니다. 왜냐하면 Matching을 하면서 가져온 데이터에는 데이터 고유의 특성이 있기 때문입니다. 예를 들어서 time matching을 한 경우에, 비슷한 시기의 데이터를 샘플링을해서 하나의 strata를 만들고 데이터셋을 구성하게 됩니다. 그러면 이 시기에 의한 효과가 추정량에 영향을 주게 됩니다. 따라서 conditional logistic regression 이라는 조금 더 개선된 logistic regression 방법을 사용하여야합니다. (수학적인 설명은 생략하겠습니다..)

## Analysis

res.clogistic <- clogistic(d ~ cest + dur, strata = set, data = bdendo)

R로 conditional logistic regression(clr)을 하는 방법은 간단한 데 Epi 패키지의 clogistic을 활용하면 됩니다. 분석하고자 하는 공변량을 ~ 뒤에 넣고, strata에 matching pair를 나타내는 변수값을 입력합니다. 이 데이터의 경우, set 이라는 변수가 strata의 정보를 갖고 있습니다. 총 5개의 row가 같은 strata임을 알 수 있습니다.

결과를 돌리면 다음과 같이 나오는 것을 확인할 수 있습니다.

res.clogistic

Call:

clogistic(formula = d ~ cest + dur, strata = set, data = bdendo)

coef exp(coef) se(coef) z p

cest.L 0.240 1.271 2.276 0.105 0.92

cest.Q 0.890 2.435 1.812 0.491 0.62

cest.C 0.113 1.120 0.891 0.127 0.90

dur.L 1.965 7.134 2.222 0.884 0.38

dur.Q -0.716 0.489 1.858 -0.385 0.70

dur.C 0.136 1.146 1.168 0.117 0.91

dur^4 NA NA 0.000 NA NA

Likelihood ratio test=35.3 on 6 df, p=3.8e-06, n=254

association을 알아보기 위한 변수 cest, dur의 추정량을 볼 수 있습니다.

'Data science > Statistics' 카테고리의 다른 글

벅슨의 역설 (Berkson's Paradox) (0)	2018.08.31
로지스틱 회귀분석의 원리와 장점 (0)	2018.08.30
쉽게 이해하는 민감도, 특이도, 양성예측도 (12)	2018.08.08
Nested case-control study와 Retrospective cohort study (0)	2018.04.23
역학 연구에서의 스터디 디자인 (0)	2018.01.30

Deepplay interested in data analytics and ML modeling

admin write link

notice

블로그 운영 정보

my link

statistics

total :
today :
yesterday :

Data science/Statistics