Stata: psestimate-傾向得分匹配(PSM)中匹配變量的篩選

對(duì)對(duì)子不錯(cuò) 2019-06-03

展開全文

作者：丁海 (華中科技大學(xué))

特別說明

文中包含的鏈接在微信中無法生效,。請(qǐng)點(diǎn)擊本文底部左下角的【閱讀原文】,，轉(zhuǎn)入本文【簡(jiǎn)書版】。

傾向得分匹配分析 (PSM) 已經(jīng)在諸多領(lǐng)域得到了應(yīng)用,。雖然 PSM 不能完全解決內(nèi)生性問題,，但卻能在很大程度上緩解自我選擇問題導(dǎo)致的偏差。在前期文獻(xiàn)中,，Becker & Ichino (2002, Stata Journal, 2(4):358-377) 對(duì) PSM 的分析過程進(jìn)行了詳細(xì)的介紹,，Stata 中也有多個(gè)命令可以執(zhí)行 PSM 分析，如 pscore, psmatch2, treatrew (Stata Journal, 14(3): 541-561), gpscore (SJ 8(3):354--373), kmatch

net describe st0328, from(http://www./software/sj14-1)

平衡性假設(shè)

在 PSM 匹配時(shí),，用treat變量對(duì)控制變量進(jìn)行Logit回歸,，得到傾向得分值。傾向得分值最接近的控制組個(gè)體即為實(shí)驗(yàn)組的配對(duì)樣本,，通過這種方法可以最大程度減少實(shí)驗(yàn)組與控制組個(gè)體存在的系統(tǒng)性差異,，從而減少估計(jì)偏誤。在進(jìn)行PSM匹配后的其他估計(jì)前比如PSM-DID 估計(jì)前,，還需進(jìn)行協(xié)變量的平衡性假設(shè)檢驗(yàn),，即匹配后各變量在實(shí)驗(yàn)組和控制組之間是否變得平衡，也就是說實(shí)驗(yàn)組和控制組協(xié)變量的均值在匹配后是否具有顯著差異,。如果不存在顯著差異,，則支持進(jìn)一步的模型估計(jì)。

在平衡性檢驗(yàn)之前,，我們先使用psmatch2命令進(jìn)行PSM匹配,，處理變量為train,，協(xié)變量為age、educ,、black,，結(jié)果變量為re78，采用一對(duì)一近鄰匹配,，具體操作如下：

 use ldw_exper.dta,clear
 psmatch2 train age educ black, out(re78) logit ate neighbor(1) common caliper(.05) ties

PSM 匹配完成之后,，我們需要檢驗(yàn)匹配后的樣本是否滿足平衡性假設(shè)，即實(shí)驗(yàn)組與控制組的匹配協(xié)變量是否沒有顯著性差異,，在這里可以使用pstest命令進(jìn)行檢驗(yàn),，具體如下：

pstest age educ black hisp married , t(train)

平衡性假設(shè)檢驗(yàn)結(jié)果如下：

------------------------------------------------------------------------------
                        |       Mean               |     t-test    |  V(T)/
Variable                | Treated Control    %bias |    t    p>|t| |  V(C)
------------------------+--------------------------+---------------+----------
age                     | 25.527   24.714     11.4 |   1.19  0.234 |  1.24
educ                    | 10.291   10.401     -6.0 |  -0.59  0.557 |  1.60*
black                   | .84066   .87363     -8.9 |  -0.90  0.370 |     .
hisp                    | .06044   .09066    -10.9 |  -1.09  0.277 |     .
married                 | .18681    .1522      9.2 |   0.88  0.380 |     .
------------------------------------------------------------------------------

根據(jù)t檢驗(yàn)結(jié)果發(fā)現(xiàn)，以上5個(gè)協(xié)變量在實(shí)驗(yàn)組與控制組之間不存在顯著性差異,。

那么,，在進(jìn)行 PSM 分析之前，應(yīng)當(dāng)如何選擇匹配協(xié)變量,，使模型實(shí)現(xiàn)最佳的擬合效果呢,？今天介紹的 psestimate 命令可以通過比較不同模型的極大似然值，幫助我們選擇能實(shí)現(xiàn)最佳擬合效果的協(xié)變量的一階和二階形式,。

The psestimate command estimates the propensity score proposed by Imbens and Rubin (2015). The main purpose of the program is to select a linear or quadratic function of covariates to include in the estimation function of the propensity score.

1. 命令的安裝與示例數(shù)據(jù)導(dǎo)入

在Stata命令窗口執(zhí)行第一行代碼即可完成對(duì) psestimate 命令的下載,，然后輸入第二行命令下載網(wǎng)上示例數(shù)據(jù) nswre74.dta（LaLonde, 1986），并執(zhí)行第三行命令導(dǎo)入數(shù)據(jù),。

ssc install psestimate, replace //安裝命令
net get psestimate //下載命令附帶的數(shù)據(jù)到當(dāng)前工作路徑下
use 'psestimate.dta', replace

2. 命令的語法

該命令的語法如下：

 psestimate depvar [indepvars] [if] [in] [, options]
 options：             
      totry(indepvars)     
      notry(varlist)       
      nolin               
      noquad              
      clinear(real)       
      cquadratic(real)    
      iterate(#)          
      genpscore(newvar)   
      genlor(newvar)

各個(gè)主要選項(xiàng)的含義如下：

depvar,，必選項(xiàng)，填入處理變量（如 treat）,，即標(biāo)記是否參與實(shí)驗(yàn)的虛擬變量
indepvars,，可選項(xiàng)，指定基準(zhǔn)模型中的協(xié)變量
totry(indepvars),，可選項(xiàng),，放入供選擇的協(xié)變量列表，默認(rèn)為全部
notry(varlist),，可選項(xiàng),，指定不包括的協(xié)變量列表，默認(rèn)為沒有
nolin,，可選項(xiàng),，指定不進(jìn)行一階多項(xiàng)式的選擇
noquad，可選項(xiàng),，指定不進(jìn)行二階多項(xiàng)式的選擇
clinear(real),，可選項(xiàng),，指定一階協(xié)變量似然比檢驗(yàn)的門檻值,，默認(rèn)值為 1
cquadratic(real),，可選項(xiàng)，指定二階協(xié)變量似然比檢驗(yàn)的門檻值,，默認(rèn)值是 2.71
iterate(#),，可選項(xiàng)，指定在每個(gè) logit 中執(zhí)行循環(huán)的最大值,，默認(rèn)值是 16000
genpscore(newvar),，可選項(xiàng)，由于指定程序自動(dòng)生成的用于記錄傾向得分值的新變量的名稱
genlor(newvar),，可選項(xiàng),，生成對(duì)數(shù)似然比的新變量的名稱

3. 命令操作

3.1 命令基本操作

下面本文將基于 psestimate 命令的作者提供的數(shù)據(jù)集 nswre74.dta 來簡(jiǎn)要說明如何使用 psestimate 這一命令來選擇能最好擬合處理變量 (treat) 的協(xié)變量的一階及二階形式。

在這里,，我們事先選定教育變量 ed 作為基準(zhǔn)模型中的一個(gè)協(xié)變量,，意味著 Stata 自動(dòng)將 ed 放入基準(zhǔn)模型中。另外,，我們將指定 age,、black、hisp,、nodeg 四個(gè)變量作為待選協(xié)變量,。代碼如下：

use 'nswre74.dta', clear
psestimate treat ed, totry(age black hisp nodeg)

運(yùn)行結(jié)果如下：

Selecting first order covariates... (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
...s..s..
Selected first order covariates are: nodeg hisp
Selecting second order covariates... (21)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....s.....
Selected second order covariates are: c.nodeg#c.ed
Final model is: ed nodeg hisp c.nodeg#c.ed

根據(jù)以上結(jié)果，可以確定在傾向得分匹配中,，我們應(yīng)該選取的一階協(xié)變量為 nodeg,、hisp，二階協(xié)變量為 c.nodeg#c.ed,。綜上,，根據(jù) psestimate 命令的運(yùn)算結(jié)果，我們應(yīng)該選取 ed,、nodeg,、hisp、c.nodeg#c.ed 等四個(gè)變量作為傾向得分匹配的協(xié)變量,。

3.2 提升運(yùn)算速度

psestimate命令在運(yùn)算中會(huì)耗費(fèi)較長(zhǎng)時(shí)間,，而通常來說，該命令在選擇協(xié)變量的一階形式時(shí)要快于二階形式的選擇,，因此,，為了加快運(yùn)算速度，我們可以首先通過加入noquad選項(xiàng),，只對(duì)協(xié)變量的一階形式進(jìn)行篩選,，當(dāng)一階形式選定后，將其作為解釋變量放入基準(zhǔn)模型中,，然后加入nolin 選項(xiàng)跳過一階形式篩選步驟,，只對(duì)協(xié)變量的二階形式進(jìn)行篩選,。具體操作如下。

首先,，加入入noquad選項(xiàng),，只篩選協(xié)變量的一階形式，如下：

use 'nswre74.dta', clear
psestimate treat ed, totry(age black hisp nodeg) noquad

一階協(xié)變量的篩選結(jié)果如下：

Selecting first order covariates... (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
...s..s..
Selected first order covariates are: nodeg hisp
Final model is: ed nodeg hisp

然后,，將選定的ed,、nodeg、hisp作為解釋變量放入基準(zhǔn)模型中,，加入nolin選項(xiàng)值進(jìn)行二階形式的篩選,，操作如下：。

psestimate treat ed nodeg hisp , totry(age black hisp nodeg) nolin

二階協(xié)變量的篩選結(jié)果如下：

Selecting second order covariates... (21)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....s.....
Selected second order covariates are: c.nodeg#c.ed
Final model is: ed nodeg hisp  c.nodeg#c.ed

4. psestimate 的核心思想

4.1 協(xié)變量一階形式的選擇

第一步,，該程序首先在基準(zhǔn)模型（logit treat ed）基礎(chǔ)上通過循環(huán)分別加入 totry() 中指定的四個(gè)變量 age,、black、hisp,、nodeg,，進(jìn)行四次模型估計(jì)，如下所示：

logit treat ed age
logit treat ed black
logit treat ed hisp
logit treat ed nodeg

每次估計(jì)完成后,，它將得到的新的極大似然值與基準(zhǔn)模型比較,，選擇上述四個(gè)模型中對(duì)數(shù)極大似然值 (Log-Likelihood, 簡(jiǎn)稱 LL 值) 最大的模型中的協(xié)變量放入基準(zhǔn)模型中，除非上述四個(gè)模型的極大似然值都低于 clinear(real) 中指定的門檻值,。若此處假設(shè)為 nodeg,，則基準(zhǔn)模型擴(kuò)展為 logit treat ed nodeg, 然后第二步，它將估計(jì)如下模型：

logit treat ed nodeg age
logit treat ed nodeg black
logit treat ed nodeg hisp

這一步的協(xié)變量篩選原則與第一步相同,?？梢钥闯觯?dāng)供選擇的協(xié)變量數(shù)量為時(shí),，在確定協(xié)變量的一階形式時(shí),，該程序理論上會(huì)估計(jì)個(gè) Logit 模型。本例中有 4 個(gè)供選擇的協(xié)變量,，則需要估計(jì) 10 次（如下括號(hào)中所示）,，該命令選擇的協(xié)變量一階形式結(jié)果如下：

Selecting first order covariates... (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
...s..s..
Selected first order covariates are: nodeg hisp

4.2 協(xié)變量二階形式的選擇

在協(xié)變量二階形式的選擇上，主要分為協(xié)變量平方項(xiàng)以及協(xié)變量間的交乘項(xiàng),。
如果在一階形式中只選擇了 a 這一個(gè)協(xié)變量,，則二階形式的選擇只需要檢驗(yàn) a^2 這一變量。但是如果有 a,、b 兩個(gè)一階協(xié)變量被選擇,，則二階形式的選擇需要檢驗(yàn) a^2、b^2,、ab 三個(gè)二階協(xié)變量形式,。具體到本例,，確定的一階協(xié)變量有 ed、nodeg,、hisp 三個(gè),，則需要檢驗(yàn)的二階協(xié)變量有六個(gè),，即 ed^2,、nodeg^2、hisp^2,、c.ed#c.nodeg,、c.ed#c.hisp、c.nodeg#c.hisp,，篩選過程與選擇協(xié)變量一階形式的方法一致,。因此本例中共需估計(jì)即 21 次（如下括號(hào)中所示），結(jié)果如下所示：

Selecting second order covariates... (21)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....s.....
Selected second order covariates are: c.nodeg#c.ed
Final model is: ed nodeg hisp c.nodeg#c.ed

4.3 流程圖展示

如下流程圖可以更加直觀地展現(xiàn)psestimate篩選協(xié)變量一階及二階形式的過程,，為簡(jiǎn)化分析,，我們可供選擇的協(xié)變量為a、b兩個(gè)變量,，假設(shè)各模型的對(duì)數(shù)極大似然值存在如下大小關(guān)系,，LL1>LL2> clinear() >LL3,LL4>LL5>LL6> cquadratic() >LL7>LL8。

5. PSM估計(jì)的完整流程示例

5.1 psestimate 篩選匹配變量的一階,、二階形式

第一步,，使用psestimate篩選匹配變量

use 'nswre74.dta', clear
psestimate treat ed, totry(age black hisp nodeg)

匹配變量選擇如下：

Selecting first order covariates... (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
...s..s..
Selected first order covariates are: nodeg hisp
Selecting second order covariates... (21)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....s.....
Selected second order covariates are: c.nodeg#c.ed
Final model is: ed nodeg hisp c.nodeg#c.ed

最終選擇的匹配變量為ed、nodeg,、hisp,、c.nodeg#c.ed

5.2 psmatch2 基于篩選出的匹配變量進(jìn)行PSM匹配

基于上述匹配變量進(jìn)行PSM匹配：

psmatch2 treat ed nodeg hisp c.nodeg#c.ed, logit ate neighbor(1) common caliper(.05) ties

結(jié)果如下：

Logistic regression                             Number of obs     =        445
                                                LR chi2(4)        =      17.03
                                                Prob > chi2       =     0.0019
Log likelihood = -293.58317                     Pseudo R2         =     0.0282

------------------------------------------------------------------------------
       treat |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          ed |   .5093428   .3298117     1.54   0.123    -.1370762    1.155762
       nodeg |   6.506319   4.112404     1.58   0.114    -1.553845    14.56648
        hisp |  -.5954105   .3754841    -1.59   0.113    -1.331346    .1405248
             |
c.nodeg#c.ed |  -.6068825   .3375387    -1.80   0.072    -1.268446    .0546813
             |
       _cons |  -6.021438    4.05441    -1.49   0.138    -13.96794    1.925059
------------------------------------------------------------------------------

5.3 pstest 進(jìn)行平衡性假設(shè)檢驗(yàn)

pstest ed nodeg hisp c.nodeg#c.ed, t(treat)

結(jié)果如下：

------------------------------------------------------------------------------
                        |       Mean               |     t-test    |  V(T)/
Variable                | Treated Control    %bias |    t    p>|t| |  V(C)
------------------------+--------------------------+---------------+----------
ed                      |  10.29   10.464     -9.6 |  -0.91  0.363 |  1.28
nodeg                   | .71585   .69399      5.3 |   0.46  0.648 |     .
hisp                    | .06011   .06011     -0.0 |  -0.00  1.000 |     .
c.nodeg#c.ed            | 6.7814    6.694      2.1 |   0.18  0.854 |  0.96
------------------------------------------------------------------------------

可以發(fā)現(xiàn)，匹配后實(shí)驗(yàn)組與控制組的匹配變量均沒有顯著差異,，滿足平衡性假設(shè)條件

5.4 psgraph 繪圖直觀呈現(xiàn)各匹配變量的平衡性狀況

psgraph

結(jié)果如下：

圖中也可以直觀看出,，實(shí)驗(yàn)組與控制組的傾向得分值分布大致平衡。

參考文獻(xiàn)

Dehejia, Rajeev H. and Sadek Wahba. 1999. 'Causal Effects in Nonexperimental Studies'. Journal of the American Statistical Association 94(448): 1053-1062.
Imbens, Guido W. and Donald B. Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. New York: Cambridge University Press.
Imbens, Guido W. 2015. “Matching Methods in Practice: Three Examples.” Journal of Human Resources 50 (2): 373–419. [PDF1],， [PDF2-wp]
LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review 76 (4): 604–20. [PDF]

本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn),。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式,、誘導(dǎo)購買等信息，謹(jǐn)防詐騙,。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,，請(qǐng)點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自：對(duì)對(duì)子不錯(cuò) > 《連玉君》

舉報(bào)/認(rèn)領(lǐng)