Friday, May 27, 2011

sas缺失值missing data详解

有数据的地方就有缺失值,正确管理缺失值,对我们写出清晰明了的代码非常有帮助。本文对SAS中的缺失值作一个详细的介绍。

  
1 SAS的缺失值
SAS的缺失值分为两类,一类是数值型的缺失值,用(.)表示,另一类是字符型的缺失值,用(’’)或者(’ ‘)表示。例:
data miss1;
  input charmiss $ 1 nummiss 3;
cards;
A 1
 
  3
D 4
;
proc print;
run;
结果:
Obs    charmiss    nummiss
       A           1
       B           .
                   3
       D           4

除了上述的表示方式之外,我们还可以用特殊字符A-Z或_来表示(如.Z等),这在调查问卷等很有用,比如A表示不回答,N表示不知道,R表示未回复,_表示拒绝等。这样子使得缺失值有不同的含义,例:
data miss2;
  input charmiss $ 1 nummiss 3-4;
    missing n a r _;
cards;
A -1
A .
B r
  3
D 0
  a
E 6
F n
G _
;
proc print;
run;
结果:
Obs    charmiss    nummiss

       A           -1
       A            .
       B            R
                    3
       D            0
                    A
       E            6
       F            N
       G            _
除此这外,我们还可以对不同的缺失缺赋予不同的格式。例:
proc format;
  value spec .='Missing'
             ._='Illegible'
      .R='Refused'
      .N='Not Done'
      .A='Absent';
run;

proc print data=miss2;
var charmiss nummiss;
 format nummiss spec.;
run;
结果:
Obs    charmiss     nummiss

       A               -1
       A        Missing
       B        Refused
                        3
       D                0
                Absent
       E                6
       F        Not Done
       G        Illegible

2 缺失值的引用
我们先查看一下不同的缺失值的排序方式:
proc sort data=miss2 out=miss3;
by nummiss;
run;
proc print data=miss3;
run;
结果:
Obs    charmiss    nummiss
       G            _
       A            .
                    A
       F            N
       B            R
       A           -1
       D            0
                    3
       E            6
我们从上面的结果可以看到,从小到大的排序结果为:_ . A N R然后是数字。

缺失值的引用1:查看数据集中nummiss非.的数据
proc sort data=miss2(where=(nummiss ne .)) out=miss3;
by nummiss;
run;
proc print data=miss3;
run;
结果:
Obs    charmiss    nummiss
       G            _
                    A
       F            N
       B            R
       A           -1
       D            0
                    3
       E            6

缺失值的引用2:查看数据集中nummiss非空的数据
proc sort data=miss2(where=(nummiss not in (. ._ .r .a .n))) out=miss3;
by nummiss;
run;
proc print data=miss3;
run;
结果:
Obs    charmiss    nummiss
       A           -1
       D            0
                    3
       E            6

缺失值的引用3:查看数据集中nummiss非.Z的数据
proc sort data=miss2(where=(nummiss le .z))  out=miss3;
by nummiss;
run;
结果:
Obs    charmiss    nummiss
       G           _
       A           .
                   A
       F           N
       B           R

缺失值的引用4:查看数据集中nummiss比.Z大的数据
proc sort data=miss2(where=(nummiss gt .z))  out=miss3;
by nummiss;
run;
结果:
Obs    charmiss    nummiss

       A           -1
       D            0
                    3
       E            6

缺失值的引用5:查看数据集中charmiss非空的数据
proc sort data=miss2(where=(charmiss ne ''))  out=miss3;
by nummiss;
run;
结果:
Obs    charmiss    nummiss
       G            _
       A            .
       F            N
       B            R
       A           -1
       D            0
       E            6

缺失值的引用6:查看数据集中nummiss非空数据
proc sort data=miss2(where=(nummiss is not missing)) out=miss3;
by nummiss;
run;
或者:
proc sort data=miss2(where=(nummiss is not null)) out=miss3;
by nummiss;
结果:
Obs    charmiss    nummiss
       A           -1
       D            0
                    3
       E            6

3 MISSING和NMISS函数介绍
MISSING:可用于字符型和数字型变量,当变量为空时,返回1,当变量不为空时,返回0。特定的缺失字符如A N R _也为空处理。例如:
proc sort data=miss2(where=(missing (nummiss)))  out=miss3;
by nummiss;
run;
结果:
Obs    charmiss    nummiss
       G           _
       A           .
                   A
       F           N
       B           R
这里,当nummiss为空时,missing (nummiss)返回1,数据输出到miss3中。当nummiss不为空时,missing (nummiss)返回0,数据不输出。

NMISS:只用于数字型变量,返回一组变量的值中缺失值个数。例:
先建立一个数据集:
data test;
  n=_n_;
 input score1 - score4;
cards;
   1   .      3
   .   3    2   2
   2   2    2   2
   1   .    .   .
   .   .    .   .
   3   3    2   1
   4   .    1   1
;
run;
data countmiss;
set test;
 miss_c=nmiss(of score1-score4);
run;
proc print;
run;
结果:
Obs    n    score1    score2    score3    score4    miss_c
    1       1         .         .         3         2
    2       .         3         2         2         1
    3       2         2         2         2         0
    4       1                 .         .         3
    5       .         .         .         .         4
    6       3         3         2         1         0
    7       4         .         1         1         1
这里miss_c记录了每一行有多少个变量的值为缺失值。
类似的,我们也可以用missing函数来实现
data countmiss;
  set test;
by n;
miss_c=sum(missing(score1),missing(score2),missing(score3),missing(score4));
run;

我们还可以衍生更多的变量:
data test2;
  set countmiss;
by n;
 if miss_c=0 then total=sum(of score1-score4);
  else if miss_c lt 4 then flag=1;
   else if miss_c=4 then flag=2;
run;
结果:
Obs    n    score1    score2    score3    score4    miss_c    total    flag
    1       1         .         .         3         2        .        1
    2       .         3         2         2         1               1
    3       2         2         2         2         0        8        .
    4       1         .         .         .         3        .        1
    5       .         .         .         .         4        .        2
    6               3         2         1         0        9        .
    7       4         .         1         1         1        .        1

4 merge和update缺失值数据
UPDATE:用新数据集的数据更新主表数据
MERGE:将两个数据集合并为一个数据集。
先建立两个数据集MISSDT1和MISSDT2
data MISSDT1;
 input visit labdate $20.;
cards;
     01JAN2006
     02JAN2006
     03JAN2006
     04JAN2006
     05JAN2006
;
run;
data MISSDT2;
 input visit labdate $20.;
cards;
     01FEB2006
     
     
     04FEB2006
     05FEB2006
;
run;
用merge:
data merged;
 merge missdt1 
       missdt2;
by visit;
run;
结果:
Obs    visit     labdate
      1      01FEB2006
      2              .
      3              .
      4      04FEB2006
      5      05FEB2006
用UPDATE
data updated;
 update missdt1 
        missdt2;
 by visit;
run;
结果:
Obs    visit     labdate
      1      01FEB2006   
      2      02JAN2006   
      3      03JAN2006
      4      04FEB2006   
      5      05FEB2006

如果加上选项UPDATEMODE=< MISSINGCHECK|NOMISSINGCHECK >
data updated;
 update missdt1 
        missdt2 UPDATEMODE=  NOMISSINGCHECK;
 by visit ;
run;
得到的结果跟merge一样。

5 SAS中缺失值处理
利用Array填补缺失值0
%macro missing(data);
data &data;
set &data;
array TEMP _numeric_;
 do over TEMP;
 if missing(TEMP) then TEMP=0;
 end;
run;
data &data;
set &data;
array TEMP _character_;
 do over TEMP;
 if missing(TEMP) then TEMP=0;
 end;
run;
%mend;

%missing(cx);

删除缺失值
如果出现缺失值,就删掉该条记录
data AnalysisData;
   set RawData;
   array score {5} score1-score5;
   do Treament=1 to 5;
     if missing(score[Treament]) then delete;
     end;  
   run;

如果出现大于N(这里n=2)个的缺失值,就删掉该记录。
data AnaylsisData;
    set Rawdata;
    if nmiss(of score1-score5)>2 then delete; 
    run;

6 缺失值处理要注意的地方
缺失值小于0,即当变量X为缺失值时,X<0为true。
对缺失值与其它值进行+或-运算时,结果为缺失值。如果想返回非缺失值,就得用sum函数,其返回非缺失值的数据之和。
很多过程步(SUMMARY, TABULATE, FREQ, CALENDAR等)都忽略缺失值,除非你加入nomissing选项
有些过程步会分别处理缺失值数据,例如SUMMARY将分别得到非缺失的数据个数和缺失值的数据个数。
例:
data _null_ ;                                              
a=. ;                                                     
b=0 ;                                                    
c=-7 ;                                                   
d=99 ;                                                   
add=a+b+c+d ;                                             
put '缺失值与非缺失值相加,结果为缺失值 ' add= ; 
sum=sum(a,b,c,d) ;                                        
put '缺失值与非缺失值用sum函数,结果为非缺失值之和:' sum= ;      
summiss=sum(.,a) ;                                        
put '缺失值求sum函数,结果为缺失值 ' summiss= ;           
sumzero=sum(0,.,a) ;                                      
put '0和缺失值求sum函数,结果为0 ' sumzero= ;          
*查看缺失值与0的大小关系;
if a<0 then                                              
put '缺失值小于0' ;                          
else if a>0 then                                          
put '缺失值大于0' ;                       
run ;      
日志的输出结果:
缺失值与非缺失值相加,结果为缺失值 add=.
缺失值与非缺失值用sum函数,结果为非缺失值之和:sum=92
缺失值求sum函数,结果为缺失值 summiss=.
0和缺失值求sum函数,结果为0 sumzero=0
缺失值小于0

主要参考文献:
MISSING! - Understanding and Making the Most of Missing Data
http://www2.sas.com/proceedings/sugi31/025-31.pdf
Tools of Miss-Calculation: Managing Missing Values with SAS
http://www2.sas.com/proceedings/forum2008/082-2008.pdf
Chapter 1 A Collection of Useful Tips
http://www.sas.com/service/doc/pubcat/chaps/55513.pdf

0 comments:

 
Copyright 2010 NiuNiu's Warehouse. Powered by Blogger
Blogger Templates created by DeluxeTemplates.net | Blogger Styles | Balance Transfer Credit Cards
Wordpress by Wpthemescreator
Blogger Showcase