NiuNiu's Warehouse: 数据清理data Cleaning技术大全及SAS实现

转载请注明出处：http://blog.sina.com.cn/s/blog_5d3b177c0100esmx.html

1 简介
数据清理是数据准备一个很重要的环节，什么是数据清理呢？数据清理
Is for techies 技术人员的事
Is just coding 只是写代码
Is boring 很无聊
Consumes up to 80 % of the project要花掉项目80%的时间
Was not in the focus of data mining literature so far在数据挖掘中数据清理相关的文章不是很多
Is something that SAS can excellently do SAS可以很好地搞定
Is vital to the quality of the project 是项目质量的一个重要步骤

首先说明一下，由于没搞到本书的数据，所以就用其它的书《Predictive Modeling Using Logistic Regressio》的数据进行程序调试。

2 字符型数据清理
2.1 观察数据集
2.1.1 首先可以观察一下数据集中，所有字符型变量的数据情况：
proc freq data=pmlr.Develop(drop=branch);
tables _character_ / nocum nopercent;
run;
关键词_character_表示所有的字会型变量，其它的_numeric_和_all_已经讲解过。Drop和keep选项可以剔除或保留选择的变量。

2.1.2 观测了变量的值的情况后，我们就可能会发现一些错误值，这时可以将这些异常值输出出来（输出到日志中，或输出到数据集中，或外部文件）：
data _null_;
set pmlr.Develop;
file print; ***send output to the output window;
***check Res;
if Res not in ('R' 'S' 'U') then put Res= ;
***check Branch;
if verify(trim(Branch),'B0123456789') and not missing(Branch) and substr(Branch,1,1) ne trim('B')
then put Branch= ;
run;
这里，如果RES变量的值不是R、S、U，则为错误的值。

2.1.3 通过format来查看错误值个数
这里我们假设Res中的U是错误的值，我们可以用以下操作来实现错误值个数的观测
proc format;
value $Res 'R','S' = 'Valid'
other = 'Miscoded';
run;
proc freq data=pmlr.Develop;
format Res $Res.;
tables Res/ nocum nopercent missing;
run;

2.2 几个重要的函数
Verify： SAS的verify函数在数据处理和data clean的过程中十分有用，verify函数的第一个参数是源字符串，后续参数都是待查找字符，如果源字符串中包含的都是待查找字符，verify就返回0，否则，返回不包含字符在源字符串中的位置。由此可见，我们可以利用verify函数对字符串作非正常字符的检测，从而达到clean data的目的。
verify的语法:verify(source, chkstring1[, chkstring2, … , chkstringn]) （参考文献：http://blog.chinaunix.net/u/6776/showart_1073354.html）
TRIM(s) 返回去掉字符串s的尾随空格的结果。
MISSING：可用于字符型和数字型变量，当变量为空时，返回1，当变量不为空时，返回0。这个前面的博文已有讲解。
NOTDIGIT(character_value) 在字符串中搜索非数字字符，若找到则返回其第一次出现的位置，在这里，notdigit(character_value)和verify(character_value,'0123456789')实现的功能相同。

3数值型数据清理
3.1 极值异常值
3.1.1 三个重要的过程步：
means过程步：得到缺失值和非缺失值个数、最大值、最小值等统计值。
proc means data=pmlr.Develop n nmiss min max maxdec=3;
var DDABal DepAmt;
run;

tabulate过程步：得到缺失值和非缺失值个数、均值、最大值、最小值等统计值
proc tabulate data=pmlr.Develop format=7.3;
var DDABal DepAmt;
tables DDABal DepAmt,
n*f=7.0 nmiss*f=7.0 mean min max / rtspace=18;
keylabel
n = 'Number'
nmiss = 'Missing'
mean = 'Mean'
min = 'Lowest'
max = 'Highest';
run;

univariate过程步：UNIVARIATE过程除了可以提供MEANS和SUMMARY所提供了基本统计数外，还提供位置特征数（如Med中位数，Mode众数）和偏度系数（Skewness）、峰度系数(Kurtosis)这些变异数。此外它还可通过FREQ选项统计变量次数及频率，通过PLOT选项给出茎叶图(Stem Leaf)和正态概率密度图(Normal Probability Plot)，通过NORMAL选项进行变数正态性检验（给出W：Normal值）。
proc univariate data=pmlr.Develop plot;
var DDABal DepAmt;
run;

除了上述过程步外，还有很多过程步也可以完成相似的功能，大家可以去查阅相关文献。

3.2 最大（小）N条数据
前面的方法可以得到最大最小值，但是当我们要取最大的N条数据，或最小的N条数据时，可以用下面的选项：
univariate过程步：用nextrobs或nextrvals选项，可以得到最大的和最小的N条数据
proc univariate data=pmlr.Develop nextrobs=10; ** nextrvals=10;
var DDABal;
run;

3.3 得到分位数数据
3.3.1 用univariate过程步得到前N%的数据：
proc univariate data=pmlr.Develop noprint;
var DDABal;
output out=tmp pctlpts=10 90 pctlpre = L_; **得到10%和90%分位数;
run;
data hilo;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set tmp; **将得到10%和90%分位数数据集与原数据集合并,生成新数据集hilo;
if DDABal le L_10 and not missing(DDABal) then do;
Range = 'Low '; **如果小于10%分位数的值,则标记为low并输出;
output;
end;
else if DDABal ge L_90 then do;
Range = 'High'; **如果大于90%分位数的值,则标记为high并输出;
output;
end;
run;
将上面的代码参数化一下，就可以得到一个宏，功能为输出前N%的数据
%macro hilowper(Dsn=,
Var=,
Percent=,
Idvar= );

%let Up_per = %(100 - &Percent);

proc univariate data=&Dsn noprint;
var &Var;
id &Idvar;
output out=tmp pctlpts=&Percent &Up_per pctlpre = L_;
run;

data hilo;
set &Dsn(keep=&Idvar &Var);
if _n_ = 1 then set tmp;
if &Var le L_&percent and not missing(&Var) then do;
range = 'Low';
output;
end;
else if &Var ge L_&Up_per then do;
range = 'High';
output;
end;
run;

proc sort data=hilo;
by &Var;
run;

title "Low and High Values for Variables";
proc print data=hilo;
id &Idvar;
var Range &Var;
run;

proc datasets library=work nolist;
delete tmp hilo;
run;
quit;
%mend hilowper;

3.3.2 用rank过程步得到前N%的数据：
Rank过程步详见（http://blog.sina.com.cn/s/blog_5d6632e70100ddqe.html），下面介绍一下宏，实现上例的功能：
%macro top_bottom_nPercent
(Dsn=,
Var=,
Percent=,
Idvar=);
%let Bottom = %(&Percent - 1);
%let Top = %(100 - &Percent);

proc format;
value rnk 0 - &Bottom = 'Low'
&Top - 99 = 'High';
run;

proc rank data=&Dsn(keep=&Var &Idvar)
out=new(where=(&Var is not missing))
groups=100;
var &Var;
ranks Range;
run;

proc sort data=new(where=(Range le &Bottom or
Range ge &Top));
by &Var;
run;

***Produce the report;
proc print data=new;
title "Upper and Lower &Percent.% Values for %upcase(&Var)";
id &Idvar;
var Range &Var;
format Range rnk.;
run;

proc datasets library=work nolist;
delete new;
run;
quit;
%mend top_bottom_nPercent;

3.4 得到最大或最小N个数据
实现方法：
最小N个数据：排序后直接用obs=N即可得到
最大N个数据：排序，然后得到样本总数Num，减去N-1，再用firstobs=(Num-N+1)得到
proc sort data=pmlr.Develop(keep=DDABal DepAmt) out=tmp;
by DepAmt;
run;
data _null_;
set tmp nobs=Num_obs;
call symputx('Num',Num_obs);
stop;
run;
%let High = %(&Num - 9);
title "Ten Highest and Ten Lowest Values for HR";
data _null_;
set tmp(obs=10)
tmp(firstobs=&High) ;
file print;
if _n_ le 10 then do;
if _n_ = 1 then put / "Ten Lowest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
else if _n_ ge 11 then do;
if _n_ = 11 then put / "Ten Highest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
run;
用宏实现：
%macro highlow(Dsn=,
Var=,
Idvar=,
n= );
proc sort data=&Dsn(keep=&Idvar &Var
where=(&Var is not missing)) out=tmp;
by &Var;
run;
data _null_;
set tmp nobs=Num_obs;
call symput('Num',Num_obs);
stop;
run;

%let High = %(&Num - &n + 1);
title "&n Highest and Lowest Values for &Var";
data _null_;
set tmp(obs=&n)
tmp(firstobs=&High) ;
file print;
if _n_ le &n then do;
if _n_ = 1 then put / "&n Lowest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
else if _n_ ge %(&n + 1) then do;
if _n_ = %(&n + 1) then put / "&n Highest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
run;
proc datasets library=work nolist;
delete tmp;
run;
quit;
%mend highlow;

3.5 异常值错误值重复值缺失值
对于错误值，处理字符型变量用到的方法大多也可以用到数值型变量上，如上面的if then ,以及format，这里就不作多的讲解。
缺失值和重复值请见前面的博文：
缺失值：sas缺失值missing data详解 http://blog.sina.com.cn/s/blog_5d3b177c0100e6lm.html
重复值：除SAS数据集中的重复值方法汇总 http://blog.sina.com.cn/s/blog_5d3b177c0100bblp.html
下面说一下异常值的处理方法
3.5.1 用均值标准差的方法
proc means data=pmlr.Develop noprint;
var DDABal;
output out=means(drop=_type_ _freq_)
mean=M_DDABal
std=S_DDABal;
run;
data _null_;
file print;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set means;
if DDABal lt M_DDABal - 2*S_DDABal and not missing(DDABal) or
DDABal gt M_DDABal + 2*S_DDABal then put DDABal=;
run;

3.5.2 用分位数去极值，再用均值标准差
proc rank data=pmlr.Develop(keep=DepAmt) out=tmp groups=5;
var DepAmt;
ranks R_DepAmt;
run;
proc means data=tmp noprint;
where R_DepAmt not in (0,4); ***只要中间60%的数据来计算均值和标准差;
var DepAmt;
output out=means(drop=_type_ _freq_)
mean=M_DepAmt
std=S_DepAmt;
run;
%let N_sd = 2;
%let Mult = 2.12; title "Outliers Based on Trimmed Statistics";
data _null_;
file print;
set pmlr.Develop;
if _n_ = 1 then set means;
if DepAmt lt M_DepAmt – &N_sd*S_DepAmt*&Mult and not missing(DepAmt) or
DepAmt gt M_DepAmt + &N_sd*S_DepAmt*&Mult then put Patno= HR=;
run;

3.5.3 基于分位数的异常值处理
基本箱线图的四分位数异常值处理的原理：在Q3＋1.5IQR（四分位距）和Q1－1.5IQR处画两条与中位线一样的线段，这两条线段为异常值截断点，称其为内限；在F＋3IQR和F－3IQR处画两条线段，称其为外限。处于内限以外位置的点表示的数据都是异常值，其中在内限与外限之间的异常值为温和的异常值（mild outliers），在外限以外的为极端的异常值(extreme outliers)。参考文献如下：http://baike.baidu.com/view/1326550.htm?func=retitle
%macro interquartile
(
var=,
n_iqr=2 );

title "Outliers Based on &N_iqr Interquartile Ranges";

proc means data=&dsn noprint;
var &var;
output out=tmp
q1=Lower
q3=Upper
qrange=Iqr;
run;
data _null_;
set &dsn(keep=&Idvar &Var);
file print;
if _n_ = 1 then set tmp;
if &Var le Lower - &N_iqr*Iqr and not missing(&Var) or
&Var ge Upper + &N_iqr*Iqr then
put &Idvar= &Var=;
run;

proc datasets library=work;
delete tmp;
run;
quit;
%mend interquartile;

4 日期型数据
日期型数据可以延用字符型或数值型数据的处理方法，这里不作介绍。

5 多个数据集的处理
在SAS中处理多个数据集的情况太多，这里不可能一一列出，因此，本文仅讲一点内容，抛砖引玉：
这里讲的例子是两个数据集合并，以及关键词IN(Creates a variable that indicates whether the data set contributed data to the current observation)。
首先创建两个数据集:
data one;
input Patno x y;
datalines;
1 69 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56
3 67
4 88
5 98
13 99
;
run;
注意：Merge数据集前一定要将数据集都排序：
proc sort data=one;
by Patno;
run;
proc sort data=two;
by Patno;
run;
然后在日志中列出数据集的数据分布情况
data _null_;
file print;
merge one(in=Inone)
two(in=Intwo) end=Last;
by Patno;
if not Inone then do;
put "ID " Patno "is not in data set one";
n + 1;
end;
else if not Intwo then do;
put "ID " Patno "is not in data set two";
n + 1;
end;
if Last and n eq 0 then
put "All ID's match in both files";
run;
IN是个很好用的关键词，灵活运用可以得到很多想要的结果，例如我们要实现SQL中的right join 或left join的功能：
data missing;
merge one
two(in=Intwo); **right join功能;
by Patno;
if Intwo;
run;
或者SQL中的not in功能：
data missing;
merge one
two(in=Intwo);
by Patno;
if not Intwo; **not in功能;
run;

6 数据集间的比较
data one;
input Patno z y;
datalines;
1 56 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56 79
2 56 .
3 11 99
5 98 87
12 13 14
;
run;
proc compare base=one compare=two brief;
id Patno;
run;

对于上述绝大多数操作，我们同样可以用SQL过程步实现，请大家参考以前的博文：
SAS中的SQL语句完全教程之一：SQL简介与基本查询功能
http://blog.sina.com.cn/s/blog_5d3b177c0100cksl.html
SAS中的SQL语句完全教程之二：数据合并与建表、建视图
http://blog.sina.com.cn/s/blog_5d3b177c0100cm1t.html
SAS中的SQL语句完全教程之三：SQL过程步的其它特征
http://blog.sina.com.cn/s/blog_5d3b177c0100cn8v.html

参考文献：
《Codys Data Cleaning Techniques Using SAS》

NiuNiu's Warehouse

Pages

Saturday, May 28, 2011

数据清理data Cleaning技术大全及SAS实现

0 comments:

Search

About Me

Blog Archive

Music

Total Pageviews