NiuNiu's Warehouse: 05/28/11

Saturday, May 28, 2011

Character variable combination function

The CAT* functions and CALL routines concatenate character strings.
These functions and CALL routines differ in how they treat leading
and trailing blanks in character arguments, and in whether separator
strings are inserted between the concatenated strings:

 * CAT does not remove leading or trailing blanks, and does not
   insert separators.

 * CATT trims trailing blanks, but not leading blanks, and does not
   insert separators.

 * CATS strips both leading and trailing blanks, and does not
   insert separators.

 * CATX strips both leading and trailing blanks, and inserts
   separators. The first argument to CATX specifies the separator.

All of the CAT* functions and CALL routines strip both leading and
trailing blanks from numeric arguments after formatting the numeric
value with the BEST. format.

CAT can be used only as a function. CATT, CATS, and CATX can be
used as either functions or CALL routines. When used as functions,
CAT, CATT, CATS, and CATX return strings with a length of up to:

 * 200 characters in WHERE clauses and/or PROC SQL.

 * 32767 characters in other parts of the DATA step.

 * 65534 characters when called from the macro processor.

When CATT and CATS are used as CALL routines, the result is returned
in the first argument. When CATX is used as a CALL routine, the result
is returned in the second argument (since the first argument is the
separator). The argument that contains the result must be a variable;
the following arguments are appended to the value of this variable.
If this variable is not long enough to contain the entire result, then:

 * a warning is printed that the result was truncated,

 * a note is printed showing the location of the function call
   and telling which argument caused the truncation

 * in the DATA step, _ERROR_ is set to one.

When CATT, CATS, or CATX are used as CALL routines, the result is
never assigned a blank value due to truncation. Note that this behavior
differs from that of the CAT, CATS, CATT, and CATX functions, as
described below.

If CAT, CATS, CATT, or CATX are used as functions, the result may
either be assigned directly to a variable, or returned in a temporary
buffer that the user normally is unaware of. There is no simple way
to determine which of these alternatives will occur except to try it
and see. If a temporary buffer is used, then the length of the buffer
depends on the calling environment, and the value in the buffer may
subsequently be truncated after the CAT, CATS, CATT, or CATX function
has finished, in which case there will be no message about truncation.
However, if the variable or buffer to which the result is assigned is
not long enough to contain the concatenation of all the arguments, then:

 * the result is changed to a blank value in the DATA step, SQL, and
   possibly other calling environments

 * a warning is printed that the result was either truncated
   or set to a blank value, depending on the calling environment

 * a note is printed showing the location of the function call
   and telling which argument caused the truncation

 * in the DATA step, _ERROR_ is set to one.

If truncation occurs after the function has finished, the result
is not set to blank, no messages are printed, and _ERROR_ is not
set to one.

The results of the CAT* functions are typically equivalent to those
produced by certain combinations of the concatenation operator (||)
and the TRIM and LEFT functions. However, the CAT* functions are
faster, more convenient, and can be used with variable lists in
calling environments that support variable lists. The following
table shows equivalents of the CAT* functions using || and other
functions, assuming that X1, X2, X3, and X4 are character variables:

Function          Equivalent || code
------------------------------------

CAT(OF X1-X4)     X1||X2||X3||X4

CATT(OF X1-X4)    TRIM(X1)||TRIM(X2)||TRIM(X3)||TRIM(X4)

CATS(OF X1-X4)    TRIM(LEFT(X1))||TRIM(LEFT(X2))||TRIM(LEFT(X3))||
                  TRIM(LEFT(X4))

CATX(SP,OF X1-X4) TRIM(LEFT(X1))||SP||TRIM(LEFT(X2))||SP||
                  TRIM(LEFT(X3))||SP||TRIM(LEFT(X4))
                  (where SP is a separator such as a blank or comma)

The following table shows statements that are equivalent to the
use of CATT, CATS, and CATX as CALL routines:

CALL Routine            Equivalent statement
--------------------------------------------

CALL CATT(OF X1-X4);    X1=TRIM(X1)||TRIM(X2)||TRIM(X3)||TRIM(X4);

CALL CATS(OF X1-X4);    X1=TRIM(LEFT(X1))||TRIM(LEFT(X2))||
                        TRIM(LEFT(X3))||TRIM(LEFT(X4));

CALL CATX(SP,OF X1-X4); X1=TRIM(LEFT(X1))||SP||TRIM(LEFT(X2))||SP||
                        TRIM(LEFT(X3))||SP||TRIM(LEFT(X4)); (where SP
                        is a separator such as a blank or comma)

CATX differs slightly from the || code shown above, in the CATX omits
the corresponding separator if any of the arguments to be concatenated
is completely blank. For example, CATX("+","X"," ","Z"," ") produces
"X+Z".


Examples:

In this DATA step, the CALL CATS routine prints messages because the
variable A has a length of only 4, which is not long enough to hold
the concatentation of two strings with a nonblank length of 3:

   data _null_;
      length a b $4;
      a = 'abc';
      b = 'xyz';
      call cats(a,b);
      put a= b=;
   run;

In this DATA step, truncation is done silently after the CATS function
finishes:

   data _null_;
      length a b $4;
      a = 'abc';
      b = 'xyz';
      a = cats(a,b);
      put a= b=;
   run;

In this DATA step, truncation occurs inside the CATS function,
appropriate messages are issued, and the result is set to a blank
value:

   data _null_;
      length a b c $4;
      a = 'abc';
      b = 'xyz';
      c = cats(a,b);
      put a= b= c=;
   run;

In this DATA step, truncation is done silently:

   data _null_;
      length a b c $4;
      a = 'abc';
      b = 'xyz';
      c = trim(cats(a,b));
      put a= b= c=;
   run;

Reference: ftp://ftp.sas.com/pub/neural/functions/cat.txt

To my warehouse...

I know I am always out of date. And I know blog is something out of date, since everybody now seems to play with twitter. But still I have made a lot of troubles to create you.

I spent more than 13 hours to create you. At one time, I even decided to give up, because the codes of creating a table of content always went against me. However, after trying MSN space and Facebook, I still preferred you due to your clean and tidy layout. And I stood the pain of creating the table by hand. Unbelievable. I always get great fun of trying Google's products. However, from time to time, I do find that some shortcomings of the products make me feel unbearable. Maybe it is due to culture difference?

The main purpose to create you is just as your name indicates, I need a warehouse. I need a warehouse to store all the interesting things and useful links I find online. And I just wonder whether you can bring some fun to me. Hope you can:)

SAS中的SQL语句完全教程之一：SQL简介与基本查询功能

本系列全部内容主要以《SQL Processing with the SAS System (Course Notes)》为主进行讲解，本书是在网上下载下来的，但忘了是在哪个网上下的，故不能提供下载链接了，需要的话可以发邮件向我索取，我定期邮给大家，最后声明一下所有资料仅用于学习，不得用于商业目的，否则后果自负。

转载请注明出处：http://blog.sina.com.cn/s/blog_5d3b177c0100cksl.html

1 SQL过程步介绍
1.1 SQL过程步可以实现下列功能：
查询SAS数据集、从SAS数据集中生成报表、以不同方式实现数据集合并、创建或删除SAS数据集、视图、索引等、更新已存在的数据集、使得SAS系统可以使用SQL语句、可以和SAS的数据步进行替换使用。注意，SQL过程步并不是用来代替SAS数据步，也不是一个客户化的报表工具，而是数据处理用到的查询工具。

1.2 SQL过程步的特征
SQL过程步并不需要对每一个查询进行重复、每条语句都是单独处理、不需要print过程步就能打印出查询结果、也不用sort过程步进行排序、不需要run、要quit来结束SQL过程步

1.3 SQL过程步语句
SELECT：查询数据表中的数据
ALTER：增加、删除或修改数据表的列
CREATE：创建一个数据表
DELETE：删除数据表中的列
DESCRIBE：列出数据表的属性
DROP：删除数据表、视图或索引
INSERT：对数据表插入数据
RESET：没用过，不知道什么意思
SELECT：选择列进行打印
UPDATE：对已存在的数据集的列的值进行修改

2 SQL基本查询功能

2.1 SELECT语句基本语法介绍
SELECT object-item <, ...object-item>
FROM from-list

>

>;
这里SELECT：指定被选择的列
FROM：指定被查询的表名
WHERE：子数据集的条件
GROUP BY：将数据集通过group进行分类
HAVING：根据GROUP BY的变量得到数据子集
ORDER BY：对数据集进行排序

2.2 SELECT语句的特征
选择满足条件的数据、数据分组、对数据进行排序、对数据指定格式、一次最多查询32个表。这里还要提到的就是，在SAS系统中，对于表名和变量名一般不超过32个字符，对于库名，文件引用名，格式等不能超过8个字符

2.3 Validate关键字
Validate关键字只存在于select语句中、可以在不运行查询的情况下测试语句的语法、检查列名是否合法、对于不正确的查询将打印其消息。例：
1 proc sql;
2 validate
3 select Region, Product,Sales
4 from sashelp.shoes
5 where Region = 'Africa';
NOTE: PROC SQL 语句有有效语法。
6 quit;
此外，我们还可以用noexec选项也可以用来进行语法测试。例：
7 proc sql noexec;
8 select Region, Product,Sales
9 from sashelp.shoes
10 where Region = 'Africa';
NOTE: 由于 NOEXEC 选项，未执行语句。
11 quit;
这里提示未执行，未提示错误，说明该语句没有语法错误。但是如果加入一个sashelp.shoes表里没有字段，这里就会出现错误，例：
12 proc sql noexec;
13 select Region, Product,Sales,test
14 from sashelp.shoes
15 where Region = 'Africa';
ERROR: 以下这些列在起作用的表中没有找到: test.
16 quit;

2.4 查询列
我们可以像2.3那样查询指定列，也可以用*来查询所有列。例：
proc sql;
select *
from sashelp.shoes;
quit;
这里我们可以用feedback选项来查看到底我们选择了哪些列：
17 proc sql feedback;
18 select *
19 from sashelp.shoes;
NOTE: Statement transforms to:
select SHOES.Region, SHOES.Product, SHOES.Subsidiary, SHOES.Stores, SHOES.Sales, SHOES.Inventory, SHOES.Returns
from SASHELP.SHOES;
20 quit;
这时，我们可以看到从sashelp.shoes表中选择了8个列

2.5 消除重复值
我们可以用distinct选项来消除重复值。例如，我们要得到没有重复的所有地区的名称：
proc sql;
select distinct Region
from sashelp.shoes
quit;

2.6 where子集查询
2.6.1 比较运算符
先列出where语句用到的比较运算符：
LT < 小于 GT > 大于
EQ = 等于
LE <= 小于或等于 GE >= 大于或等于
NE ^= 不等于
例如，我们要查询sales大于100000的所有数据：
proc sql;
select *
from sashelp.shoes
where sales>100000;
quit;

2.6.2 in：只要满足in里的任意一个值，表达式即为真，例如，我们要选择Region在Africa和Eastern Europe的所有数据：
proc sql;
select *
from sashelp.shoes
where Region in ('Africa','Eastern Europe');
quit;

2.6.3 逻辑运算符
OR | 或
AND & 是
NOT ^ 非
例如，选择Region在Africa和Eastern Europe，且销售额大于100000的所有数据：
proc sql;
select *
from sashelp.shoes
where Region in ('Africa','Eastern Europe') and sales>100000;
quit;

2.6.4 CONTAINS或?：判断某列是否包含指定字符串
例如，选择列Region包含’Afr’的数据：
proc sql;
select *
from sashelp.shoes
where Region ? 'Afr';
quit;

2.6.5 IS NULL或IS MISSING：判断某列数据是否为空
例如，如果找出Region为空的数据：
proc sql;
select *
from sashelp.shoes
where Region is missing;
quit;
注意，这里我们还可以用以下表达式对where语句进行替换。如果region为数值型变量，则可以用region=.，如果region为字符型变量，则可以用region= ‘’进行替换。

2.6.6 Between and：选择某一区间的数据
例如选择sales大于100000，但小于200000的所有数据：
proc sql;
select *
from sashelp.shoes
where sales between 100000 and 200000;
quit;

2.6.7 like：判断是否能匹配某些字符
例如，选择以region以A开头的所有地区
proc sql;
select *
from sashelp.shoes
where Region like 'A%';
quit;
这里注意有两类通配符，‘%’可以通配任意个任意字符，‘_’只能通配一个任意字符

2.6.8 =*：类似匹配
这里由于sashelp.shoes里没有符合要求的数据，所有就用书上的例子说明一下吧：
Where lastname=* ‘smith’，出来的结果可能是：smith，smythe等

2.7 表达式
我们可以通过已有的列进行计算来得到新的列，这时用关键词as来给新的列赋列名，例如：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
quit;
这时结果就会多一列salesperstores，用来得到该地区该产品每个商店的平均销售量。这里要注意的是，在创建表达式时，我们还可以在SQL里用到SAS中的除LAG和DIFF之外的所有函数。

这里我们还可以用表达式计算出来的结果来进行子集查询，但一定要记住用calculated关键词。例如我们要找出商店平均销售量大于5000的数据：
方法一：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
where Sales/Stores>5000;
quit;
方法二：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
where calculated salesperstores>5000;
quit;

2.8 查询结果展示
2.8.1 order by数据排序
默认的排序方式是升序，我们可以用DESC关键词来进行降序排列。例如以sales降序排列数据：
proc sql;
select *
from sashelp.shoes
order by Sales DESC;
quit;
这里提示一下，我们可以用任意多列进行排序，包括表达式结果（不用calculated），但最好是选择的列。

2.8.2 LABEL与FORMAT
LABEL：改变输出变量名的内容
FORMAT：改变列的值的输出方式
例如，改变salesperstores的label和format
proc sql;
select Region, Product,Sales,Stores,
Sales/Stores as salesperstores
label='sales per stores'
format=dollar12.2
from sashelp.shoes;
quit;

2.9 处理SQL常用函数
MEAN或AVG：均值
COUNT或N或FREQ：非缺失值个数
MAX：最大值
MIN：最小值
NMISS：缺失值个数
STD：标准差
SUM：求和
VAR：方差

2.9.1 求和sum
proc sql;
select Region, Product,Sales,Stores,
sum(Sales,Inventory,Returns) as total
from sashelp.shoes;
quit;

2.9.2 求均值avg
proc sql;
select Region, Product,Sales,Stores,
avg(Sales) as salesavg
from sashelp.shoes;
quit;

2.9.3 分组求均值group by
proc sql;
select Region,
avg(Sales) as salesavg
from sashelp.shoes
group by Region;
quit;

2.9.4 计数count
proc sql;
select Region,count(*) as count
from sashelp.shoes
group by Region;
quit;

2.9.5 HAVING数据子集
proc sql;
select Region,count(*) as count
from sashelp.shoes
group by Region
having count(*)>50;
quit;
其它的就不多作介绍了，多用用就熟悉了

2.10子查询
2.10.1 找出regions平均sales大于全部平均sales的region
proc sql;
select Region,
avg(Sales) as salesavg
from sashelp.shoes
group by Region
having avg(Sales)>
(select avg(Sales) from sashelp.shoes);
quit;

2.10.2 ANY关键词介绍
>ANY(20,30,40) 最终效果：>20
ALL (20,30,40) 最终效果：>40
例如，选择出region为united state的sales小于所有region为africa的sales的数据
proc sql;
select Region,Sales
from sashelp.shoes
where Region='United States'
and Sales (select Sales from sashelp.shoes where Region='Africa');
quit;

2.10.4 EXISTS与NOT EXISTS
proc sql;
select *
from sashelp.shoes
where exists
(select * from sashelp.orsales);
quit;

数据清理data Cleaning技术大全及SAS实现

转载请注明出处：http://blog.sina.com.cn/s/blog_5d3b177c0100esmx.html

1 简介
数据清理是数据准备一个很重要的环节，什么是数据清理呢？数据清理
Is for techies 技术人员的事
Is just coding 只是写代码
Is boring 很无聊
Consumes up to 80 % of the project要花掉项目80%的时间
Was not in the focus of data mining literature so far在数据挖掘中数据清理相关的文章不是很多
Is something that SAS can excellently do SAS可以很好地搞定
Is vital to the quality of the project 是项目质量的一个重要步骤

首先说明一下，由于没搞到本书的数据，所以就用其它的书《Predictive Modeling Using Logistic Regressio》的数据进行程序调试。

2 字符型数据清理
2.1 观察数据集
2.1.1 首先可以观察一下数据集中，所有字符型变量的数据情况：
proc freq data=pmlr.Develop(drop=branch);
tables _character_ / nocum nopercent;
run;
关键词_character_表示所有的字会型变量，其它的_numeric_和_all_已经讲解过。Drop和keep选项可以剔除或保留选择的变量。

2.1.2 观测了变量的值的情况后，我们就可能会发现一些错误值，这时可以将这些异常值输出出来（输出到日志中，或输出到数据集中，或外部文件）：
data _null_;
set pmlr.Develop;
file print; ***send output to the output window;
***check Res;
if Res not in ('R' 'S' 'U') then put Res= ;
***check Branch;
if verify(trim(Branch),'B0123456789') and not missing(Branch) and substr(Branch,1,1) ne trim('B')
then put Branch= ;
run;
这里，如果RES变量的值不是R、S、U，则为错误的值。

2.1.3 通过format来查看错误值个数
这里我们假设Res中的U是错误的值，我们可以用以下操作来实现错误值个数的观测
proc format;
value $Res 'R','S' = 'Valid'
other = 'Miscoded';
run;
proc freq data=pmlr.Develop;
format Res $Res.;
tables Res/ nocum nopercent missing;
run;

2.2 几个重要的函数
Verify： SAS的verify函数在数据处理和data clean的过程中十分有用，verify函数的第一个参数是源字符串，后续参数都是待查找字符，如果源字符串中包含的都是待查找字符，verify就返回0，否则，返回不包含字符在源字符串中的位置。由此可见，我们可以利用verify函数对字符串作非正常字符的检测，从而达到clean data的目的。
verify的语法:verify(source, chkstring1[, chkstring2, … , chkstringn]) （参考文献：http://blog.chinaunix.net/u/6776/showart_1073354.html）
TRIM(s) 返回去掉字符串s的尾随空格的结果。
MISSING：可用于字符型和数字型变量，当变量为空时，返回1，当变量不为空时，返回0。这个前面的博文已有讲解。
NOTDIGIT(character_value) 在字符串中搜索非数字字符，若找到则返回其第一次出现的位置，在这里，notdigit(character_value)和verify(character_value,'0123456789')实现的功能相同。

3数值型数据清理
3.1 极值异常值
3.1.1 三个重要的过程步：
means过程步：得到缺失值和非缺失值个数、最大值、最小值等统计值。
proc means data=pmlr.Develop n nmiss min max maxdec=3;
var DDABal DepAmt;
run;

tabulate过程步：得到缺失值和非缺失值个数、均值、最大值、最小值等统计值
proc tabulate data=pmlr.Develop format=7.3;
var DDABal DepAmt;
tables DDABal DepAmt,
n*f=7.0 nmiss*f=7.0 mean min max / rtspace=18;
keylabel
n = 'Number'
nmiss = 'Missing'
mean = 'Mean'
min = 'Lowest'
max = 'Highest';
run;

univariate过程步：UNIVARIATE过程除了可以提供MEANS和SUMMARY所提供了基本统计数外，还提供位置特征数（如Med中位数，Mode众数）和偏度系数（Skewness）、峰度系数(Kurtosis)这些变异数。此外它还可通过FREQ选项统计变量次数及频率，通过PLOT选项给出茎叶图(Stem Leaf)和正态概率密度图(Normal Probability Plot)，通过NORMAL选项进行变数正态性检验（给出W：Normal值）。
proc univariate data=pmlr.Develop plot;
var DDABal DepAmt;
run;

除了上述过程步外，还有很多过程步也可以完成相似的功能，大家可以去查阅相关文献。

3.2 最大（小）N条数据
前面的方法可以得到最大最小值，但是当我们要取最大的N条数据，或最小的N条数据时，可以用下面的选项：
univariate过程步：用nextrobs或nextrvals选项，可以得到最大的和最小的N条数据
proc univariate data=pmlr.Develop nextrobs=10; ** nextrvals=10;
var DDABal;
run;

3.3 得到分位数数据
3.3.1 用univariate过程步得到前N%的数据：
proc univariate data=pmlr.Develop noprint;
var DDABal;
output out=tmp pctlpts=10 90 pctlpre = L_; **得到10%和90%分位数;
run;
data hilo;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set tmp; **将得到10%和90%分位数数据集与原数据集合并,生成新数据集hilo;
if DDABal le L_10 and not missing(DDABal) then do;
Range = 'Low '; **如果小于10%分位数的值,则标记为low并输出;
output;
end;
else if DDABal ge L_90 then do;
Range = 'High'; **如果大于90%分位数的值,则标记为high并输出;
output;
end;
run;
将上面的代码参数化一下，就可以得到一个宏，功能为输出前N%的数据
%macro hilowper(Dsn=,
Var=,
Percent=,
Idvar= );

%let Up_per = %(100 - &Percent);

proc univariate data=&Dsn noprint;
var &Var;
id &Idvar;
output out=tmp pctlpts=&Percent &Up_per pctlpre = L_;
run;

data hilo;
set &Dsn(keep=&Idvar &Var);
if _n_ = 1 then set tmp;
if &Var le L_&percent and not missing(&Var) then do;
range = 'Low';
output;
end;
else if &Var ge L_&Up_per then do;
range = 'High';
output;
end;
run;

proc sort data=hilo;
by &Var;
run;

title "Low and High Values for Variables";
proc print data=hilo;
id &Idvar;
var Range &Var;
run;

proc datasets library=work nolist;
delete tmp hilo;
run;
quit;
%mend hilowper;

3.3.2 用rank过程步得到前N%的数据：
Rank过程步详见（http://blog.sina.com.cn/s/blog_5d6632e70100ddqe.html），下面介绍一下宏，实现上例的功能：
%macro top_bottom_nPercent
(Dsn=,
Var=,
Percent=,
Idvar=);
%let Bottom = %(&Percent - 1);
%let Top = %(100 - &Percent);

proc format;
value rnk 0 - &Bottom = 'Low'
&Top - 99 = 'High';
run;

proc rank data=&Dsn(keep=&Var &Idvar)
out=new(where=(&Var is not missing))
groups=100;
var &Var;
ranks Range;
run;

proc sort data=new(where=(Range le &Bottom or
Range ge &Top));
by &Var;
run;

***Produce the report;
proc print data=new;
title "Upper and Lower &Percent.% Values for %upcase(&Var)";
id &Idvar;
var Range &Var;
format Range rnk.;
run;

proc datasets library=work nolist;
delete new;
run;
quit;
%mend top_bottom_nPercent;

3.4 得到最大或最小N个数据
实现方法：
最小N个数据：排序后直接用obs=N即可得到
最大N个数据：排序，然后得到样本总数Num，减去N-1，再用firstobs=(Num-N+1)得到
proc sort data=pmlr.Develop(keep=DDABal DepAmt) out=tmp;
by DepAmt;
run;
data _null_;
set tmp nobs=Num_obs;
call symputx('Num',Num_obs);
stop;
run;
%let High = %(&Num - 9);
title "Ten Highest and Ten Lowest Values for HR";
data _null_;
set tmp(obs=10)
tmp(firstobs=&High) ;
file print;
if _n_ le 10 then do;
if _n_ = 1 then put / "Ten Lowest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
else if _n_ ge 11 then do;
if _n_ = 11 then put / "Ten Highest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
run;
用宏实现：
%macro highlow(Dsn=,
Var=,
Idvar=,
n= );
proc sort data=&Dsn(keep=&Idvar &Var
where=(&Var is not missing)) out=tmp;
by &Var;
run;
data _null_;
set tmp nobs=Num_obs;
call symput('Num',Num_obs);
stop;
run;

%let High = %(&Num - &n + 1);
title "&n Highest and Lowest Values for &Var";
data _null_;
set tmp(obs=&n)
tmp(firstobs=&High) ;
file print;
if _n_ le &n then do;
if _n_ = 1 then put / "&n Lowest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
else if _n_ ge %(&n + 1) then do;
if _n_ = %(&n + 1) then put / "&n Highest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
run;
proc datasets library=work nolist;
delete tmp;
run;
quit;
%mend highlow;

3.5 异常值错误值重复值缺失值
对于错误值，处理字符型变量用到的方法大多也可以用到数值型变量上，如上面的if then ,以及format，这里就不作多的讲解。
缺失值和重复值请见前面的博文：
缺失值：sas缺失值missing data详解 http://blog.sina.com.cn/s/blog_5d3b177c0100e6lm.html
重复值：除SAS数据集中的重复值方法汇总 http://blog.sina.com.cn/s/blog_5d3b177c0100bblp.html
下面说一下异常值的处理方法
3.5.1 用均值标准差的方法
proc means data=pmlr.Develop noprint;
var DDABal;
output out=means(drop=_type_ _freq_)
mean=M_DDABal
std=S_DDABal;
run;
data _null_;
file print;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set means;
if DDABal lt M_DDABal - 2*S_DDABal and not missing(DDABal) or
DDABal gt M_DDABal + 2*S_DDABal then put DDABal=;
run;

3.5.2 用分位数去极值，再用均值标准差
proc rank data=pmlr.Develop(keep=DepAmt) out=tmp groups=5;
var DepAmt;
ranks R_DepAmt;
run;
proc means data=tmp noprint;
where R_DepAmt not in (0,4); ***只要中间60%的数据来计算均值和标准差;
var DepAmt;
output out=means(drop=_type_ _freq_)
mean=M_DepAmt
std=S_DepAmt;
run;
%let N_sd = 2;
%let Mult = 2.12; title "Outliers Based on Trimmed Statistics";
data _null_;
file print;
set pmlr.Develop;
if _n_ = 1 then set means;
if DepAmt lt M_DepAmt – &N_sd*S_DepAmt*&Mult and not missing(DepAmt) or
DepAmt gt M_DepAmt + &N_sd*S_DepAmt*&Mult then put Patno= HR=;
run;

3.5.3 基于分位数的异常值处理
基本箱线图的四分位数异常值处理的原理：在Q3＋1.5IQR（四分位距）和Q1－1.5IQR处画两条与中位线一样的线段，这两条线段为异常值截断点，称其为内限；在F＋3IQR和F－3IQR处画两条线段，称其为外限。处于内限以外位置的点表示的数据都是异常值，其中在内限与外限之间的异常值为温和的异常值（mild outliers），在外限以外的为极端的异常值(extreme outliers)。参考文献如下：http://baike.baidu.com/view/1326550.htm?func=retitle
%macro interquartile
(
var=,
n_iqr=2 );

title "Outliers Based on &N_iqr Interquartile Ranges";

proc means data=&dsn noprint;
var &var;
output out=tmp
q1=Lower
q3=Upper
qrange=Iqr;
run;
data _null_;
set &dsn(keep=&Idvar &Var);
file print;
if _n_ = 1 then set tmp;
if &Var le Lower - &N_iqr*Iqr and not missing(&Var) or
&Var ge Upper + &N_iqr*Iqr then
put &Idvar= &Var=;
run;

proc datasets library=work;
delete tmp;
run;
quit;
%mend interquartile;

4 日期型数据
日期型数据可以延用字符型或数值型数据的处理方法，这里不作介绍。

5 多个数据集的处理
在SAS中处理多个数据集的情况太多，这里不可能一一列出，因此，本文仅讲一点内容，抛砖引玉：
这里讲的例子是两个数据集合并，以及关键词IN(Creates a variable that indicates whether the data set contributed data to the current observation)。
首先创建两个数据集:
data one;
input Patno x y;
datalines;
1 69 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56
3 67
4 88
5 98
13 99
;
run;
注意：Merge数据集前一定要将数据集都排序：
proc sort data=one;
by Patno;
run;
proc sort data=two;
by Patno;
run;
然后在日志中列出数据集的数据分布情况
data _null_;
file print;
merge one(in=Inone)
two(in=Intwo) end=Last;
by Patno;
if not Inone then do;
put "ID " Patno "is not in data set one";
n + 1;
end;
else if not Intwo then do;
put "ID " Patno "is not in data set two";
n + 1;
end;
if Last and n eq 0 then
put "All ID's match in both files";
run;
IN是个很好用的关键词，灵活运用可以得到很多想要的结果，例如我们要实现SQL中的right join 或left join的功能：
data missing;
merge one
two(in=Intwo); **right join功能;
by Patno;
if Intwo;
run;
或者SQL中的not in功能：
data missing;
merge one
two(in=Intwo);
by Patno;
if not Intwo; **not in功能;
run;

6 数据集间的比较
data one;
input Patno z y;
datalines;
1 56 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56 79
2 56 .
3 11 99
5 98 87
12 13 14
;
run;
proc compare base=one compare=two brief;
id Patno;
run;

对于上述绝大多数操作，我们同样可以用SQL过程步实现，请大家参考以前的博文：
SAS中的SQL语句完全教程之一：SQL简介与基本查询功能
http://blog.sina.com.cn/s/blog_5d3b177c0100cksl.html
SAS中的SQL语句完全教程之二：数据合并与建表、建视图
http://blog.sina.com.cn/s/blog_5d3b177c0100cm1t.html
SAS中的SQL语句完全教程之三：SQL过程步的其它特征
http://blog.sina.com.cn/s/blog_5d3b177c0100cn8v.html

参考文献：
《Codys Data Cleaning Techniques Using SAS》

Ten ways to build a wrong scoring model

Some ways to build a wrong scoring model are below- The author doesn’t take any guarantee if your modeling team is using one of these and still getting a correct model.

1) Over fit the model to the sample. This over fitting can be checked by taking a random sample again and fitting the scoring equation and compared predicted conversion rates versus actual conversion rates. The over fit model does not rank order deciles with lower average probability may show equal or more conversions than deciles with higher probability scores.

2) Choose non random samples for building and validating the scoring equation. Read over fitting above.

3) Use Multicollinearity (http://en.wikipedia.org/wiki/Multicollinearity ) without business judgment to remove variables which may make business sense.Usually happens a few years after you studied and forgot Multicollinearity.

If you don't know the difference between Multicollinearity , Heteroskedasticity http://en.wikipedia.org/wiki/Heteroskedasticity this could be the real deal breaker for you

4) Using legacy codes for running scoring usually with step wise forward and backward regression .Happens usually on Fridays and when in a hurry to make models.

5) Ignoring signs or magnitude of parameter estimates ( that's the output or the weightage of the variable in the equation).

6) Not knowing the difference between Type 1 and Type 2 error especially when rejecting variables based on P value. ( Not knowing P value means you may kindly stop reading and click the You Tube video in the right margin )

7) Excessive zeal in removing variables. Why ? Ask yourself this question every time you are removing a variable.

Using the wrong causal event (like mailings for loans) for predicting the future with scoring model (for mailings of deposit accounts) . or using the right causal event in the wrong environment ( rapid decline/rise of sales due to factors not present in model like competitor entry/going out of business ,oil prices, credit shocks sob sob sigh)

9) Over fitting

10) Learning about creating models from blogs and not reading and refreshing your old statistics textbooks

Reference:http://blog.sina.com.cn/s/blog_5d3b177c0100h55m.html

NiuNiu's Warehouse

Pages

Saturday, May 28, 2011

Character variable combination function

To my warehouse...

SAS中的SQL语句完全教程之一：SQL简介与基本查询功能

数据清理data Cleaning技术大全及SAS实现

Ten ways to build a wrong scoring model

Search

About Me

Blog Archive

Music

Total Pageviews