NiuNiu's Warehouse: May 2011

Tuesday, May 31, 2011

Children's Day...

Tomorrow is International Children's Day. After coming to Canada, I found that China had so many holidays. Once I asked some fellow students and they said they did not celebrate holidays such as Children's Day. And Pro.K said when he was a kid, he celebrated Children's Day on April 4. It seems that the June 1st International Children's Day is not that popular. Anyway, I like all kinds of holidays.

When I was a child, I always dreamed to have an age of two digits. After I got a two-digit age, I dreamed to be 12, a round of years in Chinese culture. After being 12, I dreamed to go to university and live independently. I think I stopped hoping to grow up in my university years. Time always goes so slowly when you hope to grow up. However, it goes so fast when you want to keep as young as possible. How ironic it is!

Today was a pretty bad day, since I begun to write my term paper. Although I say I begun to write it, I still have not written a single word until now. To be honest, it should not be that hard. I replicate the paper of Pro.C, my supervisor, and use the same data and results of his project. And my nice supervisor told me how to write it in great details. However, I still hate it very much to tell the story. I really want to dig a hole, hide myself in it and never come out.NEVER COME OUT!!!

I believe that pursuing a PhD degree is just like having a battle with disappointment. Recently I constantly feel myself like a total failure. I never have such strong feelings before. My professional knowledge is super poor, my computer skill is poor, I don't know how to find a research topic, I don't know how to tell a story and so on. Moreover, I don't know how to do with my personal life. Everything that is so normal in others' eyes becomes so complicated in my eyes. To get married or not to get married, to work in Canada or go back home and so on and so on. As soon as I think about all of these issues, or even one of these issues, I just get terrible headache. Confucius says that people should settle down at the age of 30. Can I find the answers to all the questions above within 3 years? I hope so. Sigh...

Tomorrow is another day. I should give myself some courage to keep going. I should be OK...

Anyway, tomorrow will be Children's Day. It is a good excuse to be happy and laugh:)

Sunday, May 29, 2011

Answer...

Recently I have a lot of sleep. And it is more than waking up naturally. Anyway, I think I am one of the most sleepy persons I've ever known. No class and supervisor is pretty pretty nice. Sandy jiejie said that I had too much sleep and it was bad for health. I think perhaps it is true...

Usually I will lie on bed for some time, think about things making me confused and daydream. When I was a kid, I also liked sleeping late. I have to say my parents are not tough ones and they spoil me a little. When I woke up, usually I would call for my mom to get to know whether there was anybody home. "Mom...Mom..." If mom was at home, she would show up and answer me. Later, I chose to study at a faraway university, because I thought the university's landscape was beautiful and it would be great to live independently. The first part proved to be right while the latter part proved to be not that right. I missed my parents. When I took vacations at home, I still called for my mom when I woke up. As before, my mom would show up and answer me if she was at home. However, my feeling was quite different. The purpose of my call was not to check if anybody home. Instead I expected the joy of getting my parents' answer. Pure joy of getting an answer. The feeling was great. Happiness is simple...

Later I had a boyfriend. As I woke up, I also called his name. Before that call, I am sure I was waken for several times, because he believes that breakfast is important...Usually he would answer me and come in to drag the lazy person up. The answer was also sweet. Nevertheless, it was also very precious, since we lived in different cities...

Now I live in a different country by myself. Nobody will answer me when I wake up. And of course, I will not call at all now...This morning I missed the answer so much that I felt pretty bad. Maybe it is the pain of growing up and being an adult...

Gregory Colbert

The other day, I saw a picture of an elephant and a little monk. The picture touched my heart.However, I forgot to save the picture. Today I happened to see a similar style picture online and get to know the photographer's name. Interestingly, he is a Canadian. His name is Gregory Colbert.

The picture touched me very much is this one.

I believe many people may find this picture touching. But which part of the picture touches me? I cannot tell. The little monk looks a bit weak and pathetic. And elephant is one of my favorite animals. But still there is something else. The harmony between people and animal? The intelligence of animal? The calm from religion? Hard to say...I also find some pictures of the same series. I have not read a lot about the photographer and his experience, since his website is a bit hard to follow. And it does not give a description of every picture. But it seems these photos were taken in Myanmar.

These photos give people a tranquil feeling. Maybe it is due to religion. Having been in Canada for almost a year, I have learned more about religion. I can still remember the experience of interrupting my fellow Iranian student's pray by accident. I was so scared, because when I opened the door, he looked so serious. I run to another Iranian fellow, told him that I made a big mistake and asked for help. Luckily, the thing was not as serious as I thought. The pray fellow told me that he was praying so he was not able to answer the door when I knocked the door and he was not able to talk to me when I entered. His smile made me feel relaxed...

Later I talked about religion with some of my fellows. When I told them I had no religion, they were quite surprised. And it made me surprised. I always believe that my belief of life is to be a good-hearted person. But what is a good-hearted person? Can religion really bring tranquil to life? Can it make people free from confusion? I think I cannot give the answer for my life. But I think it is really lucky for those people who have a religion...

Saturday, May 28, 2011

Character variable combination function

The CAT* functions and CALL routines concatenate character strings.
These functions and CALL routines differ in how they treat leading
and trailing blanks in character arguments, and in whether separator
strings are inserted between the concatenated strings:

 * CAT does not remove leading or trailing blanks, and does not
   insert separators.

 * CATT trims trailing blanks, but not leading blanks, and does not
   insert separators.

 * CATS strips both leading and trailing blanks, and does not
   insert separators.

 * CATX strips both leading and trailing blanks, and inserts
   separators. The first argument to CATX specifies the separator.

All of the CAT* functions and CALL routines strip both leading and
trailing blanks from numeric arguments after formatting the numeric
value with the BEST. format.

CAT can be used only as a function. CATT, CATS, and CATX can be
used as either functions or CALL routines. When used as functions,
CAT, CATT, CATS, and CATX return strings with a length of up to:

 * 200 characters in WHERE clauses and/or PROC SQL.

 * 32767 characters in other parts of the DATA step.

 * 65534 characters when called from the macro processor.

When CATT and CATS are used as CALL routines, the result is returned
in the first argument. When CATX is used as a CALL routine, the result
is returned in the second argument (since the first argument is the
separator). The argument that contains the result must be a variable;
the following arguments are appended to the value of this variable.
If this variable is not long enough to contain the entire result, then:

 * a warning is printed that the result was truncated,

 * a note is printed showing the location of the function call
   and telling which argument caused the truncation

 * in the DATA step, _ERROR_ is set to one.

When CATT, CATS, or CATX are used as CALL routines, the result is
never assigned a blank value due to truncation. Note that this behavior
differs from that of the CAT, CATS, CATT, and CATX functions, as
described below.

If CAT, CATS, CATT, or CATX are used as functions, the result may
either be assigned directly to a variable, or returned in a temporary
buffer that the user normally is unaware of. There is no simple way
to determine which of these alternatives will occur except to try it
and see. If a temporary buffer is used, then the length of the buffer
depends on the calling environment, and the value in the buffer may
subsequently be truncated after the CAT, CATS, CATT, or CATX function
has finished, in which case there will be no message about truncation.
However, if the variable or buffer to which the result is assigned is
not long enough to contain the concatenation of all the arguments, then:

 * the result is changed to a blank value in the DATA step, SQL, and
   possibly other calling environments

 * a warning is printed that the result was either truncated
   or set to a blank value, depending on the calling environment

 * a note is printed showing the location of the function call
   and telling which argument caused the truncation

 * in the DATA step, _ERROR_ is set to one.

If truncation occurs after the function has finished, the result
is not set to blank, no messages are printed, and _ERROR_ is not
set to one.

The results of the CAT* functions are typically equivalent to those
produced by certain combinations of the concatenation operator (||)
and the TRIM and LEFT functions. However, the CAT* functions are
faster, more convenient, and can be used with variable lists in
calling environments that support variable lists. The following
table shows equivalents of the CAT* functions using || and other
functions, assuming that X1, X2, X3, and X4 are character variables:

Function          Equivalent || code
------------------------------------

CAT(OF X1-X4)     X1||X2||X3||X4

CATT(OF X1-X4)    TRIM(X1)||TRIM(X2)||TRIM(X3)||TRIM(X4)

CATS(OF X1-X4)    TRIM(LEFT(X1))||TRIM(LEFT(X2))||TRIM(LEFT(X3))||
                  TRIM(LEFT(X4))

CATX(SP,OF X1-X4) TRIM(LEFT(X1))||SP||TRIM(LEFT(X2))||SP||
                  TRIM(LEFT(X3))||SP||TRIM(LEFT(X4))
                  (where SP is a separator such as a blank or comma)

The following table shows statements that are equivalent to the
use of CATT, CATS, and CATX as CALL routines:

CALL Routine            Equivalent statement
--------------------------------------------

CALL CATT(OF X1-X4);    X1=TRIM(X1)||TRIM(X2)||TRIM(X3)||TRIM(X4);

CALL CATS(OF X1-X4);    X1=TRIM(LEFT(X1))||TRIM(LEFT(X2))||
                        TRIM(LEFT(X3))||TRIM(LEFT(X4));

CALL CATX(SP,OF X1-X4); X1=TRIM(LEFT(X1))||SP||TRIM(LEFT(X2))||SP||
                        TRIM(LEFT(X3))||SP||TRIM(LEFT(X4)); (where SP
                        is a separator such as a blank or comma)

CATX differs slightly from the || code shown above, in the CATX omits
the corresponding separator if any of the arguments to be concatenated
is completely blank. For example, CATX("+","X"," ","Z"," ") produces
"X+Z".


Examples:

In this DATA step, the CALL CATS routine prints messages because the
variable A has a length of only 4, which is not long enough to hold
the concatentation of two strings with a nonblank length of 3:

   data _null_;
      length a b $4;
      a = 'abc';
      b = 'xyz';
      call cats(a,b);
      put a= b=;
   run;

In this DATA step, truncation is done silently after the CATS function
finishes:

   data _null_;
      length a b $4;
      a = 'abc';
      b = 'xyz';
      a = cats(a,b);
      put a= b=;
   run;

In this DATA step, truncation occurs inside the CATS function,
appropriate messages are issued, and the result is set to a blank
value:

   data _null_;
      length a b c $4;
      a = 'abc';
      b = 'xyz';
      c = cats(a,b);
      put a= b= c=;
   run;

In this DATA step, truncation is done silently:

   data _null_;
      length a b c $4;
      a = 'abc';
      b = 'xyz';
      c = trim(cats(a,b));
      put a= b= c=;
   run;

Reference: ftp://ftp.sas.com/pub/neural/functions/cat.txt

To my warehouse...

I know I am always out of date. And I know blog is something out of date, since everybody now seems to play with twitter. But still I have made a lot of troubles to create you.

I spent more than 13 hours to create you. At one time, I even decided to give up, because the codes of creating a table of content always went against me. However, after trying MSN space and Facebook, I still preferred you due to your clean and tidy layout. And I stood the pain of creating the table by hand. Unbelievable. I always get great fun of trying Google's products. However, from time to time, I do find that some shortcomings of the products make me feel unbearable. Maybe it is due to culture difference?

The main purpose to create you is just as your name indicates, I need a warehouse. I need a warehouse to store all the interesting things and useful links I find online. And I just wonder whether you can bring some fun to me. Hope you can:)

SAS中的SQL语句完全教程之一：SQL简介与基本查询功能

本系列全部内容主要以《SQL Processing with the SAS System (Course Notes)》为主进行讲解，本书是在网上下载下来的，但忘了是在哪个网上下的，故不能提供下载链接了，需要的话可以发邮件向我索取，我定期邮给大家，最后声明一下所有资料仅用于学习，不得用于商业目的，否则后果自负。

转载请注明出处：http://blog.sina.com.cn/s/blog_5d3b177c0100cksl.html

1 SQL过程步介绍
1.1 SQL过程步可以实现下列功能：
查询SAS数据集、从SAS数据集中生成报表、以不同方式实现数据集合并、创建或删除SAS数据集、视图、索引等、更新已存在的数据集、使得SAS系统可以使用SQL语句、可以和SAS的数据步进行替换使用。注意，SQL过程步并不是用来代替SAS数据步，也不是一个客户化的报表工具，而是数据处理用到的查询工具。

1.2 SQL过程步的特征
SQL过程步并不需要对每一个查询进行重复、每条语句都是单独处理、不需要print过程步就能打印出查询结果、也不用sort过程步进行排序、不需要run、要quit来结束SQL过程步

1.3 SQL过程步语句
SELECT：查询数据表中的数据
ALTER：增加、删除或修改数据表的列
CREATE：创建一个数据表
DELETE：删除数据表中的列
DESCRIBE：列出数据表的属性
DROP：删除数据表、视图或索引
INSERT：对数据表插入数据
RESET：没用过，不知道什么意思
SELECT：选择列进行打印
UPDATE：对已存在的数据集的列的值进行修改

2 SQL基本查询功能

2.1 SELECT语句基本语法介绍
SELECT object-item <, ...object-item>
FROM from-list

>

>;
这里SELECT：指定被选择的列
FROM：指定被查询的表名
WHERE：子数据集的条件
GROUP BY：将数据集通过group进行分类
HAVING：根据GROUP BY的变量得到数据子集
ORDER BY：对数据集进行排序

2.2 SELECT语句的特征
选择满足条件的数据、数据分组、对数据进行排序、对数据指定格式、一次最多查询32个表。这里还要提到的就是，在SAS系统中，对于表名和变量名一般不超过32个字符，对于库名，文件引用名，格式等不能超过8个字符

2.3 Validate关键字
Validate关键字只存在于select语句中、可以在不运行查询的情况下测试语句的语法、检查列名是否合法、对于不正确的查询将打印其消息。例：
1 proc sql;
2 validate
3 select Region, Product,Sales
4 from sashelp.shoes
5 where Region = 'Africa';
NOTE: PROC SQL 语句有有效语法。
6 quit;
此外，我们还可以用noexec选项也可以用来进行语法测试。例：
7 proc sql noexec;
8 select Region, Product,Sales
9 from sashelp.shoes
10 where Region = 'Africa';
NOTE: 由于 NOEXEC 选项，未执行语句。
11 quit;
这里提示未执行，未提示错误，说明该语句没有语法错误。但是如果加入一个sashelp.shoes表里没有字段，这里就会出现错误，例：
12 proc sql noexec;
13 select Region, Product,Sales,test
14 from sashelp.shoes
15 where Region = 'Africa';
ERROR: 以下这些列在起作用的表中没有找到: test.
16 quit;

2.4 查询列
我们可以像2.3那样查询指定列，也可以用*来查询所有列。例：
proc sql;
select *
from sashelp.shoes;
quit;
这里我们可以用feedback选项来查看到底我们选择了哪些列：
17 proc sql feedback;
18 select *
19 from sashelp.shoes;
NOTE: Statement transforms to:
select SHOES.Region, SHOES.Product, SHOES.Subsidiary, SHOES.Stores, SHOES.Sales, SHOES.Inventory, SHOES.Returns
from SASHELP.SHOES;
20 quit;
这时，我们可以看到从sashelp.shoes表中选择了8个列

2.5 消除重复值
我们可以用distinct选项来消除重复值。例如，我们要得到没有重复的所有地区的名称：
proc sql;
select distinct Region
from sashelp.shoes
quit;

2.6 where子集查询
2.6.1 比较运算符
先列出where语句用到的比较运算符：
LT < 小于 GT > 大于
EQ = 等于
LE <= 小于或等于 GE >= 大于或等于
NE ^= 不等于
例如，我们要查询sales大于100000的所有数据：
proc sql;
select *
from sashelp.shoes
where sales>100000;
quit;

2.6.2 in：只要满足in里的任意一个值，表达式即为真，例如，我们要选择Region在Africa和Eastern Europe的所有数据：
proc sql;
select *
from sashelp.shoes
where Region in ('Africa','Eastern Europe');
quit;

2.6.3 逻辑运算符
OR | 或
AND & 是
NOT ^ 非
例如，选择Region在Africa和Eastern Europe，且销售额大于100000的所有数据：
proc sql;
select *
from sashelp.shoes
where Region in ('Africa','Eastern Europe') and sales>100000;
quit;

2.6.4 CONTAINS或?：判断某列是否包含指定字符串
例如，选择列Region包含’Afr’的数据：
proc sql;
select *
from sashelp.shoes
where Region ? 'Afr';
quit;

2.6.5 IS NULL或IS MISSING：判断某列数据是否为空
例如，如果找出Region为空的数据：
proc sql;
select *
from sashelp.shoes
where Region is missing;
quit;
注意，这里我们还可以用以下表达式对where语句进行替换。如果region为数值型变量，则可以用region=.，如果region为字符型变量，则可以用region= ‘’进行替换。

2.6.6 Between and：选择某一区间的数据
例如选择sales大于100000，但小于200000的所有数据：
proc sql;
select *
from sashelp.shoes
where sales between 100000 and 200000;
quit;

2.6.7 like：判断是否能匹配某些字符
例如，选择以region以A开头的所有地区
proc sql;
select *
from sashelp.shoes
where Region like 'A%';
quit;
这里注意有两类通配符，‘%’可以通配任意个任意字符，‘_’只能通配一个任意字符

2.6.8 =*：类似匹配
这里由于sashelp.shoes里没有符合要求的数据，所有就用书上的例子说明一下吧：
Where lastname=* ‘smith’，出来的结果可能是：smith，smythe等

2.7 表达式
我们可以通过已有的列进行计算来得到新的列，这时用关键词as来给新的列赋列名，例如：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
quit;
这时结果就会多一列salesperstores，用来得到该地区该产品每个商店的平均销售量。这里要注意的是，在创建表达式时，我们还可以在SQL里用到SAS中的除LAG和DIFF之外的所有函数。

这里我们还可以用表达式计算出来的结果来进行子集查询，但一定要记住用calculated关键词。例如我们要找出商店平均销售量大于5000的数据：
方法一：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
where Sales/Stores>5000;
quit;
方法二：
proc sql;
select Region, Product,Sales,Stores,Sales/Stores as salesperstores
from sashelp.shoes
where calculated salesperstores>5000;
quit;

2.8 查询结果展示
2.8.1 order by数据排序
默认的排序方式是升序，我们可以用DESC关键词来进行降序排列。例如以sales降序排列数据：
proc sql;
select *
from sashelp.shoes
order by Sales DESC;
quit;
这里提示一下，我们可以用任意多列进行排序，包括表达式结果（不用calculated），但最好是选择的列。

2.8.2 LABEL与FORMAT
LABEL：改变输出变量名的内容
FORMAT：改变列的值的输出方式
例如，改变salesperstores的label和format
proc sql;
select Region, Product,Sales,Stores,
Sales/Stores as salesperstores
label='sales per stores'
format=dollar12.2
from sashelp.shoes;
quit;

2.9 处理SQL常用函数
MEAN或AVG：均值
COUNT或N或FREQ：非缺失值个数
MAX：最大值
MIN：最小值
NMISS：缺失值个数
STD：标准差
SUM：求和
VAR：方差

2.9.1 求和sum
proc sql;
select Region, Product,Sales,Stores,
sum(Sales,Inventory,Returns) as total
from sashelp.shoes;
quit;

2.9.2 求均值avg
proc sql;
select Region, Product,Sales,Stores,
avg(Sales) as salesavg
from sashelp.shoes;
quit;

2.9.3 分组求均值group by
proc sql;
select Region,
avg(Sales) as salesavg
from sashelp.shoes
group by Region;
quit;

2.9.4 计数count
proc sql;
select Region,count(*) as count
from sashelp.shoes
group by Region;
quit;

2.9.5 HAVING数据子集
proc sql;
select Region,count(*) as count
from sashelp.shoes
group by Region
having count(*)>50;
quit;
其它的就不多作介绍了，多用用就熟悉了

2.10子查询
2.10.1 找出regions平均sales大于全部平均sales的region
proc sql;
select Region,
avg(Sales) as salesavg
from sashelp.shoes
group by Region
having avg(Sales)>
(select avg(Sales) from sashelp.shoes);
quit;

2.10.2 ANY关键词介绍
>ANY(20,30,40) 最终效果：>20
ALL (20,30,40) 最终效果：>40
例如，选择出region为united state的sales小于所有region为africa的sales的数据
proc sql;
select Region,Sales
from sashelp.shoes
where Region='United States'
and Sales (select Sales from sashelp.shoes where Region='Africa');
quit;

2.10.4 EXISTS与NOT EXISTS
proc sql;
select *
from sashelp.shoes
where exists
(select * from sashelp.orsales);
quit;

数据清理data Cleaning技术大全及SAS实现

转载请注明出处：http://blog.sina.com.cn/s/blog_5d3b177c0100esmx.html

1 简介
数据清理是数据准备一个很重要的环节，什么是数据清理呢？数据清理
Is for techies 技术人员的事
Is just coding 只是写代码
Is boring 很无聊
Consumes up to 80 % of the project要花掉项目80%的时间
Was not in the focus of data mining literature so far在数据挖掘中数据清理相关的文章不是很多
Is something that SAS can excellently do SAS可以很好地搞定
Is vital to the quality of the project 是项目质量的一个重要步骤

首先说明一下，由于没搞到本书的数据，所以就用其它的书《Predictive Modeling Using Logistic Regressio》的数据进行程序调试。

2 字符型数据清理
2.1 观察数据集
2.1.1 首先可以观察一下数据集中，所有字符型变量的数据情况：
proc freq data=pmlr.Develop(drop=branch);
tables _character_ / nocum nopercent;
run;
关键词_character_表示所有的字会型变量，其它的_numeric_和_all_已经讲解过。Drop和keep选项可以剔除或保留选择的变量。

2.1.2 观测了变量的值的情况后，我们就可能会发现一些错误值，这时可以将这些异常值输出出来（输出到日志中，或输出到数据集中，或外部文件）：
data _null_;
set pmlr.Develop;
file print; ***send output to the output window;
***check Res;
if Res not in ('R' 'S' 'U') then put Res= ;
***check Branch;
if verify(trim(Branch),'B0123456789') and not missing(Branch) and substr(Branch,1,1) ne trim('B')
then put Branch= ;
run;
这里，如果RES变量的值不是R、S、U，则为错误的值。

2.1.3 通过format来查看错误值个数
这里我们假设Res中的U是错误的值，我们可以用以下操作来实现错误值个数的观测
proc format;
value $Res 'R','S' = 'Valid'
other = 'Miscoded';
run;
proc freq data=pmlr.Develop;
format Res $Res.;
tables Res/ nocum nopercent missing;
run;

2.2 几个重要的函数
Verify： SAS的verify函数在数据处理和data clean的过程中十分有用，verify函数的第一个参数是源字符串，后续参数都是待查找字符，如果源字符串中包含的都是待查找字符，verify就返回0，否则，返回不包含字符在源字符串中的位置。由此可见，我们可以利用verify函数对字符串作非正常字符的检测，从而达到clean data的目的。
verify的语法:verify(source, chkstring1[, chkstring2, … , chkstringn]) （参考文献：http://blog.chinaunix.net/u/6776/showart_1073354.html）
TRIM(s) 返回去掉字符串s的尾随空格的结果。
MISSING：可用于字符型和数字型变量，当变量为空时，返回1，当变量不为空时，返回0。这个前面的博文已有讲解。
NOTDIGIT(character_value) 在字符串中搜索非数字字符，若找到则返回其第一次出现的位置，在这里，notdigit(character_value)和verify(character_value,'0123456789')实现的功能相同。

3数值型数据清理
3.1 极值异常值
3.1.1 三个重要的过程步：
means过程步：得到缺失值和非缺失值个数、最大值、最小值等统计值。
proc means data=pmlr.Develop n nmiss min max maxdec=3;
var DDABal DepAmt;
run;

tabulate过程步：得到缺失值和非缺失值个数、均值、最大值、最小值等统计值
proc tabulate data=pmlr.Develop format=7.3;
var DDABal DepAmt;
tables DDABal DepAmt,
n*f=7.0 nmiss*f=7.0 mean min max / rtspace=18;
keylabel
n = 'Number'
nmiss = 'Missing'
mean = 'Mean'
min = 'Lowest'
max = 'Highest';
run;

univariate过程步：UNIVARIATE过程除了可以提供MEANS和SUMMARY所提供了基本统计数外，还提供位置特征数（如Med中位数，Mode众数）和偏度系数（Skewness）、峰度系数(Kurtosis)这些变异数。此外它还可通过FREQ选项统计变量次数及频率，通过PLOT选项给出茎叶图(Stem Leaf)和正态概率密度图(Normal Probability Plot)，通过NORMAL选项进行变数正态性检验（给出W：Normal值）。
proc univariate data=pmlr.Develop plot;
var DDABal DepAmt;
run;

除了上述过程步外，还有很多过程步也可以完成相似的功能，大家可以去查阅相关文献。

3.2 最大（小）N条数据
前面的方法可以得到最大最小值，但是当我们要取最大的N条数据，或最小的N条数据时，可以用下面的选项：
univariate过程步：用nextrobs或nextrvals选项，可以得到最大的和最小的N条数据
proc univariate data=pmlr.Develop nextrobs=10; ** nextrvals=10;
var DDABal;
run;

3.3 得到分位数数据
3.3.1 用univariate过程步得到前N%的数据：
proc univariate data=pmlr.Develop noprint;
var DDABal;
output out=tmp pctlpts=10 90 pctlpre = L_; **得到10%和90%分位数;
run;
data hilo;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set tmp; **将得到10%和90%分位数数据集与原数据集合并,生成新数据集hilo;
if DDABal le L_10 and not missing(DDABal) then do;
Range = 'Low '; **如果小于10%分位数的值,则标记为low并输出;
output;
end;
else if DDABal ge L_90 then do;
Range = 'High'; **如果大于90%分位数的值,则标记为high并输出;
output;
end;
run;
将上面的代码参数化一下，就可以得到一个宏，功能为输出前N%的数据
%macro hilowper(Dsn=,
Var=,
Percent=,
Idvar= );

%let Up_per = %(100 - &Percent);

proc univariate data=&Dsn noprint;
var &Var;
id &Idvar;
output out=tmp pctlpts=&Percent &Up_per pctlpre = L_;
run;

data hilo;
set &Dsn(keep=&Idvar &Var);
if _n_ = 1 then set tmp;
if &Var le L_&percent and not missing(&Var) then do;
range = 'Low';
output;
end;
else if &Var ge L_&Up_per then do;
range = 'High';
output;
end;
run;

proc sort data=hilo;
by &Var;
run;

title "Low and High Values for Variables";
proc print data=hilo;
id &Idvar;
var Range &Var;
run;

proc datasets library=work nolist;
delete tmp hilo;
run;
quit;
%mend hilowper;

3.3.2 用rank过程步得到前N%的数据：
Rank过程步详见（http://blog.sina.com.cn/s/blog_5d6632e70100ddqe.html），下面介绍一下宏，实现上例的功能：
%macro top_bottom_nPercent
(Dsn=,
Var=,
Percent=,
Idvar=);
%let Bottom = %(&Percent - 1);
%let Top = %(100 - &Percent);

proc format;
value rnk 0 - &Bottom = 'Low'
&Top - 99 = 'High';
run;

proc rank data=&Dsn(keep=&Var &Idvar)
out=new(where=(&Var is not missing))
groups=100;
var &Var;
ranks Range;
run;

proc sort data=new(where=(Range le &Bottom or
Range ge &Top));
by &Var;
run;

***Produce the report;
proc print data=new;
title "Upper and Lower &Percent.% Values for %upcase(&Var)";
id &Idvar;
var Range &Var;
format Range rnk.;
run;

proc datasets library=work nolist;
delete new;
run;
quit;
%mend top_bottom_nPercent;

3.4 得到最大或最小N个数据
实现方法：
最小N个数据：排序后直接用obs=N即可得到
最大N个数据：排序，然后得到样本总数Num，减去N-1，再用firstobs=(Num-N+1)得到
proc sort data=pmlr.Develop(keep=DDABal DepAmt) out=tmp;
by DepAmt;
run;
data _null_;
set tmp nobs=Num_obs;
call symputx('Num',Num_obs);
stop;
run;
%let High = %(&Num - 9);
title "Ten Highest and Ten Lowest Values for HR";
data _null_;
set tmp(obs=10)
tmp(firstobs=&High) ;
file print;
if _n_ le 10 then do;
if _n_ = 1 then put / "Ten Lowest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
else if _n_ ge 11 then do;
if _n_ = 11 then put / "Ten Highest Values" ;
put "Patno = " DepAmt @15 "Value = " DDABal;
end;
run;
用宏实现：
%macro highlow(Dsn=,
Var=,
Idvar=,
n= );
proc sort data=&Dsn(keep=&Idvar &Var
where=(&Var is not missing)) out=tmp;
by &Var;
run;
data _null_;
set tmp nobs=Num_obs;
call symput('Num',Num_obs);
stop;
run;

%let High = %(&Num - &n + 1);
title "&n Highest and Lowest Values for &Var";
data _null_;
set tmp(obs=&n)
tmp(firstobs=&High) ;
file print;
if _n_ le &n then do;
if _n_ = 1 then put / "&n Lowest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
else if _n_ ge %(&n + 1) then do;
if _n_ = %(&n + 1) then put / "&n Highest Values" ;
put "&Idvar = " &Idvar @15 "Value = " &Var;
end;
run;
proc datasets library=work nolist;
delete tmp;
run;
quit;
%mend highlow;

3.5 异常值错误值重复值缺失值
对于错误值，处理字符型变量用到的方法大多也可以用到数值型变量上，如上面的if then ,以及format，这里就不作多的讲解。
缺失值和重复值请见前面的博文：
缺失值：sas缺失值missing data详解 http://blog.sina.com.cn/s/blog_5d3b177c0100e6lm.html
重复值：除SAS数据集中的重复值方法汇总 http://blog.sina.com.cn/s/blog_5d3b177c0100bblp.html
下面说一下异常值的处理方法
3.5.1 用均值标准差的方法
proc means data=pmlr.Develop noprint;
var DDABal;
output out=means(drop=_type_ _freq_)
mean=M_DDABal
std=S_DDABal;
run;
data _null_;
file print;
set pmlr.Develop(keep=DDABal);
if _n_ = 1 then set means;
if DDABal lt M_DDABal - 2*S_DDABal and not missing(DDABal) or
DDABal gt M_DDABal + 2*S_DDABal then put DDABal=;
run;

3.5.2 用分位数去极值，再用均值标准差
proc rank data=pmlr.Develop(keep=DepAmt) out=tmp groups=5;
var DepAmt;
ranks R_DepAmt;
run;
proc means data=tmp noprint;
where R_DepAmt not in (0,4); ***只要中间60%的数据来计算均值和标准差;
var DepAmt;
output out=means(drop=_type_ _freq_)
mean=M_DepAmt
std=S_DepAmt;
run;
%let N_sd = 2;
%let Mult = 2.12; title "Outliers Based on Trimmed Statistics";
data _null_;
file print;
set pmlr.Develop;
if _n_ = 1 then set means;
if DepAmt lt M_DepAmt – &N_sd*S_DepAmt*&Mult and not missing(DepAmt) or
DepAmt gt M_DepAmt + &N_sd*S_DepAmt*&Mult then put Patno= HR=;
run;

3.5.3 基于分位数的异常值处理
基本箱线图的四分位数异常值处理的原理：在Q3＋1.5IQR（四分位距）和Q1－1.5IQR处画两条与中位线一样的线段，这两条线段为异常值截断点，称其为内限；在F＋3IQR和F－3IQR处画两条线段，称其为外限。处于内限以外位置的点表示的数据都是异常值，其中在内限与外限之间的异常值为温和的异常值（mild outliers），在外限以外的为极端的异常值(extreme outliers)。参考文献如下：http://baike.baidu.com/view/1326550.htm?func=retitle
%macro interquartile
(
var=,
n_iqr=2 );

title "Outliers Based on &N_iqr Interquartile Ranges";

proc means data=&dsn noprint;
var &var;
output out=tmp
q1=Lower
q3=Upper
qrange=Iqr;
run;
data _null_;
set &dsn(keep=&Idvar &Var);
file print;
if _n_ = 1 then set tmp;
if &Var le Lower - &N_iqr*Iqr and not missing(&Var) or
&Var ge Upper + &N_iqr*Iqr then
put &Idvar= &Var=;
run;

proc datasets library=work;
delete tmp;
run;
quit;
%mend interquartile;

4 日期型数据
日期型数据可以延用字符型或数值型数据的处理方法，这里不作介绍。

5 多个数据集的处理
在SAS中处理多个数据集的情况太多，这里不可能一一列出，因此，本文仅讲一点内容，抛砖引玉：
这里讲的例子是两个数据集合并，以及关键词IN(Creates a variable that indicates whether the data set contributed data to the current observation)。
首先创建两个数据集:
data one;
input Patno x y;
datalines;
1 69 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56
3 67
4 88
5 98
13 99
;
run;
注意：Merge数据集前一定要将数据集都排序：
proc sort data=one;
by Patno;
run;
proc sort data=two;
by Patno;
run;
然后在日志中列出数据集的数据分布情况
data _null_;
file print;
merge one(in=Inone)
two(in=Intwo) end=Last;
by Patno;
if not Inone then do;
put "ID " Patno "is not in data set one";
n + 1;
end;
else if not Intwo then do;
put "ID " Patno "is not in data set two";
n + 1;
end;
if Last and n eq 0 then
put "All ID's match in both files";
run;
IN是个很好用的关键词，灵活运用可以得到很多想要的结果，例如我们要实现SQL中的right join 或left join的功能：
data missing;
merge one
two(in=Intwo); **right join功能;
by Patno;
if Intwo;
run;
或者SQL中的not in功能：
data missing;
merge one
two(in=Intwo);
by Patno;
if not Intwo; **not in功能;
run;

6 数据集间的比较
data one;
input Patno z y;
datalines;
1 56 79
2 56 .
3 66 99
5 98 87
12 13 14
;
run;
data two;
input Patno z;
datalines;
1 56 79
2 56 .
3 11 99
5 98 87
12 13 14
;
run;
proc compare base=one compare=two brief;
id Patno;
run;

对于上述绝大多数操作，我们同样可以用SQL过程步实现，请大家参考以前的博文：
SAS中的SQL语句完全教程之一：SQL简介与基本查询功能
http://blog.sina.com.cn/s/blog_5d3b177c0100cksl.html
SAS中的SQL语句完全教程之二：数据合并与建表、建视图
http://blog.sina.com.cn/s/blog_5d3b177c0100cm1t.html
SAS中的SQL语句完全教程之三：SQL过程步的其它特征
http://blog.sina.com.cn/s/blog_5d3b177c0100cn8v.html

参考文献：
《Codys Data Cleaning Techniques Using SAS》

Ten ways to build a wrong scoring model

Some ways to build a wrong scoring model are below- The author doesn’t take any guarantee if your modeling team is using one of these and still getting a correct model.

1) Over fit the model to the sample. This over fitting can be checked by taking a random sample again and fitting the scoring equation and compared predicted conversion rates versus actual conversion rates. The over fit model does not rank order deciles with lower average probability may show equal or more conversions than deciles with higher probability scores.

2) Choose non random samples for building and validating the scoring equation. Read over fitting above.

3) Use Multicollinearity (http://en.wikipedia.org/wiki/Multicollinearity ) without business judgment to remove variables which may make business sense.Usually happens a few years after you studied and forgot Multicollinearity.

If you don't know the difference between Multicollinearity , Heteroskedasticity http://en.wikipedia.org/wiki/Heteroskedasticity this could be the real deal breaker for you

4) Using legacy codes for running scoring usually with step wise forward and backward regression .Happens usually on Fridays and when in a hurry to make models.

5) Ignoring signs or magnitude of parameter estimates ( that's the output or the weightage of the variable in the equation).

6) Not knowing the difference between Type 1 and Type 2 error especially when rejecting variables based on P value. ( Not knowing P value means you may kindly stop reading and click the You Tube video in the right margin )

7) Excessive zeal in removing variables. Why ? Ask yourself this question every time you are removing a variable.

Using the wrong causal event (like mailings for loans) for predicting the future with scoring model (for mailings of deposit accounts) . or using the right causal event in the wrong environment ( rapid decline/rise of sales due to factors not present in model like competitor entry/going out of business ,oil prices, credit shocks sob sob sigh)

9) Over fitting

10) Learning about creating models from blogs and not reading and refreshing your old statistics textbooks

Reference:http://blog.sina.com.cn/s/blog_5d3b177c0100h55m.html

Friday, May 27, 2011

SAS和蒙特卡罗模拟_随机数

转载自：http://Johnthu.spaces.live.com/

一、为什么选择SAS做蒙特卡罗模拟？

为什么要用SAS做蒙卡？首先，对我来说，我只会用SAS，而且打算用SAS完成我所有的工作。当然，其他一些通用的理由有(Fan, etc.,2002)：

蒙卡是个计算密集的活，而SAS Base、SAS Macro、SAS/IML强大而灵活的编程能力能满足这一要求；
做蒙卡时要用到大量的统计/数学技术，而SAS就内置了大量的统计/数学函数(在 SAS/Stat和SAS/ETS)；用Fortran或C++当然也是非常好的主意，只是他们缺少内置的统计函数，代码就要冗长复杂很多。

二、什么是蒙卡？一个启发性例子

好，开始，什么是蒙卡？了解它背景知识的最好办法当然是wiki-Monte_Carlo_method。蒙特卡罗是位于摩洛哥的一家赌场，二战时，美国Los Alamos国家实验室把它作为核裂变计算机模拟的代码名称。作为模拟方法，蒙卡以前就叫统计抽样(statistical sampling)，我们感兴趣的结果因为输入变量的不确定而不可知，但如果能依概率产生输入变量的样本，我们就可以估计到结果变量的分布。跟蒙卡对应的，还有一种模拟技术叫系统模拟，包括排队、库存等模型，这些模型都跟随时间推移而出现的事件序列有关。下面举个蒙卡的例子，来自Evans, etc.( 2001)的超级简化版。

假设一家企业，利润是其需求量的函数，需求是随机变量。为了简化讨论，假定利润就是需求的两倍。这里输入变量就是不可控的需求，结果变量就是我们感兴趣的利润。假设需求以相同的概率取10、20、30、40、50、60这六种情况。在这样的简化下，我们就可以投一枚均匀的骰子来产生需求的样本，如果点数为1，对应得需求就是10，点数为2，需求就是20，以下类推。这样，我们的模拟过程就是：

投骰子；
根据骰子的点数确定需求量；
根据需求量，求利润。

掷10次骰子，假设我们的模拟结果如下：

重复次数	骰子点数	需求量	利润
1	5	50	100
2	3	30	60
3	3	30	60
4	6	60	120
5	1	10	20
6	3	30	60
7	4	40	40
8	5	50	100
9	2	20	20
10	5	50	50

这样通过模拟需求的样本，我们对结果利润的分布也就有所知晓，比如平均利润可以算出就是63。蒙卡一个重要的步骤就是生成随机数，这里我们是用投骰子来完成。

三、又一个例子：利用蒙特卡罗模拟方法求圆周率∏(pi)

再举个很有名的例子，就是估计圆周率∏的值，来自Ross(2006)。这个试验的思路正好可以帮我们温习一下几何概型的概念。我们知道概率的古典概型，就是把求概率的问题转化为计数：样本空间由n个样本点组成，事件A由k个样本点组成，则事件A的概率就是k/n。考虑到概率和面积在测度上具有某种共性，几何概型的基本想法是把事件跟几何区域相对应，用面积来计算概率，其要点是：

认为样本空间Ω是平面上的某个区域，其面积记为υ(Ω)；
向区域Ω随机投掷一点，该点落入Ω内任何部分区域内的可能性只与这部分区域的面积成比例，而与这部分区域的位置和形状无关；
设事件A是Ω的某个区域，面积为υ(A)，则向区域Ω上随机投掷一点，该点落在区域A的可能性称为事件A的概率，P(A)=υ(A)/υ(Ω)。

扯远了，回到用蒙卡估计圆周率∏的实验思路。假设样本空间是一个边长为2面积为4的正方形，我们感兴趣的事件是正方形内的一个半径为1面积为∏的圆，所以向正方形内随机投掷一点，落在圆里面的概率为∏/4。实验的思路如下：

1）生成随机数——生成n个均匀落在正方形内的点；
2）对落在正方形内的n个点，数一数正好落在圆里面的点的个数，假设为k（另外n-k个点就落在圆外面的正方形区域内）。
3）k/n就可以大致认为是圆的面积与正方形的面积之比，另其等于∏/4，就可以求出圆周率∏的估计值。

又，Forcode提供了利用电子表格Excel求解以上问题的详细过程，有兴趣可以看看。另外，这里也有一个详细的说明，用Mathematica实现。

四、随机数很重要（以及下期预告）

在第一个例子中，我强调了骰子要是均匀的。那里骰子就是我们的随机数生成器，骰子的均匀程度就是我们随机数的“随机”成分。在上面的10次试验中，投出了3次点数“3”和3次点数“5”，然后点数“1”、“2”、“4”、“6”各出现一次——因为只是重复了10次，这样我们对骰子的均匀程度不好评估，但如果重复无数次——假如是十万次，如果还出现类似比例的结果，那么就有理由怀疑这粒骰子的均匀程度了。如果骰子灌了铅，比如投出点数“6”的可能性从六分之一上升到6分之三，那么根据这粒骰子来做模拟，我们就有高估需求以及利润的危险。

下次就讲讲随机数的生成原理，当然结合SAS来讲，或者也会提一下Excel。

五、其他软件包

在商业领域，基于Excel的插件比较兴盛，比如Crystal Ball应用就较为普遍，有试用版可以下载。其他竞争产品有@Risk、Risk Solver等。现在据说Risk Simulator也开始兴盛起来，大有后来居上之势。他的开发者Johnathan Mun还在Wiley Finance系列有一本书，Modeling Risk: Applying Monte Carlo Simulation, Real Options Analysis, Forecasting, and Optimization Techniques。

S语言（R或者S-Plus）由于其面对对象的特性，加之丰富的内置函数和诸多用户提供的库，使得R或者S-Plus也是蒙卡研究的得力工具——或者比SAS更有前景。这个我不熟，略之。

Random Numbers

最简单、最基本、最重要的随机变量是在[0,1]上均匀分布的随机变量。一般地，我们把[0,1]上均匀分布随机变量的抽样值称为随机数，其他分布随机变量的抽样都是借助于随机数来实现的。以下谈的都是所谓“伪随机数”(Pseudo Random Numbers)。产生随机数，可以通过物理方法取得（很久很久以前，兰德公司就曾以随机脉冲源做信息源，利用电子旋转轮来产生随机数表），但当今最为普遍的乃是在计算机上利用数学方法产生随机数。这种随机数根据特定的迭代公式计算出来，初值确定后，序列就可以预测出来，所以不能算是真正的随机数（就成为“伪随机数”）。不过，在应用中，只要产生的伪随机数列能通过一系列统计检验，就可以把它们当成“真”随机数来用。

现在大多软件包内置的随机数产生程序，都是使用同余法(Congruential Random Numbers Generators)。“同余”是数论中的概念。

0.预备知识：同余

捡回小学一年级的东西："4/2=2"读作“4除以2等于2”，或者，“2除4等于2”。还有求模的符号mod(number,divisor)，其中，number是被除数（在上式中，为4），divisor是除数（上式中的2）。这样的约定对SAS和Excel都通用，如mod(4,13)=4，mod(13,4)=1。

现在我们可以开讲“同余”了。设m是正整数，用m去除整数a、b，得到的余数相同，则称“a与b关于模m同余”。上面的定义可以读写成，对整数a、b和正整数m，若mod(a,m)=mod(b,m)，则称“a与b关于模m有相同的余数（同余）”，记做a≡b(mod m)（这就是同余式）。举个例子，mod(13,4)=1，mod(1,4)=1，则读成13和1关于模4同余，记做13≡1(mod 4)。当然，同余具有对称性，上式还可以写成1=13(mod 4)。

a≡b(mod m)的一个充要条件是a=b+mt，t是整数，比如13=1+3*4。a=b+mt可以写成(a-b)/m=t，即m能整除(a-b)。

这些就够了。更多基础性的介绍，可以参考《同余（数据基础）》。

1.乘同余法

同余法是一大类方法的统称，包括加同余法、乘同余法等。因为这些方法中的迭代公式都可以写成上面我们见过的同余式形式，故统称同余法。常用的就是下面的乘同余法(Multiplicative Congruential Generator.)。符号不好敲，做些约定，如R(i)就是R加一个下标i。

乘同余法随机数生成器的同余形式如下：R(i+1)=a*R(i) (mod m)。这个迭代式可以写成更直观的形式，R(i+1)=mod[a*R(i),m]，其中初值R(0)称为随机数种子。因为mod(x,m)总是等于0到m-1的一个整数，所以最后把R(i+1)这个随机序列都除以m，就可以得到在[0,1]上均匀分布的随机数。下面用电子表格演示一遍，假设随机数种子R(0)=1，a=4，m=13：

	A	B	C	D	E
1	i	R(i)	a*R(i)	mod[a*R(i),m]	mod[a*R(i),m]/m
2	0	1	=B2*4	=mod(C2,13)	=D2/13
3	1	=D2	=B3*4	=mod(C3,13)	=D3/13
4	2	=D3	=B4*4	=mod(C4,13)	=D4/13

上面显示的是公式（Excel中，公式与计算结果的转换，用快捷键ctrl+~实现），结果看起来是这样的，E列就是我们想要的在[0,1]上均匀分布的随机数：

	A	B	C	D	E
1	i	R(i)	a*R(i)	mod[a*R(i),m]	mod[a*R(i),m]/m
2	0	1	4	4	0.307692308
3	1	4	16	3	0.230769231
4	2	3	12	12	0.923076923

上表中，a*R(1)=16，R(2)=3，m=13，一个同余式就是R(2)=a*R(1) (mod m)，3=16(mod 13)，或者说，16-3能够被13整除——同余法就是这意思了。说“乘同余法”，要点在于a*R(i)是乘法形式。在SAS系统中，a=397,204,094，m = 2^31-1=2147483647（一个素数），随机种子R(0)可以取1到m-1之间的任何整数。

2.伪随机数的检验

现在内置于各大软件包的随机数生成器都经过了彻底的检验，我们当然可以安心地使用这个黑箱。或则我们也可以多了解些。一个好的“伪”随机数列，应该看起来就跟从[0,1]上均匀分布的随机数列中随机抽取出来的一样。两个比较直观的检验有：

均匀性检验——所有的数是否均匀地分布在[0,1]区间；
独立性（或不相关）检验——序列中的数不存在序列相关，每个数都跟其前后出现的数独立无关。一个例子是如0.1、0.2、0.3、0.4、0.5，这里每一个相继的数都正好比它前面的一个数大0.1，这样这个数列就似乎有了某种“格局”。

具体的检验方法就略之了，更多请参见徐钟济（1985）。

3.生成其他概率分布的随机数

前面提到过，我们把[0,1]上均匀分布随机变量的抽样值称为随机数，其他分布随机变量的抽样都是借助于随机数来实现的。对其他连续型分布，（累积）分布函数为F(x)，r是在[0,1]上均匀分布的随机变量，另r=F(x)，解出的x就是该连续型分布的随机抽样。由于x可以写成r的反函数，该变换就称作“逆变换”(Inverse Transformation method)。对离散型分布，思路类似，不过繁琐些，具体可以见徐钟济（1985）、Ross（2006）和Evans等（2001）。大多数相关软件包都会提供各种分布的随机数生成函数，知道有这回事就可以。最重要的，是我们陈述一类问题时，知道需要用哪种概率分布来描述Evans等（2001）：

常用的连续分布

均匀分布(Uniform Distribution)描述在某最小值和最大值之间所有结果等可能的随机变量的特性。
正态分布(Normal Distribution)是对称的，具有中位数等于均值的性质，就是我们熟悉的钟形形状。各种类型的误差常常是正态分布的。最后，作为中心极限定理的推论，大量具有任意分布的随机变量的平均数也呈正态分布。
三角形分布(Triangular Distribution)由三个参数来定义，最大值、最小值和最可能值。临近最可能值的结果比那些位于端点的结果有更大的出现机会。
指数分布(Exponential Distribution)常用来构建在时间上随机重现的事件的模型，如顾客到达服务系统的时间间隔、机器元件失效前的工作时间等。它的主要性质是“无记忆性”(Memoryless) ，即当前时间对未来结果没有影响。例如，不论机器原先已经运转了多长时间，它继续运转直至出现故障的时间间隔总有同样的分布。
对数正态分布(Lognormal Distribution)：若随机变量x的自然对数是正态的，则x就服从对数正态分布。对数正态分布的常见例子是股票价格。

常用的离散分布

贝努利分布(Bernoulli Distribution)描述的是只有两个以常数概率出现的可能结果的随机变量的特性；
二项分布(Binomial Distribution)给出每次试验成功概率为p的n次独立重复贝努利实验的模型；
泊松分布(Poisson Distribution)用于建立某种度量单位内发生次数的模型的一种离散分布，如某时段内某事件发生的次数。

其他有用的分布如伽玛分布(Gamma Distribution)、威布尔分布(Weibull Distribution)、贝塔分布(Beta Distribution)略之。

4.下期预告

上次落了些东西。说到蒙卡与C++，在数量金融领域，不能不提的一本书就是Mark Joshi的C++ Design Patterns and Derivatives Pricing (Cambridge Uni. Press, 2004)，面试中一定要说自己读过的，这样人家以为你至少会用C++做一个蒙特卡罗模拟。这个系列只讲SAS，下次先讲如何用SAS生成随机数，然后就是具体的项目。

SAS随机数函数及CALL子程序

**************************************************************************************************************************

《SAS和蒙特卡罗模拟（1）：开篇》——简介，通过例子建立起蒙卡的直观概念；参考软件包及书目
《SAS和蒙特卡罗模拟（2）：随机数基础》——随机数入门；参考书目

**************************************************************************************************************************************************

一、SAS随机数函数和CALL子程序

SAS系统产生随机数，两种方式，利用SAS函数(Functions)或者CALL子程序(CALL Routines)，它们的语法格式是（具体的区别容后讨论）：

方式代码说明

函数 var=name(seed,<arg>) var为存储随机数列的变量，name为特定的分布函数形式，seed为随机数种子，<arg>为特定分布要求的参数（可选）

CALL子程序 call name(seed,<arg>,var) 同上,记得seed=0, ±1,±2, , ± (2**31-2)

SAS可用的随机数函数列表如下（可以参见SAS Help and Documentation-SAS Products-Base SAS-SAS Language Dictionary-Functions and CALL Routines-Functions and CALL Routines by Category）：

分布 SAS函数说明

二项分布(Binomial) ranBin(seed,n,p) n为独立实验的次数，p为成功概率

指数分布(Exponential) ranExp(seed)

正态分布(Normal) ranNor(seed),or normal(seed) ranNor和normal是同质的，但normal没有相对应的CALL子程序

泊松分布(Poisson) ranPoi(seed,m) m为均值

均匀分布(Uniform) ranUni(seed),or uniform(seed) ranUni和uniform是同质的，但uniform没有相对应的CALL子程序

柯西分布(Cauchy) ranCau(seed)

伽玛分布(Gamma) ranGam(seed,a) a>0为形状参数

由分布律表格决定的离散分布(tabled probability distribution) ranTbl(seed,p1,p2,...pn) ∑p=1

三角分布(Triangular) ranTri(seed,h) h为斜边（最可能值）

上面的随机函数，除了normal和uniform，都可以由直接相应的CALL子程序调用。

二、SAS随机数函数和CALL子程序：比较

用SAS随机数函数同时创建的多个随机数变量其实都属于同一个随机数列。这话费解，一个例子先，创建两个随机数变量，各包含3个记录，其中x1的种子为123，x2的种子为456：

data ranuni(drop=i);
    retain seed1 123 seed2 456;
    do i=1 to 3;
        x1=ranuni(seed1);
        x2=ranuni(seed2);
        output;
    end;
run;
proc print data=ranuni;run;

结果为：

Obs    seed1    seed2       x1         x2
1      123      456     0.75040    0.32091
2      123      456     0.17839    0.90603
3      123      456     0.35712    0.22111

好像没什么异样。我们把上面的x1增加为6个记录：

data ranuni2(drop=i);
    retain seed1 123;
    do i=1 to 6;
        x1=ranuni(seed1);
        output;
    end;
run;
proc print data=ranuni2;run;

结果如下，把上下用红色标出的数字对照看一看：

Obs    seed1       x1
1      123     0.750402      123     0.32091
3      123     0.178394      123     0.906035      123     0.35712
6      123    0.22111

什么意思？在第一段代码中，x2的种子根本不起作用，把x2的记录安插到x1里，看起来就是用种子123产生的随机数列加长了而已。x2的第一个值并不是由种子456产生的，而是产生第一个x1后的下一个值，x1、x2其实属于同一个随机数列，尽管x2的种子被指定为456，而x1的被指定为123。现在就可以重复上面的一句话：用SAS随机数函数同时创建的多个随机数变量其实都属于同一个随机数列。

用CALL子程序就能够同时产生独立的随机数列。

data ranuni3(drop=i);
    retain seed3 123 seed4 456;
    do i=1 to 3;
        call ranuni(seed3,x3);
        call ranuni(seed4,x4);
        output;
    end;
run;
proc print data=ranuni3;run;

结果如下：

Obs       seed3        seed4         x3         x4
1     1611463328    736440516   0.75040    0.34293
2      689153326    774069794    0.32091    0.36045
3      383088854    686944750    0.17839    0.31988

以上x3就是x1。x1和x3的初始种子都是123，但x3那个结果显示的种子是当前种子值。要在SAS随机数函数语句中显示随机种子的当前值，可以由以下代码给出：

data ranuni4(drop=i);
    retain seed1 123;
    do i=1 to 3;
        x1=ranuni(seed1);
        seed=x1*(2**31-1);
        output;
    end;
run;
proc print data=ranuni4;
    var seed1 seed x1;
run;

结果如下，可以跟上面由CALL子程序得出的结果对照：

Obs    seed1       seed          x1
1      123     1611463328   0.75040
2      123      689153326    0.32091
3      123      383088854    0.17839

---------参考资料---------

Xitao Fan, etc..Monte Carlo Studies: A Guide for Quantitative Researchers. SAS Institute Inc.,2002
朱世武《SAS编程技术与金融数据处理》，北京：清华大学出版社，2003

NiuNiu's Warehouse

Pages