You can do these exercises in several ways:
%include spidaeda(filename);where
filename
is the name of the file after the
path shown below. E.g., %include spidaeda(pontdchk);
.
There are three data sets. In SAS statements, refer to the data in a PROC step or DATA step by the following names:
SAS name | Number of Cases |
Number of Variables |
Description |
---|---|---|---|
slid.lp9394nm | 27854 | 1176 | Complete person file (minus all-missing) |
slid.pontario | 7425 | 48 | Ontario subset (wages and related variables) |
slid.pquebec | 5224 | 48 | Quebec subset (wages and related variables) |
Here, we will work with the Ontario and Quebec subset files, to keep the time for analysis small. The variables in these subsets and their statistics for the Ontario subset are shown here from the result of
%missing(data=slid.pontario, id=pupid26c, drop=);
options fmtsearch=(slid); *-- Look for interesting variables, summarize their properties; %datachk(data=slid.pontario, id=pupid26c, var=cmphw28c ttwgs28c ttinc42c eage26c yrsch18c hlevg18c yrxft11c);
cmphw28c ttwgs28c ttinc42c
are possible response variables.
Run one or more of the following in SAS and pick one variable to pursue.
*-- Examine transformations for response variables; title 'SLID: cmphw28c, Comp. hourly wage, 1994'; %symbox(data=slid.pontario, var=cmphw28c, powers=-1 -0.5 0 0.5 1); title 'SLID: Total wages and salaries (TTWGS28C)'; %symbox(data=slid.pontario, var=ttwgs28c, powers=-0.5 0 0.5 1); title 'SLID: Total money income (TTINC42C)'; %symbox(data=slid.pontario, var=ttinc42c, powers=-0.5 0 0.5 1);
if xxxx > 0 then
parts suppress SAS errors/warnings
for negative or missing values.]
title; data pontario; set slid.pontario; if loghwage > 0 then loghwage = log10(cmphw28c); if ttwgs28c > 0 then sqrtwage = sqrt(ttwgs28c); if ttinc42c > 0 then sqrtinc = sqrt(ttinc42c); label loghwage = 'log(Hourly wage)' sqrtwage = 'sqrt(Total wages and salaries)' sqrtinc = 'sqrt(Total money income)' eage26c = 'Age in 1994'; run;This step creates a temporary copy of the data set, including the new, transformed variables. From here on, we use the new, temporary file, PONTARIO
In the statements below,
substitute the name of your chosen transformed reponse
(e.g., SQRTWAGE) measure for
@RESP@
.
Use the %LOWESS macro to fit a smoothed curve relating your @RESP@*
to Age (EAGE26C) and Years of Schooling (YRSCH18C)
%lowess(data=pontario, y=@RESP@, x=eage26c, hsym=1, interp=rl); %lowess(data=pontario, y=@RESP@, x=yrsch18c, hsym=1, interp=rl);
@RESP@
measure
by sex (SEX21, in the SLID)
Again, substitute the name of your chosen transformed reponse
(e.g., SQRTWAGE) measure for
@RESP@
in the statements below.
proc sort data=pontario; by sex21; proc boxplot data=pontario; plot @RESP@ * sex21 /boxstyle=schematicidfar notches; id pupid26c; run;If the notches do not overlap, the difference in (median) wage measure is significant at the 95% level (ignoring other factors).
Since we have already transformed the response (total wages), we will consider the second possibility here.
- Unbend the data: Transform the data to make the relationship more nearly linear. [Question: look back at the LOWESS curve relating SQRTWAGE and EAGE26C. By Tukey's Arrow Rule, which way would you move on the latter of powers to transform EAGE26C?]
- Bend the model: Fit a non-linear model, or add polynomial terms in X.
Fit the model below, which includes both linear and quadratic terms in Age and Years of Schooling. We also include Sex and Visible Minority Status as additional predictors.
Note the use of "|" notation in the MODEL statement: A|A
is short for A A*A
.
proc glm data=pontario; class sex21; model sqrtwage = sex21 eage26c|eage26c yrsch18c|yrsch18c vismn15 / solution; output out=stats r=residual p=fitted; run;Not bad, in terms of overall fit --- the R2 is 0.40. Note that the linear term in YRSCH18C is not significant by the Type III tests; however, most analysts would retain lower-order terms whenever a high-order term is significant.
PROC RSREG fits a response surface model, generating all quadratic
and cross-product terms.
For illustration, we list VISMN15 first
on the MODEL statement below, and use the option COVAR=1
to suppress cross-products (interactions) with Visible Minority Status.
Note that SEX21 is binary, so it need not be treated as a CLASS variable.
proc rsreg data=pontario; model sqrtwage = vismn15 sex21 eage26c yrsch18c / covar=1; id pupid26c; run;The cross-product terms collectively are highly significant, yet the R2 increases only to 0.41, so the increase in model complexity may not be worthwhile. [Question: Can you think of some interpretation for the significant interaction of Age * Sex on Wages?]
First, let's make a normal probability plot:
%nqplot(data=stats, var=residual);
What about heteroscedisticity (constant variance around the regression surface)?
Here we can divide the observations into groups based on fitted value and test whether the variability of SQRTWAGE varies systematically with the predicted wage. (In the Regression Diagnostics course you will learn about other, better methods for detecting non-constant variance.)
We use the SPRDPLOT macro to create a Spread vs. Level plot.
proc rank data=stats out=grouped groups=10; var fitted; ranks decile; %sprdplot(data=grouped, var=sqrtwage, class=decile);