All the commands below are stored in the file R:\psych\spida\courses\eda\psidex1.sas. You can include them directly into SAS by typing in the Editor window:
%include spidaeda(psidex1);or, you can copy/paste from this window into the SAS Editor window.
First of all, we copy the data to a new dataset called PSID
,
located in the temporary work folder and is retrieved from the original dataset called PSID.PSID in the R drive.
In addition, we give a title to all subsequent output.
title 'PSIDEX1: Brief look at the PSID'; *-- Make a working copy of the PSID data; data psid; set psid.psid; run;
All the phrases after *--
are comments and SAS will not execute them. You can use comments to
write some notes and remind yourself of what you intend to do in a step-by-step approach.
Now, take a look at the whole dataset and see what is in there. In addition, we use PROC MEANS to obtain the mean scores of all the numerical variables. The other variables in the dataset are categorical and they should not be calculated to get the means.
*-- See whats there; proc contents data=psid; run; proc means data=psid maxdec=3; var exp wks ed lwage; run;
The quantitative variables are work experience (EXP), weeks worked (WKS), education (ED), and log (wage) (LWAGE). LWAGE is the criterion variable. The DATACHK macro shows brief graphical statistics for all the variables.
*-- Graphical and numerical overview; %datachk(data=psid, var=exp wks ed lwage); *-- WKS has a strongly negatively skewed distribution; %symbox(data=psid, var=wks, powers=3 4 5 6 7);You will see that the variable WKS is strongly negatively skewed. The SYMBOX macro helps to find which power transformation will minimize the skewness. There's a reason here why weeks of work is negatively skewed (why?), so we don't pursue this here. Who would like a model using
WKS**7
anyway?
*-- Compare lwage for dichotomous variables; %splot(data=psid, var=lwage, class=fem south); run;
' | '
between the predictor variables, interactions between them are also investigated.
proc glm data=psid; class fem south; model lwage = fem | south | union; run;
*-- Examine means for pairs of Fem, South, Union; %intplot(data=psid, response=lwage, class=Fem South Union);
The main quantitative variables are ED and EXP. How do they relate to LWAGE? The LOWESS macro shows the relationship between a pair of variables chosen. On the Y-axis, we plot the criterion variable.
*-- Relations between quantitative variables; %lowess(data=psid, y=lwage, x=ed, hsym=1, interp=rl); %lowess(data=psid, y=lwage, x=exp, hsym=1, interp=rl);
The SCATMAT macro draws a high-resolution scatterplot matrix for all pairs of variables defined in the VAR list.
%scatmat(data=psid, var=lwage ed exp, interp=rl, anno=lowess, hsym=1, symbols=+);
*-- Fit a model; proc glm data=psid; class fem south union; model lwage = fem | south | union t ed exp fem*ed; output out=results r=resid p=predict; run;
The INFLGLIM macro produces various influence plots for a generalized linear model fit by PROC GENMOD. Here we use dist=normal
to fit an ordinary linear model.
Each influence plot is a bubble plot of a measure of residual
against a measure of leverage (Hat value, here),
with the bubble size proportional to a measure of influence (usually, BUBBLE=COOKD).
*-- Fit this model with GENMOD, get influence plots; %inflglim(data=psid, resp=lwage, class=fem south union, model= fem | south | union t ed exp fem*ed, id=id, dist=normal, lsize=1, print=no);You will see that several observations are unusual in terms of the predictors (large Hat values) and several have large residuals, but none have both to make them influential.
We haven't considered the changes over time (T) here. However, this initial look at the PSID data has shown some important variables to be incorporated into any model.