A Brief Look at the PSID

In the following exercise, we will take a brief look at the PSID (Panel Study of Income Dynamics) which is smaller than the SLID and NLSCY. PSID is a longitudinal data set, measuring income over 7 years. We ignore the time variable (T) in most of this exercise.

All the commands below are stored in the file R:\psych\spida\courses\eda\psidex1.sas. You can include them directly into SAS by typing in the Editor window:

%include spidaeda(psidex1);

or, you can copy/paste from this window into the SAS Editor window.

First of all, we copy the data to a new dataset called PSID, located in the temporary work folder and is retrieved from the original dataset called PSID.PSID in the R drive. In addition, we give a title to all subsequent output.

    title 'PSIDEX1: Brief look at the PSID'; 
    *-- Make a working copy of the PSID data;
    data psid;
       set psid.psid;
    run;

All the phrases after *-- are comments and SAS will not execute them. You can use comments to write some notes and remind yourself of what you intend to do in a step-by-step approach.

Summary statistics

Now, take a look at the whole dataset and see what is in there. In addition, we use PROC MEANS to obtain the mean scores of all the numerical variables. The other variables in the dataset are categorical and they should not be calculated to get the means.

    *-- See whats there;
    proc contents data=psid;
    run; 

    proc means data=psid maxdec=3;
    var exp wks ed lwage;
    run;

Graphical summary statistics

The quantitative variables are work experience (EXP), weeks worked (WKS), education (ED), and log (wage) (LWAGE). LWAGE is the criterion variable. The DATACHK macro shows brief graphical statistics for all the variables.

    *-- Graphical and numerical overview;
    %datachk(data=psid, var=exp wks ed lwage); 

    *-- WKS has a strongly negatively skewed distribution;
    %symbox(data=psid, var=wks, powers=3 4 5 6 7);

You will see that the variable WKS is strongly negatively skewed. The SYMBOX macro helps to find which power transformation will minimize the skewness. There's a reason here why weeks of work is negatively skewed (why?), so we don't pursue this here. Who would like a model using WKS**7 anyway?

Graphical comparisons for dichotomous variables

There are also several dichotomous varaibles, including female (FEM), south (SOUTH) and union (UNION). The SPLOT macro shows comparisons between boxplots of LWAGE for the two levels in a dichotomous variable. Are there any significant difference between the levels of FEM and SOUTH in terms of LWAGE?

    *-- Compare lwage for dichotomous variables;
    %splot(data=psid, var=lwage, class=fem south);
    run;

Fitting a model

PROC GLM is a combined model of ANOVA and REGRESSION. In this example, FEM and SOUTH are considered as the class variable and the model indicates the relationship between all the dichotomous variables. With the ' | ' between the predictor variables, interactions between them are also investigated.

   proc glm data=psid;
   class fem south;
   model lwage = fem | south | union;
   run;

Graphical comparisons of variables

These effects can be seen better in a plot. The INTPLOT macro shows the main effects and two-way interaction effects of a N-way design in the format of a scatterplot matrix.

    *-- Examine means for pairs of Fem, South, Union;
    %intplot(data=psid, response=lwage, class=Fem South Union);

The main quantitative variables are ED and EXP. How do they relate to LWAGE? The LOWESS macro shows the relationship between a pair of variables chosen. On the Y-axis, we plot the criterion variable.

    
*-- Relations between quantitative variables;
    %lowess(data=psid, y=lwage, x=ed, hsym=1, interp=rl);
    %lowess(data=psid, y=lwage, x=exp, hsym=1, interp=rl);

The SCATMAT macro draws a high-resolution scatterplot matrix for all pairs of variables defined in the VAR list.

   %scatmat(data=psid, var=lwage ed exp, interp=rl,
      anno=lowess, hsym=1, symbols=+);

Fitting an improved models

Now, we add some more variables into the GLM.

   *-- Fit a model;
   proc glm data=psid;
      class fem south union;
      model lwage = fem | south | union  t  ed exp fem*ed;
      output out=results r=resid p=predict;
   run;

The INFLGLIM macro produces various influence plots for a generalized linear model fit by PROC GENMOD. Here we use dist=normal to fit an ordinary linear model. Each influence plot is a bubble plot of a measure of residual against a measure of leverage (Hat value, here), with the bubble size proportional to a measure of influence (usually, BUBBLE=COOKD).

   *-- Fit this model with GENMOD, get influence plots;
   %inflglim(data=psid, resp=lwage,
       class=fem south union,
       model= fem | south | union  t ed exp fem*ed,
      id=id, dist=normal, lsize=1, print=no);

You will see that several observations are unusual in terms of the predictors (large Hat values) and several have large residuals, but none have both to make them influential.

We haven't considered the changes over time (T) here. However, this initial look at the PSID data has shown some important variables to be incorporated into any model.