The Wages of SLID

Here are a few exercises to try, examining variables related to wages and salaries in the SLID Person File.

You can do these exercises in several ways:

Locating the data

The datasets are stored as SAS system files in the directory (folder) known to SAS as SLID in the Hebb Lab. This libarary name is automatically allocated when SAS starts.

There are three data sets. In SAS statements, refer to the data in a PROC step or DATA step by the following names:

SAS name Number of
Cases
Number of
Variables
Description
slid.lp9394nm 27854 1176 Complete person file (minus all-missing)
slid.pontario 7425 48 Ontario subset (wages and related variables)
slid.pquebec 5224 48 Quebec subset (wages and related variables)

Here, we will work with the Ontario and Quebec subset files, to keep the time for analysis small. The variables in these subsets and their statistics for the Ontario subset are shown here from the result of

%missing(data=slid.pontario, id=pupid26c, drop=);

First steps [File: R:\psych\spida\courses\eda\pontdchk.sas]

First, let's look at some of the possibly interesting variables, getting a quick overview with the %datachk macro.
options fmtsearch=(slid);
*-- Look for interesting variables, summarize their properties;
%datachk(data=slid.pontario, id=pupid26c, 
    var=cmphw28c ttwgs28c ttinc42c eage26c yrsch18c hlevg18c yrxft11c);

Choosing a Response variable and re-expression [File: R:\psych\spida\courses\eda\pontsym.sas]

The first three variables cmphw28c ttwgs28c ttinc42c are possible response variables.

Run one or more of the following in SAS and pick one variable to pursue.

*-- Examine transformations for response variables;
title 'SLID: cmphw28c, Comp. hourly wage, 1994';
%symbox(data=slid.pontario, var=cmphw28c, powers=-1 -0.5 0 0.5 1);

title 'SLID: Total wages and salaries (TTWGS28C)';
%symbox(data=slid.pontario, var=ttwgs28c, powers=-0.5 0 0.5 1);

title 'SLID: Total money income (TTINC42C)';
%symbox(data=slid.pontario, var=ttinc42c, powers=-0.5 0 0.5 1);

Transforming the response [File:R:\psych\spida\courses\eda\ponttrans.sas]

In SAS, you create transformed variables in a data step, using functions such as log10, sqrt, etc. Feel free to change the lines below, depending on what variable you select, and the power you find most appropriate. [The if xxxx > 0 then parts suppress SAS errors/warnings for negative or missing values.]

title;
data pontario;
   set slid.pontario;
   if loghwage > 0 then loghwage = log10(cmphw28c);
   if ttwgs28c > 0 then sqrtwage = sqrt(ttwgs28c);
   if ttinc42c > 0 then sqrtinc = sqrt(ttinc42c);
   label
      loghwage = 'log(Hourly wage)' 
      sqrtwage = 'sqrt(Total wages and salaries)'
      sqrtinc = 'sqrt(Total money income)'
      eage26c = 'Age in 1994';
  run;
This step creates a temporary copy of the data set, including the new, transformed variables. From here on, we use the new, temporary file, PONTARIO

Relation between the response and Age, Yrs of Schooling

Both Age (EAGE26C) and Years of Schooling (YRSCH18C) are potential predictors of your response measure.

In the statements below, substitute the name of your chosen transformed reponse (e.g., SQRTWAGE) measure for @RESP@.

Use the %LOWESS macro to fit a smoothed curve relating your @RESP@* to Age (EAGE26C) and Years of Schooling (YRSCH18C)

%lowess(data=pontario, y=@RESP@, x=eage26c, hsym=1, interp=rl);

%lowess(data=pontario, y=@RESP@, x=yrsch18c, hsym=1, interp=rl);

Men vs. Women

Is there an overall gender gap in wages (ignoring other factors)? Let's make a boxplot of your chosen @RESP@ measure by sex (SEX21, in the SLID)

Again, substitute the name of your chosen transformed reponse (e.g., SQRTWAGE) measure for @RESP@ in the statements below.

proc sort data=pontario;
   by sex21;

proc boxplot data=pontario;
   plot @RESP@ * sex21 /boxstyle=schematicidfar notches;
   id pupid26c;
   run;
If the notches do not overlap, the difference in (median) wage measure is significant at the 95% level (ignoring other factors).

Fitting a model [File:R:\psych\spida\courses\eda\pontfit.sas]

The plots of SQRTWAGE vs. Age and Yrs of Schooling both showed a substantial curvature. In such cases, there are two possibilities:
Since we have already transformed the response (total wages), we will consider the second possibility here.

Fit the model below, which includes both linear and quadratic terms in Age and Years of Schooling. We also include Sex and Visible Minority Status as additional predictors.

Note the use of "|" notation in the MODEL statement: A|A is short for A A*A.

proc glm data=pontario;
   class sex21;
   model sqrtwage = sex21 eage26c|eage26c yrsch18c|yrsch18c vismn15 / solution;
   output out=stats  r=residual p=fitted;
   run;
Not bad, in terms of overall fit --- the R2 is 0.40. Note that the linear term in YRSCH18C is not significant by the Type III tests; however, most analysts would retain lower-order terms whenever a high-order term is significant.

A more complex model?

Perhaps there are important interactions among these predictors. We can use the following 'trick' to test for all possible pairwise interactions among a set of quantitative predictors:

PROC RSREG fits a response surface model, generating all quadratic and cross-product terms. For illustration, we list VISMN15 first on the MODEL statement below, and use the option COVAR=1 to suppress cross-products (interactions) with Visible Minority Status.

Note that SEX21 is binary, so it need not be treated as a CLASS variable.

proc rsreg data=pontario;
   model sqrtwage = vismn15 sex21 eage26c yrsch18c / covar=1;
   id pupid26c;
   run;
The cross-product terms collectively are highly significant, yet the R2 increases only to 0.41, so the increase in model complexity may not be worthwhile. [Question: Can you think of some interpretation for the significant interaction of Age * Sex on Wages?]

Simple model diagnostics [File: R:\psych\spida\courses\eda\pontresid.sas]

Here, we will examine the simple residuals from the PROC GLM step (In the Regression Diagnostics course you will learn about other, better diagnostic measures and visual displays.)

First, let's make a normal probability plot:

%nqplot(data=stats, var=residual);

What about heteroscedisticity (constant variance around the regression surface)?

Here we can divide the observations into groups based on fitted value and test whether the variability of SQRTWAGE varies systematically with the predicted wage. (In the Regression Diagnostics course you will learn about other, better methods for detecting non-constant variance.)

We use the SPRDPLOT macro to create a Spread vs. Level plot.

proc rank data=stats out=grouped groups=10;
   var fitted;
   ranks decile;

%sprdplot(data=grouped, var=sqrtwage, class=decile);