Working Experience in the SLID

by Gigi Luk

The following study investigates the correlations between various predictor variables and an individual's duration of working experience. The target province studied is Ontario. Some of the variables chosen measure an individual's personal characteristics (e.g. age, sex, wages and salaries and total years of schooling). One of the variables examines the impact of family members' labour activity on an individual's working experience (i.e. FY/FT workers in family).

Same as the previous exercise, you can either type the commands, copy and paste the commands from the web or simply open a file from the File menu.

Locating the data

The SAS data set SLID.PWORKEX1 was created in the SLID library, containing a subset of the variables relating to work experience and the 7425 observations from Ontario.

The statements below were used to create SLID.PWORKEX1 from the original SLID person file, SLID.LP9394nm. Note that we used a subsetting IF statement at the end to restrict the data to respondents from Ontario (for the purposes of this workshop)

You do not need to run this step -- the dataset SLID.PWORKEX1 was already created for you.


data slid.pworkex1;                  
   set slid.lp9394nm       (keep=
      pupid26c        /* Random person ID 1994 */
      elgw26c         /* Ext longitudinal weight */

      yrxft11c        /* Years of work experience 1994 */
      eage26c         /* Ext person's age 1994 */
      sex21           /* Sex */
      regre25c        /* Region 1994 */
      yrsch18c        /* Total yrs of schooling 1994 */
      nfyft27c        /* FY/FT workers in family 1994 */
      ttwgs28c        /* Wages and salaries all job 1994 */
      );
   if regre25c=3;     /* retain Ontario subjects only */
   id=put(pupid26c,8.);   /* make a character ID variable */
run;

Examining the variables [File: R:\psych\spida\courses\eda\pworkex2.sas]

The file pworkex2.sas includes all the rest of the SAS commands that you are going to use in this exercise. The notes below explain the purpose of each step.

Variables at a glance

Take a look at the names of the variables according to the variable list.

Title 'SLID: Working Experience 1994';    /* Add a title */
proc contents data=slid.pworkex1 position;
run;

Summary statistics

Before analyzing the data, let's take a look at the brief summary statistics for all the variables. After "var" (variable), you can either use the short form (eage26c -- yrsch18c) of stating the variables (like below) or type out every single variable that you are interested in. But make sure the variables in short form are in the positional order according to the variable list.

proc means data=slid.pworkex1 n min max mean std skew maxdec=2;
   var eage26c -- yrsch18c;
   run;

Graphical summary statistics

The DATACHK macro shows brief graphical statistics for all the variables.

%datachk(data=slid.pworkex1, var=eage26c -- yrsch18c, ls=90);
run;

Transformation of response variable

The SYMBOX macro helps to obtain a suitable power for transformation of the dependent variable in order to minimize skewness. The most symmetrical box plot will be your choice.

%symbox(data=slid.pworkex1, var=yrxft11c, powers=0 0.5 1 1.5);
run;

What power seems to make the distribution most symmetric? After choosing a suitable power, you'll have to define the whole data set to include the transformed dependent variable.

We use SQRTWEX, the square root of Years of Work Experience.

data pworkex2;
   set slid.pworkex1;
   sqrtwex = sqrt(yrxft11c);
   label sqrtwex = 'sqrt(Working experience 94)';
   run;

The BOXGLM macro can also identify a transformation power needed for the response, for the goal of minimizing the MSE in a model.

%boxglm(data=slid.pworkex1,
        resp=yrxft11c,
        model=eage26c sex21 yrsch18c ttwgs28c nfyft27c,
        id=pupid26c,
      gplot=RMSE EFFECT,
      pplot=RMSE EFFECT,
        lopower=-1, add=0.5);
run;

The response variable, yrxft11c (years of work), contains some zero values. We use ADD=0.5 to add a 0.5 to each response, to avoid errors. (The SYMBOX macro simply ignores non-positive values.)

How do these results compare with what we saw from the SYMBOX macro?

Relation between the response and age, wage and yrs of schooling

The LOWESS macro presents graphs which show the relations between the response variable and one single predictor variable at a time. The straight line is the regression line, and the curve is the smooth line. Note that the original data set is used.

Try one or more of the following:

%lowess(data=slid.pworkex1, y=yrxft11c, x=eage26c,  hsym=0.5, interp=r1);
%lowess(data=slid.pworkex1, y=yrxft11c, x=ttwgs28c, hsym=0.5, interp=r1);
%lowess(data=slid.pworkex1, y=yrxft11c, x=yrsch18c, hsym=0.5, interp=r1);
run;

Do the relations seem reasonably linear?

Fitting a model

PROC GLM is used to fit a combined model of ANOVA and REGRESSION. The variable SEX21 is considered as a classification variable whereas the other variables are regression ones. The output data set of the model is defined as 'pworkex3', the predicted values in this set of model is called 'predict1', same naming strategy for residuals.

proc glm data=pworkex2;
   class sex21;
   model sqrtwex = eage26c|eage26c sex21 
                   ttwgs28c|ttwgs28c
                   yrsch18c|yrsch18c
                   nfyft27c / solution;
   output out=pworkex3 predicted=predict1 residuals=resid1;
run;

Correlation matrix

PROC CORR provides a numeric matrix in the output for you to examine the correlation between each of the variables including the response variable. In fact, some interesting correlation may be found within some other predictor variables.

proc corr data=pworkex2 outp=corr;
   var sqrtwex eage26c sex21 ttwgs28c yrsch18c nfyft27c;
   run;

The OUTP=CORR option produces an output data set containing the correlation matrix. The CORRGRAM macro gives a visual summary of a correlation matrix, with variables permuted to put similar variables together.

goptions hsize=6in vsize=6in;
%corrgram(data=corr, var=sqrtwex eage26c sex21 ttwgs28c yrsch18c nfyft27c);