Michael Friendly

- 3.1 Regression model
- 3.2 Residual Plots
- 3.3 Leverage and Influence diagnostics
- 3.4 Partial residual plots
- 3.5 Box-Cox Transformations

y = bIn matrix form for a sample of_{0}+ b_{1}x_{1}+ b_{2}x_{2}+ b_{3}x_{3}+ ... + b_{p}x_{p}+ e (6)

y = X b + e (7)

In addition to the assumption that model (6) correctly
specifies the relation between * y * and the * x _{i} *,
the usual statistical inference requires the following assumptions
on the residual errors:

- Normality
- The residuals have a normal distribution (with mean zero).
- Homogeneity of variance
- The variance of
*e*is a constant value,_{i}*s*.^{2} - Independence
- The residuals associated with distinct observations are uncorrelated.

The usual least squares estimators can be expressed in matrix notation as:

**Regression coefficients**

b Hat = ( X

^{T}X )^{-1}X^{T}y**Fitted values**

y hat = X b Hat = X ( X

where^{T}X )^{-1}X^{T}y = H y*H*is the*( n x n )*matrix called the**hat matrix***y*).**Residuals**

e = y - y hat = ( I - H ) y

Plots of the residuals can help to answer the following questions:

- Has the model been correctly specified? Are there important explanatory variables which have been omitted?
- Are the assumptions of the model satisfied?

A better way to plot residuals to assess changes in spread is
to plot the absolute value of the residual, * | e _{i} | *
against

One idea is to make a diagnostic plot of
* log _{10} | e sub i^{*} | *
against

- leverage
- The "horizontal" distance of the
*x*-value from the mean,*x bar*. The further from the mean, the more leverage an observation has. - y-discrepancy
- The vertical distance between
*y*and a regression line that ignores_{i}*y*._{i}

Influence = Leverage × y-Discrepancyso, an observation must have

hIn general, it can be shown that_{i}= (1 / n) + ( x_{i}- x bar )^{2}over S ( x_{k}- x bar )^{2}

0 <= hso the average value of_{i}<= 1 and S_{i=1}^{n}(h_{i}) = p + 1

- the variance of the residuals varies with leverage,
*var ( e*, and_{i}) = s^{2}( 1 - h_{i}) - outliers on
*y*tend to pull the regression line towards themselves.

To correct these difficulties, we commonly use what is called the

RSTUDENT identical e_{i}star = e_{i}over s_{(-i)}sqrt 1 - h_{i}(9)

Two influence measures based on these ideas are commonly used.
Both are functions of the product of leverage and * y
*-discrepancy.

- DFFITS
- a scaled measure of the change in the predicted
value,
*y hat*when observation_{i}*i*is omitted,DFFITS

Belsley et al. (1980) suggest using_{i}= y hat_{i}- y hat_{(-i)}over s_{(-i)}sqrt h_{i}= e_{i}over s_{(-i)}x sqrt h_{i}over 1 - h_{i}(10)*DFFITS > 2 sqrt < ( p+1)/n >*as a criterion for declaring an influential outlier. - COOK'S D
- a measure of change in the parameter estimates,
*b Hat*, when observation*i*is deleted. Let*d*be the difference in the least squares estimates with and without observation_{i}= ( b Hat - b Hat_{(-i)})*i*. Then, Cook's D isCOOKD

Cook's D is approximately_{i}= d_{i}^{T}( X^{T}X ) d_{i}over ( p + 1 ) s^{2}= e_{i}^{2}over (p+1) s^{2}x h_{i}over 1 - h_{i}^{2}(11)*DFFITS*, so a cutoff of^{2}/ p*4 / n*might be used to detect influential outliers.

- INCOME
- The proportion of males in a given occupational category reporting income of $3500. or more in the 1950 US Census.
- EDUC
- The proportion of males in each occupation with at least a high school education in that census.
- PRESTIGE
- A survey of almost 3000 people by the National Opinion Research Center obtained ratings on a 5-point scale of the "general standing" of someone engaged in each occupation. Duncan's prestige measure was the percent of people giving a rating of "good" or "excellent".

The results of a regression analysis predicting prestige from income and education are shown below. From these results, it would appear that

- the goodness of fit of the model is very good (
*R*).^{2}= .83 - income and education have approximately equal weights in predicting occupational prestige.

Duncan Occupational Prestige Data DEP VARIABLE: PRESTIGE Prestige ROOT MSE 13.36903 R-SQUARE 0.8282 DEP MEAN 47.68889 ADJ R-SQ 0.8200 PARAMETER ESTIMATES PARAMETER STANDARD T FOR H0: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > |T| INTERCEP 1 -6.06466292 4.27194117 -1.420 0.1631 INCOME 1 0.59873282 0.11966735 5.003 0.0001 EDUC 1 0.54583391 0.09825264 5.555 0.0001However, plots of the influence measures for each case indicate that several observations a great deal of influence on this conclusion. One way to understand the separate effects of discrepancy and leverage is to make a plot of RSTUDENT * HATVALUE.

The plot below identifies individual observations which are
either high leverage (* h _{i} > 2 (p+1)/n = .133 *) or
have large studentized residuals (|RSTUDENT| > 2.02, the
critical value of

-+----+----+----+----+----+----+----+----+----+----+----+----+----+- S 4.00+ + T | | U | | D | | E | * Minister | N 3.00+ + T | | I | | Z | | E | | D 2.00+ * Contractor + | * 28 | R | * 18 | E | | S | | I 1.00+ * 25 * 32 + D | 26 * * 5* 7 RR Engineer * | U | * 29 * 20 | A | 361**** 10* 13 | L | * 37 ** 40 | 0.00+-*-44---343**-3-*-15----------------------------------------------+ | 11 * * 35 | | 334*** * * 8 | | 42 * 31 | | * 41 | -1.00+ * 38 * 21 + | * 33 | | * 22 | | | | * 24 * RR Conductor | -2.00+ * 23 + | | | * Reporter | | | | | -3.00+ + -+----+----+----+----+----+----+----+----+----+----+----+----+----+- .02 .04 .06 .08 .10 .12 .14 .16 .18 .20 .22 .24 .26 .28 H LEVERAGE

- large residual, low leverage
- Reporter, Contractor
- small residual, low leverage
- RR Engineer
- large residual, high leverage
- Minister, RR Conductor

%include inflplot; %inflplot(data=duncan, y=Prestige, x=Income Educ, id=Job, bsize=14, color=red);

The partial regression plot for variable * x _{k} * is
defined as follows. Let

X [ k ] = [ 1 , xand let_{1}, x_{2}, ... , x_{k-1}, x_{k+1}, ... , x_{p}]

ySimilarly, let_{k}^{*}= y - y hat_{X}[ k ]

- The slope of
*y*on_{k}^{*}*x*is equal to_{k}^{*}*b*, the estimate of the (partial) regression coefficient,_{k}*b*, in the full model. (The simple scatter plot of_{k}*y*against*x*does not necessarily show the correct partial slope.) - The residuals from the regression line in this plot are
identically the residuals for
*y*in the full model. - The simple correlation between
*y*and_{k}^{*}*x*is equal to the partial correlation between_{k}^{*}*y*and*x*with the other_{k}*x*variables partialled out or controlled. Visually, this correlation indicates how good the estimate of*b sub k*is. - When the relation between
*y*and*x*is nonlinear, the partial regression plot shows both the linear and nonlinear components of the relation, controlling for the other predictors. This helps determine if a term in_{k}*x*or some other function of_{k}^{2}*x*needs to be added to the model._{k} - The values
*x*are proportional to the partial leverage added to^{*}_{ik}^{2}*h*by the addition of_{i}*x*to the regression. High leverage observations on_{k}*x*are those with extreme values of_{k}*x*._{k}^{*}

From these plots, it was decided to eliminate these three cases and repeat the regression analysis. The regression results are shown below.

Duncan Occupational Prestige Data DEP VARIABLE: PRESTIGE Prestige ROOT MSE 11.49227 R-SQUARE 0.8762 DEP MEAN 46.52381 ADJ R-SQ 0.8699 PARAMETER ESTIMATES PARAMETER STANDARD T FOR H0: VARIABLE DF ESTIMATE ERROR PARAMETER=0 PROB > |T| INTERCEP 1 -6.31736312 3.67962246 -1.717 0.0939 INCOME 1 0.93066285 0.15374847 6.053 0.0001 EDUC 1 0.28464102 0.12136228 2.345 0.0242The prediction of occupational prestige has improved somewhat (

The PARTIAL option of PROC REG produces partial regression plots, one for each predictor given in the MODEL statement (plus one for the intercept). In these plots, points are identified by the value of the (character or numeric) variable given in the ID statement.

For the Duncan occupational prestige data, the partial regression plots were produced by the statements,

proc reg data=duncan; model prestige=income educ / influence partial r; output out=diagnose p=yhat r=residual rstudent=rstudent h=hatvalue cookd=cookd dffits=dffits; id case;The plot for INCOME in Figure 5 shows that three observations (6, 16 and 27) have great influence on the regression coefficient for income. Without these three observations, the partial effect (slope) of income on prestige would be a good deal greater.

PARTIAL REGRESSION RESIDUAL PLOTS PRESTIGE P 80 + r | e | s | t | i 60 + g | e | | | 40 + 27 | 17 | 18 | 28 | 20 + | 2*20 | 6 25 5 10 | 32 7 1* 30 16 | ** 19 0 + 44 35 1* | ** 33 | 1534 | 43**42 | 14 31 22 -20 + 45 4138 | 23 | 21 24 9 | | -40 + | | | | -60 + | | | | -80 + +---------+---------+---------+---------+---------+---------+ -60 -40 -20 0 20 40 60 Income INCOMEFigure 5: Partial regression plot for income

%macro PARTIAL( data = _LAST_, /* name of input data set */ yvar =, /* name of dependent variable */ xvar =, /* list of independent variables */ id =, /* ID variable */ label=INFL, /* label ALL, NONE, or INFLuential obs */ out =, /* output data set: partial residuals */ htext=1.5, /* height of text labels in the plots */ plots=&xvar, /* which partial plots to produce */ gout=gseg, /* name of graphic catalog */ name=PARTIAL); /* name of graphic catalog entries */An example is shown in Figure 6. This plot, with all points labelled with CASE number, is produced by the statements,

%include partial ; %partial(data=duncan, yvar=prestige, xvar=educ income, id=case, label=ALL);

ywith the general linear model,^{( l )}= left { lpile y^{l}- 1 over l , above log y , lpile l != 0 above l = 0 right "" , (12)

Box and Cox proposed a maximum likelihood procedure, to
estimate the power (* l hat *), the regression
coefficients (* b Hat *) and error variance simultaneously
by maximizing the likelihood,

L ( l , b , sThis gives a statistical test of whether a transformation is necessary as the test of_{e}^{2}) = Prob ( model | data )

G^{2}= - 2 log left [ L ( b , s_{e}^{2}| l = 1 ) over L ( b , s_{e}^{2}| l = l hat ) right ] ~ chi_{(1)}^{2}

It turns out that * log L ( l ) ~ sqrt MSE *, the
mean square error in regression using the * y ^{l} * as
the criterion, so maximizing log L is equivalent to finding the
value of

Hence, a simple (but computationally intensive) method is to
fit the model * y hat ^{<(} l )> = X b *, for a
range of

log L ( l ) > log L ( l hat ) - ( 3.84 ) / 2where 3.84 is the .95 percentile of the chi square distribution with 1 df. As an added benefit, one can plot the values of the partial

For the auto data, the previous plots were constructed with the following statements:

%include data(auto); %include boxcox; *-- unnecessary in PsychLab; %boxcox(data=auto, resp=MPG, model=Weight Displa Gratio, id=model, out=coxout, gplot=RMSE EFFECT INFL, pplot=RMSE EFFECT INFL, lopower=-2.5, hipower=2.5, conf=.99);The OUT= parameter specifies the name of the dataset to contain the transformed response.

Previous section Next section.