Michael Friendly

One idea is to add an elliptical confidence region (*
data ellipse *) around the mean to highlight the joint
relationship. With data for several groups this helps show the
extent to which the relationship is the same for all groups.
(Better than individual regression lines.)

- The 50% ellipse is analogous to the central box in the boxplot.
- The correlation for the total sample may be very different from the within-group correlations.
- The size of each ellipse shows within-group variances on each variable.

( xwhere_{-}x bar )^{T}S^{size}+2 -1 ( x_{-}x bar ) <= chi_{2}^{2}( 1 - a ) ' ' . (3)

%macro CONTOUR( data=_LAST_, /* input data set */ x=, /* X variable */ y=, /* Y variable */ group=, /* Group variable */ pvalue= .5, /* probability value */ out=CONTOUR, /* output data set */ colors=RED GREEN BLUE BLACK);For example, a set of 50% data ellipses for Weight vs. Price of automobiles grouped by region of origin is produced by these statements:

%include contour; %contour(data=auto, x=price, y=weight, group=origin);

One set of ideas involves dividing the data into roughly equal-sized strips according to their ordered x-values. For each strip, we can plot:

- the median and hinge trace lines, e.g., a plot of
*median ( y | x )*vs.*x*. (Example: Draft lottery data) - parallel boxplots, i.e., boxplot of all
*y*values in the strip positioned at the middle*x*-value. (Example: Baseball salaries vs. batting average)

The plot below shows the lottery numbers assigned to each birthday (Day of Year) in the 1970 US Draft Lottery carried out by the Selective Service to determine the order in which elligible men would be drafted into the US Army. It was supposed to be completely random, and looks that way from the scatter plot

SCATTER DRAFT 366 - ** * * 2 * ** * L - *** * * **2 ** 2 2* * * O - *** * * 2 * ** * * *2 * * * T -*** * * * 2* * ** * * * 2* T - * 2*2** * * ** * ** ** ** E 275 o * 2 **** * * 2 ** * * ** * R -* *2* * *2 * * * * * * * ** Y - ** * * * * * *******2 * * -*** **** * **2 ** * * * * * N - 2 * * * ** * * * * 2* * 2 * U 184 * * ** * ** * * 2 *** ** ** M -* * 2 * * * ** * * * 2** 2 B - * * 2 * ** * * **2 ** * 3 E - * ** * 2 * * * * 3 ** 2 * R - * * * ** *** *2* * * * * * * 92 + * **** * * * ** * * * * *2* - * * 2 * * * ** * ** * *2* ** - *2 * * * ** * 2 2 * * * * * - * * 2 *** * * * * * 2 * *2 - * * * ** * **** * * * * * ** * 1 - * * 2* ** * * * -------------------+------------------+------------------- 1 123 244 366 DAY OF YEARThe next slide shows the trace line plot of the medians (M), upper (H), and lower (L) hinges of the lottery numbers against day of the year. It is clear, even with fairly coarse resolution, that the lottery numbers decrease towards the end of the year, indicating a defect in the method used to select these "random numbers".

342 + H H L - H HH HHH H O - HH HHH H HHH HH H H T -H H H HHHHHHHHHH HHH HHHHH HH T -H HHHHH HHHH HH H HHHHH HH H H E 268 + HHHHH H H HH H HH HHH H R - HHHH M HHH H HHHHHH HHHH H Y - HH MMMMMMMM MM HH H HHHH HH - MMM MMMMMMMM MM MMM H H H H H N -M MMM MMMM MMMMMM H M HHHHHHHHH U 194 +MM M L MM MM MMM MMMMM HH H M - MMMM LL MMMMM MM MMMMMM MM H H B - M L L M MMMMM MMMM M MM E - L LLLL MM MM L MMM R -L LLLL LLLLL L LLLLL MMMMMMM 121 + L LLLL L L L LL LL LL LL MM M - L L LLLLL LLLL L LL LL LLL LL - LLLL LLLLL LLLLL LLL LL LLL LLLLLL LL - LL LL LLL LLLLLL LL L LL LL LLLLL LLL - L L LLL LLL L LLLLL L 47 + L LL LL L +-------------------+-------------------+-------------------+ 21 129 237 346 DAY OF YEAR

Another generally useful technique is * robust, locally
weighted regression smoothing *, often called

The robust part works as follows: Once the fitted values, *
y hat _{i} *, have been found, the residuals,

In practice, lowess depends on the choice of a smoothing
parameter, * f , 0 < f <= 1 *, the fraction of the
data points to be considered in the calculation of * y hat _{i}
*. Choosing

**Step 1: Choose window around**, select the*x**r*observations closest to*x*. Call these_{i}*x*. The corresponding_{i1}, x_{i2}, ... , x_{ir}*y*values are*y*. The_{i1}, y_{i2}, ... , y_{ir}**window half width***x*is the distance to the furthest observation.:_{i}h

_{i}= max | x_{i}- x_{ij}|**Step 2: Find weights**

For each*x*in the window of_{ij}*x*, find the weight,_{i}w

where_{ij}= W left "[" x_{ij}- x_{i}over h_{i}right "]"*W ( bullet )*is the**tricube**W ( z ) = left "{" lpile 0 above ( 1 - | z |

^{3})^{3}lpile for | z | >= 1 above for | z | < 1 right " "**Step 3: Fit weighted linear regression**

Use weighted least squares to find coefficients,*a*,_{i}*b*, to minimize_{i}*S to r < w*in_{ij}e sub ij^{2}>y

_{ij}= a_{i}+ b_{i}x_{ij}+ e_{ij}**Step 4: Find fitted value, yhat i**

*y hat*, using the slope and intercept found by weighted least squares. All this gives one smoothed value,_{i}= a_{i}+ b_{i}x_{i}*y hat*for_{i}*x*!_{i}**Step 5: Down-weight outliers**

Calculate residuals,*e*. From these, calculate the_{i}= y_{i}- y hat_{i}**robustness**delta

where_{i}= B left "[" e_{i}over 6 Mdn ( | e | ) right "]"*B ( bullet )*is the**bisquare**B ( z ) = left "{" lpile 0 above ( 1 - z

^{2})^{2}lpile for | z | >= 1 above for | z | < 1 right " "**Step 6: Repeat**

Repeat steps 1 - 4, but use the compound weights,*delta*in the individual regressions, finding new fitted values,_{j}w_{ij}*y hat*. The effect is that outliers receive small weights, even when they are quite near_{i}*x*. Mostly, one robustness step is sufficient; once or twice I've seen cases where three iterations were better than two, however._{i}

The figure below shows the lowess smoothed fit for the draft lottery data. The steady decline in Lottery Number over the last two-thirds of the year is now quite clear. (The plot is over-dramatic, and perhaps misleading, since the vertical scale covers a smaller range than the plots of the raw data and trace lines.)

SMOOTH ~ DRAFT LOWESS }366 SCATTER SMOOTH 213 - 26666663 L o *666666674 366 O -46665 66 T - *65 T - *66 E 189 66* R - 563 Y - 3664 - 366663 N - 3666* U 165 + 56* M - 55 B - 26* E - 54 R - 26* 141 + 54 - 26 - 62 - 44 - 26 117 - 4 -------------------+------------------+------------------- 1 123 244 366 DAY OF YEAR

The LOWESS macro reads the input data set into PROC IML,
calculates the smoothed * y hat * values, and creates an
output data set containing the original and smoothed data. The
macro takes the following parameters:

/*---------------------------------------------------------------* * LOWESS SAS Locally weighted robust scatterplot smoothing * *---------------------------------------------------------------* %macro LOWESS( data=_LAST_, /* name of input data set */ out=SMOOTH, /* name of output data set */ x = X, /* name of independent variable */ y = Y, /* name of Y variable to be smoothed */ id=, /* optional row ID variable */ f = .50, /* lowess window width */ p = 1, /* 1=linear fit, 2=quadratic */ iter=2, /* total number of iterations */ plot=NO, /* draw the plot? */ gplot=NO, /* draw the plot? */ pplot=NO, /* draw a printer plot? */ symbol=circle, htext=1.5, hsym=1.5, name=LOWESS); /* name for graphic catalog entry */e.g.,

%include lowess; %lowess( data=auto, x=weight, y=mpg, f=.3, gplot=YES );

y = f ( x ) + residual = fit + residual

All other things equal, we prefer a "simple" * f
( x ) * like a linear function, to a more complex one. If a
scatterplot of * y * against * x * appears substantially
non-linear, there are two choices:

- Bend the model: Try fitting a quadratic, cubic, or other
polynomial in
*x*instead of a linear model. - Unbend the data: Transform either
*y --> y*, or^{'}*x --> x*(or both), so that the relation between the transformed variables is more nearly linear,^{'}y

^{'}= a + b x^{'}+ residual

Once again, the ladder of powers provides a scheme for
understanding the effect of various power transformations. Visual
examination of the scatterplot of the raw data, possiblly enhanced
with a lowess smoothed curve, can suggest the direction to move on
the ladder for transforming * x *, or * y *, or both.

As a preliminary example, consider the simple case where

y = size -2 1 over 5 xfor^{2}

xor^{T}= x^{2}--> y = size -2 1 over 5 x^{T}

yThe data and the transformed values are:^{T}= sqrt y --> y = sqrt size -2 1 over 5 x^{T}

X Y X} Y.9 0 0 0 0 1 0.2 1 0.4472 2 0.8 4 0.8944 3 1.8 9 1.342 4 3.2 16 1.789 5 5 25 2.236These values are plotted below:

SCAT X, Y SCAT XP, Y 5- f 5- f - - - - - - - f - f - - - f - f - - - f - f - - 0f----f-------------------- 0ff------------------------ 0 X 5 0 X' 25 SCAT X , YP 2.5- - f - - f - - f - f - - f - 0 f------------------------- 0 X 5

The arrow points in the direction to move along the ladder of
powers for * x * or for * y *. That is, if the arrow
points toward smaller values of * x * (or * y *), move
down the ladder of powers, toward * sqrt x * or * log ( x )
* (or * sqrt y *, * log ( y ) *). The greater the
curvature, the further from 1 (raw data) should be the power.

x --> x^{p}

y --> y^{q}

would have, you would have to transform the data and examine the scatterplot of the re-expressed values.

Since the power transformations are order-preserving, we can try different power transformations on a few well chosen points, to see what effect they would have on all the data.

Tukey (1977) suggests finding three * summary points
* which are the median

If the relation is linear, these three points will fall on a
line, or equivalently, the slope of the line from * (x , y )
sub L * to * (x , y ) _{M} * will be the same as the
slope of the line from

Y- - - Y- - - H - - - - - - - H - - - - - - M - - - M - - - L - - - L - - - - - - - - 0+------------------------- 0+------------------------- 0 X 0 XOn the other hand, if the relation is not linear, the lines connecting the pairs of points will have different slopes.

- Sort the
*( x , y )*pairs in increasing order by*x*-value. - Divide the points into thirds, based on ordered
*x*-values. If*n / 3*is not an integer, put the*[ n / 3 ]*points with the smallest*x*-values in the Low group, the*[ n / 3 ]*points with the largest*x*-values in the High group, and the remaining points in the Middle group.This division is modified when equal

*x*-values straddle the dividing line: they are all kept in the same third. Also, neither end portion should cover more than half the range of the*x*-values, which can happen if the distribution of*x*is highly skewed. - Then
*x*is the median of the_{L}*x*s in the L group;*y*is the median_{L}*y*. The summary points*(x*and_{M}, y_{M})*(x*are defined similarly._{H}, y_{H})

This section uses data on literacy rates and gross national
product for 22 nations to show the details of the calculations.
(The plot indicates the relation is highly nonlinear. The arrow
rule indicates we should transform * x * to lower powers or
* y * to higher powers.)

NEPAL 45 5 BURMA 57 47.5 UGANDA 64 27.5 S. VIETNAM 76 17.5 SCAT GNP THAILAND 96 68 100- f f f f HAITI 105 10.5 - f INDONESIA 131 17.5 - f S. KOREA 144 77 L - f GHANA 172 22.5 I - f PERU 179 47.5 T - f f EL SALVADOR 219 39.4 E - f BR.GUIANA 235 74 R - f HONG KONG 272 57.5 A -f f f PANAMA 329 65.7 C - f LEBANON 362 47.5 Y - SINGAPORE 400 50 -f ARGENTINA 490 86.4 -fff ICELAND 572 98.5 - f CZECHOSLOVAK 680 97.5 -f FRANCE 943 96.4 0------------------------------------- NEW ZEALAND 1310 98.5 0 GNP 2000 CANADA 1947 97.5Since the distribution of GNP is so skewed, the range rule suggests allocating only two points to the upper third. The summary points are found to be:

2 RSTAB GNP SUMPTS: SPLIT IS 7 13 2 SUMMARY POINTS L 76 17.5 M 329 65.7 H 1629 98

r = mAs shown earlier, a linear relation implies_{HM}over m_{ML}= ( y_{H}- y_{M}) / ( x_{H}- x_{M}) over ( y_{M}- y_{L}) / ( x_{M}- x_{L}) (4)

For the GNP data, this ratio is quite far from 1:

1 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*1] [Y*1] 1629 98 0.02485 329 65.7 0.1304 0.1905 76 17.5

Since we are using medians, and the ladder of powers is order-preserving (as long as all values are positive), we can transform the summary points to various powers, and the ratio of slopes for the transformed summary points will tell what effect that transformation will have on the whole data set.

If we transform * x --> x ^{p} *, and

rSince we know we must go down the scale of powers for^{<}(p,q) = m_{HM}^{<}(p,q) over m_{ML}^{<}(p,q) = ( y_{H}^{q}- y_{M}^{q}) / ( x_{H}^{p}- x_{M}^{p}) over ( y_{M}^{q}- y_{L}^{q}) / ( x_{M}^{p}- x_{L}^{p}) (5)

0 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*0] [Y*1] 3.212 98 46.49 2.517 65.7 0.6138 75.74 1.881 17.5 H.5 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*H0.5] [Y*1] H0.02478 98 1064 H0.05513 65.7 1.315 809 H0.1147 17.5 H.33 1 SLOPES S SUMMARY HALF SLOPE POINTS SLOPES RATIO [X*H0.33] [Y*1] H0.08711 98 533.3 H0.1477 65.7 1.016 524.9 H0.2395 17.5

RATIOS OF SLOPES : YE [H2 ] [H1 ] [ H.5] [LOG ] [ .5] [RAW ] [ 2 ] X$ [H2 ] .778 2.213 3.575 5.590 8.459 12.394 24.385 [H1 ] .175 .499 .806 1.261 1.908 2.796 5.500 [ H.5] .083 .235 .379 .593 .898 1.315 2.588 [LOG ] .039 .110 .177 .277 .419 .614 1.208 [ .5] .018 .051 .082 .128 .194 .284 .559 [RAW ] .008 .023 .038 .059 .089 .130 .257 [ 2 ] .002 .005 .008 .012 .018 .027 .053

The figure below shows the data and scatterplot when GNP is transformed to the reciprocal cube root.

NEG. CUBE ROOT OF X IS CLOSEST TO BEING STRAIGHT GNP3 ~ GNP GNP3[;1]~GNP[;1] POWER H.333 GNP3 TRANSFORMED X H0.2815 5 H0.2602 47.5 H0.2503 27.5 SCAT GNP3 H0.2364 17.5 100 ff f f - H0.2187 68 f - H0.2123 10.5 f - H0.1972 17.5 L f - H0.1911 77 I f - H0.1801 22.5 T f f - H0.1777 47.5 E f - H0.1662 39.4 R f - H0.1623 74 A f f f - H0.1546 57.5 C f - H0.1451 65.7 Y - H0.1406 47.5 f - H0.136 50 f f f - H0.1271 86.4 f - H0.1207 98.5 f - H0.114 97.5 0------------------------------------ H0.1022 96.4 H0.3 GNP*H1 3 H0.05 H0.09161 98.5 H0.08029 97.5The plot shows no signs of nonlinearity. However, some find "reciprocal cube root" of GNP hard to interpret, and might prefer a slightly curved relationship with

The plot below shows data generated from a mixture of three bivariate normal distributions. It is hard to see that there are three regions of high density.

PLOT OF Y*X LEGEND: A = 1 OBS, B = 2 OBS, ETC. Y | AA A | A BA 50 + A A A A | B A ABACD ABB CAB | A AACBCBAA BA | AA B BAAACACA B | A AA B ABAABA A A A 40 + AA A B D | A AA B | A CA AA | A B BBC BBA AB A | A B BAB E AAAA AA A 30 + A A ADAAC A A A A | A A AB A BA BAAAA | A A AB A A A A A | A A AAA AA | A A A AAA 20 + A AABAAA A A A | AA A BACBB A | A ABA ABB DAA BAA B A | AAA AAB CACAAB A | B AA AF AAB B A 10 + AA AB B | A B | B | | 0 + --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-- -3 0 3 6 9 12 15 18 21 24 27 30 33 36 39 X

To jitter, add a uniformly distributed random quantity, scaled
so that it breaks up the overlap, but does not corrupt the data.
Let * u _{i} = u [ -1, 1 ] * be uniformly distributed from
-1 to 1. Then, to jitter

xwhere^{T}_{i}= x_{i}+ s u_{i}

Then, in the scatter plot of * y * (or * y ^{T}*)
vs.

Previous section Next section.