Michael Friendly

- 4.1 Scatterplot matrix
- 4.2 Glyphs, Stars & Profiles
- 4.3 Biplot
- 4.4 Assessing multivariate normality
- 4.5 Detecting multivariate outliers
- 4.6 Canonical discriminant plots and MANOVA

The next slide shows a scatterplot matrix of four measurements on three species of Iris flowers, the length and width of the sepal, and length and width of the petal. Iris species is reflected in the color and shape of the plotting symbol (which unfortunately do not remain visually distinct when photoreduced).

%macro SCATMAT( data =_LAST_, /* data set to be plotted */ var =_NUMERIC_, /* variables to be plotted - can be */ /* a list or X1-X4 or VARA--VARB */ group=, /* grouping variable (plot symbol) */ symbols=%str(- + : $ = X _ Y), colors=BLACK RED GREEN BLUE BROWN YELLOW ORANGE PURPLE, gcat=GSEG); /* graphic catalog for plot matrix */The scatterplot matrix for the Iris data is produced by these statements:

%include scatmat; %scatmat(data=iris, var=SEPALLEN SEPALWID PETALLEN PETALWID, group=spec_no);

Star plots differ from glyph plots in that all variables are used to construct the plotted star figure; there is no separation into foreground and background variables. Instead, the star-shaped figures are usually arranged in a rectangular array on the page.

The main use of star plots would be in detecting which groups of observations seem to cluster together. They would appear as stars with similar shape.

If the variables are measured on different scales, there two ways the data are typically displayed, neither very satisfactory:

- Plot the raw data profiles. Variables with a large range tend to dominate the display.
- Plot the standard score profiles, i.e., change each variable to
*z*scores,*z*. Unless the observations have relatively flat standard score profiles, the result will be hard to see._{ij}= ( x_{ij}- x bar sub j ) / s_{j}

If the data values are denoted * y _{ij} *, we first
allow each variable to be transformed to standard scores,

ywhere_{ij}star = ( y_{ij}- y bar_{j}) / s_{j}

Say the variables are positioned on the horizontal scale at
positions * x _{j} *, the values we wish to find. Then if
each case has a linear profile, that means that

ywith slope_{ij}star = a_{i}+ b_{i}x_{j}

The linear profiles display of the crimes data is shown in the next slide. Note that property crimes are positioned on the left, crimes of violence on the right. Cities with a positive slope (Dallas, Chicago) have a greater prevalence of violent crime than property crime; for cities with a negative slope (Los Angeles, Denver, Honolulu) the reverse is true.

Overall crime rate for each city is represented by the intercept of each line: cities at the top (Los Angeles, New York, New Orleans) having the most crimes and cities at the bottom having the least.

The biplot is based on the idea that any data matrix, * Y
( n × p ) *, can be represented approximately as the product
of a two-column matrix, * A ( n × 2 ) *, and a two-row
matrix, * B ^{T}(2 × p) *,

Y = A Bso that any score^{T}(13)

yThus, the rows of_{ij}= a_{i}^{T}b_{j}= a_{i1}b_{1j}+ a_{i2}b_{2j}(14)

The approximation used in the biplot is like that in *
principal components analysis *: the two biplot dimensions
account for the greatest possible variance of the original data
matrix. In the biplot display,

- the observations are plotted as points. The configuration of points is essentially the same as scores on the first two principal components.
- the variables are plotted as vectors from the origin. The angles between the vectors represent the correlations among the variables.

OBS TRIBE SCHOOL POVERTY ECONOMIC 1 SHOSHONE 10.3 29.0 9.08 2 APACHE 8.9 46.8 10.02 3 SIOUX 10.2 46.3 10.75 4 NAVAJOS 5.4 60.2 9.26 5 HOPIS 11.3 44.7 11.25For the biplot, the data is usually first expressed in terms of deviations from the mean of each variable.

Deviations from Means Y SCHOOL POVERTY ECONOMIC SHOSHONE 1.08 -16.4 -0.992 APACHE -0.32 1.4 -0.052 SIOUX 0.98 0.9 0.678 NAVAJOS -3.82 14.8 -0.812 HOPIS 2.08 -0.7 1.178The row and column dimensions are found from the

Biplot coordinates Dim1 Dim2 OBS SHOSHONE -3.4579 -0.9040 OBS APACHE 0.3024 -0.0619 OBS SIOUX 0.1567 0.6842 OBS NAVAJOS 3.2110 -0.9216 OBS HOPIS -0.2122 1.2034 VAR SCHOOL -0.7305 1.5996 VAR POVERTY 4.6791 0.2435 VAR ECONOMIC 0.0296 0.9841Algebraically, this expresses each score as the product of the observation row and the variable row:

Shoshone-school = -3.46*-0.73 + (-.904)*1.6 = 1.08Geometrically, this corresponds to dropping a perpendicular line from the Shoshone point to the School vector in the biplot.

D 2 + i | m | e | n | school s 1.5 + i | o | n | HOPIS | 2 1 + economic | | | SIOUX | 0.5 + | | | poverty | 0 + + | APACHE | | | -0.5 + | | | |SHOSHONE NAVAJOS -1 + | | | | -1.5 + +---------+---------+---------+---------+---------+ -4 -2 0 2 4 6 Dimension 1

As a substantive example of the biplot technique, we consider data on the rates of various crimes (per 100,000 population) in the U.S. states.

The horizontal dimension in the plot is interpreted as *
overall crime rate*. The states at the right (Nevada,
California, NY) are high in crime, those at the left (North Dakota,
South Dakota) are low. The vertical dimension is a contrast
between * property crime vs. violent crime*. The angles
between the variable vectors represent the correlations of the
variables.

%macro BIPLOT( data=_LAST_, /* Data set for biplot */ var =_NUMERIC_, /* Variables for biplot */ id =ID, /* Observation ID variable */ dim =2, /* Number of dimensions */ factype=SYM, /* factor type: GH|SYM|JK */ scale=1, /* Scale factor for vars */ out =BIPLOT, /* Biplot coordinates */ anno=BIANNO, /* Annotate labels */ std=MEAN, /* Standardize: NO|MEAN|STD*/ pplot=YES); /* Produce printer plot? */The number of biplot dimensions is specified by the DIM= parameter and the type of biplot factorization by the FACTYPE= value. The BIPLOT macro constructs two output data sets, identified by the parameters OUT= and ANNO=. The macro will produce a printed plot (if PPLOT=YES), but leaves it to the user to construct a PROC GPLOT plot, since the scaling of the axes should be specified to achieve an appropriate geometry in the plot.

dwhere_{i}^{2}= ( x_{i}- x bar )^{T}S^{-1}( x_{i}- x bar ) (15)

z_{i}^{2}= ( x_{i}- x bar )^{2}over s^{2}

With * p * variables, * d _{i} ^{2} * is
distributed approximately as

Unfortunately, like all classical (e.g., least squares)
techniques, the * chi² * plot for multivariate normality
is not resistant to the effects of outliers. A few discrepant
observations not only affect the mean vector, but also inflates the
variance covariance matrix. Thus, the effect of the few wild
observations is spread through all the * d ^{2} * values.
Moreover, this tends to decrease the range of the

One reasonably general solution is to use a *
multivariate trimming * procedure (Gnanadesikan &
Kettenring, 1972; Gnanadesikan, 1977), to calculate squared
distances which are not affected by potential outliers.

This is an iterative process where, on each iteration, some
proportion of the observations with the largest * d ^{2} *
values are temporarily set aside, and the trimmed mean,

dThe effect of trimming is that observations with large distances do not contribute to the calculations for the remaining observations._{i}^{2}= ( x_{i}- x bar_{(-)})^{T}S_{(-)}^{-1}( x_{i}- x bar_{(-)}) (16)

Observations trimmed in calculating Mahalanobis distance OBS MODEL PASS CASE DSQ PROB 1 AMC PACER 1 2 22.7827 0.0189645 2 CAD. SEVILLE 1 14 23.8780 0.0132575 3 CHEV. CHEVETTE 1 15 23.5344 0.0148453 4 VW RABBIT DIESEL 1 66 25.3503 0.0080987 5 VW DASHER 1 68 36.3782 0.0001464 6 AMC PACER 2 2 34.4366 0.0003068 7 CAD. SEVILLE 2 14 42.1712 0.0000151 8 CHEV. CHEVETTE 2 15 36.7623 0.0001263 9 PLYM. CHAMP 2 52 20.9623 0.0337638 10 VW RABBIT DIESEL 2 66 44.2961 0.0000064 11 VW DASHER 2 68 78.5944 0.0000000

%macro OUTLIER( data=_LAST_, /* Data set to analyze */ var=_NUMERIC_, /* input variables */ id=, /* ID variable for observations */ out=CHIPLOT, /* Output dataset for plotting */ pvalue=.1, /* Prob < pvalue -> weight=0 */ passes=2, /* Number of passes */ print=YES); /* Print OUT= data set? */By default, two passes (PASSES= ) are made through the iterative procedure. A printer plot of DSQ * EXPECTED is produced automatically, and you can use PROC GPLOT afterward with the OUT= data set to give a high-resolution plot with labelled outliers, as shown in the example below.

The outlier procedure is run on the auto data with these statements:

%include OUTLIER; * get the macro; %include AUTO; * get the data; data AUTO; set AUTO; repair = mean(of REP77 REP78); if repair=. then delete; %OUTLIER(data=AUTO, var=PRICE MPG REPAIR HROOM RSEAT TRUNK WEIGHT LENGTH TURN DISPLA GRATIO, pvalue=.05, id=MODEL, out=CHIPLOT, print=NO);The plot is produced from the OUT= data set as shown below:

data labels; set chiplot; if prob < .05; * only label outliers; xsys='2'; ysys='2'; y = dsq; function = 'LABEL'; text = model; size = 1.4;

proc gplot data=chiplot ; plot dsq * expected = 1 expected * expected = 2 / overlay anno=labels vaxis=axis1 haxis=axis2; symbol1 f=special v=K h=1.5 i=none c=black; symbol2 v=none i=join c=red; label dsq ='Squared Distance' expected='Chi-square quantile'; axis1 label=(a=90 r=0 h=1.5 f=duplex) value=(h=1.3); axis2 order=(0 to 30 by 10) label=(h=1.5 f=duplex) value=(h=1.3); title h=1.5 'Outlier plot for Auto data';

For * p * variables, * x ^{T}= ( x _{1} , ... ,
x _{p} ) *, sets of weights,

- are all uncorrelated
- each have the maximum possible univariate
*F*statistic

- Groups widely separated on a given dimension are discriminated well by that linear combination of the variables.
- The contribution of each variable to each canonical dimension can be shown by plotting the variable weights as vectors in the plot.
- A variable aligned with one of the canonical dimensions contributes highly to discrimination among the groups on that canonical variable.
- The length of each variable vector is proportional to its contribution to discrimination through the two canonical variables.
- Angles between the vectors represent the correlations between the variables, in a way similar to the biplot.
- The mean for each group on the canonical variables can also be shown, together with confidence circles.

Canonical discriminant analysis is performed with PROC CANDISC.
Since there are 3 groups * s = g - 1 = 2 * dimensions
are sufficient to display all possible discrimination among groups,
and the printed output indicates that both dimensions are highly
significant.

Canonical Discriminant Analysis Adjusted Approx Squared Canonical Canonical Standard Canonical Correlation Correlation Error Correlation 1 0.909389 0.906426 0.014418 0.826988 2 0.625998 0.618252 0.050677 0.391874 Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio Approx F Num DF Den DF Pr > F 1 0.10521315 57.4891 10 276 0.0001 2 0.60812612 22.3928 4 139 0.0001

The canonical discriminant plot below shows that the three groups are ordered on CAN1 from Normal to Overt Diabetic with Chemical Diabetics in the middle. The two glucose measures and the measure SSPG of insulin resistance are most influential in separating the groups along the first canonical dimension. The Chemical Diabetics are apparently higher than the other groups on these three variables. Among the remaining variables, INSTEST is most influential, and contributes mainly to the second canonical dimension. The wide separation of the 99% confidence circles portrays the strength of group differences on the canonical dimensions and the variable vectors help to interpret the meaning of these dimensions.

title 'Canonical Discriminant analysis plot: Diabetes data'; %include canplot; *-- unnecessary in PsychLab; %include diabetes; *-- %include data(diabetes) in PsychLab; %canplot( data=diabetes, class=group, var=relwt glufast glutest instest sspg, scale=3.5);The SCALE parameter sets the scaling of the variable vectors relative to the points in the plot.

Previous section