The next slide shows a scatterplot matrix of four measurements on three species of Iris flowers, the length and width of the sepal, and length and width of the petal. Iris species is reflected in the color and shape of the plotting symbol (which unfortunately do not remain visually distinct when photoreduced).
%macro SCATMAT(
data =_LAST_, /* data set to be plotted */
var =_NUMERIC_, /* variables to be plotted - can be */
/* a list or X1-X4 or VARA--VARB */
group=, /* grouping variable (plot symbol) */
symbols=%str(- + : $ = X _ Y),
colors=BLACK RED GREEN BLUE BROWN YELLOW ORANGE PURPLE,
gcat=GSEG); /* graphic catalog for plot matrix */
The scatterplot matrix for the Iris data is produced by these
statements:
%include scatmat;
%scatmat(data=iris,
var=SEPALLEN SEPALWID PETALLEN PETALWID,
group=spec_no);
Star plots differ from glyph plots in that all variables are used to construct the plotted star figure; there is no separation into foreground and background variables. Instead, the star-shaped figures are usually arranged in a rectangular array on the page.
The main use of star plots would be in detecting which groups of observations seem to cluster together. They would appear as stars with similar shape.
If the variables are measured on different scales, there two ways the data are typically displayed, neither very satisfactory:
If the data values are denoted y ij , we first allow each variable to be transformed to standard scores,
y ij star = ( y ij - y bar j ) / s jwhere y bar j is the mean of variable j , and s j is the standard deviation. This allows us to equate variables with very different ranges, such as the crime variables.
Say the variables are positioned on the horizontal scale at positions x j , the values we wish to find. Then if each case has a linear profile, that means that
y ij star = a i + b i x jwith slope b i and intercept a i . Hartigan (1975) shows that the solution can be expressed in terms of the first two eigenvectors of the correlation matrix C .
The linear profiles display of the crimes data is shown in the next slide. Note that property crimes are positioned on the left, crimes of violence on the right. Cities with a positive slope (Dallas, Chicago) have a greater prevalence of violent crime than property crime; for cities with a negative slope (Los Angeles, Denver, Honolulu) the reverse is true.
Overall crime rate for each city is represented by the intercept of each line: cities at the top (Los Angeles, New York, New Orleans) having the most crimes and cities at the bottom having the least.
The biplot is based on the idea that any data matrix, Y ( n × p ) , can be represented approximately as the product of a two-column matrix, A ( n × 2 ) , and a two-row matrix, BT(2 × p) ,
Y = A BT (13)so that any score y ij of Y is approximately the inner product of row i of A , and of column j of BT,
y ij = a iTb j = a i1 b 1j + a i2 b 2j (14)Thus, the rows of A represent the observations in a two-dimensional space, and the columns of BT represent the variables in the same space.
The approximation used in the biplot is like that in principal components analysis : the two biplot dimensions account for the greatest possible variance of the original data matrix. In the biplot display,
OBS TRIBE SCHOOL POVERTY ECONOMIC
1 SHOSHONE 10.3 29.0 9.08
2 APACHE 8.9 46.8 10.02
3 SIOUX 10.2 46.3 10.75
4 NAVAJOS 5.4 60.2 9.26
5 HOPIS 11.3 44.7 11.25
For the biplot, the data is usually first expressed in terms of
deviations from the mean of each variable.
Deviations from Means
Y SCHOOL POVERTY ECONOMIC
SHOSHONE 1.08 -16.4 -0.992
APACHE -0.32 1.4 -0.052
SIOUX 0.98 0.9 0.678
NAVAJOS -3.82 14.8 -0.812
HOPIS 2.08 -0.7 1.178
The row and column dimensions are found from the singular
value decomposition of this matrix. (In this case, the
fit is perfect, since there are only two linearly independent
variables.)
Biplot coordinates
Dim1 Dim2
OBS SHOSHONE -3.4579 -0.9040
OBS APACHE 0.3024 -0.0619
OBS SIOUX 0.1567 0.6842
OBS NAVAJOS 3.2110 -0.9216
OBS HOPIS -0.2122 1.2034
VAR SCHOOL -0.7305 1.5996
VAR POVERTY 4.6791 0.2435
VAR ECONOMIC 0.0296 0.9841
Algebraically, this expresses each score as the product of the
observation row and the variable row:
Shoshone-school = -3.46*-0.73 + (-.904)*1.6
= 1.08
Geometrically, this corresponds to dropping a perpendicular line
from the Shoshone point to the School vector in the biplot.
D 2 +
i |
m |
e |
n | school
s 1.5 +
i |
o |
n | HOPIS
|
2 1 + economic
|
|
| SIOUX
|
0.5 +
|
|
| poverty
|
0 + +
| APACHE
|
|
|
-0.5 +
|
|
|
|SHOSHONE NAVAJOS
-1 +
|
|
|
|
-1.5 +
+---------+---------+---------+---------+---------+
-4 -2 0 2 4 6
Dimension 1
As a substantive example of the biplot technique, we consider data on the rates of various crimes (per 100,000 population) in the U.S. states.
The horizontal dimension in the plot is interpreted as overall crime rate. The states at the right (Nevada, California, NY) are high in crime, those at the left (North Dakota, South Dakota) are low. The vertical dimension is a contrast between property crime vs. violent crime. The angles between the variable vectors represent the correlations of the variables.
%macro BIPLOT( data=_LAST_, /* Data set for biplot */ var =_NUMERIC_, /* Variables for biplot */ id =ID, /* Observation ID variable */ dim =2, /* Number of dimensions */ factype=SYM, /* factor type: GH|SYM|JK */ scale=1, /* Scale factor for vars */ out =BIPLOT, /* Biplot coordinates */ anno=BIANNO, /* Annotate labels */ std=MEAN, /* Standardize: NO|MEAN|STD*/ pplot=YES); /* Produce printer plot? */The number of biplot dimensions is specified by the DIM= parameter and the type of biplot factorization by the FACTYPE= value. The BIPLOT macro constructs two output data sets, identified by the parameters OUT= and ANNO=. The macro will produce a printed plot (if PPLOT=YES), but leaves it to the user to construct a PROC GPLOT plot, since the scaling of the axes should be specified to achieve an appropriate geometry in the plot.
d i 2 = ( x i - x bar )TS -1 ( x i - x bar ) (15)where S is the p x p sample variance covariance matrix. This is the multivariate analog of the square of the standard score for a single variable,
z i 2 = ( x i - x bar ) 2 over s 2
With p variables, d i 2 is distributed approximately as chi² with p degrees of freedom for large samples from the multivariate normal distribution. Therefore, a Q-Q plot of the ordered distance values, d (i) 2 , against the corresponding quantiles of the chi² ( p ) distribution should yield a straight line through the origin for multivariate normal data.
Unfortunately, like all classical (e.g., least squares) techniques, the chi² plot for multivariate normality is not resistant to the effects of outliers. A few discrepant observations not only affect the mean vector, but also inflates the variance covariance matrix. Thus, the effect of the few wild observations is spread through all the d 2 values. Moreover, this tends to decrease the range of the d 2 values, making it harder to detect extreme ones.
One reasonably general solution is to use a multivariate trimming procedure (Gnanadesikan & Kettenring, 1972; Gnanadesikan, 1977), to calculate squared distances which are not affected by potential outliers.
This is an iterative process where, on each iteration, some proportion of the observations with the largest d 2 values are temporarily set aside, and the trimmed mean, x bar sub (-) and trimmed variance covariance matrix, S (-) are computed from the remaining observations. Then new d sup 2 values are computed from
d i 2 = ( x i - x bar (-) )TS (-) -1 ( x i - x bar (-) ) (16)The effect of trimming is that observations with large distances do not contribute to the calculations for the remaining observations.
Observations trimmed in calculating Mahalanobis distance
OBS MODEL PASS CASE DSQ PROB
1 AMC PACER 1 2 22.7827 0.0189645
2 CAD. SEVILLE 1 14 23.8780 0.0132575
3 CHEV. CHEVETTE 1 15 23.5344 0.0148453
4 VW RABBIT DIESEL 1 66 25.3503 0.0080987
5 VW DASHER 1 68 36.3782 0.0001464
6 AMC PACER 2 2 34.4366 0.0003068
7 CAD. SEVILLE 2 14 42.1712 0.0000151
8 CHEV. CHEVETTE 2 15 36.7623 0.0001263
9 PLYM. CHAMP 2 52 20.9623 0.0337638
10 VW RABBIT DIESEL 2 66 44.2961 0.0000064
11 VW DASHER 2 68 78.5944 0.0000000
%macro OUTLIER(
data=_LAST_, /* Data set to analyze */
var=_NUMERIC_, /* input variables */
id=, /* ID variable for observations */
out=CHIPLOT, /* Output dataset for plotting */
pvalue=.1, /* Prob < pvalue -> weight=0 */
passes=2, /* Number of passes */
print=YES); /* Print OUT= data set? */
By default, two passes (PASSES= ) are made through the iterative
procedure. A printer plot of DSQ * EXPECTED is produced
automatically, and you can use PROC GPLOT afterward with the OUT=
data set to give a high-resolution plot with labelled outliers, as
shown in the example below.
The outlier procedure is run on the auto data with these statements:
%include OUTLIER; * get the macro; %include AUTO; * get the data; data AUTO; set AUTO; repair = mean(of REP77 REP78); if repair=. then delete; %OUTLIER(data=AUTO, var=PRICE MPG REPAIR HROOM RSEAT TRUNK WEIGHT LENGTH TURN DISPLA GRATIO, pvalue=.05, id=MODEL, out=CHIPLOT, print=NO);The plot is produced from the OUT= data set as shown below:
data labels; set chiplot; if prob < .05; * only label outliers; xsys='2'; ysys='2'; y = dsq; function = 'LABEL'; text = model; size = 1.4;
proc gplot data=chiplot ;
plot dsq * expected = 1
expected * expected = 2
/ overlay anno=labels vaxis=axis1 haxis=axis2;
symbol1 f=special v=K h=1.5 i=none c=black;
symbol2 v=none i=join c=red;
label dsq ='Squared Distance'
expected='Chi-square quantile';
axis1 label=(a=90 r=0 h=1.5 f=duplex) value=(h=1.3);
axis2 order=(0 to 30 by 10) label=(h=1.5 f=duplex) value=(h=1.3);
title h=1.5 'Outlier plot for Auto data';
For p variables, xT= ( x 1 , ... , x p ) , sets of weights, c = ( c 1 , ... , c sub p ) , are found so as to maximize the F statistic for the derived canonical variable, CAN1 = S c i x sub i . That is, the canonical weights indicate the linear combination of the variables which best discriminates among groups, in the sense of having the greatest univariate F . When there are more than two groups, it is possible to find additional linear combinations, CAN2, CAN3, ... CANs (s is the smaller of p and g - 1 ) which
Canonical discriminant analysis is performed with PROC CANDISC. Since there are 3 groups s = g - 1 = 2 dimensions are sufficient to display all possible discrimination among groups, and the printed output indicates that both dimensions are highly significant.
Canonical Discriminant Analysis
Adjusted Approx Squared
Canonical Canonical Standard Canonical
Correlation Correlation Error Correlation
1 0.909389 0.906426 0.014418 0.826988
2 0.625998 0.618252 0.050677 0.391874
Test of H0: The canonical correlations in the
current row and all that follow are zero
Likelihood
Ratio Approx F Num DF Den DF Pr > F
1 0.10521315 57.4891 10 276 0.0001
2 0.60812612 22.3928 4 139 0.0001
The canonical discriminant plot below shows that the three groups are ordered on CAN1 from Normal to Overt Diabetic with Chemical Diabetics in the middle. The two glucose measures and the measure SSPG of insulin resistance are most influential in separating the groups along the first canonical dimension. The Chemical Diabetics are apparently higher than the other groups on these three variables. Among the remaining variables, INSTEST is most influential, and contributes mainly to the second canonical dimension. The wide separation of the 99% confidence circles portrays the strength of group differences on the canonical dimensions and the variable vectors help to interpret the meaning of these dimensions.
title 'Canonical Discriminant analysis plot: Diabetes data';
%include canplot; *-- unnecessary in PsychLab;
%include diabetes; *-- %include data(diabetes) in PsychLab;
%canplot(
data=diabetes,
class=group,
var=relwt glufast glutest instest sspg,
scale=3.5);
The SCALE parameter sets the scaling of the variable vectors
relative to the points in the plot.