[Previous] [Next] [Up] [Top] Categorical Data Analysis with Graphics
Michael Friendly

Part 5: Correspondence analysis

Correspondence analysis is an exploratory technique related to to principal components analysis which finds a multidimensional representation of the association between the row and column categories of a two-way contingency table. This technique finds scores for the row and column categories on a small number of dimensions which account for the greatest proportion of the chi² for association between the row and column categories, just as principal components account for maximum variance. For graphical display two or three dimensions are typically used to give a reduced rank approximation to the data.

For a two-way table the scores for the row categories, namely x sub im , and column categories, y sub jm , on dimension m = 1, ... , M are derived from a singular value decomposition of residuals from independence, expressed as d sub ij / sqrt n , to account for the largest proportion of the chi² in a small number of dimensions. This decomposition may be expressed as

(10)

where lambda sub 1 >= lambda sub 2 >= ... >= lambda sub M , and M = min ( I-1 , J-1 ) . In M dimensions, the decomposition (10) is exact. A rank- d approximation in d dimensions is obtained from the first d terms on the right side of (10), and the proportion of chi² accounted for by this approximation is

n % sum from m to d { % lambda sub m sup 2 } / chi²

Thus, correspondence analysis is designed to show how the data deviate from expectation when the row and column variables are independent, as in the association plot and mosaic display. However, the association plot and mosaic display depict every cell in the table, and for large tables it may be difficult to see patterns. Correspondence analysis shows only row and column categories in the two (or three) dimensions which account for the greatest proportion of deviation from independence.

PROC CORRESP

In SAS Version 6, correspondence analysis is performed using PROC CORRESP in SAS/STAT. PROC CORRESP can read two kinds of input:

a two-way contingency table
raw category responses on two or more classification variables

An OUT= data set from PROC CORRESP contains the row and column coordinates, which can be plotted with PROC PLOT or PROC GPLOT. The procedure has many options for scaling row and column coordinates, and for printing various statistics which aid interpretation. Only the basic use of the procedure is illustrated here.

Example: Hair and Eye Color

The program below reads the hair and eye color data into the data set COLORS, and calls the CORRESP procedure. This example illustrates the use of PROC PLOT and the Annotate facility with PROC GPLOT to produce a labeled display of the correspondence analysis solution. To input a contingency table in the CORRESP step, the hair colors (columns) are specified as the variables in the VAR statement, and the eye colors (rows) are indicated as the ID variable.

data colors;
   input BLACK BROWN RED BLOND   EYE $;
   cards;
           68  119   26    7     Brown
           20   84   17   94     Blue
           15   54   14   10     Hazel
            5   29   14   16     Green
;
proc corresp data=colors out=coord short;
    var black brown red blond;
    id eye;
proc print data=coord;

The printed output from the CORRESP procedure is shown below. The section labeled "Inertia, ... " indicates that over 98% of the chi² for association is accounted for by two dimensions, with most of that attributed to the first dimension.

+--------------------------------------------------------------------+
|                                                                    |
|                 The Correspondence Analysis Procedure              |
|                                                                    |
|                                                                    |
|                  Inertia and Chi-Square Decomposition              |
|                                                                    |
|    Singular  Principal Chi-                                        |
|    Values    Inertias  Squares Percents   18   36   54   72   90   |
|                                        ----+----+----+----+----+---|
|    0.45692   0.20877   123.593  89.37% *************************   |
|    0.14909   0.02223    13.158   9.51% ***                         |
|    0.05097   0.00260     1.538   1.11%                             |
|              -------   -------                                     |
|              0.23360    138.29 (Degrees of Freedom = 9)            |
|                                                                    |
|                                                                    |
|                            Row Coordinates                         |
|                                                                    |
|                                  Dim1          Dim2                |
|                                                                    |
|                   Brown      -.492158      -.088322                |
|                   Blue       0.547414      -.082954                |
|                   Hazel      -.212597      0.167391                |
|                   Green      0.161753      0.339040                |
|                                                                    |
|                                                                    |
|                           Column Coordinates                       |
|                                                                    |
|                                  Dim1          Dim2                |
|                                                                    |
|                   BLACK      -.504562      -.214820                |
|                   BROWN      -.148253      0.032666                |
|                   RED        -.129523      0.319642                |
|                   BLOND      0.835348      -.069579                |
|                                                                    |
+--------------------------------------------------------------------+

The singular values, lambda sub i , in Eqn. (10), are also the (canonical) correlations between the optimally scaled categories. Thus, if the DIM1 scores for hair color and eye color are assigned to the 592 observations in the table, the correlation of these variables would be 0.4569. The DIM2 scores give a second, orthogonal scaling of these two categorical variables, whose correlation would be 0.1491.

A plot of the row and column points can be constructed from the OUT= data set COORD requested in the PROC CORRESP step. The variables of interest in this example are shown in below. Note that row and column points are distinguished by the variable _TYPE_.

+-------------------------------------------------------------------+
|                                                                   |
|    OBS    _TYPE_      EYE       DIM1        DIM2                  |
|                                                                   |
|     1     INERTIA               .           .                     |
|     2     OBS        Brown    -0.49216    -0.08832                |
|     3     OBS        Blue      0.54741    -0.08295                |
|     4     OBS        Hazel    -0.21260     0.16739                |
|     5     OBS        Green     0.16175     0.33904                |
|     6     VAR        BLACK    -0.50456    -0.21482                |
|     7     VAR        BROWN    -0.14825     0.03267                |
|     8     VAR        RED      -0.12952     0.31964                |
|     9     VAR        BLOND     0.83535    -0.06958                |
|                                                                   |
+-------------------------------------------------------------------+

The interpretation of the correspondence analysis results is facilitated by a labelled plot of the row and column points. As of Version 6.08, points can be labeled in PROC PLOT. The following statements produce a labelled plot. The plot should be scaled so that the number of data units/inch are the same for both dimensions. Otherwise, the distances in this plot would not be represented accurately. In PROC PLOT, this is done with the VTOH option, which specifies the aspect ratio ( vertical to horizontal ) of your printer.

proc plot vtoh=2;
   plot dim2 * dim1 = '*' $ eye / box haxis=by .1 vaxis=by .1;
run;

                 Plot of DIM2*DIM1$EYE.  Symbol used is '*'.
     -+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
DIM2 |                                                                       |
     |                                                                       |
 0.4 +                                                                       +
     |                                 * Green                               |
 0.3 +                   * RED                                               +
     |                                                                       |
 0.2 +                                                                       +
     |              * Hazel                                                  |
 0.1 +                                                                       +
     |                  * BROWN                                              |
 0.0 +                                                                       +
     |                                                             BLOND *   |
-0.1 +* Brown                                             * Blue             +
     |                                                                       |
-0.2 +* BLACK                                                                +
     |                                                                       |
-0.3 +                                                                       +
     |                                                                       |
     -+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-
    -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9
                                       DIM1

A labeled high-resolution display of the correspondence analysis solution ( Figure 21) is constructed with PROC GPLOT, using a DATA step to produce an Annotate data set LABELS from the COORD data set. In the PROC GPLOT step, axes are equated with the AXIS statements: AXIS1 specifies a length and range which are both twice that in the AXIS2 statement, so that the ratio of data units to plot units is the same in both dimensions.

data label;
   set coord;
   xsys='2'; ysys='2';
   x = dim1; y = dim2;
   text = eye;
   size = 1.3;
   function='LABEL';
   if _type_='VAR' then color='RED  '; else color='BLUE';
proc gplot data=coord;
   plot dim2 * dim1
        / anno=label frame
          href=0 vref=0 lvref=3 lhref=3
          vaxis=axis2 haxis=axis1
          vminor=1 hminor=1;
   axis1 length=6 in  order=(-1. to 1. by .5)
         label=(h=1.5          'Dimension 1');
   axis2 length=3 in  order=(-.5 to .5 by .5)
         label=(h=1.5 a=90 r=0 'Dimension 2');
   symbol v=none;
run;

Figure 21: Correspondence analysis solution for Hair color, Eye color data

Multi-way tables

A three- or higher-way table can be analyzed by correspondence analysis in several ways. One approach is called "stacking". A three-way table, of size I x J x K can be sliced into I two-way tables, each J x K . If the slices are concatenated vertically, the result is one two-way table, of size ( I x J ) x K . In effect, the first two variables are treated as a single composite variable, which represents the main effects and interaction between the original variables that were combined. Van der Heijden and de Leeuw (1985) discuss this use of correspondence analysis for multi-way tables and show how each way of slicing and stacking a contingency table corresponds to the analysis of a specified log-linear model.

In particular, for the three-way table that is reshaped as a table of size ( I x J ) x K , the correspondence analysis solution analyzes residuals from the log-linear model [AB] [C]. That is, for such a table, the I x J rows represent the joint combinations of variables A and B. The expected frequencies under independence for this table are

e sub [ij]k = < f sub [ij]+ % f sub [+]k > over n = < f sub ij+ % f sub ++k > over n

which are the ML estimates of expected frequencies for the log-linear model [AB] [C]. The chi² that is decomposed is the Pearson chi² for this log-linear model. When the table is stacked as I x ( J x K ) or J x ( I x K ) , correspondence analysis decomposes the residuals from the log-linear models [A] [BC] and [B] [AC], respectively. Van der Heijden and de Leeuw (1985) show how a generalized form of correspondence analysis can be interpreted as decomposing the difference between two specific log-linear models.

Example: Suicide Rates

To illustrate the use of correspondence analysis for the analysis for three-way tables, we use data on suicide rates in West Germany, classified by age, sex, and method of suicide used. The data, from Heuer (1979, Table 1), have been discussed by van der Heijden and de Leeuw (1985) and others.

+-------------------------------------------------------------------+
|                                                                   |
|  Sex  Age    POISON     GAS    HANG   DROWN     GUN    JUMP       |
|                                                                   |
|   M  10-20     1160     335    1524      67     512     189       |
|   M  25-35     2823     883    2751     213     852     366       |
|   M  40-50     2465     625    3936     247     875     244       |
|   M  55-65     1531     201    3581     207     477     273       |
|   M  70-90      938      45    2948     212     229     268       |
|                                                                   |
|   F  10-20      921      40     212      30      25     131       |
|   F  25-35     1672     113     575     139      64     276       |
|   F  40-50     2224      91    1481     354      52     327       |
|   F  55-65     2283      45    2014     679      29     388       |
|   F  70-90     1548      29    1355     501       3     383       |
|                                                                   |
+-------------------------------------------------------------------+

The table below shows the results of all possible hierarchical log-linear models for the suicide data. It is apparent that none of these models has an acceptable fit to the data. Given the enormous sample size ( n = 48,177 ), even relatively small departures from expected frequencies under any model would appear significant, however.

 Model              df        L.R. G
chi²


 [M] [A] [S]        49       10119.60         9908.24

 [M] [AS]           45        8632.0          8371.3
 [A] [MS]           44        4719.0          4387.7
 [S] [MA]           29        7029.2          6485.5

 [MS] [AS]          40        3231.5          3030.5
 [MA] [AS]          25        5541.6          5135.0
 [MA] [MS]          24        1628.6          1592.4

 [MA] [MS] [AS]     20         242.0           237.0

Correspondence analysis applied to the [AS] by [M] table helps to show the nature of the association between method of suicide and the joint age-sex combinations and decomposes the chi² = 8371 for the log-linear model [AS] [M]. To carry out the analysis with the data as shown above, the variables age andsex are combined into a single variable sexage.

proc corresp data=suicide;
   var poison gas hang drown gun jump;
   id sexage;

The results show that over 93% of the association can be represented well in two dimensions.

+-------------------------------------------------------------------+
|                                                                   |
|                Inertia and Chi-Square Decomposition               |
|                                                                   |
|  Singular  Principal Chi-                                         |
|  Values    Inertias  Squares Percents   12   24   36   48   60    |
|                                      ----+----+----+----+----+--- |
|  0.32138   0.10328   5056.91  60.41% *************************    |
|  0.23736   0.05634   2758.41  32.95% **************               |
|  0.09378   0.00879    430.55   5.14% **                           |
|  0.04171   0.00174     85.17   1.02%                              |
|  0.02867   0.00082     40.24   0.48%                              |
|            -------   -------                                      |
|            0.17098   8371.28 (Degrees of Freedom = 45)            |
|                                                                   |
+-------------------------------------------------------------------+

The plot of the scores for the rows (sex-age combinations) and columns (methods) shows residuals from the log-linear model [AS] [M]. Thus, it shows the two-way associations of sex x method, age x method, and the three-way association, sex x age x method which are set to zero in the model [AS] [M]. The possible association between sex and age is not shown in this plot.

Dimension 1 in the plot separates males and females. This dimension indicates a strong difference between suicide profiles of males and females. The second dimension is mostly ordered by age with younger groups at the top and older groups at the bottom. Note also that the positions of the age groups are approximately parallel for the two sexes. Such a pattern indicates that sex and age do not interact in this analysis. The relation between the age - sex groups and methods of suicide can be interpreted in terms of similar distance and direction from the origin, which represents the marginal row and column profiles. Young males are more likely to commit suicide by gas or a gun, older males by hanging, while young females are more likely to ingest some toxic agent and older females by jumping or drowning.

Fig
Figure 22: Two-dimensional correspondence analysis solution for the [SA] [M] multiple table

Fig
Figure 23: Mosaic display for sex and age. The frequency of suicide shows opposite trends with age for males and females.

Fig
Figure 24: Mosaic display showing deviations from model [SA] [M]. The methods have been reordered according to their positions on Dimension 1 of the correspondence analysis solution for the [SA] [M] table.

[Previous] [Next] [Up] [Top]

Part 5: Correspondence analysis

Contents

PROC CORRESP

Example: Hair and Eye Color

Multi-way tables

Example: Suicide Rates