User's Guide for MOSAICS

Version 3.5
Michael Friendly
Psychology Department
York University

1. Introduction
2. Installation Guide
- How to obtain MOSAICS
- Installing MOSAICS
3. Using MOSAICS

4. Macro interface
5. Examples
6. Implementation
References

1. Introduction

The mosaic display, proposed by Hartigan & Kleiner (1981) represents the counts in a contingency table directly by tiles whose area is proportional to the cell frequency. This display generalizes readily to n-way tables. Friendly (1991, 1992, 1994) extended the use of the mosaic display as a graphical tool for fitting log-linear models. The enhanced mosaic uses color and shading of the tiles to reflect the sign and magnitude of the residual from a specified log-linear model. Friendly also shows how the understanding of patterns of association can be enhanced by reordering the rows and columns to make the pattern more coherent. Refer to Friendly (1991, 1992, 1994) for details of the method and examples of its use in fitting log-linear models.

There is also:

An online, web application, with several sets of sample data. You can submit your own data through a form or an uploaded data file.
A brief tutorial introduction to mosaic displays.
A PDF version (mosaics.pdf). Note: This HTML version is no longer maintained. The PDF version are more up to date.

This report describes MOSAICS, a collection of SAS/IML programs and macros for producing mosaic displays. The programs has the following features:

It produces graphical displays of an n-way contingency table of any size. Experience shows that tables of up to 5 or 6 dimensions can be usefully explored. The main limitation is in the resolution of the display with large, complex tables.
The order of variables in the mosaic is specified by the user. Different orderings of the variables can show different aspects of the data.
For an unordered factor, the order of its levels can be determined to enhance understanding of the pattern of association. This ordering can be found from a correspondence analysis of the residuals from a model of independence.
The program can produce sequential displays of the marginal subtables, [A], [AB], [ABC], and so forth, up to the full n-way table, where A, B, C, ... refer to the table variables in the order entered.
For each display, the program fits a log-linear model and depicts the residuals from the model by the color and shading of tiles in the mosaic.
The program can automatically construct and fit a wide set of baseline models of independence or partial independence among the table variables. A shorthand keyword is used to specify many models of interest. Alternatively, the user can specify and fit any log-linear model which can be estimated by iterative proportional fitting.
The program can perform a correspondence analysis on marginal subtables to suggest a reordering of the levels of each variable.
Models and tables with structural zeros area accommodated naturally.
A contingency table can be read from a SAS data set or entered in SAS/IML as a table of frequencies together with variable name and factor level values. A collection of sample contingency tables in this format is suppplied.
A SAS macro, mac/mosaic.sas provides a more easily-used interface to the SAS/IML modules.
Other SAS/IML modules extend the idea of mosaic displays to mosaic matrices (mosmat.sas), both marginal and conditional, and partial mosaic plots (mospart.sas). Partial mosaics are included in the mac/mosaic.sas macro; mosaic matrices have their own macro (mac/mosmat.sas).

2. Installation Guide

How to obtain MOSAICS

The program, mosaics.sas, and examples of its use, are available from this site in two identical archives: mosaics.tar.gz , and mosaics.zip.

Installing MOSAICS

MOSAICS.SAS consists of a collection of SAS/IML modules which are designed to be called from another program in a proc iml step. Because the program is large, the modules are most conveniently stored in compiled form in a SAS/IML storage catalog, called MOSAIC.MOSAIC. To install the program in this way,

copy the files MOSAICS.SAS and MOSAICM.SAS to a directory, ('~/sasuser/mosaics/', or 'c:\sasuser\mosaics\', say),

Edit the libname and filename statements to correspond to this directory. On a Unix system, these might be,

*-- Change the path in the following filename statement to point to
    the installed location of mosaics.sas;
filename mosaics  '~/sasuser/mosaics/';
*--- Change the path in the libname to point to where the compiled
    modules will be stored, ordinarily the same directory;
libname  mosaic   '~/sasuser/mosaics/';

On Windows,

filename mosaics  'c:\sasuser\mosaics\';
libname  mosaic   'c:\sasuser\mosaics\';

You may wish to change some of the program default values, (in the module globals in MOSAICS.SAS) particularly the font= value which is set to font='hwpsl009' (Helvetica for the PS driver) in the distribution copy.
Run the MOSAICM.SAS program, with the command,
```
sas mosaicm
```
Optionally, install the sample data sets (see ``Sample data sets'') by running sas mosdata.
These steps need only be done once.

In applications, the modules are loaded into the SAS/IML workspace with the load or %include statement, as follows,

libname mosaic '~/sasuser/mosaics';
proc iml;
  reset storage=mosaic.mosaic;
  load module=_all_;

On most platforms, a libname statement is needed to specify the location of the MOSAIC library in the operating system file structure. Note: This requires that you have Read/Write access to the MOSAIC library, even if the MOSAIC modules are only loaded. See "Public Use" below for a solution.

Alternatively, it is possible to store and use the program in source form. This avoids the need to maintain and access the SAS/IML catalog, but means that the program is compiled each time it is run. To use the program in this way, simply access the program with a %include statement:

filename mosaics 'path/to/mosaics.sas';
proc iml;
  %include mosaics;

On some platforms you may need to add a path specification to the %include statement or use a filename statement to specify the location of the MOSAICS.SAS file in the operating system file structure.

Public Use

On most platforms, SAS/IML requires (by default) that the user have Read/Write access to the library accessed by the load command. Therefore, if the MOSAIC modules are stored in compiled form and are to be accessed publically (on a network), users must speciofy access=readonly on the libname statement:

libname mosaic '~/sasuser/mosaics' access=readonly;

You can place this statement in the system-wide autoexec.sas file.

Alternatively, copy the MOSAICS.SAS file to any public (readable) directory, and instruct users to load them using the %include statement, as described above.

3. Using MOSAICS

You can use MOSAICS either through a SAS/IML step or through the mosaic macro. The macro is easier to use, but IML is somewhat more flexible.

If you are using IML, the contingency table can either be defined directly with IML statements, or input from a SAS data set. The macro reads data from a SAS data set.

Input parameters

The frequency table analyzed is specified in the run mosaic statement. A great many options, all of which have default values, are specified by global variables in the proc iml step. Hence, the program is typically used as follows:

proc iml symsize=256;
  reset storage=mosaic.mosaic;
  load module=_all_;
  *-- specify data parameters;
  levels = { ... };   *-- variable levels;
  table  = { ... };   *-- contingency table;
  vnames = { ... };   *-- variable names;
    ...
 
  *-- specify non-default global inputs;
  fittype='USER';
  config = { 1  1,
             2  3 };
 
  run mosaic(levels, table, vnames, lnames, plots, title);

The n-way contingency table to be analyzed is specified by the table parameter; the names of the dimension (factor) variables and the names of the values that the dimension variables take on are specified in the vnames and lnames parameters, respectively, as described below.

In situations where the contingency table and factor variables are available in a SAS dataset, the table, levels, and lnames matrices may be constructed with the readtab module, described in Dataset Input. The parameters for the run mosaic statement are:

Parameter

Description

levels

is a vector which specifies the number of variables and the dimensions of the contingency table. If levels is n x 1, then the table has n dimensions, and the number of levels of variable i is levels[i]. The order of the variables in levels is the order they are entered into the mosaic display.

table

is a matrix or vector giving the frequency, f _ij..., of observations in each cell of the table. The table variables are arranged in accordance with the conventions of the SAS/IML IPF and MARG functions, so the first variable varies most rapidly across the columns of table and the last variable varies most slowly down the rows.

In addition table must conform to levels as follows. If table is I rows by J columns, the product of all entries in levels must be IJ. Moreover, J must equal the product of the first k entries of levels, for some k. That is, the columns must correspond to the combinations of one or more of the first k factors.

vnames

is a 1 x n character vector of variable (factor) names, in an order corresponding to levels.

lnames

is a character matrix of labels for the variable levels, one row for each variable. The number of columns is the maximum value in levels. When the number of levels are unequal, the rows for smaller factors must be padded with blank entries.

plots

is a vector containing any of the integers 1 to n which specifies the list of marginal tables to be plotted. If plots contains the value i the marginal subtable for variables 1 to i will be displayed. For a 3-way table, plots={1 2 3} displays each sequential plot, showing the [A], [AB] and [ABC] marginal tables; while plots=3 displays only the final 3-way [ABC] mosaic.

title

is a character string or vector of strings containing title(s) for the plots. If title is a single character string, it is used as the title for all plots. Otherwise, title may be a vector of up to max(plots) strings, and title[i] is used as the tile for the plot produced by plots[ ] = i. If the number of strings is less than max(plots) the last string is used for all remaining plots.

Moreover, if the title for a given plot contains the string &MODEL (upper case), that string is replaced by the symbolic model description. Similarly, the string &G2 (or &X2) is replaced by the LR (Pearson) chisquare value and df for the current model, in the form 'G2 (df) = value'. Enclose such titles in single quotes, otherwise the SAS macro processor will complain about an 'Apparent symbolic reference'. For example, the specifications,

plots = 2:3;
fittype='JOINT';
title = { '',
          'Hair-color Eye-color Data  Model (H)(E)',
          'Hair-color Eye-color Data  Model (HE)(S)'};

produces two plots with titles from title[2] and title[3].(1). Equivalent results (using substitution) are produced with the single title,

title = 'Hair-color Eye-color Data  Model &MODEL';

------------------------
(1) SAS/GRAPH fonts do not produce brackets, [ ] and braces, { }. Use parentheses instead in model symbolic formulae.
------------------------

Global input variables

The global variables below allow many of the details of the model fitting and mosaic display to be altered. Since they all have default values, it is only necessary to specify those you wish to change. All character-valued variables are case-insensitive.

colors

is a character vector of one or two elements specifying the colors used for positive and negative residuals. The default is {BLUE RED}. For a monochrome display, specify colors='BLACK' and use two distinct fill patterns for the fill type, such as filltype={M0 M45} or filltype={GRAY M45}.

config

is a numeric or character matrix specifying which marginal totals to fit when fittype='USER' is also specified. config is ignored for all other fit types. Each column specifies a high-order marginal in the model, either by the names of the variables, or by their indices, according to their order in vnames. For example, the log-linear model [AB][AC] [BC] for a three-way table is specified by the 2 by 3 matrix,

 config = { 1  1  2,
            2  3  3};

 config = { A  A  B,
            B  C  C};

The same model can be specified more easily row-wise, and then transposed:

 config = t( {1 2, 1 3, 2 3} );

devtype {GF | LR | FT}

is a character string which specifies the type of deviations (residuals) to be represented by shading. devtype='GF' is the default.

GF: calculates components of Pearson goodness of fit chisquare, where m hat _ij is the estimated expected frequency under the model.
LR: calculates components of the likelihood ratio (deviance) chisquare,
FT: calculates Freeman-Tukey residuals,

fittype {JOINT | MUTUAL | CONDIT | PARTIAL | MARKOV | USER}

is a character string which specifies the type of sequential log-linear models to fit. fittype='JOINT' is the default. For two-way tables, (or two-way margins of larger tables) all fittypes fit the independence model.

JOINTk: specifies sequential models of joint independence, [A][B], [AB][C] , [ABC][D], ... These models specify that the last variable in a given plot is independent of all previous variables jointly.
Optionally, the keyword JOINT may be followed by a digit, k, to specify which of the n ordered variables is independent of the rest jointly.
MUTUAL: specifies sequential models of mutual independence, [A][B], [A][B][C] , [A][B][C][D], ...
CONDITk: specifies sequential models of conditional independence which hypothesize that all previous variables are independent, given the last, i.e., [A][B], [AC][BC], [ A D ] [ B D ] [ C D] , ... For the 3-way model, A and B are hypothesized to be conditionally independent, given C; for the 4-way model, A, B, and C are conditionally independent, given D.
Optionally, the keyword CONDIT may be followed by a digit, k, to specify which of the n ordered variables is conditioned upon.
PARTIAL: specifies sequential models of partial independence of the first pair of variables, conditioning on all remaining variables one at a time: [A][B], [AC][BC] , [ A C D ] [ B C D ], ... For the 3-way model, A and B are hypothesized to be conditionally independent, given C; for the 4-way model, A and B are conditionally independent, given C and D.
MARKOVk: specifies a sequential series of Markov chain models fit to the table, whose dimensions are assumed to represent discrete ordered time points, such as lags in a sequential analysis. The keyword MARKOV can be optionally followed by a digit to specify the order of the Markov chains, e.g., fittype='MARKOV2'; specifies a second-order Markov chain. First-order is assumed if not specified. Such models assume that the table dimensions are ordered in time, e.g., Lag0, Lag1, Lag2, ...
MARKOV (or MARKOV1) fits the models [A][B], [AB] [BC], [AB] [BC] [CD], ..., where the categories at each lag are associated only with those at the previous lag. MARKOV2 fits the models [A][B], [A] [B] [C], [ABC] [BCD], [ABC] [BCD] [CDE], ....
USER: If fittype='USER', specify the hypothesized model in the global matrix config. The models for plots of marginal tables are based on reducing the hypothesized configuration, eliminating all variables not participating in the current plot.

filltype {M45 | LR | M0 | GRAY | HLS}

is a character vector of one or two elements which specifies the type of fill pattern to use for shading. filltype[1] is used for positive residuals; filltype[2], if present, is used for negative residuals. If only one value is specified, a complementary value for negative residuals is generated internally. filltype={HLS HLS} is the default.

M45: uses SAS/GRAPH patterns MdN135 and Md45 with hatching at 45 and 135°. d is the density value determined from the residual and the shade parameter.
LR: uses SAS/GRAPH patterns Ld and Rd.
M0: uses SAS/GRAPH patterns MdN0 and MdN90 with hatching at 0 and 90°. step
GRAYstep: uses solid, greyscale fill using the patterns GRAYnn starting from GRAYF0 for density=1 and increasing darkness by step for each successive density level. The default for step is 16, so 'GRAY' gives GRAYF0, GRAYE0, GRAYD0, and so forth.
HLS: uses solid, color-varying fill based on the HLS color scheme. The colors are selected attempting to vary the lightness in approximately equal steps. For this option, the colors values must be selected from the following hue names: RED GREEN BLUE MAGENTA CYAN YELLOW.

cellfill {NONE | SIGN | SIZE | DEV}

Provides the ability to display a symbol in the cell representing the coded value of large residuals. This is particularly useful for black and white output, where it is difficult to portray both sign and magnitude distinctly.

NONE: Nothing (default)
SIGN: Draws + or - symbols in the cell, whose number corresponds to the shading density.
SIZE: Draws + or - symbols in the cell, whose size corresponds to the shading density.
DEV: Writes the value of the standardized residual in the cell.

htext

is a numeric value which specifies the height of text labels, in character cells. The default is htext=1.3. The program attempts to avoid overlap of category labels, but this cannot always be achieved. Adjust htext (or make the labels shorter) if they collide.

legend {H | V | NONE}

Orientation of legend for shading of residual values in mosaic tiles. 'V' specifies a vertical legend at the right of the display; 'H' specifies a horizontal legend beneath the display. Default: 'NONE'.

order {NONE | [ DEV | JOINT ] | [ ROW | COL ] }

Specifies whether and how to perform a correspondence analysis to assist in reordering the levels of each factor variable as it is entered into the mosaic display. Not performed if order='NONE'. Otherwise, order may be a character vector containing either 'DEV' or 'JOINT' to specify that the CA is performed on residuals from the model for the current subtable (DEV) or on residuals from the model of joint independence for this subtable (JOINT). In addition, order may contain either 'ROW' or 'COL' or both to specify which dimensions of the current subtable are considered for reordering. The ususal options for this reordering are

order = {JOINT COL};

At present this analysis merely produces printed output which suggests an ordering, but does not actually reorder the table or the mosaic display.

shade

is a vector of up to 5 values of | d _ij |, which specify the boundaries between shading levels. If shade={2 4} (the default), then the shading density number d is:

 0    0 <= | d _ij | < 2
 1    2 <= | d _ij | < 4
 2    4 <= | d _ij |

Standardized deviations are often referred to a standard Gaussian distribution; under the assumption that the model fits, these values roughly correspond to two-tailed probabilities p < .05 and p < .0001 that a given value of | d _ij | exceeds 2 or 4, respectively. Use shade= a big number to suppress all shading.

space

is a vector of two values which specify the x, y percent of the plotting area reserved for spacing between the tiles of the mosaic. The default value is 10 times the number of variables allocated to each of the vertical and horizontal directions in the plot.

split

is a character vector consisting of the letters V and H which specifies the directions in which the variables divide the unit square of the mosaic display. If split={H V} (the default), the mosaic alternates between horizontal and vertical splitting. If the number of elements in split is less than the maximum number in plots, the elements in split are reused cyclically.

verbose {NONE | FIT | BOX}

is a character vector of one or more words which controls verbose or detailed output. If verbose contains 'FIT', additional details of the fitting process (fitted frequencies, marginal proportions) are printed. If verbose contains 'BOX', additional details of the drawing process (tile dimensions, label placement) are printed.

vlabels

is an integer from 0 to the number of variables in the table. It specifies that variable names (in addition to level names) are to be used to label the first vlabels variables. The default is vlabels=2, meaning variable names are used in plots of the first two variables only.

zeros

is a matrix of the same size and shape as the input table containing entries of 0 or 1, where 0 indicates that the corresponding value in table is to be ignored or treated as missing or a structural zero.

Zero entries cause the corresponding cell frequency to be fitted exactly; one degree of freedom is subtracted for each such zero. The corresponding tile in the mosaic display is outlined in black.

If an entry in any marginal subtable in the order [A], [AB], [ABC] ... corresponds to an all-zero margin, that cell is treated similarly as a structural zero in the model for the corresponding subtable. Note, however, that tables with zero margins may not always have estimable models.

If the table contains zero frequencies which should be treated as structural zeros, assign the zeros matrix like this:

zeros = table > 0;

For a square table, to fit a model of quasi-independence ignoring the diagonal entries, assign the zeros matrix like this (assuming a 4 x 4 table):

zeros = J(4,4) - I(4);

There is one caveat imposed by this use of global variables: The mosaic module should not be called from an IML module with its own arguments, since this would cause all variables defined within that module to inaccessible as global variables. The mosaic module may be called either in immediate mode, as in the examples in the next section, or from an IML module defined without arguments.

Graphic options

MOSAICS assumes that the vertical and horizontal dimensions of the plot are equal, so you should include a goptions statement specifying equal values for hsize and vsize if the default values for your device are unequal. For example,

goptions hsize=7 in vsize=7 in;

The program uses the colors blue and red to draw the tiles corresponding to positive and negative residuals. You can specify the IML global colors variable to change these assignments if you wish. (Or, change the default values in the globals module.)

The program cannot access global fonts assigned with the GOPTION FTEXT= and HTEXT= options. Instead, you may specify a desired font with the IML global font and htext variables. For some output devices (e.g., PostScript), specifying a hardware font (e.g., font = 'hwpsl009'; for Helvetica) can yield an enormous reduction in the size of the generated graphic output files.

EPS Output

Some output devices, such as Encapsulated Postscript (and GIF) require that each figure be written to a separate output file. Mosaics contains a gskip module which handles this automatically for EPS output.

It uses three global SAS macro variables:

DEVTYPE: Device type: Use %let devtype=eps; for EPS output.
DISPLAY: Display option: Use %let display=ON; for ordinary use. Setting DISPLAY=OFF suppresses graphic output (for all devices).
FIG: Figure number: Initialize to 1 %let fig=1;

Listed below is a macro, EPS, which I use to initialize graphics options for EPS output.

%global fig gsasfile devtype;
%macro eps;
   %let devtype = EPS;
   %let fig=1;
   %let gsasfile=grfout.eps;
   %put gsasfile is: "&gsasfile";
   filename gsasfile  "&gsasfile";

   goptions horigin=.5in vorigin=.5in; *-- override, for BBfix;
   goptions device=PSLEPSFC gaccess=gsasfile
       gend='0A'x  gepilog='showpage' '0A'x   /* only for 6.07 */
       gsflen=80 gsfmode=replace;
%mend;

Multiple calls

The mosaic module may be called repeatedly in one proc iml step. However, global variables which are set in one call remain in force. To restore these values to their default setting, use the SAS/IML free statement. For example, to revert to the default fit type of joint independence, use the statement,

free fittype;

before the next run mosaic statement.

SAS Dataset Input

A contingency table and its index (factor) variables may be read into SAS/IML, in the format required for MOSAICS using the readtab module, as shown in the following example. The factors in the 2 x 3 x 2 table are gender, occup, and heart. The data set heart has 12 observations---one observation per cell.

* Sex, Occupation and heart disease [Karger, 1980]; 
data heart;
   input gender $ occup $  @;
   heart='Disease';  input freq @;  output;
   heart='No Dis';   input freq @;  output;
cards;
Male   Unempl     254    759
Female Unempl     431  10283
Male   WhiteCol   158   3155
Female WhiteCol    52   3082
Male   BlueCol     87   2829
Female BlueCol     16    416
;
proc sort data=heart;
   by heart occup gender;

proc iml;
   title  = 'Sex, Occupation, and Heart Disease'; 
   reset storage=mosaic.mosaic;
   load module=_all_;

   vnames = {'Gender' 'Occup' 'Heart' };
   run readtab('heart', 'freq', vnames, table, levels, lnames);

   plots = 2:ncol(levels);
   run mosaic(levels, table, vnames, lnames, plots, title);

The readtab routine reads the index (factor) variables from the input dataset (heart), and determines the order of the factor variables according to which variable is actually varying most rapidly in the input dataset. The variable names vector (vnames) can be given in any order; it is reordered to correspond to the order of observations in the input dataset.

Note that if you sort the dataset as in the example above, character-valued index variables are arranged in alphabetical order. For example, the levels of occup are arranged in the order BlueCol, Unempl, WhiteCol, which may or may not be what you want. The PROC SORT step can be omitted, in which case the levels are ordered according to their order in the input dataset.

You can also use the DESCENDING option in the PROC SORT step to reverse the order of the levels of a given factor. For example, to reverse the levels of the gender variable, use

proc sort data=heart;
   by heart occup descending gender;

[add more description]

Fitting specialized models

For square tables, or tables with ordered factors, a wide variety of specialized models are available which cannot be specified as any IPF configuration for a hierarchical log-linear model. However, many of these models can be fit simply using the matrix operations and functions available in SAS/IML. For example, the model of symmetry for a square table has expected frequencies $\hat{m}_{ij} = ( f_{ij} + f_{ji} ) / 2$. The fitted frequencies and residuals can be calculated in SAS/IML as

   fit = (f + f`)/2;
   dev = (f - fit)/sqrt(fit);

where f is a square table of observed frequencies. MOSAICS includes an additional program, mosaicd.sas, designed for situations such as this, where the fitted values and residuals are calculated externally. The mosaicd is called instead of mosaic. The residuals are supplied as a dev parameter (which replaces the plots parameter of mosaic). The following example uses mosaicd to fit a model of symmetry to a $4 \times 4$ table of women classified by visual acuity ratings of their left and right eyes.

proc iml;
   dim = { 4 4 };
   /* Unaided distant vision data Bishop etal p. 284*/
      /*    Left eye grade */
   f = {1520   266   124    66,
         234  1512   432    78,
         117   362  1772   205,
          36    82   179   492 };
   title  = {'Unaided distant vision: Symmetry'};
   vnames = {'Right Eye','Left Eye'};
   lnames = { 'High' '2' '3' 'Low',
              'High' '2' '3' 'Low'};
   reset storage=mosaic.mosaic;
   load module=_all_;
   %include '~/sasuser/mosaics/mosaicd.sas';
   fit = (f + f`)/2;
   dev = (f - fit)/sqrt(fit);
   run mosaicd(dim, f, vnames, lnames, dev, title);

The sample program, moseye.sas, included in the distribution archives, illustrates how models of quasi-independence and quasi-symmetry can also be fit with MOSAICS.

4. Macro interface

[This material has not yet be included in this HTML version.]

5. Examples

Example 1

The program below shows the use of MOSAICS to produce a set of different mosaic displays for a 4 x 4 x 2 table of 592 people classified by hair color, eye color and sex.

The module haireye creates the variables table, levels, vnames, lnames, and title. Since the variables are to be entered into the mosaic in the order hair color, eye color, and sex, the table variable is created as a 2 x 16 matrix with hair color varying most rapidly across the columns and sex varying down the two rows. Note that the lnames variable is a 3 x 4 matrix, and the last row contains two blank values. The statement run haireye; creates these variables in the SAS/IML workspace.

The first run mosaics statement produces two plots, whose tiles show the [Hair][Eye] marginal table and the full three-way table. Since fittype is not specified, the model [HairEye] [Sex], in which Sex is independent of hair color and eye color jointly, is fit to the three-way table. split={V H} specifies that the first division of the mosaic is in the vertical direction. The printed output produced from this run is shown in Figure 1.

The second run mosaics statement fits the same models, but reorders the eye colors in the table to better display the pattern of association between hair color and eye color in the two-way table. It is also necessary to rearrange the eye color labels in row 2 of lnames. (This reordering is based on a correspondence analysis of residuals in the two-way table described by Friendly (1994) carried out separately.) Note that the global variables split and htext specified in the first mosaic continue to be used here. The plots produced from this call are shown in Figure 2 and Figure 3.

The third run mosaics statement plots only the three-way display, showing residuals from the model in which hair color, eye color and sex are mutually independent. This plot is shown in Figure 4.

goptions vsize=7in hsize=7in ;   *-- square plot environment;
 
proc iml;
start haireye;
   *-- Hair color, eye color data;
  table = {
  /* ----brown---   -----blue-----   ----hazel---   ---green--- */
   32  53  10  3   11  50  10  30   10  25  7  5   3  15  7  8,   /* M */
   36  66  16  4    9  34   7  64    5  29  7  5   2  14  7  8 }; /* F */
 
  levels= { 4 4 2 };
  vnames = {'Hair' 'Eye' 'Sex' };    /* Variable names */
  lnames = {                         /* Category names */
           'Black' 'Brown' 'Red' 'Blond',    /* hair color */
           'Brown' 'Blue' 'Hazel' 'Green',   /* eye color  */
           'Male' 'Female' ' '  ' ' };       /* sex        */
  title  = 'Hair color - Eye color data';
  finish;
 
  run haireye;
   reset storage=mosaic.mosaic;
   load module=_all_;
   *-- Fit models of joint independence (fittype='JOINT');
   plots = 2:3;
   split={V H};
   htext=1.6;
   run mosaic(levels, table, vnames, lnames, plots, title);
 
   *-- reorder eye colors (brown, hazel, green, blue);
   table  = table[,((1:4) || (9:16) || (5:8))];
   lnames[2,] = lnames[2,{1 3 4 2}];
   plots=2:3;
   run mosaic(levels, table, vnames, lnames, plots, title);
 
   plots=3;
   fittype='MUTUAL';
   run mosaic(levels, table, vnames, lnames, plots, title);
quit;

+-------------------------------------------------------------------+
|                                                                   |
|              +-------------------------------------------+        |
|              |  Generalized Mosaic Display, Version 2.9  |        |
|              +-------------------------------------------+        |
|                                                                   |
|                       TITLE                                       |
|                       Hair color - Eye color data                 |
|                                                                   |
|             VNAMES     LEVELS    LNAMES                           |
|             Hair            4    Black  Brown  Red    Blond       |
|             Eye             4    Brown  Hazel  Green  Blue        |
|             Sex             2    Male   Female                    |
|                                                                   |
|                              Global options                       |
|                                                                   |
|               FITTYPE  DEVTYPE  FILLTYPE  SPLIT  SHADE            |
|               JOINT    GF       M45       V H       2    4        |
|                                                                   |
|                          Factor:         1 Hair                   |
|                                                                   |
|                             Marginal totals                       |
|                                                                   |
|              MARGIN     Black     Brown       Red     Blond       |
|                                                                   |
|                           108       286        71       127       |
|                                                                   |
|                          Factor:         2 Eye                    |
|                                                                   |
|                             Marginal totals                       |
|                                                                   |
|              MARGIN     Brown     Hazel     Green      Blue       |
|                                                                   |
|              Black         68        15         5        20       |
|              Brown        119        54        29        84       |
|              Red           26        14        14        17       |
|              Blond          7        10        16        94       |
|                                                                   |
|                                                                   |
|             MODEL              DF   CHISQ               PROB      |
|             {Hair}{Eye}         9   G.F.    138.290   0.0000      |
|                                     L.R.    146.444   0.0000      |
|                                                                   |
|                     Standardized Pearson deviations               |
|                                                                   |
|                          Brown    Hazel    Green     Blue         |
|                                                                   |
|                Black      4.40    -0.48    -1.95    -3.07         |
|                Brown      1.23     1.35    -0.35    -1.95         |
|                Red       -0.07     0.85     2.28    -1.73         |
|                Blond     -5.85    -2.23     0.61     7.05         |
|                                                                   |
|                          Factor:         3 Sex                    |
|                                                                   |
|                             Marginal totals                       |
|                                                                   |
|                     MARGIN            Male    Female              |
|                                                                   |
|                     Black Brown         32        36              |
|                     Black Hazel         10         5              |
|                     Black Green          3         2              |
|                     Black Blue          11         9              |
|                     Brown Brown         38        81              |
|                     Brown Hazel         25        29              |
|                     Brown Green         15        14              |
|                     Brown Blue          50        34              |
|                     Red   Brown         10        16              |
|                     Red   Hazel          7         7              |
|                     Red   Green          7         7              |
|                     Red   Blue          10         7              |
|                     Blond Brown          3         4              |
|                     Blond Hazel          5         5              |
|                     Blond Green          8         8              |
|                     Blond Blue          30        64              |
|                                                                   |
|                                                                   |
|           MODEL                  DF   CHISQ               PROB    |
|           [Hair,Eye][Sex]        15   G.F.     28.993   0.0161    |
|                                       L.R.     29.350   0.0145    |
|                                                                   |
|                     Standardized Pearson deviations               |
|                                                                   |
|                                       Male   Female               |
|                                                                   |
|                      Black Brown      0.30    -0.27               |
|                      Black Hazel      1.28    -1.15               |
|                      Black Green      0.52    -0.46               |
|                      Black Blue       0.70    -0.63               |
|                      Brown Brown     -2.07     1.86               |
|                      Brown Hazel      0.19    -0.17               |
|                      Brown Green      0.57    -0.52               |
|                      Brown Blue       2.05    -1.84               |
|                      Red   Brown     -0.47     0.42               |
|                      Red   Hazel      0.30    -0.27               |
|                      Red   Green      0.30    -0.27               |
|                      Red   Blue       0.88    -0.79               |
|                      Blond Brown     -0.07     0.06               |
|                      Blond Hazel      0.26    -0.23               |
|                      Blond Green      0.32    -0.29               |
|                      Blond Blue      -1.84     1.65               |
|                                                                   |
+-------------------------------------------------------------------+

Figure 1: Printed output for hair color, eye color data, run 1

[Fig. 2] Figure 2: Two-way mosaic for hair color and eye color. Positive deviations from independence have solid outlines and are shaded blue. Negative deviations have dashed outlines and are shaded red. The two levels of shading density correspond to standardized deviations greater than 2 and 4 in absolute value.

[Fig. 3] Figure 3: Mosaic display for hair color, eye color, and sex. The categories of sex are crossed with those of hair color, but only the first occurrence is labeled. Residuals from the model [HE] [S] are shown by shading.

[Fig. 4] Figure 4: Mosaic display for hair color, eye color, and sex, showing residuals from the model of complete independence, [H] [E] [S] (This figure was created in a separate run, using the LEGEND option.)

Example 2

This example illustrates input of data from a SAS data set and the use of proc sort to rearrange the variables in a table to the order desired in the mosaic displays.

The data is a 2 ⁴ table classified by Gender, reported Pre-marital sex, Extra-marital sex and Marital Status, read in by the DATA step marital below. Note that the variable marital varies most rapidly and the variable gender varies most slowly in the observations in the data set. The desired order of the variables in the mosaic is Gender, Pre, Extra, and Marital. In the table variable in SAS/IML, the first variable, Gender, must vary most rapidly. This is accomplished by sorting the observations with the variables listed in the reverse order on the by statement in the proc sort step.

data marital;
   input gender $ pre $ extra $ @;
   marital='Divorced';  input freq @;  output;
   marital='Married';   input freq @;  output;
cards;
Women  Yes  Yes   17   4
Women  Yes  No    54  25
Women  No   Yes   36   4
Women  No   No   214 322
Men    Yes  Yes   28  11
Men    Yes  No    60  42
Men    No   Yes   17   4
Men    No   No    68 130
;
proc sort data=marital;
   by marital extra pre gender;

In the proc iml step, the statement use marital; accesses the data set. The variable freq from the data set is read into the IML table variable, a 16 x 1 matrix. Note that the levels of the character variables gender, pre, and extra are sorted alphabetically, so the category labels in lnames must appear in this order.

proc iml;
   use marital;
   read all var{freq} into table;
   levels = { 2 2 2 2 };
   vnames = {'Gender' 'Pre' 'Extra' 'Marital'};
   lnames = {'Men      '  'Women     ',
             'Pre Sex: No'  'Yes',
             'Extra Sex: No'   'Yes',
             'Divorced'   'Married' };
   title  = 'Pre/Extramarital Sex and Marital Status';
 
   reset storage=mosaic.mosaic;
   load module=_all_;
   split = {V H};
   htext=1.6;
   plots = 2:4;
   run mosaic(levels, table, vnames, lnames, plots, title);
 
   plots = 4;
   fittype='USER';
   title ='Model (GPE, PM, EM)';
   config = { 1  2  3,
              2  4  4,
              3  0  0};
   run mosaic(levels, table, vnames, lnames, plots, title);

The first run mosaic statement produces plots of the 2-way to 4-way tables, fitting models of joint independence. The second run mosaic statement produces a plot of the 4-way table, fitting the model [GPE] [PM] [EM] specified by the config variable and fittype='USER';. This model treats G, P, and E as explanatory, and M as a response. This is equivalent to the logit model with main effects of premarital sex and extramarital sex on marital status.

Using the readtab routine, this example can be simplified as follows. The routine constructs the table, levels, and lnames variables. (But note that the values of the Pre and Extra variables are both simply 'Yes' or 'No'.)

proc iml;
   vnames = {'Gender' 'Pre' 'Extra' 'Marital'};
	run readtab('marital', 'freq', vnames, table, levels, lnames);
   title  = 'Pre/Extramarital Sex and Marital Status';
 
   reset storage=mosaic.mosaic;
   load module=_all_;
   split = {V H};
   htext=1.6;
   plots = 2:4;
   run mosaic(levels, table, vnames, lnames, plots, title);
   ...

Example 3

This example shows the use of SAS/IML itself to reorder the variables in a contingency table for the mosaic display. It uses the same data as in the previous example.

The variables in a contingency table are reordered by the MARG function (which calculates marginal totals) when the model specified by the config parameter is the saturated model, with the variables listed in the desired order. For example, for the four-way table of the previous example, the configuration {4,3,2,1} gives the same order of the variables created by the proc sort step.

MOSAICS.SAS includes an IML module reorder (shown partly below) which will reorder the variables in any table. It also rearranges the values in the levels, vnames, and lnames variables in the same order.

start reorder(dim, table, vnames, lnames, order);
   *-- reorder the dimensions of an n-way table;
   if nrow(order) =1 then order=order`;
   run marg(loc,newtab,dim,table,order);
   table = newtab;
   dim = dim[order,];
   vnames = vnames[order,];
   lnames = lnames[order,];
   finish;

The data table is defined, listing the observations in the same order as in the DATA step marital shown in Example 2. Note that vnames and lnames conform to this order. After the call to reorder the variables table, levels, vnames, and lnames have been rearranged so that Gender is the first variable in the mosaic, and Marital status is last.

proc iml;
  *-- define the data variables;
  table={ 17   4 ,  /* Women  Yes  Yes  */
          54  25 ,  /* Women  Yes  No   */
          36   4 ,  /* Women  No   Yes  */
         214 322 ,  /* Women  No   No   */
          28  11 ,  /* Men    Yes  Yes  */
          60  42 ,  /* Men    Yes  No   */
          17   4 ,  /* Men    No   Yes  */
          68 130 }; /* Men    No   No   */
   levels = { 2 2 2 2 };
   vnames = {'Marital' 'Extra' 'Pre' 'Gender'};
   lnames = {'Divorced'   'Married',
             'Extra Sex: Yes' 'No',
             'Pre Sex: Yes'   'No',
             'Women    '      'Men' };
   title  = 'Pre/Extramarital Sex and Marital Status';
 
   reset storage=mosaic.mosaic;
   load module=_all_;
 
   order = { 4,3,2,1};
   run reorder(levels, table, vnames, lnames, order);
   split = {V H};
   plots = 2:4;
   run mosaic(levels, table, vnames, lnames, plots, title);
quit;

Sample data sets

A variety of contingency tables are supplied with the MOSAICS distribution in the file mosdata.sas. These are listed in the table below, with the variable names and dimensions given in their order as in vnames.

Mosaics sample data sets
Module name	Ways	Title Variable names(dimensions)
`abortion`	3	Abortion opinion data Sex (2) x Status (2) x Support Abortion (2)
`bartlett`	3	Bartlett data Alive? (2) x Time (2) x Length (2)
`berkeley`	3	Berkeley Admissions Data Admit (2) x Gender (2) x Dept (6)
`cancer`	3	Breast Cancer Patients Survival (2) x Grade (2) x Center (2)
`cesarean`	4	Risk factors for infection in cesarean births Infection (3) x Risk? (2) x Antibiotics (2) x Planned (2)
`detergen`	4	Detergent preference data Temperature (2) x M-User? (2) x Preference (2) x Water softness (3)
`dyke`	5	Sources of knowledge of cancer Knowledge (2) x Reading (2) x Radio (2) x Lectures (2) x Newspaper (2)
`employ`	3	Employment Status Data EmployStatus (2) x Layoff (2) x LengthEmploy (6)
`gilby`	2	Clothing and intelligence rating of children Dullness (6) x Clothing (4)
`haireye`	3	Hair color - Eye color data Hair (4) x Eye (4) x Sex (2)
`heckman`	5	Labour force participation of married women 1967-1971 1971 (2) x 1970 (2) x 1969 (2) x 1968 (2) x 1967 (2)
`hoyt`	4	Minnesota High School Graduates Status (4) x Rank (3) x Occupation (7) x Sex (2)
`marital`	4	Pre/Extramarital Sex and Marital Status Marital (2) x Extra (2) x Pre (2) x Gender (2)
`mobility`	2	Social Mobility data Son's Occupation (5) x Father's Occupation (5)
`suicide`	3	Suicide data Sex (2) x Age (5) x Method (6)
`titanic`	4	Survival on the Titanic Class (4) x Sex (2) x Age (2) x Survived (2)
`victims`	2	Repeat Victimization Data First Victimization (8) x Second Victimization (8)

Each data set is stored as a SAS/IML module containing definitions for the variables title, dim, vnames, lnames, and table used in the run mosaics statement. Note that the variable dim corresponds to levels in the arguments to mosaic. See the module haireye in Example 1.

The program mosdata.sas is set up so that running it will create a SAS/IML storage catalog MOSDATA in the MOSAIC library. Once this has been done, any data set may be obtained by loading the module from MOSAIC.MOSDATA and running it. For example, the previos example could be done using the module marital, as shown below.

 
proc iml;
   reset storage=mosaic.mosdata;
   load module=marital;
   run marital;
	
   reset storage=mosaic.mosaic;
   load module=_all_;
 
   ord = { 4,3,2,1};
   run reorder(dim, table, vnames, lnames, ord);
   split = {V H};
   plots = 2:4;
   run mosaic(dim, table, vnames, lnames, plots, title);
quit;

6. Implementation

This section describes the algorithm for the construction of mosaic displays and provides some notes on the structure of the program.

Algorithm

The process is a naturally recursive one which can be implemented easily in a language which supports recursion and multi-dimensional arrays, such as APL. Wang (1985) describes a FORTRAN implementation of mosaic displays which simulates multi-dimensional arrays by subscripting a vector. The following algorithm, which uses two-dimensional arrays, is much simpler.

Denote the number of levels of the n variables by l ₁ , ... , l _n, and let L _s = Product from i=1 to s l _i. At step s = 0, start with one tile, a square of size 100 x 100, and let L ₀ = 1.
The tiles in the mosaic are represented by an array B of four columns (called boxes in the program). Columns 1 and 2 give the (x , y) location of the lower left corner of the tile; columns 3 and 4 give the horizontal and vertical lengths of the tile. At step 0, B = { 0 0 100 100 }. There is one row for each tile. The following steps are repeated for each variable, s = 1 , ... , n:
For variable s find the marginal frequencies of variables 1, ... , s, a vector of length L_s, with the levels of variable s varying most rapidly.
Reshape this vector row-wise to a matrix M = { m _gh } of L _{s - 1} rows and l _s columns. (The array M is called margin in the program. See the arrays labelled "Marginal totals" in Figure 1.) The rows of M correspond to the tiles of the previous variables at step s - 1.
Each old tile is then divided vertically (if s is odd) or horizontally (s even) into l sub s tiles, with the width (s odd) or height (s even) of each tile proportional to m _gh / m _g+.

This computational scheme has several desirable properties:

At any stage the division of the tiles for the current variable is in proportion to the entries in each row of M divided by the row totals.
We can draw the tiles representing the marginal frequencies at any stage, not just the final stage as Hartigan and Kleiner do.
Fitting the model of joint independence of the current variable with all previous variables jointly is equivalent to testing independence of the rows and columns of the matrix M. For example, for a three-way table, the expected frequencies under the model [AB] [C] can be expressed in terms of the I J x K matrix M as m _(ij)+ m _+k / m ₊₊.

Spacing

This procedure gives a mosaic of L _n = l ₁ x l ₂ x ... x l _n tiles with no spacing, in which cells with small frequencies are difficult to see. Following Hartigan and Kleiner the tiles are separated, with larger spacings at the earlier subdivisions, to help preserve the visual impact of small counts. For a four-way table with vertical splitting on variables 1 and 3, the divisions of the first variable are spaced proportionally to 1 / ( l ₁ - 1); divisions between levels of the third variable are spaced proportionally to 1 / ( l ₁ l ₃ - 1 ).

This spacing of the tiles is accomplished by constructing an unspaced mosaic in a reduced area (determined by the space parameter), then expanding to include the necessary spacing.

Program structure

MOSAICS.SAS consists of 22 SAS/IML modules (subroutines and functions). The calling structure of the modules is shown in Figure 5.

+-------------------------------------------------------------------+
|                                                                   |
|   mosaic    *-- check inputs, assign default values;              |
|   |                                                               |
|   |-- divide   *-- fit models and draw the mosaic display;        |
|       |                                                           |
|       |--reduce   *-- find reduced model for factors 1:f;         |
|       |                                                           |
|       |--mfit     *-- fits a specified model;                     |
|       |                                                           |
|       |--chisq    *-- calculate chisquares;                       |
|       |                                                           |
|       |--df       *-- calculate degrees of freedom;               |
|       |   |--terms    *-- find all terms in a loglinear model;    |
|       |       |--vars_in    *-- find variables in a term;         |
|       |                                                           |
|       |--modname  *-- expand config into string for model label;  |
|       |                                                           |
|       |--divide1  *-- divide the mosaic for the next variable;    |
|       |                                                           |
|       |--space    *-- space the tiles in the current display;     |
|       |                                                           |
|       |--labels   *-- calculate label placements;                 |
|       |                                                           |
|       |--gboxes   *-- draw the current display;                   |
|          |--fillbox   *-- custom shading;                         |
|          |--glegend   *-- draw legend;                            |
|                                                                   |
|   readtab   *-- read input frequencies, level names;              |
|   |--readlab   *-- read level names, reorder input                |
|                                                                   |
|   reorder   *-- reorder the dimensions of an n-way table;         |
+-------------------------------------------------------------------+

Figure 5: Calling structure of the modules in MOSAICS.SAS

The top-level module, mosaic simply validates the input parameters, assigns default values for global variables, and calls the module divide. The steps in the algorithm described above are carried out by divide; the calculation of the new tiles in step 5 is performed in divide1.

Changes

Version 3.5

Fixed conflict between the global variable DEVTYPE and the macro variable used for graphics device control.
Changed circle blanking used for CELLFILL to white/black text, depending on shading density.
Added control of threshold for CELLFILL. You can now say CELLFILL = DEV 1.0 and all absolute residuals > 1.0 will have their values written in the tiles.
Added calculation and display of adjusted residuals ( = d / \sqrt(1-h) )
The default font now depends on device driver, making it easier to get PS/EPS output in Windoze.
Added NAME global variable for graph names in the graphics catalog.
Fixed a bug in the calculation of adjusted residuals
Added CELLFILL='FREQ' to display cell frequency in the tiles.
Added ABBREV global to abbreviate variable names in models and titles.

Version 3.4

Added vlabels global variable to control the number of variables for which variable names are used in the display, fuzz now sets line style solid.
Global variables are now set in a separate module to make changing defaults easier.
In reorder module, you can now specify the variable names in the new order, rather than indices. The config configuration may also be specified using variable names.
Added code for models of joint independence and conditional independence in which any variable may be specified at the jointly indpendent or conditioning one.

Version 3.3

Added a GSKIP module, for EPS ouput to separately named graphics files. Requires a global macro variable, &DEVTYPE = EPS

Version 3.2

Added zeros= global input matrix to handle structural zeros.
Added ability to display chisquare value in the mosaic title for each plot, by using '&G2' in the title string.
Changed default values to filltype={HLS HLS}, colors={BLUE RED} since this is what I always use now, except for monochrome output.

Version 3.1

Added readtab routine for easier input from a SAS dataset.
Added devtype='FT' to calculate and display Freeman-Tukey residuals.
Character values of global input variables no longer need be entered in upper case.

Version 3.0

Added ability to fit a sequence of Markov models (fittype='MARKOV';) for lag sequential data.
Fit the equiprobability model for the display of the first variable.

Version 2.9

Installation simplified by creating a separate file, MOSAICM.SAS, to install IML modules.
Filltypes changed to allow separate coding for postitive and negative residuals, and to provide grayscale shading levels.
Added ability (cellfill) to print a symbol in the cell symbolizing the value of the residual.

References

Friendly, M. (1991). Mosaic displays for multi-way contingency tables. York Univ.: Dept. of Psychology Reports, 1991, No. 195.
Friendly, M. (1992). Mosaic displays for loglinear models. Proceedings of the Statistical Graphics Section, American Statistical Association, 61-68.
Friendly, M. (1994). Mosaic displays for multi-way contingency tables. Journal of the American Statistial Association, 89, 190-200.
Friendly, M. (1998). Extending Mosaic Displays: Marginal, Partial, and Conditional Views of Categorical Data: Paper presented at the Workshop on ``Data Visualization in Statistics'', July 6-10, 1998, held at Drew University. [Published in JCGS, 1999, 8:373--395.] }.
Hartigan, J. A., and Kleiner, B. (1981), Mosaics for contingency tables. In W. F. Eddy (Ed.), Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, 268-273. New York: Springer-Verlag.
Wang, C. M. (1985). Applications and computing of mosaics. Computational Statistics & Data Analysis, 3, 89-97.