Michael Friendly

Table 1: Hair-color eye-color data

Hair Color Eye Color BLACK BROWN RED BLOND | Total | Brown 68 119 26 7 | 220 Blue 20 84 17 94 | 215 Hazel 15 54 14 10 | 93 Green 5 29 14 16 | 64 --------------------------------------------+------ Total 108 286 71 127 | 592

For any two-way table, the expected frequencies under
independence can be represented by rectangles whose widths are
proportional to the total frequency in each column, * f sub +j
*, and whose heights are proportional to the total frequency in
each row, * f sub i+ *; the area of each rectangle is then
proportional to * e sub ij *. Figure
7 shows the expected frequencies for the hair and eye color data.

Figure 7: Expected frequencies under
independence.

Riedwyl and Schüpbach (1983, 1994) proposed a * sieve
diagram * (later called a

Figure 8: Sieve diagram for hair-eye data.

Figure 9 shows data on vision acuity in a large sample of women (n=7477). The diagonal cells show the obvious: people tend to have the same visual acuity in both eyes, and there is strong lack of indepence. The off diagonal cells show a more subtle pattern which suggests symmetry, and a diagonals model.

Figure 10 shows the frequencies with which draft-age men with birthdays in various months were assigned priority values for induction into the US Army in the 1972 draft lottery. The assignment was supposed to be random, but the figure shows a greater tendency for those born in the latter months of the year to be assigned smaller priority values.

Figure 9: Vision classification data for
7477 women
Figure 10: Data from the US Draft Lottery

For a two-way contingency table, the signed contribution to
Pearson * chi² * for cell * i, %j * is

d sub ij = < f sub ij - e sub ij > over < sqrt < e sub ij > > = roman ' std. residual' fwd 300 chi² = Sigma Sigma sub ij %% ( d sub ij ) sup 2In the

- (signed) height
*~ d sub ij* - width =
*sqrt < e sub ij>*.

The rectangles for each row in the table are positioned relative to a baseline representing independence (sqrt e sub ij+---------------------+ | | | area = f_ij - e_ij | | | d sub ij |= {f sub ij - e sub ij} over {sqrt e sub ij} | | +---------------------+

Figure 11: Association plot for hair-color, eye-color

- Strength of agreement vs. strength of association: Observers ratings can be strongly associated without strong agreement.
- Marginal homogeneity: If observers tend to use the categories with different frequency, this will affect measures of agreement.

(4)

- For perfect agreement,
*kappa = 1*. - Minimum
*kappa < 0*, and lower bound depends on marginal totals. - Unweighted
*kappa*only counts strict agreement (same category assigned by both observers). A weighted version of*kappa*is used when one wishes to allow for partial agreement. For example, exact agreements might be given full weight, one-category difference given weight 1/2. (This makes sense only when the categories are ordered, as in severity of diagnosis.)

Sex is fun for me and my partner (a) Never or occasionally, (b) fairly often, (c) very often, (d) almost always.

|-------- Wife's Rating -------| Husband's Never Fairly Very Almost Rating fun often Often always SUM Never fun 7 7 2 3 19 Fairly often 2 8 3 7 20 Very often 1 5 4 9 19 Almost always 2 8 9 14 33 SUM 12 28 18 33 91Unweighted

Observed and Expected Agreement (under independence) Observed agreement 0.3626 Expected agreement 0.2680 Cohen's Kappa (Std. Error) 0.1293 (0.1343)Two commonly-used pattern of weights are those based on

Integer Weights Fleiss-Cohen Weights 1 2/3 1/3 0 1 8/9 5/9 0 2/3 1 2/3 1/3 8/9 1 8/9 5/9 1/3 2/3 1 2/3 5/9 8/9 1 8/9 0 1/3 2/3 1 0 5/9 8/9 1These weights give a somewhat higher assessment of agreement (perhaps too high).

Obs Exp Std Lower Upper Agree Agree Kappa Error 95% 95% Unweighted 0.363 0.268 0.1293 0.134 -0.1339 0.3926 Integer Weights 0.635 0.560 0.1701 0.065 0.0423 0.2978 Fleiss-Cohen Wts 0.814 0.722 0.3320 0.125 0.0861 0.5780

title 'Kappa for Agreement'; data fun; label husband = 'Husband rating' wife = 'Wife Rating'; do husband = 1 to 4; do wife = 1 to 4; input count @@; output; end; end; cards; 7 7 2 3 2 8 3 7 1 5 4 9 2 8 9 14 ; proc freq; weight count; tables husband * wife / noprintThis produces the following output:agree; run;

+-------------------------------------------------------------------+ | | | Kappa for Agreement | | STATISTICS FOR TABLE OF HUSBAND BY WIFE | | | | Test of Symmetry | | ---------------- | | Statistic = 3.878 DF = 6 Prob = 0.693 | | | | Kappa Coefficients | | Statistic Value ASE 95% Confidence Bounds | | ------------------------------------------------------ | | Simple Kappa 0.129 0.069 -0.005 0.264 | | Weighted Kappa 0.237 0.078 0.084 0.391 | | | | Sample Size = 91 | | | +-------------------------------------------------------------------+

The agreement chart is constructed as an * n x n * square,
where * n * is the total sample size. Black squares, each of
size * n sub ii x n sub ii *, show observed agreement. These
are positioned within larger rectangles, each of size * n sub i+ x
n sub +i *. The large rectangle shows the maximum possible
agreement, given the marginal totals. Thus, a visual impression of
the strength of agreement is

(5)

Figure 12: Agreement chart for husbands
and wives sexual fun. The * B sub N * measure is the ratio of
the areas of the dark squares to their enclosing rectangles, counting
only exact agreement. * B sub N = 0.146 * for these data.

left "" matrix < ccol < ' ' above ' ' above n sub < i,i-b > above ' ' above ' ' > ccol < ' ' above ' ' above ... above ' ' above ' ' > ccol < n sub < i-b,i > above : above n sub ii above : above n sub < i+b,i > > ccol < ' ' above ' ' above ... above ' ' above ' ' > ccol < ' ' above ' ' above n sub < i,i+b > above ' ' above ' ' > > right "" fwd 350 left "" matrix < ccol < ' ' above ' ' above w sub 2 above ' ' above ' ' > ccol < ' ' above ' ' above w sub 1 above ' ' above ' ' > ccol < w sub 2 above w sub 1 above 1 above w sub 1 above w sub 2 > ccol < ' ' above ' ' above w sub 1 above ' ' above ' ' > ccol < ' ' above ' ' above w sub 2 above ' ' above ' ' > > right ""This is incorporated in the agreement chart by successively lighter shaded rectangles whose size is proportional to the sum of the cell frequencies, denoted

B sub N sup w = < roman 'weighted sum of areas of agreement' > over < roman 'area of rectangles' > = size +3 1 - < Sigma from i to k % [ n sub i+ n sub +i - n sub ii sup 2 - Sigma from b=1 to q % w sub b A sub bi ] > over < Sigma from i to k % n sub i+ % n sub +i >where

Figure 13: Weighted agreement chart.. The
* B sub N sup w * measure is the ratio of the areas of the dark
squares to their enclosing rectangles, weighting cells one step
removed from exact agreement with * w sub 1 = 8 div 9 = .889
*. * B sub N sup w = 0.628 * for these data.

New Orleans |------- Winnipeg Neurologist ------| Neurologist Certain Probable Possible Doubtful SUM Certain MS 5 3 0 0 8 Probable MS 3 11 4 0 18 Possible MS 2 13 3 4 22 Doubtful MS 1 2 4 14 21 SUM 11 29 11 18 69

Figure 14: Weighted agreement chart.

title "Classification of Multiple Sclerosis: Marginal Homogeneity"; proc format; value diagnos 1='Certain ' 2='Probable' 3='Possible' 4='Doubtful'; data ms; format win_diag no_diag diagnos.; do win_diag = 1 to 4; do no_diag = 1 to 4; input count @@; if count=0 then count=1e-10; output; end; end; cards; 5 3 0 0 3 11 4 0 2 13 3 4 1 2 4 14 ;In this analysis the diagnostic categories for the two neurologists are repeated measures, since each patient is rated twice. To test whether the marginal frequencies of ratings is the same we specify

title "Classification of Multiple Sclerosis: Marginal Homogeneity"; proc catmod data=ms; weight count; response marginals; model win_diag * no_diag = _response_ / oneway; repeated neuro 2 / _response_= neuro;The test of marginal homogeneity is the test of

+-------------------------------------------------------------------+ | | | ANALYSIS-OF-VARIANCE TABLE | | | | Source DF Chi-Square Prob | | -------------------------------------------------- | | INTERCEPT 3 222.62 0.0000 | | NEURO 3 10.54 0.0145 | | | | RESIDUAL 0 . . | | | +-------------------------------------------------------------------+

Because the diagnostic categories are ordered, we can actually
obtain a more powerful test by assigning scores to the diagnostic
category and testing if the mean scores are the same for both
neurologists. To do this, we specify ` response means`.

title2 'Testing means'; proc catmod data=ms order=data; weight count; response means; model win_diag * no_diag = _response_; repeated neuro 2 / _response_= neuro;

+-------------------------------------------------------------------+ | | | ANALYSIS-OF-VARIANCE TABLE | | | | Source DF Chi-Square Prob | | -------------------------------------------------- | | INTERCEPT 1 570.61 0.0000 | | NEURO 1 7.97 0.0048 | | | | RESIDUAL 0 . . | | | +-------------------------------------------------------------------+

Figure 15 shows aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and gender. At issue is whether the data show evidence of sex bias in admission practices (Bickel et al., 1975). The figure shows the cell frequencies numerically, but margins for both sex and admission are equated in the display. For these data the sample odds ratio, Odds (Admit|Male) / (Admit|Female) is 1.84 indicating that males are almost twice as likely in this sample to be admitted. The four-fold display shows this imbalance clearly.

Figure 15: Four-fold display for Berkeley admissions. The area of each shaded quadrant shows the frequency, standardized to equate the margins for sex and admission. Circular arcs show the limits of a 99% confidence interval for the odds ratio.

The 99% confidence intervals in Figure 15 do not overlap, indicating a significant association between sex and admission. The width of the confidence rings give a visual indication of the precision of the data.

The admissions data shown in Figure 15 were obtained from six departments, so to determine the source of the apparent sex bias in favor of males, we make a new plot, Figure 16, stratified by department.

Surprisingly, Figure 16 shows that,
for five of the six departments, the odds of admission is
approximately the same for both men and women applicants. Department
A appears to differs from the others, with women approximately 2.86
(* = ' ' ( 313/19 ) / (512/89) *) times as likely to gain
admission. This appearance is confirmed by the confidence rings,
which in Figure 16 are joint 99% intervals
for * theta sub c , ' ' % c = 1, ... , k *.

Figure 16: Fourfold display of Berkeley admissions, by department. In each panel the confidence rings for adjacent quadrants overlap if the odds ratio for admission and sex does not differ significantly from 1. The data in each panel have been standardized as in Figure 15.

(This result, which contradicts the display for the aggregate data in Figure 15, is a classic example of Simpson's paradox. The resolution of this contradiction can be found in the large differences in admission rates among departments as we shall see later.)

[Previous] [Next] [Up] [Top]