Michael Friendly

The table below shows two such data sets:

- von Bortkiewicz's (1898) data on death of soldiers in the Prussian army from kicks by horses and mules. The data pertain to 10 army corps, each observed over 20 years. In 109 corps-years, no deaths occurred; 65 corps-years had one death, etc. ( Figure 1)
- Mosteller & Wallace's (1964) data on the occurrence of the word
*may*in 262 blocks of text (each about 200 words long) from issues of the Federalist Papers known to be written by James Madison. In 156 blocks, the word*may*did not occur; it occurred once in 63 blocks, etc. ( Figure 2)

Deaths by Horsekick Occurrences of 'may' k nk k nk ---------- ---------- 0 109 0 156 1 65 1 63 2 22 2 29 3 3 3 8 4 1 4 4 ----- 5 1 N=200 6 1 ----- N=256

Figure 1: von Bortkiewicz's data Figure 2: Mosteller & Wallace data

chi² = Sum from k=1 to K < ( n sub k - N p hat sub k ) > sup 2 over < N p hat sub k > ~ chi² ( K-1 )where

For example, for the Poisson distribution, the probability function is

(1)The maximum likelihood estimator of the parameter

lambda Hat = Sigma k n sub k / N

For the horsekick data, the mean is 122/200 = .610, and
calculation of Poisson probabilities (PHAT), expected frequencies,
and contributions to * chi² * are shown below.

k nk p phat exp chisq 0 109 0.545 0.54335 108.670 0.00100 1 65 0.325 0.33144 66.289 0.02506 2 22 0.110 0.10109 20.218 0.15705 3 3 0.015 0.02056 4.111 0.30025 4 1 0.005 0.00313 0.627 0.22201 === ======= ======= 200 199.915 0.70537In this case the~ chi² (4)

(von Bortkiewicz's data is collapsed over 20 years and 14 army corps,
and the Poisson model assumes that the probability of a death is constant
for all years and corps. This assumption can be tested in the raw data,
by fitting a poisson model, ` deaths = year corps`

.
The effects of both year and corps are significant, indicating that
the homogeneity assumption is not met.)

Assume, for some fixed * lambda *, each observed frequency,
* n sub k * equals the expected frequency, * m sub k = N p
sub k *. Then, setting * n sub k = N p sub k = { e sup {-
lambda} lambda sup k } / { k ! } *, and taking logs of both
sides gives

log ( n sub k ) = log N - lambda + k log lambda - log k !which can be rearranged to

(2)The left side of (2) is called the

- intercept =
*- lambda* - slope =
*log lambda*

**Resistance***n sub k*affects only the point at value*k*.**Comparison standard****Influence***lambda*in the Poisson).

phi (n sub k)CI CI Confidence Int k nk Y center width lower upper 0 109 -0.607 -0.617 0.130 -0.748 -0.487 1 65 -1.124 -1.138 0.207 -1.345 -0.931 2 22 -1.514 -1.549 0.417 -1.966 -1.132 3 3 -2.408 -2.666 1.318 -3.984 -1.348 4 1 -2.120 -3.120 2.689 -5.809 -0.432

Figure 3: Poissonness plots for two discrete distributions. The Horse Kicks data fits the Poisson distribution reasonably well, but the May data does not.

- A different formula for the count metameter,
*phi ( n sub k )*is required for each discrete distribution. - Systematic deviation from a linear relationship does not indicate which distribution provides a better fit.

(3)holds for each of the Poisson, binomial, negative binomial, and logarithmic series distributions. The slope,

Thus, a plot of * k n sub k / n sub k-1 * against * k
*, if linear, is suggested as a means to determine which
distribution to apply.

+--------------------+--------------------------------------+ | Slope Intercept | Distribution Parameter | | (b) (a) | (parameter) estimate | |--------------------+--------------------------------------| | 0 + | Poisson (lambda) lambda = a | | | | | - + | Binomial (n, p) p = b/(b-1) | | | | | + + | Neg. binom (n,p) p = 1 - b | | | | | + - | Log. series (theta) theta = b | | | theta = - a | +--------------------+--------------------------------------+

Ord Plot: Deaths by Horsekicks k nk nk) wk y 0 109 . 10.3923 . -- Weighted LS -- 1 65 109 8.0000 0.5963 slope = -0.034 2 22 65 4.5826 0.6769 inter = 0.656 3 3 22 1.4142 0.4091 4 1 3 0.0000 1.3333For the word frequency data, the slope is positive, so either the negative binomial or log series are possible. The intercept is essentially zero, which is ambiguous. However, the logarithmic series requires

Instances of 'may' in Federalist papers k nk nk) wk y 0 156 . 12.4499 . -- Weighted LS -- 1 63 156 7.8740 0.4038 slope = 0.424 2 29 63 5.2915 0.9206 inter = -0.023 3 8 29 2.6458 0.8276 4 4 8 1.7321 2.0000 5 1 4 0.0000 1.2500 6 1 1 0.0000 6.0000Plots of data fitting several different discrete distributions are shown in Figure 4.

- The Ord plot lacks resistance, since a single discrepant
frequency affects the points for
*k*and*k + 1*. - The sampling variance of
*k n sub k / n sub k-1*fluctuates widely. The use of weights*w sub k*helps, but is purely a heuristic device.

Figure 4: Ord plots for four discrete distributions. The OLS line is drawn in black, the WLS line in red. The slope and intercept of the WLS line are used to automatically diagnose the type of the distribution.

[Previous] [Next] [Up] [Top]