Michael Friendly

- 1.1 Stem and leaf plots
- 1.2 Letter value summaries
- 1.3 Boxplots
- 1.4 Transformations for symmetry
- 1.5 Dot charts
- 1.6 Normal probability plots

- Examine the data and decide how many digits to retain. Then, truncate the data values to the right of the last digit (faster than rounding).
- For each data value (usually) the last digit becomes the
**leaf****stem**data trunc stem | leaf 400.0 --> 400 40 | 0 86.3 --> 86 8 | 6 59.6 --> 59 5 | 9

Using the tens' digit for the leaf:data trunc stem | leaf 400.0 --> 40 4 | 0 86.3 --> 8 0 | 8 59.6 --> 5 0 | 5

- Looking over the data, make a list of the stem values, like
this (assuming the leaf represents the 10's digit):
0 | 1 | 2 | 3 | 4 | 5 | 6 |

- For each data value, write down its leaf on the appropriate
stem. Usually, an extra step is made in which the leaves
are sorted in ascending order on each stem.
0-0111111111111111112222222222222333344444455555555666666677777888 1-000001222222333445555666778889 2-0015 3-0 4-0 5- 6-5

- Finally, it is useful to add a column of
**depth**UNIT: 10 LPS: 1 64) 0-0111111111111111112222222222222333344444455555555666666677777888 37 1-000001222222333445555666778889 7 2-0015 3 3-0 2 4-0 1 5- 1 6-5

- If some values are very high or very low, it may be more
convenient to write them on separate lines, labelled
"LOW" or "HIGH" rather than including
many blank lines. (The rule used to designate high and low
values will be explained later.)
UNIT: 10 LPS: 5 18 0-011111111111111111 35 0-22222222222223333 49 0-44444455555555 12) 0-666666677777 40 0-888 37 1-000001 31 1-222222333 22 1-445555 16 1-66677 11 1-8889 7 2-001 4 2- 4 2-5 HIGH: 300 400 650

The display above uses five lines for each stem (LPS=5), so each line corresponds to a class interval of 20 in a histogram. Note that leaves of 0 & 1 go on one line, 2 & 3 on the next, etc.Two lines per stem is another possibility:

UNIT: 10 LPS: 2 41 0-01111111111111111122222222222223333444444 23) 0-55555555666666677777888 37 1-00000122222233344 20 1-5555666778889 7 2-001 4 2-5 3 3-0 2 3- 2 4-0 1 4- 1 5- 1 5- 1 6- 1 6-5

- where is the central or typical value?
- how much spread?
- single or multiple modes in the data?
- symmetry or skewness?
- unusual observations?
- coarseness of data?

PROC UNIVARIATE in SAS constructs stem and leaf displays for each variable when you use the PLOT option. If the infant mortality rates are the variable IMR in the data set NATION, these lines give (among other things) a stem and leaf display:

proc univariate data=nations plot; var imr;Figure 1 shows the stem and leaf portion of the PROC UNIVARIATE output for the infant mortality data.

VARIABLE=IMR Infant Mortality Rate STEM LEAF # BOXPLOT 6 5 1 * 6 5 5 4 4 0 1 0 3 3 0 1 0 2 6 1 | 2 002 3 | 1 555566666778899 15 | 1 000011222233344 15 +-----+ 0 55555556666666667778888889 26 *--+--* 0 11111111112222222222222233333333344444 38 +-----+ ----+----+----+----+----+----+----+--- MULTIPLY STEM.LEAF BY 10**+02Figure 1: PROC UNIVARIATE Stem and leaf display for infant mortality data

depth of median = d ( M ) = n + 1 / 2where

The median is a * letter value *, marked
"M", defined as the middle observation in a distribution.
Other letter values are the

The hinges are also defined in terms of their depth:

depth of hinges = d ( H ) = [ d ( M ) ] + 1 / 2where

Other letter values, called * eighths * (E),

d ( M ) = n + 1 / 2

d ( H ) = [ d ( M ) ] + 1 / 2

d ( E ) = [ d ( H ) ] + 1 / 2

d ( D ) = [ d ( E ) ] + 1 / 2This process can be continued until we obtain the

UNIT: 10 LPS: 5 18 0|011111111111111111 E= 16.9 35 0|22222222222223333 H=20 -> 25.7 49 0|44444455555555 12) 0|666666677777 D(M)= 101+1/2 = 51 M=60 -> 60.6 40 0|888 37 1|000001 31 1|222222333 D(H)= [51]+1/2= 26 H=120->129.4 22 1|445555 16 1|66677 D(E)= [26]+1/2= 13.5 E=(162.5+170)/2 11 1|8889 = 166.25 7 2|001 D(D)= [13.5]+1/2=7 D=200 4 2| 4 2|5 HIGH: 300 400 650

The batch can be summarized by a * letter-value
display *:

LETRVALS IMR DEPTH LOWER UPPER SPREAD MIDPT M 51 - 60.6 60.6 - 60.6 H 26 - 26.2 129.4 - 103.2 77.8 E 13H - 16.9 166.3 - 149.4 91.6 D 7 - 12.8 200 - 187.2 106.4 1 1 - 9.6 650 - 640.4 329.8

For each pair of letter values, the average of the two, called
a * midsummary *, contributes information about the
central tendency of the distribution. The difference, called a

Note that:

- skewness is reflected in a trend (increasing or decreasing) in the midsummaries.
- if the midsummaries are increasingly from the middle toward the tails, the distribution is positively skewed, as they appear for the infant mortality data. Decreasing midsummaries imply negative skewness.
- if the distribution is roughly symmetric, then the
**spreads***N ( 0 , 1 )*with*mu = 0*,*s*, the expected spread values are:^{2}= 1letter normal spread ------ ------------- H 1.349 E 2.301 D 3.068 C 3.726

- the center
- the spread of the middle of the data (interquartile range)
- the behavior of the tails
- outliers

- Lay off a scale that accomodates the extremes of the batch.
- Draw a box from the lower hinge to the upper hinge, with a line
within the box marking the location of the median. (The
length of the box = H-spr = IQR =
*H*)._{U}- H_{L} - the
**inner fences**H

_{L}- 1.5 x H-sprH

_{U}+ 1.5 x H-spr **Outer fences***3 x H-spr*beyond the hinges. Note, that for a Gaussian distribution,*1.5 H-spr approx 2 s*, and*3 H-spr approx 4 s*.- Any data value beyond the inner fences is termed an
**outside value****far outside** - On either side of the central box, draw a
**whisker line****adjacent value**

For the infant mortality data, the calculation of fences is (from the letter value display)

inner fences: 25.7 - 1.5 * 103.7 = -129.85 129.4 + 1.5 * 103.7 = 284.95 (outside: 300, 400) outer fences: 25.7 - 3 * 103.7 = -285.4 129.4 + 3 * 103.7 = 440.5 (far out: 650-Saudi Arabia) adjacent: 259Here is the boxplot, showing two outside, and one far-out values:

BOXPLOT IMR 9.60 223.07 436.53 650.00 +-------------------+-------------------+-------------------+ +--------+ <-| | |-----------> o o X +--------+ LIBYA AFGHAN. S. ARABIANote that the skewness of the data is apparent in (a) the location of the median relative to the hinges, (b) the relative lengths of the whisker lines, and (c) the outside values on the high end.

The PLOT option of PROC UNIVARIATE also gives a small boxplot. When you have a grouping variable, you can produce full-page, side-by-side boxplots for each group on the printer with PROC UNIVARIATE.

The countries in the NATIONS data set are classified by REGION. Adding the statement BY REGION to the previous example gives side-by-side boxplots. (The data must first be sorted by REGION.)

proc univariate data=nations plot; var imr; by region;

Univariate Procedure Schematic Plots Variable=IMR Infant Mortality Rate | 700 + | | * | 600 + | | | 500 + | | | 400 + 0 | | | 300 + 0 | | | | | 200 + | | | * +-----+ | | *--+--* | | | | +-----+ 100 + +-----+ | + | | +-----+ | | | | *--+--* | 0 *-----* | | *--+--* +-----+ 0 + +-----+ | ------------+-----------+-----------+-----------+----------- REGION Americas Africa Europe Asia/OceFigure 2: Boxplots for IMR by region. The side-by-side display allows visually comparing the groups in central value, spread, shape, and unusual values.

Figure 3 shows boxplots of Infant mortality rate by region for the NATIONS data. This plot is drawn by the statements below. The option I=BOXT draws horizontal lines at the tips of the whisker lines.

title h=1.5 'Boxplot with I=BOX'; symbol i=boxt mode=include /* V6.06 */ v=+ color=black h=1.2; proc gplot data=nations ; plot imr * region / frame vaxis=axis1 haxis=axis2 vminor=1 hminor=0; axis1 label=(h=1.5 a=90 r=0) value=(h=1.3); axis2 label=(h=1.5) value=(h=1.3) offset=(8 pct);Figure 3: Simple boxplot with I=BOX option

The notches to give approximate 95% comparison intervals are calculated as:

Median ± 1.58 ( H-spr / sqrt n )

- one or more CLASS variables, character or numeric.
- lines connecting the group medians.
- notches showing 95% confidence intervals for difference in medians.
- outside observations can be labelled with the value of an ID variable.
- box widths can be constant for all groups, or made proportional to a function of the sample size in each group.
- user-supplied annotations can be added to the graph.
- produces an output data set containing boxplot statistics.

/* Description of Parameters: */ %macro BOXPLOT( /* ------------------------- */ data=_LAST_, /* Input dataset */ class=, /* Grouping variable(s) */ var=, /* Ordinate variable */ id=, /* Observation ID variable */ width=.5, /* Box width as proportion of maximum */ notch=0, /* =0|1, 1=draw notched boxes */ connect=0, /* =0 or line style to connect medians*/ f=0.5, /* Notch depth, fraction of halfwidth */ fn=1, /* Box width proportional to &FN */ varfmt=, /* Format for ordinate variable */ classfmt=, /* Format for class variable(s) */ varlab=, /* Label for ordinate variable */ classlab=, /* Label for class variable(s) */ yorder=, /* Tick marks, range for ordinate */ anno=, /* Addition to ANNOTATE set */ out=boxstat, /* Output data set: quartiles, etc. */ name=BOXPLOT /* Name for graphic catalog entry */ );The figure below shows notched boxplots for the IMR data produced using the BOXPLOT macro. The width of each box is made proportional to

%include boxplot; title1 h=1.5 'Infant Mortality Rate by Region'; %boxplot(data=nations, class=region, var=imr,id=nation, notch=1, fn=sqrt(n), classfmt=region., varlab=Infant Mortality Rate, classlab=Region);

- making a distribution more symmetric
- equalizing variability (spreads) before comparing level across several batches
- making the relationship between two variables linear.

A particularly useful family of transformations is what Tukey
calls the * ladder of powers *, where a variable

tAlthough the family of transformations given by (1) is defined for all powers_{p}( x ) = left { lpile x^{p}-1 over p , above log_{10}x , lpile p != 0 above p = 0 right "" . (1)

+---------------------------------------------+ | Power Transformation Re-expression | |---------------------------------------------| | 3 Cube xThe important properties of this family of transformations are:^{3}| | | | 2 Square x^{2}| | | | 1 NONE (Raw) x | | | | 1/2 Square root sqrt x | | | | 0 Log log10 x | | | | -1/2 Reciprocal root -1/sqrt x | | | | -1 Reciprocal -1 / x | +---------------------------------------------+

- They preserve the order of data values. Larger data values on the original scale will be larger on the transformed scale. (That's why negative powers have their sign reversed.)
- They change the spacing of the data values. Powers
*p < 1*, such as*sqrt x*and*log x*compress values in the upper tail of the distribution relative to low values; powers*p > 1*, such as*x*, have the opposite effect, expanding the spacing of values in the upper end relative to the lower end.^{2}log X X X**2 0 1 1 1 { } 9 } 99 1 10 100 1 { } 90 } 9900 2 100 10000 1 { }900 } 990000 3 1000 1000000

- The effect on the shape of the distribution changes
systematically with
*p*. If*sqrt x*pulls in the upper tail,*log x*will do so more strongly, and negative powers will be stronger still.

Since the power transformations preserve the order of the data values, we can find a suitable transformation by looking at the letter values and their midsummaries transformed to various powers. In a symmetric distribution, the midsummaries do not systematically change as we go out toward the extremes.

LETRVALS IMR * .5 DEPTH LOWER UPPER SPREAD MIDPT M 51 - 7.8 7.8 - 7.8 H 26 - 5.1 11.4 - 6.3 8.2 E 13H - 4.1 12.9 - 8.8 8.5 D 7 - 3.6 14.1 - 10.6 8.9 1 1 - 3.1 25.5 - 22.4 14.3 LETRVALS 10zIMR DEPTH LOWER UPPER SPREAD MIDPT M 51 - 1.8 1.8 - 1.8 H 26 - 1.4 2.1 - .7 1.8 E 13H - 1.2 2.2 - 1 1.7 D 7 - 1.1 2.3 - 1.2 1.7 1 1 - 1 2.8 - 1.8 1.9The boxplots for

BOXPLOT IMR * .5 3.10 10.56 18.03 25.50 +-------------------+-------------------+-------------------+ +----------------+ ----| | |---------------------- +----------------+ BOXPLOT 10zIMR .98 1.59 2.20 2.81 +-------------------+-------------------+-------------------+ +----------------------+ -------------| | |---------------------- +----------------------+

Graphical methods can also be used to find a power
transformation for symmetry. The simplest, useful plots are based
on the idea that in a symmetric distribution, the distances of
points at the lower end to the median should match the distances of
corresponding points in the upper end to the median. In general,
if the distribution were perfectly symmetric the observations at
depth * i * should satisfy,

Lower distance to median = Upper distance to median

fwd 1.25i M - x_{(i)}= x_{(n+1-i)}- M

for

mid identical ( xThis plot is based on the idea that in a symmetric distribution, each mid value should equal the median,_{(n+1-i)}+ x_{(i)}) / 2 vs. x_{(n+1-i)}- x_{(i)}identical spread .

( x_{(n+1-i)}+ x_{(i)}) / 2 = M .

Since the spreads increase as we go from the center out to the tails, this plot depicts how the pattern of mid values changes from the middle of the distribution to the extremes. Because the plot is untilted (slope = 0) when the distribution is symmetric, expansion of the vertical scale allows us to see systematic departures from flatness far more clearly.

xas the vertical coordinate, against a squared measure of spread,_{(i)}+ x_{(n+1-i)}over 2 - M

Loweras the vertical coordinate. If this graph is approximately linear, with a slope,^{2}+ Upper^{2}over 4 M = ( M - x_{(i)})^{2}+ ( x_{(n+1-i)}- M )^{2}over 4 M

proc sort data=analyze out=sortup; by X; proc sort data=analyze out=sortdn; by descending X; data symplot; merge sortup(rename=(X=frombot)) /* frombot = x(i) */ sortdn(rename=(X=fromtop)); /* fromtop = x(n+1-i) */ mid = (fromtop + frombot) / 2; spread = fromtop - frombot;These plots are all implemented in the SYMPLOT macro, which takes the following parameters:

%macro SYMPLOT( data=_LAST_, /* data to be analyzed */ var=, /* variable to be plotted */ plot=MIDSPR, /* Type of plot(s): NONE, or any of */ /* UPLO, MIDSPR, MIDZSQ, or POWER */ trim=0, /* # or % of extreme obs. to be trimmed */ symbol=dot, /* plotting symbol */ out=symplot, /* output data set */ name=SYMPLOT); /* name for graphic catalog entry */For example, the symmetry power plot of the IMR data is produced by these statements:

%include symplot; title h=1.6 'IMR data: Symmetry transformation plot'; %symplot(data=nations,var=IMR,plot=POWER,trim=5 pct);Trimming 5% of the observations in each tail ensures that extreme observations do not influence the choice of the power transformation.

The * dot chart *, developed by Cleveland (1984)
is quite effective for this purpose, as long as the number of
observations is not too large. If the observations are sorted by
the response variable, the dot chart portrays the

Simple dotplots can be produced on a printer with the TIMEPLOT procedure. The observations are arranged on the vertical axis in their order in the data set, so the data set should be sorted in descending order of the response variable. For example, the statements below produce the printer plot shown in Figure 4. The options REF=0 and JOINREF produce the dashed line connecting the plotting symbol to the vertical axis.

%include data(duncan); proc sort data=duncan; by descending prestige; proc timeplot data=duncan; plot prestige = '*' / ref=0 joinref; id job;

Duncan Occupational Prestige Data JOB Prestige min max 0 97 *---------------------------------------------* Physician 97 ||-------------------------------------------*| Professor 93 ||------------------------------------------* | Banker 92 ||------------------------------------------* | Architect 90 ||-----------------------------------------* | Chemist 90 ||-----------------------------------------* | Dentist 90 ||-----------------------------------------* | Lawyer 89 ||----------------------------------------* | Civil Eng. 88 ||----------------------------------------* | Minister 87 ||---------------------------------------* | Pilot 83 ||--------------------------------------* | Accountant 82 ||-------------------------------------* | Factory Owner 81 ||-------------------------------------* | Author 76 ||----------------------------------* | Contractor 76 ||----------------------------------* | PS Teacher 73 ||---------------------------------* | RR Engineer 67 ||------------------------------* | Welfare Wrkr. 59 ||--------------------------* | Undertaker 57 ||-------------------------* | Machinist 57 ||-------------------------* | Electrician 53 ||------------------------* | Reporter 52 ||-----------------------* | Store Manager 45 ||--------------------* | Insur. Agent 41 ||------------------* | Policeman 41 ||------------------* | Bookkeeper 39 ||-----------------* | RR Conductor 38 ||-----------------* | Mail carrier 34 ||---------------* | Carpenter 33 ||--------------* | Plumber 29 ||------------* | Auto repair 26 ||-----------* | Machine opr. 24 ||----------* | Barber 20 ||--------* | Motorman 19 ||--------* | Store clerk 16 ||------* | Cook 16 ||------* | Coal miner 15 ||------* | Truck driver 13 ||-----* | Watchman 11 ||----* | Gas stn attn 10 ||----* | Taxi driver 10 ||----* | Waiter 10 ||----* | Janitor 8 ||---* | Bartender 7 ||--* | Soda clerk 6 ||--* | Shoe-shiner 3 ||* | *---------------------------------------------*Figure 4: Dot chart of Occupational Prestige from PROC TIMEPLOT

%macro dotplot( data=_LAST_, /* input data set */ xvar=, /* horizontal variable (response) */ xorder=, /* plotting range of response */ xref=, /* reference lines for response variable */ yvar=, /* vertical variable (observation label) */ ysortby=&xvar, /* how to sort observations */ ylabel=, /* label for y variable */ group=, /* vertical grouping variable */ gpfmt=, /* format for printing group variable */ /* value (include the . at the end) */ connect=DOT, /* draw lines to ZERO, DOT, AXIS, or NONE */ dline=2, /* style of horizontal lines */ dcolor=BLACK, /* color of horizontal lines */ errbar=, /* variable giving length of error bar */ /* for each observation */ name=DOTPLOT); /* Name for graphic catalog entry */The high-resolution dot chart of occupational prestige is produced by these statements:

%include DUNCAN ; * get the data; %include DOTPLOT; * get the macro; %dotplot( data=duncan, yvar=job, xvar=prestige, connect=DOT);

- they use arbitrary bins (class intervals)
- they lose resolution in the tails (where differences are likely)
- the standard for comparison is a curve

* Quantile comparison plots * or

G HAT ( x ) = number ( X <= x ) over n = estimated Prob ( X <= x )with the cumulative probability distribution (CDF) of the reference distribution,

F ( z ) = Prob ( Z <= z )

A general, graphical approach is to plot the ordered data
values, * x _{(i)} *, against the expected values of those
observations if they followed the particular reference
distribution. When the data follows the reference distribution,
the points in such a plot will follow a straight line. Departures
from the line shows

- Arrange the observations in ascending order, giving
*x sub (1) , x*_{(2)}, ... , x_{(i)}, ... , x_{(n)} - For each
*i*, calculate the proportion of observations*p*below_{i}*x*,_{(i)}p

(Other definitions are used also; this one splits each observation, assigning half of it above, and half below_{i}= i - 1/2 over n*x*)_{(i)} - Find the corresponding quantile of the reference distribution,
z

_{i}= F^{-1}( p_{i}) - Plot the
*z*against the ordered observation_{i}*x*on the vertical axis._{(i)}

- If the observations come from distribution
*F*(the Gaussian, say), then the plot will be approximately linear with an intercept = 0 and slope = 1. - If the distributions are identical except for location (mean), then the intercept will be non-zero; if they are identical except for spread, the slope will be different from 1.
- A
**comparison line****robust**expected = Median ( x ) + ( H-spr / 1.349 ) z

_{i}

- Postive (negative) skewed: Both tails above (below) the comparison line
- Heavy tailed: Lower tail below, upper tail above the comparison line

One way to do this is to calculate an estimated standard error,
* s Hat ( z _{i} ) *, of the ordinate

s Hat ( zwhere_{i}) = s hat over f ( z_{i}) sqrt p_{i}( 1 - p_{i}) over n (2)

To produce your own plot with SAS/GRAPH software, you can sort
the data and calculate the normal quantiles in a DATA step with the
PROBIT function as shown below. This is the most general method,
since you can easily adjust the calculation of * p _{i} *,
use a different

**- Normal Q-Q plot using PROBIT --; proc sort data=nations; by imr; proc univariate noprint data=nations; var imr; output out=stats n=nobs median=median qrange=hspr mean=mean std=std; data quantile; set nations; if _n_=1 then set stats; if imr ^= . ; *-- remove missing data!; i + 1; p = (i - .5) / nobs; z = probit(p); normal = mean + z * std; robust = median + z * hspr/1.349;

proc gplot data=quantile; plot imr * z = 1 normal* z = 2 robust* z = 3 / overlay frame; symbol1 i=none h=1.1 v=- color=black; symbol2 i=join l=20 v=none color=blue; symbol3 i=join l=1 v=none color=red;

%macro nqplot ( data=_LAST_, /* input data set */ var=x, /* variable to be plotted */ out=nqplot, /* output data set */ mu=MEDIAN, /* est of mean of normal distribution: */ /* MEAN, MEDIAN or literal value */ s=HSPR, /* est of std deviation of normal: */ /* STD, HSPR, or literal value */ stderr=YES, /* plot std errors around curves? */ detrend=YES, /* plot detrended version? */ lh=1.5, /* height for axis labels */ anno=, /* name of input annotate data set */ name=NQPLOT, /* name of graphic catalog entries */ gout=); /* name of graphic catalog */Only the name of the data set and the variable to be plotted need be supplied. The normal Q-Q plot of the IMR data would be produced with this statement:

%nqplot(data=nations,var=imr);

Previous section Next section.