Exploratory and Graphical Methods of Data Analysis
Michael Friendly

Part 1: Examining Univariate Distributions

1.1 Stem and leaf plots

The stem and leaf display is a form of histogram that retains the actual data values while showing the shape of their distribution. Tukey (1977) invented the stem and leaf display as a quick means of writing down a set of data values, producing a data display from which summary values can be easily calculated.

How to make a stem and leaf display

Examine the data and decide how many digits to retain. Then, truncate the data values to the right of the last digit (faster than rounding).
For each data value (usually) the last digit becomes the leaf and all but the last digit become the stem . For the Infant Mortality data, using the units' place for the leaf, some of the values look like this:
```
 data      trunc    stem | leaf
400.0   -->  400     40  | 0
 86.3   -->   86      8  | 6
 59.6   -->   59      5  | 9
```
Using the tens' digit for the leaf:
```
 data      trunc    stem | leaf
400.0   -->  40       4  | 0
 86.3   -->   8       0  | 8
 59.6   -->   5       0  | 5
```
Looking over the data, make a list of the stem values, like this (assuming the leaf represents the 10's digit):
```
    0 |
    1 |
    2 |
    3 |
    4 |
    5 |
    6 |
```
For each data value, write down its leaf on the appropriate stem. Usually, an extra step is made in which the leaves are sorted in ascending order on each stem.
```
 0-0111111111111111112222222222222333344444455555555666666677777888
 1-000001222222333445555666778889
 2-0015
 3-0
 4-0
 5-
 6-5
```
Finally, it is useful to add a column of depth values, which indicate how many observations are on each stem and on stems toward the closer end of the batch. The depth values make it easy to find the median, interquartile range and other summary values just by counting in from the closer end of the batch, as will be shown below.
```
 UNIT: 10    LPS: 1

  64) 0-0111111111111111112222222222222333344444455555555666666677777888
  37  1-000001222222333445555666778889
   7  2-0015
   3  3-0
   2  4-0
   1  5-
   1  6-5
```
If some values are very high or very low, it may be more convenient to write them on separate lines, labelled "LOW" or "HIGH" rather than including many blank lines. (The rule used to designate high and low values will be explained later.)
```
 UNIT: 10    LPS: 5

  18  0-011111111111111111
  35  0-22222222222223333
  49  0-44444455555555
  12) 0-666666677777
  40  0-888
  37  1-000001
  31  1-222222333
  22  1-445555
  16  1-66677
  11  1-8889
   7  2-001
   4  2-
   4  2-5

 HIGH: 300 400 650
```
The display above uses five lines for each stem (LPS=5), so each line corresponds to a class interval of 20 in a histogram. Note that leaves of 0 & 1 go on one line, 2 & 3 on the next, etc.
Two lines per stem is another possibility:
```
 UNIT: 10    LPS: 2

  41  0-01111111111111111122222222222223333444444
  23) 0-55555555666666677777888
  37  1-00000122222233344
  20  1-5555666778889
   7  2-001
   4  2-5
   3  3-0
   2  3-
   2  4-0
   1  4-
   1  5-
   1  5-
   1  6-
   1  6-5
```

Things to look for in stem and leaf displays

where is the central or typical value?
how much spread?
single or multiple modes in the data?
symmetry or skewness?
unusual observations?
coarseness of data?

Constructing stem and leaf displays with SAS

PROC UNIVARIATE in SAS constructs stem and leaf displays for each variable when you use the PLOT option. If the infant mortality rates are the variable IMR in the data set NATION, these lines give (among other things) a stem and leaf display:

proc univariate data=nations plot;
   var imr;

Figure 1 shows the stem and leaf portion of the PROC UNIVARIATE output for the infant mortality data.




 VARIABLE=IMR            Infant Mortality Rate

 STEM LEAF                                       #                       BOXPLOT
    6 5                                          1                          *
    6
    5
    5
    4
    4 0                                          1                          0
    3
    3 0                                          1                          0
    2 6                                          1                          |
    2 002                                        3                          |
    1 555566666778899                           15                          |
    1 000011222233344                           15                       +-----+
    0 55555556666666667778888889                26                       *--+--*
    0 11111111112222222222222233333333344444    38                       +-----+
      ----+----+----+----+----+----+----+---
   MULTIPLY STEM.LEAF BY 10**+02

Figure 1: PROC UNIVARIATE Stem and leaf display for infant mortality data

1.2 Letter value summaries

The median, the middle value in the batch, can be defined as the observation whose depth is

depth of median = d ( M ) = n + 1 / 2

where n is the number of observations. If n is even, so d ( M ) is a half integer (e.g., 61.5) we average the two adjacent values (at depth 61 and 62) to get the median, M .

The median is a letter value , marked "M", defined as the middle observation in a distribution. Other letter values are the hinges , marked "H", which are located at the middle of each of the upper and lower halves of the ordered values. (They are essentially the same as the quartiles , but the definitions differ slightly.)

The hinges are also defined in terms of their depth:

depth of hinges = d ( H ) = [ d ( M ) ] + 1 / 2

where [ ] means to drop the fractional part, if any.

Other letter values, called eighths (E), sixteenths (D), etc. are defined similarly as the observation half way, in number of observations, between the previous letter value and the nearer extreme.

d ( M ) = n + 1 / 2

d ( H ) = [ d ( M ) ] + 1 / 2

d ( E ) = [ d ( H ) ] + 1 / 2

d ( D ) = [ d ( E ) ] + 1 / 2

This process can be continued until we obtain the extremes at depth 1. The calculations can be done directly from the stem and leaf display, but you should take the original data values (before truncation) as the letter values.


 UNIT: 10    LPS: 5

  18  0|011111111111111111                            E=      16.9
  35  0|22222222222223333                             H=20 -> 25.7
  49  0|44444455555555
  12) 0|666666677777            D(M)= 101+1/2 = 51    M=60 -> 60.6
  40  0|888
  37  1|000001
  31  1|222222333               D(H)= [51]+1/2= 26    H=120->129.4
  22  1|445555
  16  1|66677                   D(E)= [26]+1/2= 13.5  E=(162.5+170)/2
  11  1|8889                                           = 166.25
   7  2|001                     D(D)= [13.5]+1/2=7    D=200
   4  2|
   4  2|5

 HIGH: 300 400 650

The batch can be summarized by a letter-value display :


       LETRVALS IMR
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -   60.6    60.6 -            60.6
 H 26  -   26.2   129.4 -   103.2    77.8
 E 13H -   16.9   166.3 -   149.4    91.6
 D  7  -   12.8   200   -   187.2   106.4
 1  1  -    9.6   650   -   640.4   329.8

For each pair of letter values, the average of the two, called a midsummary , contributes information about the central tendency of the distribution. The difference, called a spread , indicates the variability of the data.

Note that:

skewness is reflected in a trend (increasing or decreasing) in the midsummaries.
if the midsummaries are increasingly from the middle toward the tails, the distribution is positively skewed, as they appear for the infant mortality data. Decreasing midsummaries imply negative skewness.
if the distribution is roughly symmetric, then the spreads can be used to compare the shape to that of a normal distribution. For a standard, unit-normal distribution, N ( 0 , 1 ) with mu = 0 , s ² = 1 , the expected spread values are:
```
   letter    normal spread
   ------    -------------
     H          1.349
     E          2.301
     D          3.068
     C          3.726
```

1.3 Boxplots

Boxplots provide a schematic graphical summary of important features of a distribution, including:

the center
the spread of the middle of the data (interquartile range)
the behavior of the tails
outliers

The boxplot can be easily drawn from the information in the letter-value display:

Lay off a scale that accomodates the extremes of the batch.
Draw a box from the lower hinge to the upper hinge, with a line within the box marking the location of the median. (The length of the box = H-spr = IQR = H _U - H _L ).
the inner fences , used to identify outliers, are defined as:
H _L - 1.5 x H-spr

H _U + 1.5 x H-spr
Outer fences are defined by similar intervals extending 3 x H-spr beyond the hinges. Note, that for a Gaussian distribution, 1.5 H-spr approx 2 s , and 3 H-spr approx 4 s .
Any data value beyond the inner fences is termed an outside value , and plotted individually. Data values beyond the outer fences are termed far outside , and plotted individually with a darker or bigger plotting symbol.
On either side of the central box, draw a whisker line out to the most extreme value on that side (called the adjacent value ) that is still inside the inner fence.

For the infant mortality data, the calculation of fences is (from the letter value display)

inner fences:  25.7 - 1.5 * 103.7 = -129.85
              129.4 + 1.5 * 103.7 =  284.95  (outside: 300, 400)

outer fences:  25.7 -  3  * 103.7 = -285.4
              129.4 +  3  * 103.7 =  440.5   (far out: 650-Saudi Arabia)

adjacent:      259

Here is the boxplot, showing two outside, and one far-out values:


      BOXPLOT IMR
    9.60              223.07              436.53              650.00
     +-------------------+-------------------+-------------------+
       +--------+
     <-|  |     |----------->  o          o                     X
       +--------+
                             LIBYA      AFGHAN.              S. ARABIA

Note that the skewness of the data is apparent in (a) the location of the median relative to the hinges, (b) the relative lengths of the whisker lines, and (c) the outside values on the high end.

Constructing boxplots with SAS

The PLOT option of PROC UNIVARIATE also gives a small boxplot. When you have a grouping variable, you can produce full-page, side-by-side boxplots for each group on the printer with PROC UNIVARIATE.

The countries in the NATIONS data set are classified by REGION. Adding the statement BY REGION to the previous example gives side-by-side boxplots. (The data must first be sorted by REGION.)

proc univariate data=nations plot;
   var imr;
   by region;



                             Univariate Procedure
                               Schematic Plots

 Variable=IMR           Infant Mortality Rate

             |
         700 +
             |
             |                                                *
             |
         600 +
             |
             |
             |
         500 +
             |
             |
             |
         400 +                                                0
             |
             |
             |
         300 +                        0
             |
             |                        |
             |                        |
         200 +                        |                       |
             |            *        +-----+                    |
             |                     *--+--*                    |
             |                     |     |                 +-----+
         100 +                     +-----+                 |  +  |
             |         +-----+        |                    |     |
             |         *--+--*        |           0        *-----*
             |            |                    *--+--*     +-----+
           0 +                                 +-----+        |
              ------------+-----------+-----------+-----------+-----------
      REGION          Americas      Africa      Europe    Asia/Oce

Figure 2: Boxplots for IMR by region. The side-by-side display allows visually comparing the groups in central value, spread, shape, and unusual values.

Boxplots with PROC GPLOT

Simple boxplots can be drawn using PROC GPLOT in Version 6 of the SAS System using the I=BOX interpolation options of the SYMBOL statement. By default, the whisker lines extend ± 1.5 IQR beyond the quartiles.

Figure 3 shows boxplots of Infant mortality rate by region for the NATIONS data. This plot is drawn by the statements below. The option I=BOXT draws horizontal lines at the tips of the whisker lines.

title h=1.5 'Boxplot with I=BOX';
symbol i=boxt mode=include             /* V6.06 */
       v=+ color=black h=1.2;
proc gplot data=nations ;
   plot imr * region
        / frame vaxis=axis1 haxis=axis2
          vminor=1 hminor=0;
   axis1 label=(h=1.5 a=90 r=0) value=(h=1.3);
   axis2 label=(h=1.5)          value=(h=1.3) offset=(8 pct);

Figure 3: Simple boxplot with I=BOX option

Notched boxplots

"Notches" can be drawn in each box to show approximate confidence intervals around the medians. If the intervals for two groups do not overlap, the group medians can be considered different.

The notches to give approximate 95% comparison intervals are calculated as:

Median ± 1.58 ( H-spr / sqrt n )

BOXPLOT macro

The BOXPLOT macro provides the following features:

one or more CLASS variables, character or numeric.
lines connecting the group medians.
notches showing 95% confidence intervals for difference in medians.
outside observations can be labelled with the value of an ID variable.
box widths can be constant for all groups, or made proportional to a function of the sample size in each group.
user-supplied annotations can be added to the graph.
produces an output data set containing boxplot statistics.

Using the BOXPLOT macro

The parameters for the macro are shown below. Only the CLASS and VAR parameters must be supplied. If an ID variable is specified, outside observations (beyond the adjacent values) are labelled on the graph.

                               /* Description of Parameters:         */
%macro BOXPLOT(                /* -------------------------          */
       data=_LAST_,            /* Input dataset                      */
       class=,                 /* Grouping variable(s)               */
       var=,                   /* Ordinate variable                  */
       id=,                    /* Observation ID variable            */
       width=.5,               /* Box width as proportion of maximum */
       notch=0,                /* =0|1, 1=draw notched boxes         */
       connect=0,              /* =0 or line style to connect medians*/
       f=0.5,                  /* Notch depth, fraction of halfwidth */
       fn=1,                   /* Box width proportional to &FN      */
       varfmt=,                /* Format for ordinate variable       */
       classfmt=,              /* Format for class variable(s)       */
       varlab=,                /* Label for ordinate variable        */
       classlab=,              /* Label for class variable(s)        */
       yorder=,                /* Tick marks, range for ordinate     */
       anno=,                  /* Addition to ANNOTATE set           */
       out=boxstat,            /* Output data set: quartiles, etc.   */
       name=BOXPLOT            /* Name for graphic catalog entry     */
       );

The figure below shows notched boxplots for the IMR data produced using the BOXPLOT macro. The width of each box is made proportional to sqrt n .

%include boxplot;
title1 h=1.5 'Infant Mortality Rate by Region';
%boxplot(data=nations, class=region, var=imr,id=nation,
         notch=1, fn=sqrt(n), classfmt=region.,
         varlab=Infant Mortality Rate, classlab=Region);

1.4 Transformations for symmetry

Transformations have several uses in data analysis, including:

making a distribution more symmetric
equalizing variability (spreads) before comparing level across several batches
making the relationship between two variables linear.

These goals often coincide, in that a transformation that achieves one goal will often help for another.

A particularly useful family of transformations is what Tukey calls the ladder of powers , where a variable x is transformed to t _p ( x ) by a power p according to

t _p ( x ) = left { lpile x ^p -1 over p , above log ₁₀ x , lpile p != 0 above p = 0 right "" . (1)

Although the family of transformations given by (1) is defined for all powers p , it is common to use only the simple integer and half-integer powers listed in Table 1. (In some situations, we may add the cube root, p = 1 / 3 to this list.) Table 1: Simple transformations in the ladder of powers

       +---------------------------------------------+
       | Power     Transformation      Re-expression |
       |---------------------------------------------|
       |   3       Cube                 x ³      |
       |                                             |
       |   2       Square               x ²      |
       |                                             |
       |   1       NONE (Raw)           x            |
       |                                             |
       |  1/2      Square root          sqrt x       |
       |                                             |
       |   0       Log                  log10 x      |
       |                                             |
       | -1/2      Reciprocal root     -1/sqrt x     |
       |                                             |
       |  -1       Reciprocal          -1 / x        |
       +---------------------------------------------+

The important properties of this family of transformations are:

They preserve the order of data values. Larger data values on the original scale will be larger on the transformed scale. (That's why negative powers have their sign reversed.)

They change the spacing of the data values. Powers p < 1 , such as sqrt x and log x compress values in the upper tail of the distribution relative to low values; powers p > 1 , such as x ² , have the opposite effect, expanding the spacing of values in the upper end relative to the lower end.

     log X        X                X**2
       0           1                1
 1  {                }  9             }     99
       1          10              100
 1  {                } 90             }   9900
       2         100            10000
 1  {                }900             } 990000
       3        1000          1000000

The effect on the shape of the distribution changes systematically with p . If sqrt x pulls in the upper tail, log x will do so more strongly, and negative powers will be stronger still.

Note: for power transformations to be effective, the ratio of the largest to the smallest data value must be fairly big.

Since the power transformations preserve the order of the data values, we can find a suitable transformation by looking at the letter values and their midsummaries transformed to various powers. In a symmetric distribution, the midsummaries do not systematically change as we go out toward the extremes.


       LETRVALS IMR * .5
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -    7.8     7.8 -             7.8
 H 26  -    5.1    11.4 -     6.3     8.2
 E 13H -    4.1    12.9 -     8.8     8.5
 D  7  -    3.6    14.1 -    10.6     8.9
 1  1  -    3.1    25.5 -    22.4    14.3

       LETRVALS 10zIMR
  DEPTH   LOWER   UPPER    SPREAD   MIDPT
 M 51  -    1.8     1.8 -             1.8
 H 26  -    1.4     2.1 -      .7     1.8
 E 13H -    1.2     2.2 -     1       1.7
 D  7  -    1.1     2.3 -     1.2     1.7
 1  1  -    1       2.8 -     1.8     1.9

The boxplots for sqrt IMR and log IMR show the effect of these transformations on the shape of the data. The distribution becomes more symmetric as the power decreases.


      BOXPLOT IMR * .5
    3.10               10.56               18.03               25.50
     +-------------------+-------------------+-------------------+
          +----------------+
     ----|       |        |----------------------              
          +----------------+

      BOXPLOT 10zIMR
     .98                1.59                2.20                2.81
     +-------------------+-------------------+-------------------+
                   +----------------------+
     -------------|           |          |----------------------
                   +----------------------+

Plots for assessing symmetry

Graphical methods can also be used to find a power transformation for symmetry. The simplest, useful plots are based on the idea that in a symmetric distribution, the distances of points at the lower end to the median should match the distances of corresponding points in the upper end to the median. In general, if the distribution were perfectly symmetric the observations at depth i should satisfy,

Lower distance to median = Upper distance to median

fwd 1.25i M - x _(i) = x _(n+1-i) - M

for i = 1 to n / 2 if n is even, or i = 1 to ( n+1) / 2 if n is odd. In the Upper vs. Lower plot , these points should plot as a straight line with slope = 1 in a symmetric distribution. For skewed distributions, the points will tend to rise above the line (positive skew) or fall below (negative skew).

Untilting: Mid - Spread plot

In the Upper vs. Lower plot we judge departure from symmetry by divergence from the line y = x . A simple modification of this plot changes the coordinates so that the reference line for symmetry becomes horizontal, and it is always easier to judge departure from a flat line than a tilted one. In this plot, called the mid vs. spread plot, we plot

mid identical ( x _(n+1-i) + x _(i) ) / 2 vs. x _(n+1-i) - x _(i) identical spread .

This plot is based on the idea that in a symmetric distribution, each mid value should equal the median,

( x _(n+1-i) + x _(i) ) / 2 = M .

Since the spreads increase as we go from the center out to the tails, this plot depicts how the pattern of mid values changes from the middle of the distribution to the extremes. Because the plot is untilted (slope = 0) when the distribution is symmetric, expansion of the vertical scale allows us to see systematic departures from flatness far more clearly.

Mid vs. z² plot

In fact, the process of choosing a transformation from the ladder of powers can be made even easier, with a variation of the Mid - Spread plot suggested by Emerson & Stoto (1982). In this display, called the Mid vs. z² plot , we plot the centered mid value,

x _(i) + x _(n+1-i) over 2 - M

as the vertical coordinate, against a squared measure of spread,

Lower ² + Upper ² over 4 M = ( M - x _(i) ) ² + ( x _(n+1-i) - M ) ² over 4 M

as the vertical coordinate. If this graph is approximately linear, with a slope, b , then p = 1 - b is the indicated power for a transformation to approximate symmetry.

SYMPLOT macro

The essential idea for all these plots is to fold the distribution of scores around the middle value. To do this, sort the scores twice: once in ascending order, then in descending order, and then merge the two sets. This operation pairs ( x ₍₁₎ with x _(n) ), ( x ₍₂₎ with x _(n-1) )

proc sort data=analyze out=sortup;
   by X;
proc sort data=analyze out=sortdn;
   by descending X;

data symplot;
   merge sortup(rename=(X=frombot))      /* frombot = x(i)     */
         sortdn(rename=(X=fromtop));     /* fromtop = x(n+1-i) */

   mid = (fromtop + frombot) / 2;
   spread = fromtop - frombot;

These plots are all implemented in the SYMPLOT macro, which takes the following parameters:

 %macro SYMPLOT(
       data=_LAST_,     /* data to be analyzed                   */
       var=,            /* variable to be plotted                */
       plot=MIDSPR,     /* Type of plot(s): NONE, or any of      */
                        /* UPLO, MIDSPR, MIDZSQ, or POWER        */
       trim=0,          /* # or % of extreme obs. to be trimmed  */
       symbol=dot,      /* plotting symbol                       */
       out=symplot,     /* output data set                       */
       name=SYMPLOT);   /* name for graphic catalog entry        */

For example, the symmetry power plot of the IMR data is produced by these statements:

%include symplot;
title h=1.6 'IMR data: Symmetry transformation plot';
%symplot(data=nations,var=IMR,plot=POWER,trim=5 pct);

Trimming 5% of the observations in each tail ensures that extreme observations do not influence the choice of the power transformation.

1.5 Dot charts

We often need to display values of a quantitative variable where each value has an identifying label associated with it. To interpret the display--and understand which values are high and which are low--we need to plot the labels along with the data values.

The dot chart , developed by Cleveland (1984) is quite effective for this purpose, as long as the number of observations is not too large. If the observations are sorted by the response variable, the dot chart portrays the empirical cumulative distribution of the sample.

Simple dotplots can be produced on a printer with the TIMEPLOT procedure. The observations are arranged on the vertical axis in their order in the data set, so the data set should be sorted in descending order of the response variable. For example, the statements below produce the printer plot shown in Figure 4. The options REF=0 and JOINREF produce the dashed line connecting the plotting symbol to the vertical axis.

%include data(duncan);
proc sort data=duncan;
   by descending prestige;
proc timeplot data=duncan;
   plot prestige = '*' / ref=0 joinref;
   id job;



                       Duncan Occupational Prestige Data

JOB                 Prestige      min                                        max
                                  0                                           97
                                 *---------------------------------------------*
Physician                 97     ||-------------------------------------------*|
Professor                 93     ||------------------------------------------* |
Banker                    92     ||------------------------------------------* |
Architect                 90     ||-----------------------------------------*  |
Chemist                   90     ||-----------------------------------------*  |
Dentist                   90     ||-----------------------------------------*  |
Lawyer                    89     ||----------------------------------------*   |
Civil Eng.                88     ||----------------------------------------*   |
Minister                  87     ||---------------------------------------*    |
Pilot                     83     ||--------------------------------------*     |
Accountant                82     ||-------------------------------------*      |
Factory Owner             81     ||-------------------------------------*      |
Author                    76     ||----------------------------------*         |
Contractor                76     ||----------------------------------*         |
PS Teacher                73     ||---------------------------------*          |
RR Engineer               67     ||------------------------------*             |
Welfare Wrkr.             59     ||--------------------------*                 |
Undertaker                57     ||-------------------------*                  |
Machinist                 57     ||-------------------------*                  |
Electrician               53     ||------------------------*                   |
Reporter                  52     ||-----------------------*                    |
Store Manager             45     ||--------------------*                       |
Insur. Agent              41     ||------------------*                         |
Policeman                 41     ||------------------*                         |
Bookkeeper                39     ||-----------------*                          |
RR Conductor              38     ||-----------------*                          |
Mail carrier              34     ||---------------*                            |
Carpenter                 33     ||--------------*                             |
Plumber                   29     ||------------*                               |
Auto repair               26     ||-----------*                                |
Machine opr.              24     ||----------*                                 |
Barber                    20     ||--------*                                   |
Motorman                  19     ||--------*                                   |
Store clerk               16     ||------*                                     |
Cook                      16     ||------*                                     |
Coal miner                15     ||------*                                     |
Truck driver              13     ||-----*                                      |
Watchman                  11     ||----*                                       |
Gas stn attn              10     ||----*                                       |
Taxi driver               10     ||----*                                       |
Waiter                    10     ||----*                                       |
Janitor                    8     ||---*                                        |
Bartender                  7     ||--*                                         |
Soda clerk                 6     ||--*                                         |
Shoe-shiner                3     ||*                                           |
                                 *---------------------------------------------*

Figure 4: Dot chart of Occupational Prestige from PROC TIMEPLOT

DOTPLOT macro

A SAS macro, DOTPLOT, prepares grouped or ungrouped dot charts, with an optional error bar for each observation. The dotted lines can be omitted or drawn in several ways. The macro takes the following arguments (with default values shown after the = sign):

%macro dotplot(
       data=_LAST_,          /* input data set                         */
       xvar=,                /* horizontal variable (response)         */
       xorder=,              /* plotting range of response             */
       xref=,                /* reference lines for response variable  */
       yvar=,                /* vertical variable (observation label)  */
       ysortby=&xvar,        /* how to sort observations               */
       ylabel=,              /* label for y variable                   */
       group=,               /* vertical grouping variable             */
       gpfmt=,               /* format for printing group variable     */
                             /* value (include the . at the end)       */
       connect=DOT,          /* draw lines to ZERO, DOT, AXIS, or NONE */
       dline=2,              /* style of horizontal lines              */
       dcolor=BLACK,         /* color of horizontal lines              */
       errbar=,              /* variable giving length of error bar    */
                             /* for each observation                   */
       name=DOTPLOT);        /* Name for graphic catalog entry         */

The high-resolution dot chart of occupational prestige is produced by these statements:

%include DUNCAN ;   * get the data;
%include DOTPLOT;   * get the macro;
%dotplot( data=duncan,
          yvar=job, xvar=prestige,
          connect=DOT);

1.6 Normal probability plots

We frequently want to compare an observed distribution of a variable to some theoretical distribution (e.g., the normal or Gaussian distribution). Ordinary histograms, and stem and leaf displays are not particularly good for this because

they use arbitrary bins (class intervals)
they lose resolution in the tails (where differences are likely)
the standard for comparison is a curve

Quantile comparison plots or Q - Q plots compare the empirical cumulative distribution of the data

G HAT ( x ) = number ( X <= x ) over n = estimated Prob ( X <= x )

with the cumulative probability distribution (CDF) of the reference distribution,

F ( z ) = Prob ( Z <= z )

A general, graphical approach is to plot the ordered data values, x _(i) , against the expected values of those observations if they followed the particular reference distribution. When the data follows the reference distribution, the points in such a plot will follow a straight line. Departures from the line shows how the data differ from the assumed distribution.

Constructing a Q - Q plot

Arrange the observations in ascending order, giving x sub (1) , x ₍₂₎ , ... , x _(i) , ... , x _(n)
For each i , calculate the proportion of observations p _i below x _(i) ,
p _i = i - 1/2 over n
(Other definitions are used also; this one splits each observation, assigning half of it above, and half below x _(i) )
Find the corresponding quantile of the reference distribution,
z _i = F ^-1 ( p _i )
Plot the z _i against the ordered observation x _(i) on the vertical axis.

Notes

If the observations come from distribution F (the Gaussian, say), then the plot will be approximately linear with an intercept = 0 and slope = 1.
If the distributions are identical except for location (mean), then the intercept will be non-zero; if they are identical except for spread, the slope will be different from 1.
A comparison line , representing the expected, theoretical curve, can be drawn on the plot. For the Normal Q-Q plot, a robust comparison line can be drawn, using the median for the intercept, and the H-spr / 1.349 for the slope. Then,
expected = Median ( x ) + ( H-spr / 1.349 ) z _i

Patterns of deviation for Normal Q-Q plots

Postive (negative) skewed: Both tails above (below) the comparison line
Heavy tailed: Lower tail below, upper tail above the comparison line

Detrending

It is easier to see the pattern of departure of the data from the comparison line if we plot the difference between the data values and the comparison line on the vertical axis.

Confidence bands

Points on the Q-Q are not equally variable--we should expect observations in the tails to vary most--and it can be quite helpful to display variability information on the plot.

One way to do this is to calculate an estimated standard error, s Hat ( z _i ) , of the ordinate z _i and plot curves showing the interval z _i ± 2 s Hat ( z sub i ) to give approximate 95% confidence intervals. Chambers et al. (1983) give formulas for estimating the standard errors for most of the common comparison distributions. For the normal Q-Q plot, the standard error is estimated by

s Hat ( z _i ) = s hat over f ( z _i ) sqrt p _i ( 1 - p _i ) over n (2)

where f ( z _i ) is the normal density function, and s hat estimates the slope of the comparison line, the standard deviation of x . Chambers et al. (1983) stress using a robust estimate of s hat , such as the measure based on the hinge-spread, so that the standard errors are not distorted by outliers or straggling tails.

Constructing Normal Q-Q plots with SAS

Normal Q-Q plots can be drawn in the SAS System in several ways. The simplest (and least versatile) plot is given by the PLOT option in PROC UNIVARIATE. The normal quantiles can be calculated in a DATA step with the PROBIT function, and these values can be plotted against the data quantiles with PROC PLOT or PROC GPLOT.

To produce your own plot with SAS/GRAPH software, you can sort the data and calculate the normal quantiles in a DATA step with the PROBIT function as shown below. This is the most general method, since you can easily adjust the calculation of p _i , use a different F ^-1 function for Q-Q plots based on other distributions, and add calculations for standard errors, as described above.

**- Normal Q-Q plot using PROBIT --;
proc sort data=nations;
   by imr;
proc univariate noprint data=nations;
   var imr;
   output out=stats n=nobs median=median qrange=hspr
                           mean=mean std=std;
data quantile;
   set nations;
   if _n_=1 then set stats;
   if imr ^= . ;                       *-- remove missing data!;
   i + 1;
   p = (i - .5) / nobs;
   z = probit(p);
   normal = mean   + z * std;
   robust = median + z * hspr/1.349;

proc gplot data=quantile;
   plot imr   * z = 1
        normal* z = 2
        robust* z = 3 / overlay frame;
   symbol1 i=none h=1.1 v=- color=black;
   symbol2 i=join l=20 v=none color=blue;
   symbol3 i=join l=1   v=none color=red;

NQPLOT macro

A SAS macro program, NQPLOT, prepares normal Q-Q plots in various forms. By default, the program plots both the standard and detrended versions together with standard error curves. You can select the classical or robust estimates for the comparison line with the MU= and SIGMA= parameters. The macro program uses the following syntax:

%macro nqplot (
        data=_LAST_,    /* input data set                      */
        var=x,          /* variable to be plotted              */
        out=nqplot,     /* output data set                     */
        mu=MEDIAN,      /* est of mean of normal distribution: */
                        /*  MEAN, MEDIAN or literal value      */
        s=HSPR,     /* est of std deviation of normal:     */
                        /*  STD, HSPR, or literal value        */
        stderr=YES,     /* plot std errors around curves?      */
        detrend=YES,    /* plot detrended version?             */
        lh=1.5,         /* height for axis labels              */
        anno=,          /* name of input annotate data set     */
        name=NQPLOT,    /* name of graphic catalog entries     */
        gout=);         /* name of graphic catalog             */

Only the name of the data set and the variable to be plotted need be supplied. The normal Q-Q plot of the IMR data would be produced with this statement:

%nqplot(data=nations,var=imr);

Previous section

Next section.