Exploratory and Graphical Methods of Data Analysis
Michael Friendly

Part 2: Examining Relationships

  • 2.1 Enhanced Scatterplots
  • 2.2 Smoothing scatterplots
  • 2.3 Transformations for linearity
  • 2.4 Plotting discrete data
  • 2.1 Enhanced Scatterplots

    Scatterplots can be enhanced by adding information to help you interpret the display or understand aspects of the data that are not shown directly.

    One idea is to add an elliptical confidence region ( data ellipse ) around the mean to highlight the joint relationship. With data for several groups this helps show the extent to which the relationship is the same for all groups. (Better than individual regression lines.)

    For bivariate normal observations, x i = ( x i , y sub i ) , the elliptical region containing ( 1 - a ) of the data is given by the values x satisfying:
    ( x - x bar )T S size +2 -1 ( x - x bar ) <= chi 2 2 ( 1 - a ) ' ' . (3)
    where x bar = ( x bar , y bar ) are the sample means, S ( 2 × 2 ) is the covariance matrix of (x , y ) , and chi 2 2 ( 1 - a ) is the (1 - a ) percentage point of the chi² distribution with two degrees of freedom.

    CONTOUR macro

    The macro CONTOUR uses PROC IML to calculate points on the boundary of the ellipse for each group in the input data set.
    %macro CONTOUR(
           data=_LAST_,              /* input data set    */
           x=,                       /* X variable        */
           y=,                       /* Y variable        */
           group=,                   /* Group variable    */
           pvalue= .5,               /* probability value */
           out=CONTOUR,              /* output data set   */
           colors=RED GREEN BLUE BLACK);
    
    For example, a set of 50% data ellipses for Weight vs. Price of automobiles grouped by region of origin is produced by these statements:
    %include contour;
    %contour(data=auto, x=price, y=weight, group=origin);
    

    2.2 Smoothing scatterplots

    Our eyes are often quite effective in detecting patterns in scatterplots that could not be captured by any quantitative measures. Sometimes, however, the pattern may be too weak, or the distributions of the variables too uneven, to be able to clearly see the relation between Y and X. In these cases, it is useful to plot some smooth curve(s) or other display which shows how Y changes with X.

    Trace lines and boxplots

    One set of ideas involves dividing the data into roughly equal-sized strips according to their ordered x-values. For each strip, we can plot:

    The plot below shows the lottery numbers assigned to each birthday (Day of Year) in the 1970 US Draft Lottery carried out by the Selective Service to determine the order in which elligible men would be drafted into the US Army. It was supposed to be completely random, and looks that way from the scatter plot

    
       SCATTER DRAFT
    
        366  -        **    *      *  2 * **               *
     L       -    ***     * *  **2       ** 2       2*      *   *
     O       - ***  *    * 2    * ** *  *    *2    *  *              *
     T       -***        *    * *     2*        *  **      *     *  *  2*
     T       -     *  2*2**       * * **     *  **  **       **
     E  275  o   *      2   ****  *  *   2  **   *          *    **  *
     R       -*           *2* * *2   * *      *  *     * * *  **
     Y       -  **     *   *   *          *          * *******2     *   *
             -***  ****  *  **2       **   *   *      *      *  *
     N       - 2    *  *   *      **    *   *    * *     2* *  2   *
     U  184     *  * **          * **   *   * 2       ***    **   **
     M       -*    *      2 *                   * *  ** * *  *       2** 2
     B       -   * * 2   *    **   *           *  **2     ** *     3
     E       -  *        **    *    2    * *  *            *      3 **  2 *
     R       -    *     *            * ** ***    *2* *  * *   *    *  *
         92  + *  ****          *       *   *  **           *  * *   *  *2*
             -    *  *       2   * * *   **     *      **    *  *2*     **
             -   *2    *        *   *     ** *    2   2  *       *  * *  *
             -          * *  2    *** *       *  * * *          2 *   *2
             -  *     *        *       **  * **** *   *   * *   *     **  *
          1  -       *           *                     2*     **    * *   *
              -------------------+------------------+-------------------
              1                 123                 244                 366
                                                               DAY OF YEAR
    
    The next slide shows the trace line plot of the medians (M), upper (H), and lower (L) hinges of the lottery numbers against day of the year. It is clear, even with fairly coarse resolution, that the lottery numbers decrease towards the end of the year, indicating a defect in the method used to select these "random numbers".
    
    
         342  +                    H     H
     L        -         H          HH   HHH H
     O        -        HH    HHH H HHH HH H  H
     T        -H     H H HHHHHHHHHH  HHH   HHHHH    HH
     T        -H   HHHHH  HHHH   HH    H    HHHHH  HH   H        H
     E   268  + HHHHH H                       H    HH H HH   HHH H
     R        - HHHH    M                       HHH H HHHHHH HHHH H
     Y        -  HH    MMMMMMMM    MM             HH  H   HHHH    HH
              -      MMM  MMMMMMMM MM  MMM        H        H       H  H  H
     N        -M    MMM         MMMM MMMMMM                H     M HHHHHHHHH
     U   194  +MM  M    L        MM   MM   MMM               MMMMM     HH  H
     M        - MMMM   LL                   MMMMM   MM  MMMMMM   MM     H  H
     B        -   M      L   L                M MMMMM MMMM M      MM
     E        -        L LLLL                     MM  MM       L   MMM
     R        -L    LLLL  LLLLL     L                        LLLLL   MMMMMMM
         121  + L  LLLL      L L L LL   LL                   LL LL     MM  M
              - L  L            LLLLL  LLLL L       LL  LL LLL  LL
              - LLLL           LLLLL LLLLL LLL LL  LLL  LLLLLL    LL
              -  LL              LL  LLL   LLLLLL  LL L LL LL     LLLLL  LLL
              -   L                           L  LLL  LLL  L         LLLLL L
          47  +                             L         LL               LL  L
               +-------------------+-------------------+-------------------+
              21                 129                 237                 346
                                                                DAY OF YEAR
    

    Lowess smoothing

    Another generally useful technique is robust, locally weighted regression smoothing , often called lowess . The procedure finds a smoothed fitted value, y hat i , for each x i by fitting a weighted regression to the points in the neighborhood of x i . The points closest to x i receive the greatest weight. This is the "locally weighted regression" part of the procedure, and the weights are called neighborhood weights .

    The robust part works as follows: Once the fitted values, y hat i , have been found, the residuals, r i = y sub i - y hat i are used to determine a new set of weights ( robustness weights ) so that points which have large residuals are down-weighted, and the locally weighted regression is repeated.

    In practice, lowess depends on the choice of a smoothing parameter, f , 0 < f <= 1 , the fraction of the data points to be considered in the calculation of y hat i . Choosing f = .5 means that only the r = [ .5 n ] points closest to x i have non-zero weights. Increasing the value of f makes the fitted curve smoother, decreasing f lets the curve follow the data more closely.

    Choosing f

    curve, but may result in oversmoothing. f = .5 is often a good choice, but values in the range .33 to .67 may be tried.

    The figure below shows the lowess smoothed fit for the draft lottery data. The steady decline in Lottery Number over the last two-thirds of the year is now quite clear. (The plot is over-dramatic, and perhaps misleading, since the vertical scale covers a smaller range than the plots of the raw data and trace lines.)

    
           SMOOTH ~ DRAFT LOWESS }366
           SCATTER SMOOTH
    
        213  -             26666663
     L       o    *666666674      366
     O       -46665                  66
     T       -                        *65
     T       -                          *66
     E  189                               66*
     R       -                               563
     Y       -                                 3664
             -                                    366663
     N       -                                         3666*
     U  165  +                                             56*
     M       -                                               55
     B       -                                                26*
     E       -                                                  54
     R       -                                                   26*
        141  +                                                     54
             -                                                      26
             -                                                        62
             -                                                         44
             -                                                          26
        117  -                                                            4
              -------------------+------------------+-------------------
              1                 123                 244                 366
                                                               DAY OF YEAR
    

    LOWESS macro

    The lowess procedure involves a series of weighted regressions, one for each data point. The process can be carried out easily with the matrix operations of PROC IML.

    The LOWESS macro reads the input data set into PROC IML, calculates the smoothed y hat values, and creates an output data set containing the original and smoothed data. The macro takes the following parameters:

     /*---------------------------------------------------------------*
      * LOWESS SAS Locally weighted robust scatterplot smoothing      *
      *---------------------------------------------------------------*
    %macro LOWESS(
       data=_LAST_,    /* name of input data set            */
       out=SMOOTH,     /* name of output data set           */
       x = X,          /* name of independent variable      */
       y = Y,          /* name of Y variable to be smoothed */
       id=,            /* optional row ID variable          */
       f = .50,        /* lowess window width               */
       p = 1,          /* 1=linear fit, 2=quadratic         */
       iter=2,         /* total number of iterations        */
       plot=NO,        /* draw the plot?                    */
       gplot=NO,       /* draw the plot?                    */
       pplot=NO,       /* draw a printer plot?              */
       symbol=circle,
       htext=1.5,
       hsym=1.5,
       name=LOWESS);   /* name for graphic catalog entry    */
    
    e.g.,
    %include lowess;
    %lowess( data=auto, x=weight, y=mpg, f=.3, gplot=YES );
    

    2.3 Transformations for linearity

    If we think of y as a response (or "dependent") variable, and x as a factor (or "independent") variable, we might like to fit y as a function of x
    y = f ( x ) + residual = fit + residual

    All other things equal, we prefer a "simple" f ( x ) like a linear function, to a more complex one. If a scatterplot of y against x appears substantially non-linear, there are two choices:

    Once again, the ladder of powers provides a scheme for understanding the effect of various power transformations. Visual examination of the scatterplot of the raw data, possiblly enhanced with a lowess smoothed curve, can suggest the direction to move on the ladder for transforming x , or y , or both.

    As a preliminary example, consider the simple case where

    y = size -2 1 over 5 x 2
    for x = 0, 1, 2, 3, 4, 5. Here, the relation can be straightened by transforming x :
    xT= x 2 --> y = size -2 1 over 5 xT
    or y :
    yT= sqrt y --> y = sqrt size -2 1 over 5 xT
    The data and the transformed values are:
    X      Y      X}     Y.9
    0     0        0     0
    1     0.2      1     0.4472
    2     0.8      4     0.8944
    3     1.8      9     1.342
    4     3.2     16     1.789
    5     5       25     2.236
    
    These values are plotted below:
    
           SCAT X, Y                         SCAT XP, Y
     5-                        f       5-                        f
      -                                 -
      -                                 -
      -                                 -
      -                   f             -               f
      -                                 -
      -              f                  -        f
      -                                 -
      -         f                       -   f
      -                                 -
     0f----f--------------------       0ff------------------------
      0           X            5         0           X'         25
    
           SCAT X , YP
     2.5-
        -                        f
        -
        -                   f
        -
        -              f
        -         f
        -
        -    f
        -
     0  f-------------------------
        0           X            5
    

    Tukey's arrow rule

    More generally, a power of x or of y can be selected by examining the curvature in the scatter plot of y vs. x and drawing an arrow from the inside to the outside of the "bulge".

    The arrow points in the direction to move along the ladder of powers for x or for y . That is, if the arrow points toward smaller values of x (or y ), move down the ladder of powers, toward sqrt x or log ( x ) (or sqrt y , log ( y ) ). The greater the curvature, the further from 1 (raw data) should be the power.

    Summary points

    The arrow rule indicates the direction to move but does not indicate, except roughly, what the power should be. To see what effect a given transformation,
    x --> x p
    y --> y q

    would have, you would have to transform the data and examine the scatterplot of the re-expressed values.

    Since the power transformations are order-preserving, we can try different power transformations on a few well chosen points, to see what effect they would have on all the data.

    Tukey (1977) suggests finding three summary points which are the median ( x , y ) values in three equal-sized strips based on the ordered x -values. Use subscripts L , M , H to denote the ( x , y ) values for the Low, Middle, and High strip.

    If the relation is linear, these three points will fall on a line, or equivalently, the slope of the line from (x , y ) sub L to (x , y ) M will be the same as the slope of the line from (x , y ) M to (x , y ) sub H .

    
    
     Y-        -       -               Y-        -       -    H
      -        -       -                -        -       -
      -                    H            -
      -                                 -
      -                                 -
      -            M                    -
      -                                 -            M
      -                                 -
      -    L   -       -                -    L   -       -
      -        -       -                -        -       -
     0+-------------------------       0+-------------------------
      0           X                      0           X
    
    
    On the other hand, if the relation is not linear, the lines connecting the pairs of points will have different slopes.

    Finding summary points

    1. Sort the ( x , y ) pairs in increasing order by x -value.
    2. Divide the points into thirds, based on ordered x -values. If n / 3 is not an integer, put the [ n / 3 ] points with the smallest x -values in the Low group, the [ n / 3 ] points with the largest x -values in the High group, and the remaining points in the Middle group.

      This division is modified when equal x -values straddle the dividing line: they are all kept in the same third. Also, neither end portion should cover more than half the range of the x -values, which can happen if the distribution of x is highly skewed.

    3. Then x L is the median of the x s in the L group; y L is the median y . The summary points (x M , y M ) and (x H , y H ) are defined similarly.

    This section uses data on literacy rates and gross national product for 22 nations to show the details of the calculations. (The plot indicates the relation is highly nonlinear. The arrow rule indicates we should transform x to lower powers or y to higher powers.)

    
     NEPAL            45  5
     BURMA            57 47.5
     UGANDA           64 27.5
     S. VIETNAM       76 17.5            SCAT GNP
     THAILAND         96 68        100-         f f           f          f
     HAITI           105 10.5         -                f
     INDONESIA       131 17.5         -        f
     S. KOREA        144 77       L   -  f
     GHANA           172 22.5     I   -   f
     PERU            179 47.5     T   - f   f
     EL SALVADOR     219 39.4     E   -    f
     BR.GUIANA       235 74       R   -      f
     HONG KONG       272 57.5     A   -f f   f
     PANAMA          329 65.7     C   -   f
     LEBANON         362 47.5     Y   -
     SINGAPORE       400 50           -f
     ARGENTINA       490 86.4         -fff
     ICELAND         572 98.5         - f
     CZECHOSLOVAK    680 97.5         -f
     FRANCE          943 96.4        0-------------------------------------
     NEW ZEALAND    1310 98.5            0           GNP               2000
     CANADA         1947 97.5
    
    Since the distribution of GNP is so skewed, the range rule suggests allocating only two points to the upper third. The summary points are found to be:
    
             2 RSTAB GNP
       SUMPTS: SPLIT IS 7 13 2
       SUMMARY POINTS
       L     76 17.5
       M    329 65.7
       H   1629 98
    

    Ratio of slopes

    The curvature of the data can be measured by the ratio of slopes :
    r = m HM over m ML = ( y H - y M ) / ( x H - x M ) over ( y M - y L ) / ( x M - x L ) (4)
    As shown earlier, a linear relation implies r approx 1 (or log r approx 0 ).

    For the GNP data, this ratio is quite far from 1:

    
          1 1 SLOPES S
    SUMMARY    HALF    SLOPE
    POINTS     SLOPES  RATIO
    [X*1] [Y*1]
    1629 98
               0.02485
     329 65.7          0.1304
               0.1905
      76 17.5
    

    Since we are using medians, and the ladder of powers is order-preserving (as long as all values are positive), we can transform the summary points to various powers, and the ratio of slopes for the transformed summary points will tell what effect that transformation will have on the whole data set.

    If we transform x --> x p , and y --> y q , then the ratio of slopes for that pair of transformations is

    r < (p,q) = m HM < (p,q) over m ML < (p,q) = ( y H q - y M q ) / ( x H p - x M p ) over ( y M q - y L q ) / ( x M p - x L p ) (5)
    Since we know we must go down the scale of powers for x , we try log x , then - 1 / sqrt x . The first is not far enough, but the second is too far. p = - 1/3 is an in-between power, and seems to work quite well.
    
          0 1 SLOPES S
    SUMMARY     HALF   SLOPE
    POINTS      SLOPES RATIO
    [X*0] [Y*1]
    3.212 98
                46.49
    2.517 65.7         0.6138
                75.74
    1.881 17.5
    
          H.5 1 SLOPES S
    SUMMARY        HALF   SLOPE
    POINTS         SLOPES RATIO
    [X*H0.5] [Y*1]
    H0.02478 98
                   1064
    H0.05513 65.7         1.315
                    809
    H0.1147  17.5
    
          H.33 1 SLOPES S
    SUMMARY        HALF   SLOPE
    POINTS         SLOPES RATIO
    [X*H0.33] [Y*1]
    H0.08711 98
                   533.3
    H0.1477  65.7         1.016
                   524.9
    H0.2395  17.5
    

    Ratio of slopes table

    This processs can be automated, and the results organized in a ratio of slopes table which shows the slope ratio for all combinations of some set of powers of x and of y . Note how roughly equal values tend to run along diagonals.
    
       RATIOS OF SLOPES  :
          YE  [H2  ] [H1  ] [ H.5] [LOG ] [  .5] [RAW ] [ 2  ]
        X$
       [H2  ]   .778  2.213  3.575  5.590  8.459 12.394 24.385
       [H1  ]   .175   .499   .806  1.261  1.908  2.796  5.500
       [ H.5]   .083   .235   .379   .593   .898  1.315  2.588
       [LOG ]   .039   .110   .177   .277   .419   .614  1.208
       [  .5]   .018   .051   .082   .128   .194   .284   .559
       [RAW ]   .008   .023   .038   .059   .089   .130   .257
       [ 2  ]   .002   .005   .008   .012   .018   .027   .053
    

    The figure below shows the data and scatterplot when GNP is transformed to the reciprocal cube root.

             NEG. CUBE ROOT OF X IS CLOSEST TO BEING STRAIGHT
             GNP3 ~ GNP
             GNP3[;1]~GNP[;1] POWER H.333
    
             GNP3  TRANSFORMED X
       H0.2815   5
       H0.2602  47.5
       H0.2503  27.5               SCAT GNP3
       H0.2364  17.5         100                         ff  f f   -
       H0.2187  68                                          f      -
       H0.2123  10.5                                    f          -
       H0.1972  17.5         L                 f                   -
       H0.1911  77           I                     f               -
       H0.1801  22.5         T             f          f            -
       H0.1777  47.5         E                      f              -
       H0.1662  39.4         R                         f           -
       H0.1623  74           A        f          f    f            -
       H0.1546  57.5         C                     f               -
       H0.1451  65.7         Y                                     -
       H0.1406  47.5                   f                           -
       H0.136   50                       f    f  f                 -
       H0.1271  86.4                        f                      -
       H0.1207  98.5               f                               -
       H0.114   97.5           0------------------------------------
       H0.1022  96.4            H0.3         GNP*H1 3          H0.05
       H0.09161 98.5
       H0.08029 97.5
    
    The plot shows no signs of nonlinearity. However, some find "reciprocal cube root" of GNP hard to interpret, and might prefer a slightly curved relationship with log GNP .

    2.4 Plotting discrete data

    When the x or y values are highly discrete, or when there is a lot of data, it may be difficult to see patterns in scatterplots because of overplotting --multiple points occur at the same plot locations. Standard printer plots often use the characters A, B, C, etc. to represent 1, 2, 3, ... observations at the same plot location.

    The plot below shows data generated from a mixture of three bivariate normal distributions. It is hard to see that there are three regions of high density.

    
           PLOT OF Y*X    LEGEND: A = 1 OBS, B = 2 OBS, ETC.
    
     Y |                                                  AA  A
       |                                               A    BA
    50 +                                          A A   A        A
       |                                      B   A ABACD ABB CAB
       |                                       A   AACBCBAA BA
       |                                    AA B  BAAACACA B
       |                                 A AA B ABAABA A A  A
    40 +                                AA  A B   D
       |                            A        AA B
       |                                    A CA  AA
       |                       A   B    BBC BBA  AB A
       |                       A  B BAB E AAAA  AA   A
    30 +                    A A    ADAAC  A A   A A
       |                   A  A AB   A BA BAAAA
       |                      A   A AB A A A A A
       |                  A A AAA AA
       |                 A     A  A    AAA
    20 +                A AABAAA  A   A  A
       |             AA A     BACBB A
       |        A ABA ABB   DAA BAA B A
       |       AAA  AAB CACAAB  A
       |    B      AA  AF AAB B A
    10 +    AA      AB      B
       |          A      B
       |                B
       |
       |
     0 +
       --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+--
        -3   0   3   6   9  12  15  18  21  24  27  30  33  36  39
    
                                    X
    

    Sunflower plots

    One idea for such data is a sunflower plot . Here, the data are binned into (x , y ) cells, and the frequency of points in each cell is represented with a sunflower symbol. The sunflower symbol depicts the number of observations by the number of radial "petals" from a central point.

    Jittering

    Another idea is to add a small amount of random noise to each point to reduce overplotting. This technique, called jittering , is particularly useful when the data is discrete.

    To jitter, add a uniformly distributed random quantity, scaled so that it breaks up the overlap, but does not corrupt the data. Let u i = u [ -1, 1 ] be uniformly distributed from -1 to 1. Then, to jitter x , calculate

    xTi = x i + s u i
    where s is the scale factor, which might be some small fraction (e.g., .02) of the range of the data, or 1/2 the rounding interval if the data have been rounded. If the y variable is also discrete, the same process can be applied to jitter y --> yT.

    Then, in the scatter plot of y (or yT) vs. xT it should be easier to see the density of points in different regions.


    [Previous] Previous section [Next] Next section.
    © 1995 Michael Friendly
    Email<Friendly@yorku.ca>