No Title

$next$ $up$ $previous$
Next: About this document ...

Bivariate Data: Graphical Display

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Example:

Some investors think that the performance of the stock market in January is a good predictor of its performance for the entire year. To see if this is true, consider the following data on Standard & Poor's 500 stock index (found in SASDATA.SANDP).

	Percent	Percent
	January	12 Month
Year	Gain	Gain
1985	7.4	26.3
1986	0.2	14.6
1987	13.2	2.0
1988	4.0	12.4
1989	7.1	27.3
1990	-6.9	-6.6
1991	4.2	26.3
1992	-2.0	4.5
1993	0.7	7.1
1994	3.3	-1.5

Figure 1 is a scatterplot of the percent gain in the S&P index over the year (vertical axis) versus the percent gain in January (horizontal axis).

**Figure 1:** *Percent gain in Standard and Poor's index over the year (vertical axis) versus the percent gain in January (horizontal axis).*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f1.ps}} \vspace{2ex}\end{figure}$

How to analyze a scatterplot
The scatterplot of the S&P data can illustrate the general analysis of scatterplots. You should look for:
- Association. This is a pattern in the scatterplot.
- Type of Association. If there is association, is it:
  - Linear.
  - Nonlinear.
- Direction of Association.
For the S&P data, there is association. This shows up as a general positive relation (Larger % gain in January is generally associated with larger % yearly gain.) It is hard to tell if the association is linear, since the spread of the data is increasing with larger January % gain. This is due primarily to the 1987 datum in the lower right corner of plot, and to some extent the 1994 datum. Eliminate those two points, and the association is strong linear and positive, as Figure 2 shows.

Figure 2: Percent gain in Standard and Poor's index over the year (vertical axis) versus the percent gain in January (horizontal axis): 1987 and 1994 removed.
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f2.ps}} \vspace{2ex}\end{figure}$

There is some justification for considering the 1987 datum atypical. That was the year of the October stock market crash. The 1994 datum is a mystery to me.
Data Smoothers
Data smoothers can help identify and simplify patterns in large sets of bivariate data. You have already met one data smoother: the moving average. Figure 3 shows how a median trace reveals a downward trend, indicative of non-randomness, in the 1970 draft lottery data.

Figure 3: Draft lottery data; median trace
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f9.ps}} \vspace{2ex}\end{figure}$
Pearson Correlation
Suppose n measurements, $(X_i,\; Y_i), \; i=1, \ldots ,n$ are taken on the variables X and Y. Then the Pearson correlation between X and Y computed from these data is
$\begin{displaymath} r=\frac{1}{n-1} \sum_{i=1}^{n}X_{i}^{\prime} Y_{i}^{\prime},\end{displaymath}$
where
$\begin{displaymath} X_{i}^{\prime}=\frac{X_i-\overline{X}}{S_X} \mbox{ and } Y_{i}^{\prime}=\frac{Y_i-\overline{Y}}{S_Y}\end{displaymath}$
are the standardized data.
The scatterplots of standardized variables in Figure 4 illustrate what Pearson correlation measures.

Figure 4: Six plots of standardized X and Y data
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f10.ps}} \vspace{2ex}\end{figure}$
Good Things to Know About Pearson Correlation
- Pearson correlation is always between -1 and 1. Values near 1 signify strong positive linear association. Values near -1 signify strong negative linear association. Values near 0 signify weak linear association.
- Correlation between X and Y is the same as the correlation between Y and X.
- Correlation can never by itself adequately summarize a set of bivariate data. Only when used in conjunction with $\overline{X}$ , $\overline{Y}$ , S_X, and S_Y and a scatterplot can an adequate summary be obtained.
- The meaningfulness of a correlation can only be judged with respect to the sample size.
A Confidence Interval for the Population Correlation, $\rho$
If n is the sample size,
$\begin{displaymath} t=(r-\rho) \sqrt{\frac{n-2}{(1-r^2)(1-\rho^2)}}\end{displaymath}$
has approximately a t_n-2 distribution. We can use this fact to obtain a confidence interval for $\rho$ .
Example:
Back to the S&P data, the SAS macro CORR gives a 95% confidence interval for $\rho$ as
(-0.2775, 0.8345). As this interval contains 0, it indicates no significant linear association between JANGAIN and YEARGAIN.
If we remove the 1987 and 1994 data, a different story emerges. Then the Pearson correlation is r=0.9360, and a 95% confidence interval for $\rho$ is (0.6780, 0.9880). Since this interval consists entirely of positive numbers, we conclude that $\rho$ is positive and we estimate its value to be between 0.6780 and 0.9880.
QUESTION: What is $\rho$ ? Does this make sense?
Simple Linear Regression
The SLR model attempts to quantify the relationship between a single predictor variable Z and a response variable Y. This reasonably flexible yet simple model has the form
$\begin{displaymath} Y=\beta_0+\beta_1X(Z)+\epsilon,\end{displaymath}$
where $\epsilon$ is a random error term, and X(Z) is a function of Z, such as Z² or $\ln(Z)$ . By looking at different functions X, we are not confined to linear relationships, but can also model nonlinear ones. The function X is called the regressor. Often, we omit specifying the dependence of the regressor X on the predictor Z, and just write the model as
$\begin{displaymath} Y=\beta_0+\beta_1X+\epsilon.\end{displaymath}$

We want to fit the model to a set of data $(X_i,Y_i), \; i=1, \ldots, n$ . As with the C+E model, two options are least absolute errors, which finds values b₀ and b₁ to minimize
$\begin{displaymath} \mbox{SAE}(b_{0},b_{1})=\sum_{i=1}^{n} \mid Y_{i}-(b_0+b_1X_{i}) \mid,\end{displaymath}$
or least squares, which finds values b₀ and b₁ to minimize
$\begin{displaymath} \mbox{SSE}(b_{0},b_{1})=\sum_{i=1}^{n}(Y_{i}-(b_0+b_1X_{i}))^{2}.\end{displaymath}$

We'll concentrate on least squares. Using calculus, we find the least squares estimators of $\beta_0$ and $\beta_1$ to be
$\begin{displaymath} \hat{\beta}_1=\frac{\sum_{i=1}^n (X_i-\overline{X})(Y_i-\overline{Y})}{\sum_{i=1}^n (X_i-\overline{X})^2}.\end{displaymath}$
and
$\begin{displaymath} \hat{\beta}_0=\overline{Y}-\hat{\beta}_1\overline{X}.\end{displaymath}$
Example: For the S&P data, we would like to fit a model that can predict YEARGAIN (the response) as a function of JANGAIN (the predictor). Since a scatterplot reveals no obvious nonlinearity, we will take the regressor to equal the predictor.
The relevant SAS/INSIGHT output for the regression of YEARGAIN on JANGAIN looks like this:
Figures 5 and 6 show SAS/INSIGHT output for the regression of YEARGAIN on JANGAIN, with and without the years 1987 and 1994 removed, respectively.

Figure 5: SAS/INSIGHT output of regression of YEARGAIN on JANGAIN: all data
$\begin{figure} \centerline{ \includegraphics *[height=6in,width=6in]{lect7f3.ps}} \vspace{2ex}\end{figure}$

Figure 6: SAS/INSIGHT output of regression of YEARGAIN on JANGAIN: 1987 and 1994 removed
$\begin{figure} \centerline{ \includegraphics *[height=6in,width=6in]{lect7f4.ps}} \vspace{2ex}\end{figure}$
Residuals, Predicted and Fitted Values
- The predicted value of Y at X is
  $\begin{displaymath} \hat{Y}=\hat{\beta}_0+\hat{\beta}_1 X.\end{displaymath}$
- For X=X_i, one of the values in the data set, the predicted value is called a fitted value and is written
  $\begin{displaymath} \hat{Y}_i=\hat{\beta}_0+\hat{\beta}_1 X_i.\end{displaymath}$
- The residuals, $e_i,~ i=1, \ldots, n$ are the differences between the observed and fitted values for each data value:
  $\begin{displaymath} e_i=Y_i-\hat{Y}_i=Y_i-(\hat{\beta}_0+\hat{\beta}_1 X_i).\end{displaymath}$
Tools to Assess the Quality of the Fit
- Residuals. Residuals should exhibit no patterns when plotted versus the X_i, $\hat{Y}_i$ or other variables, such as time order. Studentized residuals should be plotted on a normal quantile plot.
- Coefficient of Determination. The coefficient of determination, r², is a measure of (take your pick):
  - How much of the variation in the response is ``explained'' by the predictor.
  - How much of the variation in the response is reduced by knowing the predictor.
The notation r² comes from the fact that the coefficient of determination is the square of the Pearson correlation. Check out the quality of the two fits for the S&P data:
Model Interpretation
- The Fitted Slope. The fitted slope may be interpreted in a couple of ways:
  - As the estimated change in the mean response per unit increase in the regressor. This is another way of saying it is the derivative of the fitted response with respect to the regressor:
    $\begin{displaymath} \frac{d\hat{Y}}{dx}=\frac{d}{dx}(\hat{\beta}_0+\hat{\beta}_1x)= \hat{\beta}_1.\end{displaymath}$
  - In terms of the estimated change in the mean response per unit increase in the predictor. In this formulation, if the regressor X, is a differentiable function of the predictor, Z,
    $\begin{displaymath} \frac{d\hat{Y}}{dz}=\frac{d}{dz}(\hat{\beta}_0+\hat{\beta}_1X)= \hat{\beta}_1\frac{dX}{dz},\end{displaymath}$
    so
    $\begin{displaymath} \hat{\beta}_1=\frac{d\hat{Y}}{dz}\left/\frac{dX}{dz}\right.\end{displaymath}$
- The Fitted Intercept. The fitted intercept is the estimate of the response when the predictor equals 0, provided this makes sense.
- The Mean Square Error. The mean square error or MSE, is an estimator of the variance of the error terms $\epsilon$ , in the simple linear regression model. Its formula is
  $\begin{displaymath} \mbox{MSE}=\frac{1}{n-2}\sum_{i=1}^ne_i^2.\end{displaymath}$
  It measures the ``average prediction error'' when using the regression.
Example: Consider the S&P data with the 1987 and 1994 observations omitted. The fitted model is
$\begin{displaymath} \widehat{YEARGAIN}=9.6462+2.3626 JANGAIN.\end{displaymath}$
- The Fitted Slope. The fitted slope, 2.3626 is interpreted as the estimated change in YEARGAIN per unit increase in JANGAIN.
- The Fitted Intercept. The fitted intercept, 9.6462, is the estimated YEARGAIN if JANGAIN equals 0.
- The Mean Square Error. The MSE, 21.59, estimates the variance of the random errors.
Classical Inference for the SLR Model
- Estimation of Slope and Intercept
  Level L confidence intervals for $\beta_0$ and $\beta_1$ are
  $\begin{displaymath} (\hat{\beta}_0-\hat{\sigma}(\hat{\beta}_0)t_{n-2,\frac{1+L}{... ...hat{\beta}_0+\hat{\sigma}(\hat{\beta}_0)t_{n-2,\frac{1+L}{2}}),\end{displaymath}$
  and
  $\begin{displaymath} (\hat{\beta}_1-\hat{\sigma}(\hat{\beta}_1)t_{n-2,\frac{1+L}{... ...hat{\beta}_1+\hat{\sigma}(\hat{\beta}_1)t_{n-2,\frac{1+L}{2}}),\end{displaymath}$
  respectively, where
  $\begin{displaymath} \hat{\sigma}(\hat{\beta}_0)=\sqrt{\mbox{MSE} \left/ \left[\f... ...rline{X}^2}{\sum_{i=1}^n(X_i-\overline{X})^2}\right]\right .} ,\end{displaymath}$
  and
  $\begin{displaymath} \hat{\sigma}(\hat{\beta}_1)=\sqrt{\mbox{MSE} \left/\sum_{i=1}^n(X_i-\overline{X})^2\right .} \end{displaymath}$
- Example: For the reduced S&P data (i.e. with 1987 and 1994 removed) the SAS/INSIGHT output shows the estimated intercept is $\hat{\beta}_0=9.65$ with $\hat{\sigma}(\hat{\beta}_0)=1.77$ . Since t_6,0.975=2.45, a 95% confidence interval for $\beta_0$ is
  $\begin{displaymath} 9.65\pm (1.77)(2.45)=(5.31,13.98).\end{displaymath}$
  A similar computation with $\hat{\beta}_1=2.36$ and $\hat{\sigma}(\hat{\beta}_1)=0.36$ , gives a 95% confidence interval for $\beta_1$ as
  $\begin{displaymath} 2.36\pm (0.36)(2.45)=(1.47,3.25).\end{displaymath}$
- Estimation of The Mean Response
  The mean response at X=x₀ is
  $\begin{displaymath} \mu_0=\beta_0+\beta_1 x_0.\end{displaymath}$
  The point estimator of $\mu_0$ is
  $\begin{displaymath} \hat{Y}_0=\hat{\beta}_0+\hat{\beta}_1 x_0.\end{displaymath}$
  A level L confidence interval for $\mu_0$ is
  $\begin{displaymath} (\hat{Y}_0-\hat{\sigma}(\hat{Y}_0)t_{n-2,\frac{1+L}{2}} , \hat{Y}_0+\hat{\sigma}(\hat{Y}_0) t_{n-2,\frac{1+L}{2}}),\end{displaymath}$
  where
  $\begin{displaymath} \hat{\sigma}(\hat{Y}_0)=\sqrt{\mbox{MSE}\left[\frac{1}{n}+ \frac{(x_0-\overline{X})^2}{\sum(X_i-\overline{X})^2}\right]}.\end{displaymath}$
- Prediction of a Future Observation
  A level L prediction interval for a future observation at X=x₀ is
  $\begin{displaymath} (\hat{Y}_{new}-\hat{\sigma}(Y_{new}-\hat{Y}_{new})t_{n-2, \frac{1+L}{2}},\end{displaymath}$
  
  $\begin{displaymath} \hat{Y}_{new}+ \hat{\sigma}(Y_{new}-\hat{Y}_{new})t_{n-2,\frac{1+L}{2}})),\end{displaymath}$
  where
  $\begin{displaymath} \hat{Y}_{new}=\hat{\beta}_0+\hat{\beta}_1 x_0,\end{displaymath}$
  and
  $\begin{displaymath} \hat{\sigma}(Y_{new}-\hat{Y}_{new})= \sqrt{\mbox{MSE}\left[1... ... \frac{(x_0-\overline{X})^2}{\sum(X_i-\overline{X})^2}\right]}.\end{displaymath}$
  Figure 7 shows the regression fit (straight line), the location of the level 0.95 confidence intervals for the mean response at all values of JANGAIN (the inner pair of curves; these curves are also called a confidence band) and the location of the level 0.95 prediction interval for a new observation at all values of JANGAIN (the outer pair of curves; these curves are also called a prediction band). Figure 8 shows the same for the data with years 1987 and 1994 removed.
  
  Figure 7: Regression line and level 0.95 confidence and prediction bands: Standard and Poor's data
  $\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f5.ps}} \vspace{2ex}\end{figure}$
  
  Figure 8: Regression line and level 0.95 confidence and prediction bands: Standard and Poor's data with 1987 and 1994 removed
  $\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f6.ps}} \vspace{2ex}\end{figure}$
  
  The macro REGPRED will compute confidence intervals for a mean response and prediction intervals for future observations for each data value and for other user-chosen X values.
- Example:
  The SAS macro REGPRED was run on the reduced S&P data, and estimation of the mean response and prediction of a new observation at the value JANGAIN=5 were requested. Both $\hat{Y}_0$ and $\hat{Y}_{new}$ equal 9.65+(2.36)(5)=21.46. The macro computes a 95% confidence interval for the mean response at JANGAIN=5 as (16.56,26.36), and a 95% prediction interval for a new observation at JANGAIN=5 as (9.08,33.84).
The Relation Between Correlation and Regression
If the standardized responses and predictors are
$\begin{displaymath} Y'_i=\frac{Y_i-\overline{Y}}{S_Y},\end{displaymath}$
and
$\begin{displaymath} X'_i=\frac{X_i-\overline{X}}{S_X},\end{displaymath}$

Then the regression equation fitted by least squares can be written as

$\begin{displaymath} \hat{Y'}=r \cdot X',\end{displaymath}$
Where X' is any value of a predictor variable standardized as described above.
The Regression Effect refers to the phenomenon of the standardized predicted value being closer to 0 than the standardized predictor. Equivalently, the unstandardized predicted value is fewer Y standard deviations from the response mean than the predictor value is in X standard deviations from the predictor mean.
For the S&P data r=0.4295, so for a January gain $X^\prime$ standard deviations (S_X) from $\overline{X}$ , the regression equation estimates a gain for the year of
$\begin{displaymath} \hat{Y'}=0.4295 \cdot X'\end{displaymath}$
standard deviations (S_Y) from $\overline{Y}$ .
With 1987 and 1994 removed, the estimate is
$\begin{displaymath} \hat{Y'}=0.9360 \cdot X',\end{displaymath}$
which reflects the stronger relation.
The Relationship Between Two Categorical Variables
Analysis of categorical data is based on counts, proportions or percentages of data that fall into the various categories defined by the variables.
Some tools used to analyze bivariate categorical data are:
- Mosaic Plots.
- Two-Way Tables.

**Figure 2:** *Percent gain in Standard and Poor's index over the year (vertical axis) versus the percent gain in January (horizontal axis): 1987 and 1994 removed.*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f2.ps}} \vspace{2ex}\end{figure}$

**Figure 3:** *Draft lottery data; median trace*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f9.ps}} \vspace{2ex}\end{figure}$

**Figure 4:** *Six plots of standardized X and Y data*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f10.ps}} \vspace{2ex}\end{figure}$

**Figure 5:** *SAS/INSIGHT output of regression of YEARGAIN on JANGAIN: all data*
$\begin{figure} \centerline{ \includegraphics *[height=6in,width=6in]{lect7f3.ps}} \vspace{2ex}\end{figure}$

**Figure 6:** *SAS/INSIGHT output of regression of YEARGAIN on JANGAIN: 1987 and 1994 removed*
$\begin{figure} \centerline{ \includegraphics *[height=6in,width=6in]{lect7f4.ps}} \vspace{2ex}\end{figure}$

**Figure 7:** *Regression line and level 0.95 confidence and prediction bands: Standard and Poor's data*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f5.ps}} \vspace{2ex}\end{figure}$

**Figure 8:** *Regression line and level 0.95 confidence and prediction bands: Standard and Poor's data with 1987 and 1994 removed*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f6.ps}} \vspace{2ex}\end{figure}$

Example:

A survey on academic dishonesty was conducted among WPI students in 1993 and again in 1996. One question asked students to respond to the statement ``Under some circumstances academic dishonesty is justified.'' Possible responses were ``Strongly agree'', ``Agree'', ``Disagree'' and ``Strongly disagree''. Table 1 contains the information for the 1993 data.

Table 1: Table relating gender and response to ``Under some circumstances academic dishonesty is justified,'' 1993 survey.
1993 Survey: Dishonesty Sometimes Justified

Frequency

Percent

Row Pct. Strongly

Col Pct. Agree Disagree Disagree Total

Female 16 18 13 47

8.94 10.06 7.26 26.26

34.04 38.30 27.66

34.04 20.22 30.23

Gender

Male 31 71 30 132

17.32 39.66 16.76 73.74

23.48 53.79 22.73

65.96 79.78 69.77

Total 47 89 43 179

26.26 49.72 24.02 100.00

Figures 9 and 10 show two mosaic plots relating the responses to gender for the 1993 data.

**Figure 9:** *Mosaic plot relating gender and response to ``Under some circumstances academic dishonesty is justified,'' 1993 survey.*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f7.ps}} \vspace{2ex}\end{figure}$

**Figure 10:** *Another mosaic plot relating gender and response to ``Under some circumstances academic dishonesty is justified,'' 1993 survey.*
$\begin{figure} \centerline{ \includegraphics *[height=3in,width=6in]{lect7f8.ps}} \vspace{2ex}\end{figure}$

Association is NOT Cause and Effect
Two variables may be associated due to a number of reasons, such as:

1. X could cause Y.

2. Y could cause X.

3. X and Y could cause each other.

4. X and Y could be caused by a third (lurking) variable Z.

5. X and Y could be related by chance.

6. Bad (or good) luck.
The Issue of Stationarity
- When assessing the stationarity of a process in terms of bivariate measurements X and Y, always consider the evolution of the relationship between X and Y, as well as the individual distribution of the X and Y values, over time or order.
- Suppose we have a model relating a measurement from a process to time or order. If, as more data are taken the pattern relating the measurement to time or order remains the same, we say that the process is stationary relative to the model.

About this document ...

$next$ $up$ $previous$
Next: About this document ...

Joseph D Petruccelli
6/5/1998

	1993 Survey:	Dishonesty	Sometimes	Justified

	Frequency
	Percent
	Row Pct.			Strongly
	Col Pct.	Agree	Disagree	Disagree	Total

	Female	16	18	13	47
		8.94	10.06	7.26	26.26
		34.04	38.30	27.66
		34.04	20.22	30.23
Gender

	Male	31	71	30	132
		17.32	39.66	16.76	73.74
		23.48	53.79	22.73
		65.96	79.78	69.77

	Total	47	89	43	179
		26.26	49.72	24.02	100.00