The scatterplot is the basic tool for graphically displaying bivariate quantitative data.
Example:
Some investors think that the performance of the stock market in January is a good predictor of its performance for the entire year. To see if this is true, consider the following data on Standard & Poor's 500 stock index (found in SASDATA.SANDP).
Percent | Percent | |
January | 12 Month | |
Year | Gain | Gain |
1985 | 7.4 | 26.3 |
1986 | 0.2 | 14.6 |
1987 | 13.2 | 2.0 |
1988 | 4.0 | 12.4 |
1989 | 7.1 | 27.3 |
1990 | -6.9 | -6.6 |
1991 | 4.2 | 26.3 |
1992 | -2.0 | 4.5 |
1993 | 0.7 | 7.1 |
1994 | 3.3 | -1.5 |
The scatterplot of the S&P data can illustrate the general analysis of scatterplots. You should look for:
For the S&P data, there is association. This shows up as a general
positive relation (Larger % gain in January is generally associated
with larger % yearly gain.) It is hard to tell if the association is
linear, since the spread of the data is increasing with larger January
% gain. This is due primarily to the 1987 datum in the lower right
corner of plot, and to some extent the 1994 datum. Eliminate those two
points, and the association is strong linear and positive, as
Figure 2 shows.
There is some justification for considering the 1987 datum atypical. That was the year of the October stock market crash. The 1994 datum is a mystery to me.
Data smoothers can help identify and simplify patterns in large sets
of bivariate data. You have already met one data smoother: the moving
average. Figure 3 shows how a median trace reveals a
downward trend, indicative of non-randomness, in the 1970
draft lottery data.
Suppose n measurements, are taken on the variables X and Y. Then the Pearson correlation between X and Y computed from these data is
where are the standardized data.
The scatterplots of standardized variables in Figure 4
illustrate what Pearson correlation measures.
If n is the sample size,
has approximately a tn-2 distribution. We can use this fact to obtain a confidence interval for .
Back to the S&P data, the SAS macro CORR gives a 95% confidence
interval for as
(-0.2775, 0.8345). As this interval contains
0, it indicates no significant linear association between JANGAIN and
YEARGAIN.
If we remove the 1987 and 1994 data, a different story emerges. Then the Pearson correlation is r=0.9360, and a 95% confidence interval for is (0.6780, 0.9880). Since this interval consists entirely of positive numbers, we conclude that is positive and we estimate its value to be between 0.6780 and 0.9880.
QUESTION: What is ? Does this make sense?
The SLR model attempts to quantify the relationship between a single predictor variable Z and a response variable Y. This reasonably flexible yet simple model has the form
where is a random error term, and X(Z) is a function of Z, such as Z2 or . By looking at different functions X, we are not confined to linear relationships, but can also model nonlinear ones. The function X is called the regressor. Often, we omit specifying the dependence of the regressor X on the predictor Z, and just write the model asWe want to fit the model to a set of data . As with the C+E model, two options are least absolute errors, which finds values b0 and b1 to minimize
or least squares, which finds values b0 and b1 to minimizeWe'll concentrate on least squares. Using calculus, we find the least squares estimators of and to be
andThe relevant SAS/INSIGHT output for the regression of YEARGAIN on JANGAIN looks like this:
Figures 5 and 6 show
SAS/INSIGHT output for the regression of YEARGAIN on
JANGAIN, with and without the years 1987 and 1994 removed,
respectively.
Level L confidence intervals for and are
and respectively, where andThe mean response at X=x0 is
The point estimator of is A level L confidence interval for is whereA level L prediction interval for a future observation at X=x0 is
where and Figure 7 shows the regression fit (straight line), the location of the level 0.95 confidence intervals for the mean response at all values of JANGAIN (the inner pair of curves; these curves are also called a confidence band) and the location of the level 0.95 prediction interval for a new observation at all values of JANGAIN (the outer pair of curves; these curves are also called a prediction band). Figure 8 shows the same for the data with years 1987 and 1994 removed.The macro REGPRED will compute confidence intervals for a mean response and prediction intervals for future observations for each data value and for other user-chosen X values.
The SAS macro REGPRED was run on the reduced S&P data, and estimation of the mean response and prediction of a new observation at the value JANGAIN=5 were requested. Both and equal 9.65+(2.36)(5)=21.46. The macro computes a 95% confidence interval for the mean response at JANGAIN=5 as (16.56,26.36), and a 95% prediction interval for a new observation at JANGAIN=5 as (9.08,33.84).
If the standardized responses and predictors are
andThen the regression equation fitted by least squares can be written as
Where X' is any value of a predictor variable standardized as described above.
The Regression Effect refers to the phenomenon of the standardized predicted value being closer to 0 than the standardized predictor. Equivalently, the unstandardized predicted value is fewer Y standard deviations from the response mean than the predictor value is in X standard deviations from the predictor mean.
For the S&P data r=0.4295, so for a January gain standard deviations (SX) from , the regression equation estimates a gain for the year of
standard deviations (SY) from .With 1987 and 1994 removed, the estimate is
which reflects the stronger relation.Analysis of categorical data is based on counts, proportions or percentages of data that fall into the various categories defined by the variables.
Some tools used to analyze bivariate categorical data are:
A survey on academic dishonesty was conducted among WPI students in 1993 and again in 1996. One question asked students to respond to the statement ``Under some circumstances academic dishonesty is justified.'' Possible responses were ``Strongly agree'', ``Agree'', ``Disagree'' and ``Strongly disagree''. Table 1 contains the information for the 1993 data.
Two variables may be associated due to a number of reasons, such as: