• Multivariable Visualization

Tools:

• Scatterplot Array
• Rotating 3-D Plots
Let's try these out. Each of the data sets sasdata.eg10_2a, sasdata.eg10_2b, sasdata.eg10_2c and sasdata.eg10_2d contains data generated by one of four models shown on the next page. Using only the display of the data set itself and a scatterplot array, you are to tell which data set was generated by which model.

The models are:

Now use the rotating 3-D plot to view the data. Does this change your guesses?

• The MLR Model

where the Zs are the predictor variables and is a random error. Examples are

We will write these models generically as

• Fitting the MLR Model

As we did for SLR model, we use least squares to fit the MLR model. This means finding estmators of the model parameters and . The LSEs of the s are those values, of , denoted ,which minimize

The fitted values are

and the residuals are

Let's see what happens when we fit models to sasdata.eg10_2a and sasdata.eg10_2c.

• Assessing Model Fit

Residuals and studentized residuals are the primary tools to analyze model fit. We look for outliers and other deviations from model assumptions. Let's look at the residuals from some fits to sasdata.eg10_2c.

• Interpretation of the Fitted Model

The intercept has the interpretation expected response when the Xi all equal 0''. The coefficient is interpreted as the change in expected response per unit change in Xi when the other Xs are held fixed (if that is possible).

Otherwise can interpret the model using multivariate calculus: change in expected response per unit change in Zi (with the other predictors held fixed) is

So, for example, if the fitted model is

• Theory-Based Modeling

Two ways of building models:

• Empirical modeling
• Theoretical modeling
• Comparison of Fitted Models

• Residual analysis
• Principle of parsimony (simplicity of description)
• Coefficient of multiple determination, and its adjusted cousin.
• ANOVA

Idea:

• Total variation in the response (about its mean) is measured by

This is the variation or uncertainty of prediciton if no predictor variables are used.
• SSTO can be broken down into two pieces: SSR, the regression sum of squares, and SSE, the error sum of squares, so that SSTO=SSR+SSE.
• is the total sum of the squared

residuals. It measures the variation of the response unaccounted for by the fitted model or the uncertainty of predicting the response using the fitted model.

• is the variability explained by the fitted model or the reduction in uncertainty of prediction due to using the fitted model.
• Degrees of Freedom

The degrees of freedom for a SS is the number of independent pieces of data making up the SS. For SSTO, SSE and SSR the degrees of freedom are n-1, n-q-1 and q. These add just as the SSs do. A SS divided by its degrees of freedom is called a Mean Square.

• The ANOVA Table

This is a table which summarizes the SSs, degrees of freedom and mean squares.

 Analysis of Variance Source DF SS MS F Stat Prob > F Model q SSR MSR F=MSR/MSE p-value Error n-q-1 SSE MSE C Total n-1 SSTO

• Inference for the MLR Model: The F Test
• The Hypotheses:
 H0: Ha: Not H0
• The Test Statistic: F=MSR/MSE
• The P-Value: P(Fq,n-q-1>F*), where Fq,n-q-1 is a random variable from an Fq,n-q-1 distribution and F* is the observed value of the test statistic.
• T Tests for Individual Predictors
• The Hypotheses:
 H0: Ha:
• The Test Statistic:
• The P-Value: P(|tn-q-1|>|t*|), where tn-q-1 is a random variable from a tn-q-1 distribution and t* is the observed value of the test statistic.
• Summary of Intervals for MLR Model
• Confidence Interval for Model Coefficients: A level L confidence interval for is

• Confidence Interval for Mean Response: A level L confidence interval for the mean response at at predictor values is

where

and is the estimated standard error of the response.
• Prediction Interval for a Future Observation:

A level L prediction interval for a new response at predictor values is

where

and

• Multicollinearity

Multicollinearity is correlation among the predictors.

• Consequences
o
Large sampling variability for
o
Questionable interpretation of as change in expected response per unit change in Xi.
• Detection Ri2, the coefficient of multiple determination obtained from regressing Xi on the other Xs, is a measure of how highly correlated Xi is with the other Xs. This leads to two related measures of multicollinearity.
o
Tolerance TOLi=1-Ri2 Small TOLi indicates Xi is highly correlated with other Xs. We should begin getting concerned if TOLi<0.1.
o
VIF VIF stands for variance inflation factor. VIFi=1/TOLi. Large VIFi indicates Xi is highly correlated with other Xs. We should begin getting concerned if VIFi>10.
• Remedial Measures
o
Center the Xi (or sometimes the Zi)
o
Drop offending Xi
• Empirical Model Building

Selection of variables in empirical model building is an important task. We consider only one of many possible methods: backward elimination, which consists of starting with all possible Xi in the model and eliminating the non-significant ones one at at time, until we are satisfied with the remaining model.