No Title

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{lect9f1.ps}} \vspace{2ex}\end{figure}$

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{lect9f2.ps}} \vspace{2ex}\end{figure}$

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{lect7f1.eps}} \vspace{2ex}\end{figure}$

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{lect7f2.eps}} \vspace{2ex}\end{figure}$

A survey on academic dishonesty was conducted among WPI students in 1993 and again in 1996. One question asked students to respond to the statement ``Under some circumstances academic dishonesty is justified.'' Possible responses were ``Strongly agree'', ``Agree'', ``Disagree'' and ``Strongly disagree''. The 1993 results are shown in the following 2x2 table and in the mosaic plots in Figure 5 and 6:

**Figure:**
	1993 Survey
	Dishonesty Sometimes Justified

	Frequency
	Percent
	Row Pct.			Strongly
	Col Pct.	Agree	Disagree	Disagree	Total

	Male	31	71	30	132
		17.32	39.66	16.76	73.74
		23.48	53.79	22.73
		65.96	79.78	69.77
Gender

	Female	16	18	13	47
		8.94	10.06	7.26	26.26
		34.04	38.30	27.66
		34.04	20.22	30.23

	Total	47	89	43	179
		26.26	49.72	24.02	100.00

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{honesty1.ps}} \vspace{2ex}\end{figure}$

**Figure:**
$\begin{figure} \centerline{\includegraphics*[height=6in,width=6in]{honesty2.ps}} \vspace{2ex}\end{figure}$

Inference for Categorical Data with Two Categories Methods for comparing two proportions can be used (estimation from chapter 5 and hypothesis tests from chapter 6).

Inference for Categorical Data with More Than Two Categories: One-Way Tables

Suppose the categorical variable has c categories, and that the population proportion in category i is p_i. To test

**Figure:**
H₀:	p_i	=	$p_i^{(0)},~ i=1,2, \ldots, c$
H_a:	p_i	$\neq$	$p_i^{(0)} \mbox{ for at least one } i$

for pre-specified values $p_i^{(0)}, i=1,2, \ldots, c,$ use the Pearson $\mbox{{\boldmath${\chi}$}}$ ² statistic

$\begin{displaymath} X^2 = \displaystyle \sum_{i=1}^c \frac{(Y_i-np_i^{(0)})^2}{np_i^{(0)}},\end{displaymath}$

where Y_i is the observed frequency in category i, and n is the total number of observations.

Note that for each category the Pearson statistic computes (observed-expected)²/expected and sums over all categories.

Under H₀, $X^2\sim \chi^2_{c-1}$ . Therefore, if x^2* is the observed value of X², the p-value of the test is $P(\chi^2_{c-1}\geq x^{2*})$ .

Example:
Historically, the distribution of weights of ``5 pound'' dumbbells produced by one manufacturer have been normal with mean 5.01 and standard deviation 0.15 pound. It can be easily shown that 20% of the area under a normal curve lies within $\pm 0.25$ standard deviations of the mean, 20% lies between 0.25 and 0.84 standard deviations of the mean, 20% lies between -0.84 and -0.25 standard deviations of the mean, 20% lies beyond 0.84 standard deviations above the mean, and another 20% lies beyond 0.84 standard deviations below the mean.

This means that the boundaries that break the N(5.01,0.15²) density into five subregions, each with area 0.2, are 4.884, 4.9725, 5.0475 and 5.136.

A sample of 100 dumbbells from a new production lot shows that 25 lie below 4.884, 23 between 4.884 and 4.9725, 21 between 4.9725 and 5.0475, 18 between 5.0475 and 13 above 5.136. Is this good evidence that the new production lot does not follow the historical weight distribution?

Solution:
We will perform a $\chi^2$ test. Let p_i be the proportion of dumbbells in the production lot with weights in subinterval i, where subinterval 1 is $(-\infty,4.884]$ , subinterval 2 is (4.884,4.9725], and so on. If the production lot follows the historical weight distribution, all p_i equal 0.2. This gives our hypotheses:

H₀:	p_i	=	$0.2,~ i=1,2, \ldots, 5$ ,
H_a:	p_i	$\neq$	0.2,

for at least one $i,\; i=1,2,\ldots, 5$ .

Since np⁽⁰⁾_i=20 for each i, the test statistic is

$\begin{displaymath} x^{2*} = \frac{(25-20)^2}{20} + \ldots + \frac{(13-20)^2}{20}=4.4\end{displaymath}$

The p-value is $P(\chi^2_4\geq 4.4)=0.3546$ , so we cannot reject H₀.

Inference for Categorical Data with More Than Two Categories: Two-Way Tables Suppose a population is partitioned into rc categories, determined by r levels of variable 1 and c levels of variable 2. The population proportion for level i of variable 1 and level j of variable 2 is p_ij. These can be displayed in the following $r\times c$ table:

	Column	Marginals
row	1	2	...	c
1	*p₁₁*	*p₁₂*	...	p_1c	$p_{1\cdot}$
2	*p₂₁*	*p₂₂*	...	p_2c	$p_{2\cdot}$
.	.	.		.	.
.	.	.		.	.
.	.	.		.	.
r	p_r1	p_r2	...	p_rc	$p_{r\cdot}$
Marginals	$p_{\cdot 1}$	$p_{\cdot 2}$	...	$p_{\cdot c}$	1

We want to test

H₀:	row and column variables
	are independent
H_a:	row and column variables
	are not independent.

To do so, we select a random sample of size n from the population. Suppose the table of observed frequencies is

	Column	Totals
row	1	2	...	c
1	*Y₁₁*	*Y₁₂*	...	Y_1c	$Y_{1\cdot}$
2	*Y₂₁*	*Y₂₂*	...	Y_2c	$Y_{2\cdot}$
.	.	.		.	.
.	.	.		.	.
.	.	.		.	.
r	Y_r1	Y_r2	...	Y_rc	$Y_{r\cdot}$
Totals	$Y_{\cdot 1}$	$Y_{\cdot 2}$	...	$Y_{\cdot c}$	n

Under H₀ the expected cell frequencies are given by

$\begin{displaymath} \mbox{expected value} = \frac{ \mbox{row total} \times \mbox{column total} }{\mbox{sample size}}. \end{displaymath}$

To measure the deviations of the observed frequencies from the expected frequencies under the assumption of independence, we construct the Pearson $\chi^2$ statistic

$\begin{displaymath} X^2 = \displaystyle \sum_{i=1}^r \sum_{j=1}^{c} \frac{(Y_{i... ...dot}\hat{p}_{\cdot j})^2}{n\hat{p}_{i\cdot}\hat{p}_{\cdot j}}, \end{displaymath}$

where $\hat{p}_{i\cdot}=Y_{i\cdot}/n$ and $\hat{p}_{\cdot j}=Y_{\cdot j}/n$ .

Note that for the test to be valid, we require that $np_{i\cdot}p_{\cdot j}\geq 5$ .

Example: A polling firm surveyed 269 American adults concerning how leisure time is spent in the home. One question asked them to select which of five leisure activities they were most likely to partake in on a weeknight. The results are broken down by age group in the following table, in which the cell entries are frequency, expected frequency under the hypothesis of independence of age and activity, and Pearson residual.

	Activity
Age	Watch		Listen	Listen	Play Computer	Totals
Group	TV	Read	to Radio	to Stereo	Game	Totals
18-25	21	3	9	10	19	62
	(19.13)	(9.22)	(11.52)	(11.75)	(10.37)
	(+0.43)	(-2.05)	(-0.74)	(-0.51)	(+2.68)
26-35	17	5	6	8	13	49
	(15.12)	(7.29)	(9.11)	(9.29)	(8.20)
	(0.48)	(-0.85)	(-1.03)	(-0.42)	(1.68)
36-50	14	8	8	12	9	51
	(15.74)	(7.58)	(9.48)	(9.67)	(8.53)
	(-0.44)	(0.15)	(-0.48)	(0.75)	(0.16)
51-65	18	10	11	10	3	52
	(16.04)	(7.73)	(9.67)	(9.86)	(8.70)
	(0.49)	(0.82)	(0.43)	(0.04)	(-1.93)
Over 65	13	14	16	11	1	55
	(16.97)	(8.18)	(10.22)	(10.43)	(9.20)
	(-0.96)	(2.04)	(1.81)	(0.18)	(-2.70)
Totals	83	40	50	51	45	269

As an example of the computation of table entries, consider the entries in the (1,2) cell (age: 18-25, activity: Read), in which the observed frequency is 3. The marginal number in the 18-25 bracket is 62, while the marginal number in the Read bracket is 40, so $\hat{p}_{1\cdot}=62/269$ , while $\hat{p}_{\cdot 2}=40/269$ , so the expected number in the (1,2) cell is 269(62/269)(40/269)=9.22. The Pearson residual is $(3-9.22)/\sqrt(9.22)=-2.05$ . The value of the chi-square statistic is 38.91, which is computed as the sum of the squares of the Pearson residuals. Comparing this with the chi-square distribution with 16 degrees of freedom, we get a p-value of 0.0011.

Association is NOT Cause and Effect

Two variables may be associated due to a number of reasons, such as:

1.: X could cause Y.
2.: Y could cause X.
3.: X and Y could cause each other.
4.: X and Y could be caused by a third (lurking) variable Z.
5.: X and Y could be related by chance.
6.: Bad (or good) luck.

The Issue of Stationarity

When assessing the stationarity of a process in terms of bivariate measurements X and Y, always consider the evolution of the relationship between X and Y, as well as the individual distribution of the X and Y values, over time or order.
Suppose we have a model relating a measurement from a process to time or order. If, as more data are taken the pattern relating the measurement to time or order remains the same, we say that the process is stationary relative to the model.

About this document ...

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

The command line arguments were:
latex2html -split 0 lect7.

The translation was initiated by Joseph D Petruccelli on 11/4/1999

Joseph D Petruccelli
11/4/1999

	Percent	Percent
	January	12 Month
Year	Gain	Gain
1985	7.4	26.3
1986	0.2	14.6
1987	13.2	2.0
1988	4.0	12.4
1989	7.1	27.3
1990	-6.9	-6.6
1991	4.2	26.3
1992	-2.0	4.5
1993	0.7	7.1
1994	3.3	-1.5