Mass Communication Research
Chi-square Lab


INDEX SYLLABUS SCHEDULE e-MEDIA COMM-STOP

Go to the assignments page in Course Info and download the clean300.sav SPSS data file to your computer. (If you've forgotten how to download a file, you should review how to do that in the Descriptive Statistics 1 Lab Notes). Open the file in SPSS. It contains data from the Spring 2000 survey. Today we'll work with the Chi-square nonparametric statistic, which is covered in further detail in a Web note.

The Chi-Square Goodness of Fit Test

First, let's find out if the sample conforms to our expectations. One way of doing this is to examine the distributions of variables where we already know what the distribution should be. For example, when Com. 300 students conducted their surveys in the fall, they were instructed to interview an equal number of males and females.

To find out if that happened, let's run frequencies on gender, which is Variable Q28 in the clean300.sav data. We covered running frequencies in the first lab on descriptive statistics, so we won't go over it here. (You may want to review that lab, however.)

When we run frequencies on gender, we get output that looks like the following:

Gender

Frequency Percent Valid Percent Cumulative Percent
Valid male 230 50.9 50.9 50.9
female 222 49.1 49.1 100.0
Total 452 100.0 100.0

Obviously, this is very close to a 50-50 split by gender, but is it close enough to be attributed to the random chance we would expect because we are dealing with a sample? This amounts to testing the hypothesis: The proportion of males and females is not equal, which is a two-tailed hypothesis. Let's run a Chi-square Goodness of Fit test to test the hypothesis. To do this with our data set:

  1. Go to Analyze
  2. Scroll down to "Non-Parametric Tests
  3. Choose "chi-square"

You should see a window like the one below.

Click to open a new window. When the window opens, scroll down variable list and highlight gender (q28). Then click the arrow to move it into the test variables list. In the "Test Variables" Box, make certain that "all categories equal" is checked.
 

Click OK. You should get output like that shown below.

Frequencies

Gender

Observed N Expected N Residual
male 230 226.0 4.0
female 222 226.0 -4.0
Total 452


Test Statistics

Gender
Chi-Square(a) .142
df 1
Asymp. Sig. .707
a 0 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 226.0.

This table shows that if we expect the categories of gender to be equal, there will be 226 males and 226 females. The residuals column indicates that there are four more males that would be expected by chance, and 4 less females. Thus, the expected and observed values are very close. But are they close enough to be attributed to chance? Let's look at the next table.

This table shows a significance level of .707, which is much higher than our "critical" level of .05. Therefore we accept the null hypothesis. That is, we conclude that the difference in the proportion of males and females is attributable to chance.

Let's look at another variable where we might know what the distribution should look like before we gather the data. Consider Class Rank, variable Q1 in our data. When you run frequencies on Class Rank you should get a table like the one below.

class rank

Frequency Percent Valid Percent Cumulative Percent
Valid Freshman 119 26.3 26.3 26.3
Sophomore 131 29.0 29.0 55.3
Junior 85 18.8 18.8 74.1
Senior 117 25.9 25.9 100.0
Total 452 100.0 100.0

Because this looks a bit odd, we decided to call the UT Office of Institutional Research and ask about the actual distribution of students by class rank. Let's assume the Institutional Research Office says that the student body is 33% Freshmen, 28% Sophomores, 22% Juniors, and 17% Seniors.

Using these figures we can calculate expected values for our sample of 452. That is, we would expect 150 Freshmen (33% of 452), 127 Sophomores (28 % or 452), 99 Juniors (22% of 452), and 76 Seniors (17% of 452). Armed with these figures, we can do a Chi-square Goodness of Fit Test to see if the sample characteristics are in line with the known population figures. This tests the hypothesis that the frequencies are different in the population and the sample.

To do this with our data set:

  1. Go to Analyze
  2. Scroll down to "Non-Parametric Tests"
  3. Choose "Chi-square"

When the window opens, scroll down variable list and highlight Class Rank (Q1). Then click the arrow to move it into the "Test Variables" box. Then click on the word "values" in the "expected values" box. We'll insert the expected values we calculated above here.

  1. Enter 150 for Freshmen in the top box of the Test Values and click on the "Add" button. This will put the value 150 in the lower box.
  2. Enter 127 for Sophomores and click "Add."
  3. Enter 99 for Juniors and click "Add."
  4. Enter 76 for Seniors and click "Add."

When you've finished entering the expected values, click OK. You should get output like that shown below.

Chi-Square Test

Frequencies

class rank

Observed N Expected N Residual
Freshman 119 150.0 -31.0
Sophomore 131 127.0 4.0
Junior 85 99.0 -14.0
Senior 117 76.0 41.0
Total 452

Test Statistics

class rank
Chi-Square(a) 30.631
df 3
Asymp. Sig. .000
a 0 cells (.0%) have expected frequencies less than 5. The minimum expected cell frequency is 76.0.

In the top box, we can see the observed frequencies from the data, which are the same as we got running the Frequencies Procedure in the first column. In the second column are the expected frequencies, which we entered. In the third column are the "residuals," which is statistician talk for the difference between the observed and expected values.

From the residuals column we see that there are 31 fewer Freshmen than expected, 4 more Sophomores than expected, 14 fewer Juniors, and 41 more Seniors. These are large differences, but are they too large to be attributed to chance?

To answer that question we look in the bottom box to find the significance level, which is reported as .000. Clearly, .000 is less than the critical value of .05 so we reject the null hypothesis that the expected values and the observed values are close enough so that their differences are attributable to chance. We accept the alternate hypothesis that the sample frequencies are different than the population frequencies.

The Chi-square Test for Independence

The Chi-square test for Independence operates by comparing observed frequencies to expected frequencies. Most of the time we don't know ahead of time what the expected frequencies will be, as we did in the examples above. Instead, the expected values must be calculated from the data that we have gathered. To see how this is done, let's test the hypothesis:

Students who pay for their own college education are less likely to favor increased tuition than are students who receive grants and scholarships.

The independent variable in this case is sources of funding for a college education, which are indicated in Question 22. Six options for primary source of funding are offered.

22. What provides the largest share of funding for your education?

  1. Income from work
  2. parent(s)
  3. other family member(s)
  4. Grants
  5. Loans
  6. Scholarships
  7. other
  8. DON'T KNOW/NO ANSWER [DO NOT READ]

Obviously, this variable needs to be recoded if its categories are to indicate simply whether or not students' college funds come from personal sources or from grants and scholarships. To get the variable set up appropriately for the stated hypothesis, we need to create a new variable with responses 1, 2, 3, and 5 in one category which indicates the funding is coming from personal sources (or sources that may have to be paid back). Responses 4 and 6 are put in another category, which indicates that the students' primary source for college funding comes from grants or scholarships. (If you don't remember how to collapse categories, refer to Descriptive Statistics Lab 1).

The dependent variable is willingness to pay a tuition increase, which is indicated in question 8.

8. Would you be willing to pay higher tuition to help meet President Gilley's objectives for the university?

  1. Yes
  2. No
  3. DON'T KNOW/NO ANSWER [DO NOT READ]

How to get the Chi-square statistic

To do a Chi-square go to analyze, descriptive statistics, crosstabs. You'll see a window like the one below.

This will open a window similar to the one below.

You will see rows and columns. You want the independent variable to be the column and the dependent variable to be the row. In this case, source of funds will be the column and willingness to pay will be the row. (HINT: The variables have been placed correctly in the window above. It's important to set up the data correctly in order to be able to correctly interpret the data.)

After these are in place, click "statistics." A window like the one below will open.

All you want to do is click "Chi-square" which is in the upper left-hand corner of the window. Click continue and you will be taken back to the crosstabs window. Click the "cells" button (next to statistics), and a window like the one below will open.

In the cell window you want to check "observed" and "expected" counts in the upper left-hand corner and in the lower left-hand corner check column percentages. Then click continue. This will take you back to the crosstabs window. Click OK and you will get crosstabs with a Chi-square value. You should get output like that displayed below.

Case Processing Summary

Cases
Valid Missing Total
N Percent N Percent N Percent
willingness to pay higher tution to meet objectives * FUNDS 419 92.7% 33 7.3% 452 100.0%



willingness to pay higher tution to meet objectives * FUNDS Crosstabulation

FUNDS Total
1.00 2.00
willingness to pay higher tution to meet objectives yes Count 121 35 156
% within FUNDS 35.5% 44.9% 37.2%
no Count 220 43 263
% within FUNDS 64.5% 55.1% 62.8%
Total Count 341 78 419
% within FUNDS 100.0% 100.0% 100.0%

How to read the output?

The first thing you'll notice in the output is the case-processing summary. The "N" tells us this data set has no missing cases. Underneath is a 2x2 crosstabulation table with no columns having expected counts of less than five. That is why it is important not to have any missing values in your data.

The table rows (think dependent variable) provide information on willingness to pay higher tuition. The table columns provide information on source of funds. The crosstab shows where these variables overlap.

We're interested in comparing the percentages of each group who answered that they are willing to pay higher tuition. Of 341 students who are self-funded, 35.5% say they're willing to pay higher tuition. Of 78 students who get grants or scholarships, 44.9% say they're willing to pay higher tuition. Though we can see the percentages are different, are they significantly different?

Chi-Square Tests

Value df Asymp. Sig. (2-sided) Exact Sig. (2-sided) Exact Sig. (1-sided)
Pearson Chi-Square 2.394(b) 1 .122

Continuity Correction(a) 2.009 1 .156

Likelihood Ratio 2.352 1 .125

Fisher's Exact Test


.153 .079
Linear-by-Linear Association 2.388 1 .122

N of Valid Cases 419



a Computed only for a 2x2 table
b 0 cells (.0%) have expected count less than 5. The minimum expected count is 29.04.

Look at the Chi-square tests. The Chi-square value tells if that comparison is significant. We are using the critical level of .05. Is there significance? The Pearson Chi-square shows .122 significance for a two-tailed hypothesis. Since this is a one-tailed hypothesis, we divide that by 2, which equals .061. That's greater than .05 so we say accept the null hypothesis the proportions in the two samples are not different.

How to interpret the output?

To interpret the Chi-square, compare the column percentages across rows. Therefore, we'd say "The percentage of self-funded students willing to pay higher tuition is 35.5 %, compared to 44.9% of students on grants and scholarships. However, differences in the percentage of students in the two groups is not large enough to be statistically significant."

EXAMPLE 2

Let's try another example. Our hypothesis is:

"The more a student thinks that President Gilley's ideas will improve the value of a UT education, the more likely they'd be willing to pay higher tuition to meet those objectives."

We'll need to use question 8, which is reproduced above, and question 7:

7. How much do you think President Gilley's ideas will improve the value of your education at UT. Would those ideas:

  1. Improve the value of your education a great deal
  2. Improve the value of your education some
  3. Not improve the value of your education
  4. Decrease the value of your education
  5. NO OPINION/DON'T KNOW/NO ANSWER

In this case variable variable 7 is the independent variable and variable 8 is the dependent variable.

Once again, to do a Chi-square go to analyze, descriptive statistics, crosstabs. In this case voting will be the dependent variable and membership will be the independent variable. REMEMBER: the dependent variable goes into rows and the independent variable goes into columns. Choose the same options as we chose last time. Be sure to choose column percentages rather than row percentages.

Your output should have a 2x4 crosstab table like the one shown below.

Case Processing Summary

Cases
Valid Missing Total
N Percent N Percent N Percent
willingness to pay higher tution to meet objectives * improve value of UT education 405 89.6% 47 10.4% 452 100.0%



willingness to pay higher tution to meet objectives * improve value of UT education Crosstabulation

improve value of UT education Total
improve great deal improve some not improve decrease value
willingness to pay higher tution to meet objectives yes Count 49 95 13
157
% within improve value of UT education 62.0% 38.6% 17.3%
38.8%
no Count 30 151 62 5 248
% within improve value of UT education 38.0% 61.4% 82.7% 100.0% 61.2%
Total Count 79 246 75 5 405
% within improve value of UT education 100.0% 100.0% 100.0% 100.0% 100.0%
Chi-Square Tests

Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 35.686(a) 3 .000
Likelihood Ratio 38.586 3 .000
Linear-by-Linear Association 35.493 1 .000
N of Valid Cases 405

a 2 cells (25.0%) have expected count less than 5. The minimum expected count is 1.94.

When you look at the SPSS output you cans see there are 79 students who thought President Gilley's ideas would improve the value of their education "a great deal", 62% of whom said they said they would support a tuition increase. There were 246 students who said the ideas would improve the value of their education "some," 38.6% of whom said they would support a tuition increase; 73 who said the ideas would "not improve" the value, 17.3% of whom support a tuition increase, and 5 students who said the ideals would decrease the value of their education, none of whom said they would support a tuition increase.

Clearly this pattern is in line with the hypothesis. Further, we see that the value of chi-square is 0.000, which is clearly less than .05. Therefore, we reject the null hypothesis and conclude that the hypothesis is supported.

You should study these examples until you're sure that you can run SPSS to generate crosstabulations and calculate Chi Square. Also, make sure that you can interpret the output. You could also practice these things by making up your own hypotheses and testing them.

If you don't understand something in this Web note, please e-mail Dr. Sitton.

INDEX SYLLABUS SCHEDULE e-MEDIA COMM-STOP

©M. Mark Miller & Ronald W. Sitton 2009
Revised 092811 — http://www.uamont.edu/FacultyWeb/sitton/crz/mrea/chi-squarelab.html