Mass Communication Research
Descriptive Statistics Lab 2
Measures of Central Tendency and Dispersion,
Recoding Missing Data

INDEX SYLLABUS SCHEDULE e-MEDIA COMM-STOP

To begin this lab, download the Fall 300 data set (NOV300.SAV - 452 cases) from the assignments page of CourseInfo. If you don't remember how to download, you can examine last week's lab to refresh your memory. You'll also need to open the codesheet from the Fall 2000 survey in a separate browser window to see the actual questions asked.

We have discussed why you do research, and some types of research in print and broadcast media. We've also discussed downloading SPSS data files from CourseInfo, running frequencies and recoding data to collapse percentage categories.

Today we'll further examine descriptive statistics to organize and interpret data that has been collected. We'll focus on measures of central tendency (i.e. mean, median, mode), measures of dispersion (i.e. standard deviation, variance, and range), and handling missing data.

Description of Sample

The first thing we want to do is to determine what our sample looks like. Let's look at the gender of those participating in the survey (Q28), the GPA of the sample (Q25), and how they rate their overall experience at UT (Q2). That should give us a general idea about the respondents.

Descriptive Stats — Descriptives

There are two ways that SPSS will provide descriptive statistics. The first is to go under Analyze, then Descriptive Statistics and then Descriptives. You should see a display like the one in the figure below.

When you open this window, you should see a display that looks like the one below.

Getting descriptive statistics this way does not allow for us to get a frequency table, median or a mode, but it can provide some useful data. Send the three variables you want to analyze over to the right-hand column. You do this highlighted the variable names in the left-hand column and click on the arrow between the two columns. Then click "options." You will see a window something like the one below.

Click on the boxes for the mean, standard deviation, range, minimum and maximum. Make to unclick the other boxes if they're marked. Then click on "continue." That will bring you back the previous window. After you click on "OK" there, you should see output like that shown below.

Descriptive Statistics

N Range Minimum Maximum Mean Std. Deviation
q28 Gender 452 1.00 1.00 2.00 1.4912 .5005
q25 GPA 452 99.00 .00 99.00 19.9072 36.4455
q2 rate overall experience 452 4.00 1.00 5.00 2.4137 .7617
Valid N (listwise) 452




Notice first that we have 452 cases for each requested variable. The second column, range, gives the range of possible answers. The row that says "Valid cases listwise" indicates the number of valid cases for the variable that has the lowest number of them.

From this data we can possibly infer the level of measurement for the variables. Notice the range for gender is 1.00, indicating that is probably is a dichotomous variable. Of course, had we decided to code gender "1" for males and "6," the range would have been "5," but it's unlikely that we'd code things that way.

What's the level of measurement for these variables? We would have to turn to the codesheet to see the question wording and coding to be sure. However, the range for the GPA variable (99.00) indicates we may have to recode missing data since 99 is the code for "don't know/no answer".

Now let's look at the means., which you'll remember is the sum of the scores divided by the number of scores. To understand what the average signifies, we'll need to refer to the questions in the codesheet: for gender, 1=male and 2=female; for rating overall experience, 1=excellent, 2=very good, 3=good, 4=poor, 5=very poor and 6=don't know/no answer, which we code as 99 (what's the level of measurement for this variable?); and for GPA, any answer is accepted (what's the level of measurement for this variable?).

The mean gives us the average of the scores. Note for gender, the mean=1.49. (NOTICE: I rounded to the second significant digit). What does this tell us? Actually nothing. It's basically saying the mean is one and a half persons. This is just one example of why the mean is useless for nominal data. The correct measure of central tendency for nominal data is the mode.

The best way to describe a nominal level variable with a few catagories is to provide the frequency distribution, which is not provided in the descriptives menu. The frequency distribution is provided through the frequencies menu item, which indicates 50.9 percent male, 49.1 percent female. That's why we don't see the mode reported very often even though its technically the best measure of central tendency for nominal level measures.

For the rating overall experience, the mean is 2.41. This tells us the average response was between "very good" and "good", which indicates the average student has a favorable experience at UT. For GPA, however, the mean is 19.91, which we know is impossible because 4.0 is the highest anyone could have. This absurd average is an indication that missing variables have skewed the results as GPA.

Next, examine the standard deviation, which indicates how far the scores are from the mean. That is, a large standard deviation means that the scores are spead widely and a small standard deviate means they are clustered tightly around the mean.

If we have what statisticians call a normal distribution for our variable, approximately 66 percent of all cases will be within the first standard deviation above and below the means. The second standard deviation contains approximately 95 percent of all cases above and below the mean. The third standard deviation contains almost all of the cases.

For gender, the standard deviation is 0.50, which makes perfect sense. If we add 0.50 to the mean (1.49) we get 1.99 within the first standard deviation, or basically 2 which would indicate a female. If we subtract 0.50, we get .99 within the first standard deviation, or basically 1 which would indicate a male.

Now look at rating overall experience. If we add its standard deviation (0.76) to the mean (2.41) we get 3.17 within the first standard deviation, which indicates answers in the good to poor range. If we subtract its standard deviation, we get 1.65 within the first standard deviation, which indicates answers in the excellent to very good range. Apparently there are no missing cases in this variable, but there are in GPA. The first standard deviation (36.45) would produce a negative number via subtraction (-16.54) and a number for which we had no code (56.36) via addition. Thus we know we'll need to recode this variable.

Finally the variance gives us an indication of how the scores spread about the mean. Remember, a small variance indicates that most of the scores in the distribution lie fairly close to the mean; a large variance represents widely scattered scores. Gender (0.25) and rating overall experience (0.58) have small variances, but GPA has a large variance (1328.27), another indication that we'll need to recode.

Before we end our discussion of the SPSS descriptives procedure, lets talk about one of its handy features. You may have noticed that in the descriptives "Options" window, you could control the order that variables are listed in your output. Usually, this is set to "variable list," which means that the output will list variables in the order that they occur in the data.

Other options allow you to specify other orders such as alphabetical by variable name, or in ascending or descending order by variable mean. Try the ascending or descending order for the satisfactions varialble, 3a, 3b, 3c, etc. and see what happens. It will provide an easy way to find out what things students are most satisfied and least satisfied with and provide a table that might be handy for inclusion in your reports. You can do this on your own.

Descriptive Stats — Frequencies

You should recall from the last lab note how to run frequencies. You go to "analyze" on the SPSS menu, scroll down to "descriptives" and over to "frequencies." You should see a display like the one below.

When you click on "frequencies," you should see a display like the one below.

This menu item will allow you to get examine a frequency table, median and mode in addition to the mean and measures of dispersion. Send the Gender variable (Q28) to the right-hand column. Make sure the box that says "Display frequency tables" is checked. Click on statistics. You will see a display like the one below.

Click the following boxes: Mean, Median, Mode, Standard Deviation, Variance and Range. Click continue, then click OK. You will receive output for the information you requested in a table like the one below.

Statistics
q28 Gender
N Valid 452
Missing 0
Mean 1.4912
Median 1.0000
Mode 1.00
Std. Deviation .5005
Variance .2505
Range 1.00

q28 Gender

Frequency Percent Valid Percent Cumulative Percent
Valid male 230 50.9 50.9 50.9
female 222 49.1 49.1 100.0
Total 452 100.0 100.0

You will see your descriptive statistics and the frequency table. You should have a mean of 1.49, a median and mode of 1.00, a standard deviation of 0.50, variance of 0.25 and range of 1.00. Does the mean tell us anything in this instance (nominal data)? The median indicates the middle score if all the scores were written in a long-hand frequency distribution. The frequency table indicates 230 males (50.9 percent) and 222 females (49.1 percent) participated in the study.

SPSS will do whatever you tell it to do. But sometimes it is not important to get all descriptive statistics for all data. The frequency tables are important because they enable us to look at data and make sure there are no errors in it.

Now let's examine ratings of overall experience (Q2). See if you can do this without having figures to look at. Move the variable to the right-hand column. Notice the box that says "Display frequency tables" is still checked. Click on statistics. Notice that the variables we previously chose are still there. This should stay the same until you close the program. Even so, it's always a good idea to check it to be sure. Now click continue, then click OK. You will receive output for the information you requested.

You should have a mean of 2.41, a median and mode of 2.00, a standard deviation of 0.76, variance of 0.58 and range of 4.00. Does the mode tell us anything in this instance (interval data)? Yes, it indicates that the response given most often was "very good" when rating the overall UT experience. However, the mean is the preferred measure of central tendency with interval data as it tells us the average response was between "good" and "very good," tending toward the latter response.

The frequency table indicates 47 people indicated their experience at UT was excellent (10.4 percent), 197 found it very good (43.6 percent), 185 found it good (40.9 percent), 20 found it poor (4.4 percent) and three people found it very poor (0.7 percent). Finally, let's run descriptive statistics for GPA. Go to Analyze, Descriptive Statistics and the Frequencies. Move Q25 over to the right hand column. Then click Statistics for a precautionary check. Click OK and this will take you back to the original pop-up window. Make sure you have your frequency table box checked. Click OK and you will get your results.

Once again, you will see your descriptive statistics and the frequency table. You should have a mean of 19.91, a median of 3.28, a mode of 99, a standard deviation of 36.45, variance of 1328.27 and range of 99.00. The frequency table gives us a distribution of responses.

We can tell now that we have a problem. First, the mode indicates most people answered "don't know/no answer", which would indicate many are uncomfortable discussing their GPA. The frequency table shows a low response of 0.00 (most likely a freshman who didn't know, but gave an answer anyway), a response of 12.80 (most likely a coding error) and 79 responses of "don't know/no answers", which represents 17.5 percent of our sample (NOTICE: the valid percent was reported. WHY?)

Recoding Missing Data

Sometimes we code the "don't know/no answer" category as "99". Not only does this make for an ugly table (see previous example); it also might affect the tests that we may run, especially when discussing inferential statistics. In instances such as this, we need to tell SPSS to treat these cases as missing data.

That's not hard to do. Click on "Transform" on the toolbar at the top of the SPSS window, and then click on "Recode." You'll see a window like the one below.

You can either recode into the "same variable" or create a "different variable." This is probably the only time we'll recode into the same variable. We aren't changing the level of measurement of the question as happened when recoding to collapse categories. We want to keep the same categories yet get rid of those responses where people didn't care or didn't want to answer. When reporting these results, we'd note they are based on a subsample.

Choose "same variable" and see if you figure out how to do things. Put the variable we want to recode (Q25) into the box labeled "numeric variables" and click on the "Old Values and New Values" button. In the new box that appears, put 99 (the value we want to recode) into the box under "Old Value" and click on "System Missing" under the box "New" Value. Then click on the "Add" button. We also want to take care of the coding error, so put 12.80 into the box under "Old Value" and click on "System Missing" under the box "New" Value. When you finish you should see a window like the one below.

We could repeat this process if we wanted to recode other values. Because we don't, just click on "Continue." When you get back to the recode window, click on "OK."

Now we can rerun frequencies, a good idea when recoding data as it's easy to make a mistake. This will also allow us to finish checking the data since we were unable to get a clear reading on the GPA variable previously.

Go to Analyze, Descriptive Statistics and the Frequencies. Move Q25 over to the right hand column. Then click Statistics for a precautionary check. Click OK and this will take you back to the original pop-up window. Make sure you have your frequency table box checked. Click OK and you will get your results.

Notice first that we now have 372 valid cases and 80 cases listed as missing data. This corresponds to the answers of "don't know/no answer" that we previously saw in the frequency table, and also the coding error of 12.80. Now let's look at the descriptive statistics and the frequency table. You should have a mean of 3.13, a median of 3.10, a mode of 3.00, a standard deviation of 0.50, variance of 0.25 and range of 4.00. This is more like it as GPA is usually measured on a 4-point scale! The frequency table gives us a distribution of responses.

A few things to note — we still have the response 0.00, but this is ok as it's a valid response. The mean tells us the average student is a B student, and the mode tells us that B students answered the survey more than any other answer.

If you don't understand something in this Web note, please e-mail Dr. Sitton.

INDEX SYLLABUS SCHEDULE e-MEDIA COMM-STOP

İM. Mark Miller & Ronald W. Sitton 2009
Revised 092811 — http://www.uamont.edu/FacultyWeb/sitton/crz/mrea/dstats2.html