Mass Communication Research
Descriptive Statistics Lab 2
Measures of Central Tendency and Dispersion,
Recoding Missing
Data |
 |
To begin this
lab, download the Fall 300 data set (NOV300.SAV - 452 cases) from the
assignments page of CourseInfo. If you don't remember how to download,
you can examine last week's lab to
refresh your memory. You'll also need to open the codesheet from the
Fall 2000 survey in a separate browser window to see the actual
questions asked.
We have discussed why you do research, and some
types of research in print and broadcast media. We've also discussed downloading
SPSS data files from CourseInfo, running frequencies
and recoding
data to collapse percentage categories.
Today we'll further
examine descriptive
statistics to organize and interpret data that has been collected.
We'll focus on measures of
central tendency (i.e. mean, median, mode), measures of
dispersion (i.e. standard deviation, variance, and range), and
handling missing data.
Description of Sample
The first thing we want to do
is to determine what our sample looks like. Let's look at the
gender of those participating in the survey (Q28), the GPA of the sample
(Q25), and how they rate their overall experience at UT (Q2). That should
give us a general idea about the respondents.
Descriptive Stats Descriptives
There are two ways that SPSS will
provide descriptive statistics. The first is to go under Analyze, then
Descriptive Statistics and then Descriptives. You should see a display
like the one in the figure below.

When you open this window, you should see a display that looks like the one below.

Getting descriptive
statistics this way does not allow for us to get a frequency table,
median or a mode, but it can provide some useful data. Send the three
variables you want to analyze over to the right-hand column. You do this
highlighted the variable names in the left-hand column and click on the
arrow between the two columns. Then click "options." You will see a window
something like the one below.

Click on the boxes for the mean, standard deviation, range, minimum and maximum.
Make to unclick the other boxes if they're marked. Then click on "continue."
That will bring you back the previous window. After you click on "OK" there,
you should see output like that shown below.
Descriptive Statistics
|
N |
Range |
Minimum |
Maximum |
Mean |
Std. Deviation |
| q28 Gender |
452 |
1.00 |
1.00 |
2.00 |
1.4912 |
.5005 |
| q25 GPA |
452 |
99.00 |
.00 |
99.00 |
19.9072 |
36.4455 |
| q2 rate overall experience |
452 |
4.00 |
1.00 |
5.00 |
2.4137 |
.7617 |
| Valid N (listwise) |
452 |
|
|
|
|
|
Notice first that we
have 452 cases for each requested variable. The second column, range,
gives the range of possible answers. The row that says "Valid cases listwise"
indicates the number of valid cases for the variable that has the lowest number
of them.
From this data we can possibly
infer the level of measurement for the variables. Notice the range for
gender is 1.00, indicating that is probably is a dichotomous variable. Of
course, had we decided to code gender "1" for males and "6," the range would
have been "5," but it's unlikely that we'd code things that way.
What's the level of
measurement for these variables? We
would have to turn to the codesheet to see
the question wording and coding to be sure.
However, the range for the GPA variable (99.00) indicates we may have to
recode missing data since 99 is the code for "don't know/no answer".
Now let's look at the means.,
which you'll remember is the sum of the scores divided by the number
of scores. To understand what
the average signifies, we'll need to refer to the questions in the codesheet: for
gender, 1=male and 2=female; for rating overall experience, 1=excellent,
2=very good, 3=good, 4=poor, 5=very poor and 6=don't know/no answer,
which we code as 99 (what's the level of
measurement for this variable?); and for GPA, any answer is accepted
(what's the level of
measurement for this variable?).
The mean gives us the
average of the scores. Note for gender, the mean=1.49. (NOTICE: I rounded
to the second significant digit). What does this tell us? Actually
nothing. It's basically saying the mean is one and a half persons. This
is just one example of why the mean is useless for nominal data. The
correct measure of central tendency for nominal data is the mode.
The best way to describe a nominal level variable with a few catagories
is to provide the frequency distribution, which is not provided in the
descriptives menu. The frequency distribution is provided through the
frequencies menu item, which
indicates 50.9 percent male, 49.1 percent female. That's why we don't
see the mode reported very often even though its technically the best
measure of central tendency for nominal level measures.
For the rating overall experience, the mean is 2.41. This tells us the average
response was between "very good" and "good", which indicates the average
student has a favorable experience at UT. For GPA, however, the mean is
19.91, which we know is impossible because 4.0 is the highest anyone could have.
This absurd average is an indication that missing variables have skewed the results as
GPA.
Next, examine the standard
deviation, which indicates how far the scores are from the mean.
That is, a large standard deviation means that the scores are spead widely and
a small standard deviate means they are clustered tightly around the mean.
If we have what statisticians call a normal distribution for our variable,
approximately 66 percent of all cases will be within
the first standard deviation above and below the means. The second
standard deviation contains approximately 95 percent of all cases above
and below the mean. The third standard deviation contains almost
all of the cases.
For gender, the standard deviation is 0.50,
which makes perfect sense. If we add 0.50 to the mean (1.49) we get 1.99
within the first standard deviation, or basically 2 which would indicate
a female. If we subtract 0.50, we get .99 within the first standard
deviation, or basically 1 which would indicate a male.
Now look
at rating overall experience. If we add its standard deviation (0.76) to
the mean (2.41) we get 3.17 within the first standard deviation, which
indicates answers in the good to poor range. If we subtract its standard
deviation, we get 1.65 within the first standard deviation, which
indicates answers in the excellent to very good range. Apparently there
are no missing cases in this variable, but there are in GPA. The first
standard deviation (36.45) would produce a negative number via
subtraction (-16.54) and a number for which we had no code (56.36) via
addition. Thus we know we'll need to recode this
variable.
Finally the variance
gives us an indication of how the scores spread about the mean.
Remember, a small variance indicates that most of the scores in the
distribution lie fairly close to the mean; a large variance represents
widely scattered scores. Gender (0.25) and rating overall experience
(0.58) have small variances, but GPA has a large variance (1328.27),
another indication that we'll need to recode.
Before we end our discussion of the SPSS descriptives procedure, lets talk
about one of its handy features. You may have noticed that in the descriptives
"Options" window, you could control the order that variables are listed in
your output. Usually, this is set to "variable list," which means that the output
will list variables in the order that they occur in the data.
Other options allow you to specify other orders such as alphabetical by variable name,
or in ascending or descending order by variable mean. Try the ascending or descending
order for the satisfactions varialble, 3a, 3b, 3c, etc. and see what happens. It will
provide an easy way to find out what things students are most satisfied and least
satisfied with and provide a table that might be handy for inclusion in your reports.
You can do this on your own.
Descriptive Stats Frequencies
You should recall from the last lab note how to run frequencies. You go to "analyze"
on the SPSS menu, scroll down to "descriptives" and over to "frequencies." You should
see a display like the one below.

When you click on "frequencies," you should see a display like the one below.

This menu item will
allow you to get examine a frequency table, median and mode in addition
to the mean and measures of dispersion. Send the Gender variable
(Q28) to the right-hand column. Make sure the box that says "Display
frequency tables" is checked. Click on statistics. You will see a display like
the one below.

Click the following boxes: Mean, Median, Mode, Standard
Deviation, Variance and Range. Click continue, then click OK. You will
receive output for the information you requested in a table like the
one below.
Statistics
q28 Gender
| N |
Valid |
452 |
| Missing |
0 |
| Mean |
1.4912 |
| Median |
1.0000 |
| Mode |
1.00 |
| Std. Deviation |
.5005 |
| Variance |
.2505 |
| Range |
1.00 |
q28 Gender
|
Frequency |
Percent |
Valid Percent |
Cumulative Percent |
| Valid |
male |
230 |
50.9 |
50.9 |
50.9 |
| female |
222 |
49.1 |
49.1 |
100.0 |
| Total |
452 |
100.0 |
100.0 |
|
You will see your descriptive statistics and the frequency
table. You should have a mean of 1.49, a median and mode of 1.00, a
standard deviation of 0.50, variance of 0.25 and range of 1.00. Does the
mean
tell us anything in this instance (nominal data)? The median
indicates the middle score if all the scores were written in a long-hand
frequency distribution. The frequency table indicates 230 males (50.9
percent) and 222 females (49.1 percent) participated in the study.
SPSS will do whatever you tell it to do. But sometimes it is not
important to get all descriptive statistics for all data. The frequency
tables are important because they enable us to look at data and make
sure there are no errors in it.
Now let's examine ratings of overall experience (Q2). See if you can do this
without having figures to look at. Move the variable to the right-hand column.
Notice the box that says "Display frequency tables" is still checked.
Click on statistics. Notice that the variables we previously chose are
still there. This should stay the same until you close the program. Even
so, it's always a good idea to check it to be sure. Now click continue,
then click OK. You will receive output for the information you
requested.
You should have a mean of 2.41, a median and mode of
2.00, a standard deviation of 0.76, variance of 0.58 and range of 4.00.
Does the mode tell
us anything in this instance (interval data)? Yes, it indicates that the
response given most often was "very good" when rating the overall UT
experience. However, the mean is the preferred measure of central
tendency with interval data as it tells us the average response was
between "good" and "very good," tending toward the latter
response.
The frequency table indicates 47 people indicated their
experience at UT was excellent (10.4 percent), 197 found it very good
(43.6 percent), 185 found it good (40.9 percent), 20 found it poor (4.4
percent) and three people found it very poor (0.7 percent). Finally,
let's run descriptive statistics for GPA. Go to Analyze, Descriptive
Statistics and the Frequencies. Move Q25 over to the right hand column.
Then click Statistics for a precautionary check. Click OK and this will
take you back to the original pop-up window. Make sure you have your
frequency table box checked. Click OK and you will get your results.
Once again, you will see your descriptive statistics and the
frequency table. You should have a mean of 19.91, a median of 3.28, a
mode of 99, a standard deviation of 36.45, variance of 1328.27 and range
of 99.00. The frequency table gives us a distribution of responses.
We can tell now that we have a problem. First, the mode
indicates most people answered "don't know/no answer", which would
indicate many are uncomfortable discussing their GPA. The frequency
table shows a low response of 0.00 (most likely a freshman who didn't
know, but gave an answer anyway), a response of 12.80 (most likely a
coding error) and 79 responses of "don't know/no answers", which
represents 17.5 percent of our sample (NOTICE: the valid percent was
reported. WHY?)
Recoding Missing Data
Sometimes we code the "don't know/no answer" category as "99".
Not only does this make for an ugly table (see previous example); it
also might affect the tests that we may run, especially
when discussing inferential statistics. In instances such as this, we
need to tell SPSS to treat these cases as missing data.
That's
not hard to do. Click on "Transform" on the toolbar at the top of the
SPSS window, and then click on "Recode." You'll see a window like the one
below.

You can either recode into the
"same variable" or create a "different variable." This is probably the
only time we'll recode into the same variable. We aren't changing the
level of measurement of the question as happened when recoding to
collapse categories. We want to keep the same categories yet get rid
of those responses where people didn't care or didn't want to answer.
When reporting these results, we'd note they are based on a subsample.
Choose "same variable" and see if you figure out how to do things.
Put the variable
we want to recode (Q25) into the box labeled "numeric variables" and
click on the "Old Values and New Values" button. In the new box that
appears, put 99 (the value we want to recode) into the box under "Old
Value" and click on "System Missing" under the box "New" Value. Then
click on the "Add" button. We also want to take care of the coding
error, so put 12.80 into the box under "Old Value" and click on "System
Missing" under the box "New" Value. When you finish you should see a
window like the one below.

We could repeat this process if we
wanted to recode other values. Because we don't, just click on
"Continue." When you get back to the recode window, click on "OK."
Now we can rerun frequencies, a good idea when recoding data as
it's easy to make a mistake. This will also allow us to finish checking
the data since we were unable to get a clear reading on the GPA variable
previously.
Go to Analyze, Descriptive Statistics and the
Frequencies. Move Q25 over to the right hand column. Then click
Statistics for a precautionary check. Click OK and this will take you
back to the original pop-up window. Make sure you have your frequency
table box checked. Click OK and you will get your results.
Notice first that we now have 372 valid cases and 80 cases
listed as missing data. This corresponds to the answers of "don't
know/no answer" that we previously saw in the frequency table, and also
the coding error of 12.80. Now let's look at the descriptive statistics
and the frequency table. You should have a mean of 3.13, a median of
3.10, a mode of 3.00, a standard deviation of 0.50, variance of 0.25 and
range of 4.00. This is more like it as GPA is usually measured on a
4-point scale! The frequency table gives us a distribution of
responses.
A few things to note we still have the response
0.00, but this is ok as it's a valid response. The mean tells us the
average student is a B student, and the mode tells us that B students
answered the survey more than any other answer.
If you don't
understand something in this Web note, please e-mail
Dr. Sitton.
İM. Mark Miller & Ronald W. Sitton 2009
Revised 092811
http://www.uamont.edu/FacultyWeb/sitton/crz/mrea/dstats2.html
|