Fall:Lesson 3: Describing Quantitative Data: Shape, Center & Spread
These optional videos discuss the contents of this lesson.
Contents
1 Lesson Outcomes
By the end of this lesson, you should be able to:
 Open an existing data file in a spreadsheet program
 Open a new data file in a spreadsheet program
 Enter and edit data in a spreadsheet program
 Save a data file in a spreadsheet program
 Obtain summary statistics in a spreadsheet program
 Illustrate quantitative data using a histogram
 Identify the sample mean as a point estimate of the population mean
 Calculate the mean of a data set by hand
 Calculate the mean of a data set in a spreadsheet program
 Calculate the median of a data set by hand
 Calculate the median of a data set in a spreadsheet program
 Interpret the mean of a data set
 Interpret the median of a data set
 Distinguish between a parameter and a statistic
 Compute the sample standard deviation by hand
 Use a spreadsheet program to compute the sample standard deviation of a data set
 Find the variance given the standard deviation
 Use a spreadsheet program to compute percentiles (including the quartiles)
 Obtain the fivenumber summary for a given data set
 Identify the fivenumber summary, given a boxplot
 Create a boxplot by hand
 Create a sidebyside boxplot by hand
 Interpret data presented graphically in a histogram
 Discuss the shape, center, and spread of data illustrated on a histogram
 Identify whether the mean or median will be larger for a right or leftskewed distribution
 Compare the center and spread of two data sets based on summary statistics or a graphical display
 Explain the standard deviation in your own words
 Interpret percentiles and quartiles
 Interpret the fivenumber summary
 Interpret a boxplot
2 Review of the Five Steps of the Statistical Process
We will use the five steps in the Statistical Process throughout the course. Recall the five steps (and the mnemonic "Daniel Can Discern More Truth) before you begin this lesson.
Step 1:  Daniel  Design the study 
Step 2:  Can  Collect data 
Step 3:  Discern  Describe the data 
Step 4:  More  Make inferences 
Step 5:  Truth  Take action 
3 Shape of a Distribution
Cost to Treat Tuberculosis in India
Step 1: Design the study.
Tuberculosis (TB) is the most deadly bacterial disease in the world. In 2009, nine million new cases of tuberculosis were diagnosed, leading to almost 2 million deaths worldwide. Currently, the principal vaccine used to prevent tuberculosis is Bacille Calmette Guerin (BCG). Unfortunately, BCG is only moderately effective at preventing tuberculosis. Historically, India has had a high number of tuberculosis cases. The Indian Government wants to reduce the prevalence of this disease.
In this activity, you will compare the “average” costs of treating a person who contracts tuberculosis to the costs of preventing a case of tuberculosis in India.
Step 2: Collect data.
Health Care records of tuberculosis patients in India were surveyed to estimate the cost to treat patients with tuberculosis. The following data are representative of the total costs (in US dollars) incurred by society in the treatment of 10 randomly selected tuberculosis patients in India.
These costs include health care treatment, time missed from work, and in some cases utility lost due to death.
Step 3: Describe the data.
3.1 Visualizing Quantitative Data: Histograms
The following data are representative of the total costs (in US dollars) incurred by society in the treatment of 10 randomly selected tuberculosis patients in India.
To help us visualize these data, we will create a graph called a histogram. To make a histogram, we will divide the number line from 0 to 35,000 in seven equal parts. We will then count the number of data points in each of these intervals:
Interval  Number of Observations 

At least 0 and less than 5,000  2 
At least 5,000 and less than 10,000  1 
At least 10,000 and less than 15,000  3 
At least 15,000 and less than 20,000  2 
At least 20,000 and less than 25,000  1 
At least 25,000 and less than 30,000  0 
At least 30,000 and less than 35,000  1 
For each of these intervals, we draw a bar on the histogram. The width of the bars is determined by the width of the interval (5000 in this example). The height of the bars is equal to the number of observations that fall in each interval.
This is a histogram created in Excel:  This is a histogram created in SPSS: 

 Instructions for Creating a Histogram in Excel:
 To make a histogram in Excel, you can use the Excel file QuantitativeDescriptiveStatistics.xls. First, enter the 10 data values in the first column (on the left side) of the tool, the histogram will be generated automatically. You can practice creating a histogram in Excel using the data provided on the 10 patients.
 Using 8 intervals, the tuberculosis cost data can be represented by the following histogram.
 This histogram differs slightly from the first example. The width of the intervals is smaller and the starting value for the first interval is different. However, both graphs illustrate the same data.
After summarizing the data numerically and graphically, we are ready to make inferences about the population.
Step 4: Make inferences.
The historical total mean cost to society to treat a case of tuberculosis in India is known to be $13,800. To assess if the cost has increased, the null and alternative hypotheses are:
 $H_0$: The mean cost is equal to $13,800.
 $H_a$: The mean cost is greater than $13,800.
Using the symbol for the population mean, we can rewrite the null and alternative hypotheses more simply as:
 $H_0$: $\mu = \$13,800$
 $H_a$: $\mu > \$13,800$
Recall, that the $P$value is the probability that the data would be as extreme or more extreme than we observed, assuming the null hypothesis is true. A small $P$value indicates that there is a lot of evidence against the null hypothesis. If the $P$value is low, say, less than 0.05, we reject the null hypothesis.
It can be shown that the $P$value is 0.460. (Later in the course, you will have the tools to do this computation.)
Answer the following question: 

5. Based on the $P$value, would you reject or fail to reject the null hypothesis? Explain how you made this determination.

Step 5: Take action.
After making inferences, you take action. The motivation for conducting a study like this is usually to see if there is inflation in the costs.
Answer the following question: 

6. Given that you failed to reject the null hypothesis, do you think the Government of India needs to take any special action to stop the increase in the cost to treat tuberculosis?

One benefit of using a histogram is that it allows you to visualize the distribution of the data. A histogram illustrates the overall shape of the distribution of the data. The height of the bars show how many observations fall in that range.
We will describe the shape of the distribution of a data set using the following basic categories: symmetric, bellshaped, skewed right, and skewed left. Additionally, we can label the shape of a distribution as uniform, unimodal, bimodal, or multimodal.
A distribution is symmetric if both the left and right side of the distribution appear to be roughly a mirror image of each other. A special symmetric distribution is a bellshaped distribution. When data follow a Bellshaped distribution, the histogram looks like a bell. Bellshaped distributions play an important role in Statistics and will play a role in most of the future lessons.
A distribution is rightskewed if a histogram of the distribution shows a long right tail. This can occur if there are some very large outliers. A distribution is leftskewed if a histogram shows that it has a long tail to the left.
If a distribution has only one peak, it is said to be unimodal. The three distributions illustrated above are all unimodal distributions. Some people might argue that there are several peaks in the GPA data, so it should not be considered unimodal. Even though there are jagged bumps in the histogram, it is important to visualize the overall shape in the data. When interpreting a histogram, it can be helpful to blur your eyes and imagine the overall shape—after smoothing out the bumps. If the overall trend indicates that there is more than one bump, then we do not consider the distribution to be unimodal.
Some distributions have no distinct peak, others have more than one peak. When there is no distinct peak, and the histogram shows a relatively flat shape, we might say the data follow a uniform distribution. If there are two distinct peaks, a distribution is called bimodal. If there are more than two peaks, we refer to the distribution as multimodal.
4 Center of a Distribution
Step 3: Describe the data.
Sometimes people talk about the "typical" BYUIdaho student or the “average” waiting time for a bus. But what does it mean for something or someone to be "average?" How can we quantify what it means to be “typical” or “average?” In the example below, we will explore one way to define what "average" means.
When we talk about the "typical" or "average" value, we are essentially describing the center of a population. If we want to estimate the "average" costs to treat a tuberculosis patient, there are several ways we can do it.
4.1 Measuring the Center of a Distribution
4.1.1 Mean
The sample mean or sample arithmetic mean is the most common tool to estimate the center of a distribution. It is referred to simply as the mean. It is computed by adding up the observed data and dividing by the number of observations in the data set.
In Statistics, important ideas are given a name. Very important ideas are given a symbol. The sample mean has both a name (mean) and a symbol ($\bar x$).
You may have heard people refer to the sample mean as the “average.” The word “average” refers to any number that is used to estimate the center of a distribution. The mean, median and mode are all examples of averages. To avoid confusion in this course, the word “average” will not be used to describe the mean.
Answer the following question: 

1. Practice finding the mean, $\bar x$, for the tuberculosis treatment costs of the 10 patients in India by simplifying the following:
$$\bar x=\frac{15100 + 19000 + 4800 + 6500 + 14900 + 600 + 23500 + 11500 + 12900 + 32200}{10}=$$

4.1.2 Median
The median is the “middle” value in a sorted data set. Half of the observations in the data set are below the median and half are above the median. To find the median, you:
 Sort the values from smallest to largest
 Do one of the following:
 If there are an odd number of values, the median is the middle value in the sorted list.
 If there are an even number of values, the median is the mean of the two middle values in the sorted list.
 Do one of the following:
Answer the following questions: 

2. Practice finding the median of the tuberculosis treatment costs for the 10 patients in India. First, sort the data from smallest to largest.
The middle two numbers are 12900 and 14900. The mean of these two numbers is: $$\text{Median } = \frac{12900 + 14900}{2} = 13900$$ The median cost to treat the ten TB patients in India is $13,900. 
4.1.3 Mode
The most frequently occurring value is called the mode. Sometimes there is more than one mode. For example, in the data set
$${1,~~2, ~~2, ~~2, ~~3, ~~4, ~~4, ~~5, ~~5, ~~5, ~~6}$$
the modes are 2 and 5. Both of these values occurs three times, which is more times than any other value.
If no number occurs more than once in the data set, we say that there is no mode. For the data set representing the costs to treat tuberculosis in India, none of the values is repeated. So, there is no mode for these data.
Answer the following question: 

4. For a particular data set, which of the following can occur?

 To calculate the mean, median, and mode in Excel, do the following:
 If you do not know how to use Excel to find the mean and median, the spreadsheet QuantitativeDescriptiveStatistics.xls has been created to make these calculations for you.
 Download and open the Excel spreadsheet.
 Copy and paste your data into the blue column found on the left side of the page.
 The mean, median, mode, and other statistics will be located in the "Numerical Descriptive Statistics" table found in the center of the worksheet. A histogram will be shown above this table.
4.2 Additional Issues and Ideas
Rounding
As a general rule, when rounding your results, round to three decimal places unless otherwise specified.
Parameters and Statistics
We only have data on the cost to treat ten randomly selected tuberculosis patients. This represents a random sample from the population. The sample obtained by the researchers depends on random chance. If the study was repeated and a new sample of ten patients was randomly drawn from all cases of tuberculosis in India, would we observe the same data values? Certainly not!
However, if we draw a random sample, we would expect the mean of the new sample to be somewhat close to the mean for our original sample. The sample mean, $\bar x$, is an estimate of the true mean of the population.
One of the primary purposes of collecting and analyzing data is to estimate the true mean of a population.Usually, we do not know what the true mean is, and we estimate it with the sample mean.
The sample mean is an example of a statistic. A statistic is a number that describes a sample. The true (usually unknown) population mean is an example of a parameter. A parameter is any number that describes a population.
An easy way to distinguish between a parameter and a statistic is to note the repetition in the first letters:
 Population Parameter – True (usually unknown) value describing a population
 Sample Statistic – Estimate of the population parameter obtained from a sample
In the example above, the sample mean $\bar x$ = $14,100 is a statistic. Over the last few years, the total mean cost to treat tuberculosis in India has been $13,800. This is considered a parameter.
Different symbols are used to distinguish between the sample mean (a statistic) and the population mean (a parameter). The symbol for the sample mean is $\bar x$. The symbol for the population mean is $\mu$.
Perspective
The mean cost to treat the ten tuberculosis patients in the sample was $\bar x$ = $14,100. This number gives us some useful information. However, if this was all we were given, we would not be able to distinguish the data above from a situation where the cost for each of the ten patents was exactly $14,100. Notice that if the cost for each patient was $14,100, the mean would be:
$$\bar x=\frac{14100 + 14100 + 14100 + 14100 + 14100 + 14100 + 14100 + 14100 + 14100 + 14100}{10} =14,100$$
Even though measures of center are important, we need to consider the shape, center and spread of a distribution of data. When evaluating data, it is sometimes tempting to compute a mean but to avoid creating a histogram. This can lead to errant decisions based on a misunderstanding or incorrect transcription of data. If there is a transcription error in the data, it is sometimes easiest to detect it as an outlier in a histogram.
5 Spread of a Distribution
You have studied two important characteristics of a distribution: the shape and the center. In this section, you will discover ways to summarize the spread of a distribution of data. The spread of a distribution of data describes how far the observations tend to be from each other. There are many ways to describe the spread of a distribution.
5.1 Standard Deviation and Variance
This activity introduces two measures of spread: the standard deviation and the variance.
Diving Elephant Seals
Researchers Jessica Meir and Paul Ponganis studied the characteristics of the northern elephant seal, Mirounga angustirostris (Meir, 2010). These seals have the ability to dive deep into the ocean. Researchers placed a thermistor (thermometer) and a backpack recorder on 13 different seals.
Data were collected over several days for each seal. The results are given in the file DivingElephantSeals.
The following table summarizes some of the measurements collected on the seals:
Seal's Name  Number of Dives 
Mass (kg) 
Length of Deployment (Days at Sea) 
Thermistor Location 
Representative Temperature ($^\circ \text{C}$) 

Chick  267  238  3.5  Extradural  36.70 
Starburst  33  162  3  Extradural  37.19 
Patty  81  191  1  Extradural  37.85 
Bodil  192  160  21  Hepatic Sinus  37.91 
Roberta  480  148  15  Hepatic Sinus  37.25 
Larry  218  158  9  Hepatic Sinus  38.98 
Per  160  163  1.2  Hepatic Sinus  38.16 
Sir Richard  312  226  2  Arterial (femoral)  39.32 
Jerry  132  180  1.4  Arterial (brachial)  39.70 
Sammy  70  211  1  Arterial (brachial)  39.71 
Knut  242  288  2  Arterial (brachial)  35.77 
Jonesie  401  261  3.5  Arterial (brachial)  38.47 
Butler  621  151  8  Arterial (brachial)  39.74 
Mass
The data file DivingElephantSeals contains the mass in kilograms (kg) of the seals. How do we describe the distribution of the masses of the seals?
Use the data in the column labeled "Mass (kg)" to create a histogram and calculate the descriptive statistics.
Your histogram should look like this (a title and axis label have been added here):
Mass (kg) of the Diving Elephant Seals in Excel:  Mass (kg) of the Diving Elephant Seals in SPSS: 
The descriptive statistics are summarized as follows:
Mass (kg) of the Diving Elephant Seals in Excel:  Mass (kg) of the Diving Elephant Seals in SPSS: 
Some of these statistics, such as the mean and median, should be familiar to you by now. In this section, you will learn about the standard deviation and the variance. Later in this reading assignment we will discuss the minimum, $Q_1$, median, $Q_3$, and the maximum.
The mean mass of the 13 seals is $ \bar x = 195.2$ kg. (Note that this value was rounded with one more decimal place of accuracy compared to the raw data.) The mean is a measure of the center of the distribution.
There is a fairly large difference in the masses of the different seals. The smallest seal has a mass of 148 kg. The largest has a mass of 288 kg.
The standard deviation is a measure of the spread in the distribution. If the standard deviation is relatively small, then the data tend to be close together. If the standard deviation is relatively large, the data tend to be more spread out.
The standard deviation of the seals' body mass is 45.8 kg. This number contains information from all the seals. If the seals' masses had been more diverse, the standard deviation would be larger. If the seals were more uniform in their masses, then the standard deviation would have been smaller. If all the seals somehow had the same mass, then the standard deviation would be zero.
We are working with a sample. To be explicit, we call 45.8 kg the sample standard deviation. The symbol for the sample standard standard deviation is $s$. This is a statistic. The parameter representing the population standard deviation is $ \sigma $ (pronounced /SIGma/). In practice, we rarely know the value of the population standard deviation, so we use the sample standard deviation $ s $ as an approximation for the unknown population standard deviation $ \sigma $.
At this point, you probably do not have much intuition regarding the standard deviation. We will use this statistic frequently. By the end of the semester, you can expect to become very comfortable with this idea. For now, all you need to know is that if two variables are measured on the same scale, the variable with values that are further apart will have the larger standard deviation.
Man Vs. Seal
Roger Johnson compiled a collection of measurements on 252 men. The mean mass of the men was $ \bar x = 81.2 $ kg. The standard deviation of the weights was $ s = 13.3 $ kg.
The mean and standard deviation of the masses are in both cases smaller for the men than the the seals. This says that seals are generally larger than men and vary more in their weights. We compare these two distributions in the following histogram:
Weights of Men Compared to Weights of Seals
The mean mass of the men is less than the mean mass of the seals. We can see this, because the bulk of the data in the histogram for the men's masses is to the left of the seals'. You will also note that the masses of the seals are more spread out than the masses of the men.
Men  Seals  

Mean (kg)  81.2  195.2 
Standard Deviation (kg)  13.3  45.8 
Sample Size  252  13 
Length of Deployment (Days at Sea)
The variable "Length of Deployment (Days at Sea)" indicates the the number of days researchers collected data from each subject. The next two questions ask you to use the file DivingElephantSeals to create a histogram and compute the summary statistics for the "Length of Deployment (Days at Sea)."
Answer the following questions:  

7. Using Excel or SPSS, create a histogram for the "Length of Deployment (Days at Sea)" data.
8. Find the missing value in the following table:

Representative Temperature
Consider the last column in the table of data. The "representative temperature" is the mean body temperature of the seal right before a new dive begins. The researchers were very interested in this information. One of the key things they studied was the change in the body temperature of the seals as they dove. They wanted to know if a decrease in their body temperature allows seals to dive for longer periods of time. It was important for them to establish a baseline temperature for each seal.
Answer the following questions:  

9. Create a *histogram to illustrate the "Representative Temperature" data.
Excel Instructions

The standard deviation of the representative temperatures is $ s = 1.262\, ^\circ \text{C} $. When we consider body temperature measurements, this is a fairly small amount. Note that in this case, the standard deviation ($ s = 1.262\, ^\circ \text{C} $) is much smaller than the mean ($ \bar x = 38.212\, ^\circ \text{C} $).
5.1.1 Calculating the Standard Deviation by Hand
How is the standard deviation computed? Where does this "magic" number come from? How does one number include the information about the spread of all the points?
Bird Flu Fever
Avian Influenza A H5N1, commonly called the bird flu, is a deadly illness that is currently only passed to humans from infected birds. This illness is particularly dangerous because at some point it is likely to mutate to allow humantohuman transmission. Health officials worldwide are preparing for the possibility of a bird flu pandemic.
Dr. K. Y. Yuen led a team of researchers who reported the body temperatures of people admitted to Chinese hospitals with confirmed cases of Avian Influenza. Their research team collected data on the body temperature at the time that people with the bird flu were admitted to the hospital. In the article, they reported on two groups of people, those with relatively uncomplicated cases of the bird flu and those with severe cases.
The table below presents the data representative of the body temperatures for the two groups of bird flu patients:
Relatively Uncomplicated Cases 
Severe Cases 

38.1 38.3 38.4 39.5 39.7 
39.1 39.5 38.9 39.2 39.9 39.7 39.0 
We will use these data to investigate some measures of the spread in a data set.
Answer the following questions: 

11. Draw the number line below and illustrate the relatively uncomplicated cases by marking an “x” for each point. The first point has been plotted for you.

Think about the points you marked in question 5 above. On your sketch of the number line, draw a vertical line at 38.8 degrees, the sample mean. Now, draw horizontal lines from the mean to each of your $\times$'s. These horizontal line segments represent the spread of the data about the mean. Your plot should look something like this:
The length of each of the line segments represents how far each observation is from the mean. If the data are close together, these lines will be fairly short. If the distribution has a large spread, the line segments will be longer. The standard deviation is a measure of how long these lines are, as a whole. It is a little tedious to compute the standard deviation by hand. However, the process is very instructive. As you work through the following steps, please remember the goal is to find a measure of the spread in a data set. We want one number that describes how spread out the data are.
The deviation of an observation from the mean is the directed distance from the observation to the mean. In other words, deviations are the lengths of the line segments you drew in the previous set of questions.
$$ \begin{array}{1cl} \text{Deviation} & = & \text{Value}  \text{Mean} \\ \text{Deviation} & = & x  \bar x \end{array} $$
If the observed value is greater than the mean, the deviation is positive. If the value is less than the mean, the deviation is negative.
The standard deviation is a complicated sort of average of the deviations. Making a table like the one below will help you keep track of your calculations. Please participate fully in this exercise. Writing your answers at each step and to developing a table as instructed will greatly enhance the learning experience. By following these steps, you will be able to compute the standard deviation by hand.
Step 01: The first step in computing the standard deviation by hand is to create a table, like the following. Enter the observed data in the first column.
Column 1  Column 2 

Observation ($ x $) 
Deviation from the Mean ($ x\bar x $) 
$38.1$  $38.138.8=0.7$ 
$38.3$  
$38.4$  
$39.5$  
$39.7$  
$ \bar x = 38.8 $ 
Step 02: The second column of the table contains the deviations from the mean. Complete column 2 of the table above.
Column 1 Column 2 Observation
($ x $)Deviation
from the Mean
($ x\bar x $)$38.1$ $38.138.8=0.7$ $38.3$ $38.338.8=0.5$ $38.4$ $ 38.438.8=0.4 $ $39.5$ $ 39.538.8=0.7 $ $39.7$ $ 39.738.8=0.9 $ $ \bar x = 38.8 $
Answer the following questions: 

14. How could we use this table to find the "typical" distance from each point to the mean? Think carefully about this, and then write down your answer before continuing.
Please do not go on to Step 03 until you have finished this exploration. 
"Piled Higher and Deeper" by Jorge Cham
Step 03: Add a third column to your table. To get the values in this column, square the deviations from the mean that you found in Column 2.
Column 1 Column 2 Column 3 Observation
$ x $Deviation
from the Mean
$ x\bar x $Squared Deviation
from the Mean
$ \left(x\bar x\right)^2 $$38.1$ $38.138.8=0.7$ $38.3$ $38.338.8=0.5$ $38.4$ $ 38.438.8=0.4 $ $39.5$ $ 39.538.8=0.7 $ $39.7$ $ 39.738.8=0.9 $ $ \bar x = 38.8 $ Sum $ =0 $
Column 1 Column 2 Column 3 Observation
$ x $Deviation
from the Mean
$ x\bar x $Squared Deviation
from the Mean
$ \left(x\bar x\right)^2 $$38.1$ $38.138.8=0.7$ $ (0.7)^2=0.49 $ $38.3$ $38.338.8=0.5$ $ (0.5)^2=0.25 $ $38.4$ $ 38.438.8=0.4 $ $ (0.4)^2=0.16 $ $39.5$ $ 39.538.8=0.7 $ $ (0.7)^2=0.49 $ $39.7$ $ 39.738.8=0.9 $ $ (0.9)^2=0.81 $ $ \bar x = 38.8 $ Sum $ =0 $
Step 04: Now, add up the squared deviations from the mean.
Column 1 Column 2 Column 3 Observation
$ x $Deviation
from the Mean
$ x\bar x $Squared Deviation
from the Mean
$ \left(x\bar x\right)^2 $$38.1$ $38.138.8=0.7$ $ (0.7)^2=0.49 $ $38.3$ $38.338.8=0.5$ $ (0.5)^2=0.25 $ $38.4$ $ 38.438.8=0.4 $ $ (0.4)^2=0.16 $ $39.5$ $ 39.538.8=0.7 $ $ (0.7)^2=0.49 $ $39.7$ $ 39.738.8=0.9 $ $ (0.9)^2=0.81 $ $ \bar x = 38.8 $ Sum $ =0 $ Sum $ =2.20 $
The sum of the squared deviations is 2.20.
Answer the following questions:  

16. Suppose, that the researchers had collected body temperature data on 500 bird flu patients instead of 5. What would happen to the sum of the squared deviations, if the distribution of the data is the same for the 500 patients as the 5 patients?
17. What could we do to make sure the sample size does not inflate our estimate of the spread of the data?
Please do not go on until you have finished this exercise. 
Step 05: Divide the sum of the squared deviations by $n  1$. Write this value at the bottom of Column 3 of your table.
The number you computed in Step 05 is called the sample variance. It is a measure of the spread in a data set. It has very nice theoretical properties. The variance plays an important role in Statistics. We denote the sample variance by the symbol $s^2$.
It can be shown that the sample variance is an unbiased estimator of the true population variance (which is denoted $\sigma^2$.) This means that if you calculate the sample standard deviations of all possible samples (for a given value of $n$) and compute their mean, the result is $\sigma^2$.
 The sum of the squared deviations is the sum of the values in Column 3. This sum equals 2.20. We divide the sum of Column 3 ($2.20$) by $n1=51=4$ to get the sample variance, $s^2$:
$$ s^2=\frac{sum}{n1}=\frac{2.20}{51}=0.55 $$
This is the sample variance.
Column 1 Column 2 Column 3 Observation
$ x $Deviation
from the Mean
$ x\bar x $Squared Deviation
from the Mean
$ \left(x\bar x\right)^2 $$38.1$ $38.138.8=0.7$ $ (0.7)^2=0.49 $ $38.3$ $38.338.8=0.5$ $ (0.5)^2=0.25 $ $38.4$ $ 38.438.8=0.4 $ $ (0.4)^2=0.16 $ $39.5$ $ 39.538.8=0.7 $ $ (0.7)^2=0.49 $ $39.7$ $ 39.738.8=0.9 $ $ (0.9)^2=0.81 $ $ \bar x = 38.8 $ Sum $ =0 $ Sum $ =2.20 $ Variance: $\displaystyle{s^2=\frac{sum}{n1}=\frac{2.20}{51}=0.55}$
Answer the following questions: 

18. The temperature data for the bird flu patients are in degrees Centigrade. What are the units of the variance?

Step 06: Take the square root of the sample variance to get the sample standard deviation.
The sample standard deviation is defined as the square root of the sample variance.
$$ \text{Sample Standard Deviation} = s = \sqrt{ s^2 } = \sqrt{\strut\text{Sample Variance}} $$
The standard deviation has the same units as the original observations. We use the standard deviation heavily in statistics.
The sample standard deviation ($s$) is an estimate of the true population standard deviation ($\sigma$).
Answer the following questions:  

20. What is the sample standard deviation, $s$, of the temperatures of the five patients with relatively uncomplicated cases of the bird flu?
$$ s^2=\frac{sum}{n1}=\frac{2.20}{51}=0.55 $$

5.1.2 Summary
Standard Deviation
The standard deviation is one number that describes the spread in a set of data. If the data points are close together, the standard deviation will be smaller than if they are spread out.
At this point, it may be difficult to understand the meaning and usefulness of the standard deviation. For now, it is enough for you to recognize the following points:
 The standard deviation is a measure of how spread out the data are.
 If the standard deviation is large, then the data are very spread out.
 If the standard deviation is zero, then all the values are the identicalthere is no spread in the data.
 The standard deviation cannot be negative.
Variance
The variance is the square of the standard deviation. The sample variance is denoted by the symbol $s^2$. The sample standard deviation for the GPAs in the histogram above is $ s = 0.6634 $. So, the sample variance for this data set is $s^2 = 0.6634^2 = 0.4401 $.
The standard deviation and variance are two commonly used measures of the spread in a data set. Why is there more than one measure of the spread? The standard deviation and the variance each have their own pros and cons.
The variance has excellent theoretical properties. It is an unbiased estimator of the true population variance. That means that if many, many samples of $n$ observations were drawn, the variances computed for all the samples would be centered nicely around the true population variance, $\sigma^2$. Because of these benefits, the variance is regularly used in higherlevel statistics applications. One drawback of the variance is that the units for the variance are the square of the units for the original data. In the bird flu example, the body temperatures were measured in degrees Centigrade. So, the variance will have units of degrees Centigrade squared $(^\circ \text{C})^2$. What does degrees Centigrade squared mean? How do you interpret this? It doesn't make any sense. This is one of the major drawbacks of the sample variance.
Because we take the square root of the variance to get the standard deviation, the standard deviation is in the same units as the original data. This is a great advantage, and is one of the reasons that the standard deviation is commonly used to describe the spread of data.
Neither the standard deviation nor the variance is resistant to outliers. This means that when there are outliers in the data set, the standard deviation and the variance become artificially large. It is worth noting that the mean is also not resistant. When there are outliers, the mean will be "pulled" in the direction of the outliers.
The mean and standard deviation are used to describe the center and spread when the distribution of the data is symmetric and bellshaped. If a data are not symmetric and bellshaped, we typically use the fivenumber summary (discussed below) to describe the spread, because this summary is resistant.
Review of Parameters and Statistics
We have now learned some statistics that can be used to estimate population parameters. For example, we use $ \bar x $ to estimate the population mean $ \mu $. The sample statistics $s$ estimates the true population standard deviation $\sigma$. The following table summarizes what we have done so far:
Sample Statistic  Population Parameter  

Mean  $ \bar x $  $ \mu $ 
Standard Deviation  $ s $  $ \sigma $ 
Variance  $ s^2 $  $ \sigma^2 $ 
$ \vdots $  $ \vdots $  $ \vdots $ 
Unless otherwise specified, we will always use Excel or SPSS to find the sample variance and sample mean. In each case, the sample statistic estimates the population parameter. The ellipses $ \vdots $ in this table hint that we will add rows in the future.
Optional Reading: Formulas for $s$ and $s^2$ (Hidden)
Formulas
For those who like formulas, the equation for the sample variance and sample standard deviation are given here.
Sample variance:
$$\displaystyle{ s^2=\frac{\sum\limits_{i=1}^n (x_i\bar x)^2}{n1} } $$
Sample standard deviation:
$$\displaystyle{ s=\sqrt{s^2}=\sqrt{\frac{\sum\limits_{i=1}^n (x_i\bar x)^2}{n1}} } $$
where $x_i$ is the $i^{th}$ observed data value, and $i=1, 2, \ldots, n$.
Unless otherwise specified, we will always use Excel or SPSS to find the sample variance and sample mean.
Why do we divide by $n1$?
When computing the standard deviation or the variance, we are finding a value that describes the spread of data values. It is a measure of how far the data are from the mean. Since we do not know the true mean ($\mu$,) we use the sample mean ($\bar x$,) to estimate it. Typically, the data will be closer to $\bar x$ than to $\mu$, since $\bar x$ was computed using the data. To compensate for this, we divide by $n1$ rather than $n$ when we find the "average" of the squared deviations from the mean. It turns out, that subtracting 1 from $n$ inflates this average by the precise amount needed to compensate for the use of $\bar x$ as an estimate for $\mu$. As a result, the sample variance ($s^2$) is a good estimator of the population variance ($\sigma^2$.)
6 Tools to Describe the Data
Recall the five steps of the Statistical Process (and the mnemonic "Daniel Can Discern More Truth).
Step 1:  Daniel  Design the study 
Step 2:  Can  Collect data 
Step 3:  Discern  Describe the data 
Step 4:  More  Make inferences 
Step 5:  Truth  Take action 
Step 3 of this process is "Describe the data." The following information on percentiles, quartiles, 5number summaries, and boxplots will help you learn common ways to describe data.
Wrong Site/Wrong Patient Lawsuits
For symmetric, bellshaped data, the mean and standard deviation provide a good description of the center and shape of the distribution. The mean and standard deviation are not sufficient to describe a distribution that is skewed or has outliers. An outlier is any observation that is very far from the others. The mean is “pulled” in the direction of the outlier. Also, the standard deviation is inflated by points that are very far from the mean.
Percentiles can be used to describe the center and spread of any distribution and are particularly useful when the distribution is skewed or has outliers. To explore this issue, you will use software to calculate percentiles of data on costs incurred by hospitals due to certain lawsuits. The lawsuits in question were about surgeries performed on the wrong patient, or on the right patient but the wrong part of the patient's body (the wrong site).
Now, you have probably had some experience with percentiles in the past—especially when you received a score on a standardized test such as the ACT. Even though percentiles are commonly used, they are generally misunderstood. Before examining the wrong site/wrong patient data, let's review percentiles. Even if you think you understand percentiles, please study this section carefully.
6.1 Percentiles and Quartiles
Imagine a very long street with houses on one side. The houses increase in value from left to right. At the left end of the street is a small cardboard box with a leaky roof. Next door is a slightly larger cardboard box that does not leak. The houses eventually get larger and more valuable. The rightmost house on the street is a huge mansion.
Answer the following question: 

21. There are 100 homes with increasing property values. How many fences are needed to separate the 100 properties?

The home values are representative of data. If we have a list of data, sorted in increasing order, and we want to divide it into 100 equal groups, we only need 99 dividers (like fences) to divide up the data. The first divider is as large or larger than 1% of the data. The second divider is as large or larger than 2% of the data, and so on. The last divider, the 99^{th}, is the value that is as large or larger than 99% of the data. These “dividers” are called percentiles. A percentile is a number such that a specified percentage of the data are at or below this number. For example, the 99^{th} percentile is a number such that 99% of the data are at or below this value. As another example, half (50%) of the data lie at or below the 50^{th} percentile. The word “percent” means “$\div 100$.” This can help you remember that the percentiles divide the data into 100 equal groups.
Quartiles are special percentiles. The word “quartile” is from the Latin quartus, which means "fourth." The quartiles divide the data into four equal groups. The quartiles correspond to specific percentiles. The first quartile, Q_{1}, is the 25^{th} percentile. The second quartile, Q_{2}, is the same as the 50^{th} percentile or the median. The third quartile, Q_{3}, is equivalent to the 75^{th} percentile.
Answer the following questions: 

22. How many quartiles are there?

 To calculate percentiles and quartiles in Excel, do the following:
 Open the data file you are using. For this example, open the file WrongSiteWrongPatient.xlsx
 Copy and paste the desired data into Column A of the spreadsheet QuantitativeDescriptiveStatistics.xls.
 The percentiles and quartiles are listed below the Numerical Descriptive Statistics. Scroll to find the percentile you need.
 You may notice that some of the values for percentiles given in Excel are different from those given in SPSS. This is due to the different ways in which Excel and SPSS calculate percentiles. Since you are using Excel, be sure to use the percentiles calculated in Excel.
The first quartile ($Q_1$) or 25^{th} percentile (calculated in Excel) of the wrong site data is: $29,496. (This result is illustrated in the figure below.) This means that 25 percent of the time hospitals lost a wrongsite lawsuit, they had to pay $29,496 or less. The 25^{th} percentile can be written symbolically as: P_{25} = $29,496. Other percentiles can be written the same way. The 99^{th} percentile can be written as P_{99}.
1st percentile  0 
2nd percentile  0 
3rd percentile  0 
...  ... 
24th percentile  28633.4 
25th percentile  29496 
26th percentile  31067 
Answer the following questions:  

23. What is the 13^{th} percentile of the wrong site data?

6.2 The FiveNumber Summary
Another way to summarize data is with the fivenumber summary. The fivenumber summary is comprised of the minimum, the first quartile, the second quartile (or median), the third quartile and the maximum. The values in the fivenumber summary are always presented in this order. Since the order of the numbers in the five number summary is fixed, it is not necessary to label each of the values individually.
Statistical packages can give different results for some computations. This is because there are several reasonable ways to define certain quantities, such as the quartiles. As such, you may find that some of the values that are given in the software you use are different than what the other software may give. Please be sure to use the values that you calculate using the software designated for your particular class (Excel for 221 and SPSS for 222/223).
 To find the values for a fivenumber summary in Excel, do the following:
 Copy and paste the desired data into Column A of the spreadsheet QuantitativeDescriptiveStatistics.xls.
 The minimum, maximum, and quartiles are listed in the Numerical Descriptive Statistics area.
Answer the following questions:  

30. Give the fivenumber summary for the Wrong Site data.

Caution
As a caution, some students mistakenly include the mean in the fivenumber summary. The third value in the fivenumber summary is the median.
6.3 Boxplots
A boxplot is a graphical representation of the fivenumber summary. Unlike the mean or standard deviation, a boxplot is resistant to outliers. That means that it won't be "pulled" one way or the other by extraordinarily large or small values in the data as will a mean, for instance. We will illustrate the process of making a boxplot using the wrongsite data.
Follow the steps below to learn how to draw a boxplot.
 Step 01: To draw a boxplot, start with a number line.
 Step 02: Draw a vertical line segment above each of the quartiles.
 Step 03: Connect the tops and bottoms of the line segments, making a box.
 Step 04: Make a smaller mark above the values corresponding to the minimum and the maximum.
 Step 05: Draw a line from the left side of the box to the minimum, and draw another line from the right side of the box the maximum.
 Step 06: These last two lines look like whiskers, so this is sometimes called a boxandwhisker plot.
7 Summary
 A histogram allows us to visually interpret data. Histograms can be leftskewed, rightskewed, or symmetrical and bellshaped.
 The mean, median, and mode are measures of the center of a distribution. The mean is the most common measure of center, and is computed by adding up the observed data and dividing by the number of observations in the data set.
 The standard deviation is a number that describes how spread out the data are. A larger standard deviation means the data are more spread out than data with a smaller standard deviation.
 A parameter is a true (but usually unknown) number that describes a population. A statistic is an estimate of a parameter obtained from a sample.
 Quartiles/percentiles, FiveNumber Summaries, and Boxplots are tools that help us understand data. The fivenumber summary of a data set contains the minimum value, the first quartile, the median, the third quartile, and the maximum value. A boxplot is a graphical representation of the fivenumber summary.
Previous Reading: Lesson 02: The Statistical Process & Design of Studies 
This Reading: Lesson 03: Describing Quantitative Data: Shape, Center & Spread 
Next Reading: Lesson 04: Probability; Discrete Random Variables 