Lesson 21: Describing Bivariate Data: Scatterplots, Correlation, & Covariance

From BYU-I Statistics Text
Jump to: navigation, search

These optional videos discuss the contents of this lesson.



 

1 Lesson Outcomes

By the end of this lesson, you should be able to:

  • Create a scatterplot of bivariate data.
  • Interpret the overall pattern in a scatter plot to assess linearity and direction.
  • Calculate the correlation coefficient.
  • Interpret the correlation coefficient, $r$, as a measure of strength and direction of a linear relationship between two variables.


2 How Confident are You?

Think about a time when you walked into a exam, having prepared carefully, knowing that you would do well. On the other hand, have you ever entered an exam feeling unprepared? How have your exam scores compared to your confidence?

Testtaking-exam.jpg

Shane Goodwin and other researchers examined this question. They studied factors that affect a student's confidence on a multiple-choice mathematics exam. A group of n = 139 students in an Intermediate Algebra course (MATH 101) at BYU-Idaho participated in the study. In addition to marking their test question responses, they evaluated their confidence for each answer on a scale of 1 to 6. The confidence rating scale is summarized in the following table:

Confidence Rating
1 Random guess (no clue)
2 Very unsure
3 Somewhat unsure
4 Somewhat sure
5 Very sure
6 Certain (absolutely sure)

Confidence ratings were not relayed to the instructor, and they did not affect the grade on the exam.

For each student, the mean confidence rating was computed. This mean confidence rating and their score on the exam (out of 100 points) are given in the file MathSelfEfficacy.

Previously, we have been dealing with one response variable at a time. Now, we have two quantitative measurements on each unit (participant). We call these data bivariate data, since there are two (bi-) variables that we are considering simultaneously.

In the past, we have summarized quantitative data by computing summary statistics. Here are a couple of statistics computed from these data:

  • The mean score on the test was 74.7 points.
  • The mean confidence rating was 4.4.

These statistics do not provide information about the connection between the students' scores on the exam and their confidence. If a student feels very confident, what do the data tell us about their test score? We need a new tool to help us relate the values of two quantitative observations. When we have two quantitative measurements on a unit, we have bivariate data.

2.1 Describing Bivariate Data

2.1.1 Describing Form: Scatterplots

The following scatterplot illustrates the data from the Goodwin study.

SelfEfficacy-Scatterplot-SPSS.png

Each point in the plot represents both the confidence and score of one student. The points are plotted on the X-Y coordinate plane. The position on the horizontal (X) axis represents the student's confidence rating. The height of the point or the value on the Y-axis, represents the student's score on the exam. (We will explore how to create a scatterplot in the next example. For now, focus on understanding the interpretation of the graph.)

The cloud of data illustrated in this scatterplot help us visualize the relationship between the student's confidence rating and their score on the exam. Notice that the points tend to be higher as you move to the right. Students who have a high confidence rating (points further to the right) tend to have higher exam scores (higher vertical position). Similarly, students with lower confidence typically have lower exam scores. Notice that as a students’ confidence increases, their exam score tends to increase. We call this a positive association or a positive correlation.

Notice that there is variability in the responses. Consider the students who have a mean confidence rating of 5.0. The points above this number represent those students who reported a mean confidence value of about 5. There is variability in the exam scores of these students. They range from about 75 to approximately 100.

In the scatterplot, we see a cloud of data. Even though there is a considerable amount of variability in the data, the points tend to follow a line. If you squint with your eyes, you might imagine that the data look like a fat hot dog.

Scatter-HotDog.png

When the points in a scatterplot follow a straight pattern, we say that there is a linear relationship in the data. Data are considered linearly related if the points in the scatterplot follow a straight line. The points do not have to be aligned tightly to represent a linear relationship. They can be in the shape of a long skinny cucumber or a short, fat cucumber. Both broad and narrow clouds of data can be considered linear.

2.1.2 How to make a scatterplot

Estuarine Crocodile Data

Data for estuarine, or saltwater, crocodiles is given in the file EstuarineCrocodile. We will illustrate the relationship between the head length of the crocodiles and their body lengths by creating a scatterplot.



Excel Instructions
To make a scatterplot in Excel:
  • Open the QuantitativeInferentialProcedures.xls file.
  • Put the variable that you want on the x-axis (head length) in the "Explanatory Variable" column.
  • Put the variable that you want on the y-axis (body length) in the "Response Variable" column. This is the value you want to predict.
  • The points in the scatterplot will update with your data.
Here is the scatterplot for the estuarine crocodile data. The head length in centimeters is on the horizontal (x) axis and the body length in centimeters is on the vertical (y) axis.
EstuarineCrocodile-Scatterplot-Excel.png



Notice that there is a strong positive linear relationship between the head lengths and the body lengths of the crocodiles. That fact will be important in the next lesson.

We want to be able to describe the relationship between the variables. The first thing we look for is the shape or form observed in the scatterplot. Is it linear or nonlinear?

In many cases, the points a scatterplot do not follow a straight line. If the data form a curved shape, e.g. a banana shape, we say that there is a nonlinear relationship in the data. The methods presented in this course do not directly apply to nonlinear data. As a professional, you may encounter nonlinear data. Math 325 Intermediate Statistical Methods includes ways to handle nonlinear relationships. If you cannot take additional statistics courses, you should consult a statistician if you want to analyze the relationships observed in nonlinear bivariate data.

In addition to the shape or the form of the data observed in the scatterplot, we need to be able to describe the direction and strength of a linear relationship in data. We use the correlation coefficient to quantify the direction and strength of the relationship. These ideas are discussed below.

2.1.3 Describing Direction: Scatterplots and the Correlation Coefficient

We say that the direction of data in a scatterplot is positive or there is a positive association between two variables when an increase in one variable tends to lead to an increase in the other variable. We observed a positive association in Goodwin’s confidence data.

The correlation coefficient is a number that is used to measure the direction and strength of the linear association between two variables. The direction is either positive, negative, or neither. The strength can be described as weak, moderate, or strong.

Correlation coefficients are always between $-1$ and $1$.

We will use software to compute the correlation coefficient. For the Goodwin data, the correlation coefficient is: $$ r = 0.728 $$ We use the symbol $ r $ to represent the correlation coefficient. In this reading, we will explore the correlation coefficient, including its properties and interpretation.

When a positive association exists in the data, the correlation coefficient will be positive. As an example, a positive association was observed in Goodwin’s data and $ r = 0.728 > 0 $.

There are many examples of positive associations. It has been demonstrated that a student’s level of motivation is positively associated with academic success . Students who are highly motivated tend to do better academically. As another example, there is a positive association between the height of a person and their weight. If someone's height increases, we would expect that their weight would typically increase as well.

When an increase in one variable is associated with a decrease in the other variable, we say that there is a negative association between the two variables. Several studies have demonstrated that there is a negative association between the amount of time spent playing video games play and academic performance. Students who spend a lot of time on video games tend to do worse in school than their peers who do not spend much time gaming.

2.1.4 Describing Strength: The Correlation Coefficient

We also describe the relationship between two variables as weak, moderate, or strong, depending on how close the relationship between the variables is. The strength of the linear relationship is also described in the correlation coefficient.

The correlation coefficient is always between $ -1 $ and $ 1 $. If there is a strong positive association, the correlation coefficient will be close to $1$. If the correlation coefficient is positive but relatively close to 0, we say there is a weak positive association in the data.

Similarly, if the correlation coefficient is close to $-1$, we say there is a strong negative association. A weak negative association results in a correlation coefficient that is negative but close to 0.

We will not establish cut-off values to determine when a correlation goes from being weak to moderate or from moderate to strong. This depends upon the application and is very subjective.

Several scatterplots have been created, and the correlation coefficient summarizing the relationship between the two variables is presented. Study these graphs to see if you can infer some of the properties of the correlation coefficient.


Figure 1 Figure 2 Figure 3
CorrR-100.png CorrR-96.png CorrR-80.png
$r = -1.00$ $r = -0.96$ $r = -0.80$



Figure 4 Figure 5 Figure 6
CorrR-50.png CorrR00.png CorrR50.png
$r = -0.50$ $r=0$ $r = 0.50$



Figure 7 Figure 8 Figure 9
CorrR80.png CorrR96.png CorrR100.png
$r = 0.80$ $r = 0.96$ $r = 1.00$


Answer the following questions:
1. If $r$ is positive, and $X$ increases, what do we expect will happen with $Y$? Does this represent a positive or negative association?
When $r$ is positive, if $X$ increases then $Y$ will also increase. This represents a positive association.


2. If $r$ is negative, and $X$ increases, what do we expect will happen with $Y$? Does this represent a positive or negative association?
When $r$ is negative, if $X$ increases then $Y$ will decrease. This represents a negative association.


3. If $r \approx 0$, what can we conclude about the strength of the relationship between $X$ and $Y$?
We can conclude that there is no relationship between $X$ and $Y$ or that it there is a very weak relationship.


4. If $r \approx 1$, what can we conclude about the strength of the relationship between $X$ and $Y$?
We can conclude that the relationship between $X$ and $Y$ is very strong and positive.


5. If $r \approx -1$, what can we conclude about the strength of the relationship between $X$ and $Y$?
We can conclude that the relationship between $X$ and $Y$ is very strong. The reason it is negative is because it is negatively associated, not because it is a weak association.


6. If $r = 1$, what can we conclude about the strength of the relationship between $X$ and $Y$?
If $r = 1$, there is a perfect positive linear relationship between $X$ and $Y$. All the points are in an upward-sloping line.


7. If $r = -1$, what can we conclude about the strength of the relationship between $X$ and $Y$?
If $r = -1$, there is a perfect negative linear relationship between $X$ and $Y$. All the points are in a line that slopes downward.

 


Phd1641.png

2.2 More on the Correlation Coefficient

2.2.1 Calculating the Correlation Coefficient


Excel Instructions
The correlation coefficient can easily be calculated in Excel.
Using any cell in an excel page you can calculate the Correlation coefficient. The command is = CORREL (array 1, array2). For array 1 highlight the explanatory variable (x) and for array 2 highlight the response variable (y). Here is a screen shot example of finding the correlation coefficient of the Old Faithful data.
OldFaithfulCorrelation-Excel.png
An easier way to do this is to use the file QuantitativeInferentialProcedures.xls. After entering the data in the first two columns, the correlation coefficient is given as "r" in cell E8 of that spreadsheet.



2.2.2 Effect of Outliers

An outlier is any point that is very far from the others. Each of the following scatterplots shows data where there is one outlier present. Notice how one point can influence the correlation coefficient. Imagining that the outlier was removed from each of the following plots, estimate the correlation coefficient in your mind. Compare that value to the specified correlation coefficient with the outlier included.


Figure 10 Figure 11 Figure 12
CorrOutliersR-37.png CorrOutliersR62.png CorrOutliersR88.png
$r = -0.37$ $r = +0.62$ $r = +0.88$
Answer the following question:
8. What can you say about the effect of outliers on the correlation coefficient?
Outliers can dramatically affect the value of the correlation coefficient. It is possible for one outlier to change a correlation coefficient from a strong positive value to a negative value. It is possible for one outlier to change uncorrelated data to a data set that has a very high correlation.

 


2.2.3 Nonlinear Relationships

The following figures illustrate possible situations where the relationship between two variables does not follow a straight line. There can be a very strong relationship between the variables and still not have a strong correlation. Conversely, you can have a correlation coefficient that is close to zero, even though there is a perfect nonlinear association between the data.


Figure 13 Figure 14
CorrNonlinStrongRelWeakCorr.png CorrNonlinStrongRelStrongCorr.png
$r = +0.01$ $r = +0.86$



Figure 15 Figure 16
CorrNonlinModRelWeakCorr.png CorrNonlinModRelStrongCorr.png
$r = +0.04$ $r = +0.74$
Answer the following questions:
9. Out of Figures 13-16, which scatterplots show a very strong nonlinear relationship?
All four figures show distinct nonlinear relations. Figures 13 and 14 show very strong nonlinear relationships.


10. Out of Figures 13-16, which scatterplots illustrate data with the highest correlation coefficient?
Figures 14 and 16 both have relatively high correlation coefficient, $ r = 0.86 $.


11. How well does the correlation coefficient measure the strength of the nonlinear relationship between two variables?
The correlation coefficient does not measure the strength of nonlinear relationships. It only measures the strength of the linear association.

 


2.2.4 Properties of the Correlation Coefficient

We will summarize the properties of the correlation coefficient, $r$:

  • $r$ is a number between $-1$ and $1$
  • positive values of $r$ imply a positive linear relationship between the two variables
  • negative values of $r$ imply a negative linear relationship between the two variables
  • values of $r$ close to zero suggest there is a weak correlation between the two variables
  • if $r$ is close to $1$, it is evidence of a strong positive linear relationship between the two variables
  • if $r$ is close to $-1$, there is evident of a strong negative linear relationship between the two variables
  • if $r$ equals $1$ or $-1$, then there is a perfect linear relationship between the two variables (the points are all in a line)
  • the correlation of $X$ and $Y$ is the same as the correlation between $Y$ and $X$ (i.e.there is no distinction between explanatory and response variables.)
  • the correlation coefficient measures the strength of the linear relationship between two variables; it does not give the strength of a nonlinear relationship, no matter how strong
  • the correlation coefficient is affected by outliers

2.2.5 Sample Statistic and Population Parameter

The correlation coefficient, $r$, is a sample statistic. It is sometimes called the sample correlation coefficient. The value of $ r $ is computed using data. It is an estimate of the population correlation coefficient, which we will denote as $ \rho $. Usually, we do not know $ \rho $.

3 Predicting Old Faithful

Eruption of the Old Faithful Geyser
(Photo credit: William S. Keller, U.S. National Park Service)

A geyser is a hot spring that periodically erupts a mixture of hot water and steam. Old Faithful in Yellowstone National Park is the world's most famous geyser. This geyser earned its name from the predictability of the waiting time between its eruptions. At the Old Faithful Visitors Center, there is a sign predicting when the next eruption will occur. Rangers observing the behavior of the geyser maintain this sign for the convenience of park visitors.

The amount of time between eruptions (wait time) is random. However, it can be predicted, give-or-take a reasonable error bound.

Researchers observed 272 eruptions of this geyser . The researchers recorded the duration of each eruption (in minutes) and the waiting time until the next eruption (in minutes.) The data are given in the file OldFaithful.

Answer the following questions:
12. Suppose the U.S.Park Service has asked you to improve the way they predict eruptions of Old Faithful. What analyses would you conduct to better predict the wait time until the next eruption?
Answers will vary.
A natural way to predict the waiting times until the next eruption is to compute the mean and median wait times. We can compute the standard deviation of the waiting times. We can even make a 95% confidence interval for the true mean waiting times.


13. Conduct the analysis you proposed in Question 1. What is your prediction for the waiting time between eruptions?
The mean waiting times for the $n=272$ observations in the data set is $\bar x = 70.9$ minutes. A 95% confidence interval for the true mean waiting time is: $(69.3,72.5)$ minutes. We are 95% confidence the true mean waiting time between eruptions is between $69.3$ and $72.5$ minutes. This is a true statement, but does it provide a good prediction? To assess this, we create a histogram of the waiting times:

OldFaithfulHistogram.png

Since this distribution is bimodal, the mean of the waiting times does not provide the most useful prediction tool.


14. How would you describe the shape of the distribution of the waiting times?
The shape of the distribution is bimodal. This can be seen in the histogram computed as part of the solution to question 2.

 


There appear to be two peaks in the histogram representing the waiting times. This bimodal distribution is curious. Why should there be a bimodal distribution in the duration of the eruptions? Could there be a relationship between the length of an eruption and the waiting time until the next eruption?

This can be explored using a scatterplot.

For the Old Faithful data, we plot the eruption duration on the $X$-axis and the waiting time before the next eruption on the $Y$-axis of a scatterplot.

OldFaithfulEruptions-Scatterplot-Excel.png

Each point in the plot represents both the actual eruption time and the wait time until the next eruption of Old Faithful. The position of the point on the horizontal (X) axis represents the duration of the eruption, and the height of the point on the vertical (Y) axis represents the wait time for the next eruption. This helps us to visualize the relationship between the wait time and the duration of the eruption. Notice that when the wait time increases, so does the eruption duration.

Answer the following questions:
15. What do you observe in the scatterplot? Are there any features that draw your attention?
The scatter-plot shows that there are two groups of data points and that the points are going up and to the right, showing that they are positively associated.


16. Based on the scatterplot, does $\bar x = 70.9$ minutes seem like a good estimate of the mean waiting time between eruptions?
70.9 minutes is right in the middle of the two groups of data, so it might be a good "average", but it looks like it is very rare for the waiting time to actually be 70.9 minutes.

 


4 Covariance

The English idiom, "Don't put all your eggs in one basket" counsels you to avoid investing all your time or money in one thing. It is better to spread your resources across many opportunities to avoid losing everything. This advice is very appropriate in the financial markets.

When considering a portfolio of stocks and bonds, it is wise to diversify the investments. If one company or one sector of the economy declines, a diversified portfolio involving many different stocks and bonds can help minimize losses. It is important for investors to understand the risk associated with a particular portfolio. We can use the variability of the investments as a measure of the risk of a portfolio.

In a financial market, investments do not act independently; they vary together. For example, if the value of Apple Computers' stock drops, Microsoft's stock is also likely to decrease in value. When making investment decisions, it is important to take into account the interrelationships among the investments.

The covariance is a measure of how two variables (such as stock returns) vary together. The covariance of two variables, $ X $ and $ Y $, is calculated by multiplying the following three items together:

  • the correlation coefficient of $ X $ and $ Y $
  • the standard deviation of $ X $, and
  • the standard deviation of $ Y $

We compute the covariance for a data set using the formula: $$ s_{xy} = r \cdot s_x \cdot s_y $$

4.1 Example: Math Self Efficacy

As an example, we will compute the covariance for Goodwin’s data. The correlation coefficient was determined to be $ r = 0.728 $. Using the file MathSelfEfficacy, we compute the standard deviation of the mean confidence rating ($X$) to be $ s_x = 0.939 $ and the standard deviation of the test scores ($Y$) to be $ s_y = 16.37$. Applying the equation for the covariance of a collection of data, we get $$ s_{xy} = r \cdot s_x \cdot s_y = 0.728 \cdot 0.939 \cdot 16.37 = 11.19 $$

The covariance of $ X $ and $ Y $ will be positive if increasing values of $ X $ correspond to increasing values of $ Y $. This is exactly what happened in Goodwin’s data. On the other hand, the covariance will be negative if an increase in $ X $ tends to correspond with a decrease in $ Y $.

Answer the following question:
17. Is it possible for the covariance of $ X $ and $ Y $ to be negative when the correlation coefficient is positive? If so, how could this occur?
It is impossible for the covariance of $ X $ and $ Y $ to be negative when the correlation coefficient is positive. The covariance is the correlation coefficient multiplied by the standard deviations. Since the standard deviation of any variable can not be negative, the sign of the correlation coefficient will be the same as the sign of the covariance.

 


If an investor wants to purchase two stocks, and they want to reduce their risk, they can choose two stocks with a negative covariance. If the return on one stock decreases, the return on the other stock would tend to increase. This diversification can help protect the investor when market conditions change.

4.2 Example: Microsoft Versus Apple

Suppose you are considering investing in two stocks. The annual percent change in the price of Microsoft stock has a mean of 33.48% with a standard deviation of 49.31%. For Apple, the mean is 42.03% and the standard deviation is 80.84%. The correlation of these variables is 0.395. Find the covariance of these two stocks.

$$ s_{xy} = r \cdot s_x \cdot s_y = 0.395 \cdot 49.31 \cdot 80.84 = 1574.56 $$

This number is difficult to interpret directly, but it gives an estimate of the joint variability in the prices of these stocks.

4.3 Sample Statistic and Population Parameter

The sample correlation coefficient, $ r $, is an estimate of the unknown population correlation coefficient, $ \rho $. In a simiilar manner, we will consider the sample covariance and population covariance.

The sample covariance of the variables $ X $ and $ Y $ is denoted by the symbol $ s_{xy} $. Like all sample statistics, it is computed directly from specific data. If we define the sample standard deviation of $ X $ to be $ s_x $ and the sample standard deviation of $ Y $ to be $ s_y $, then we can write the sample covariance of $ X $ and $ Y $ as: $$ s_{xy} = r \cdot s_x \cdot s_y $$

Using similar notation for the population standard deviation of the random variables $ X $ and $ Y $, we can write the population covariance as: $$ \sigma_{xy} = \rho \cdot \sigma_x \cdot \sigma_y $$

The sample statistic $ s_{xy} $ is an estimator of the population parameter $ \sigma_{xy} $. We usually do not know the value of $ \sigma_{xy} $. Sometimes, we denote the population covariance as $ Cov(X,Y) $ instead of $ \sigma_{xy} $.

5 Summary

Remember...
  • Creating scatterplots of bivariate data allows us to visualize the data by helping us understand its shape (linear or nonlinear), direction (positive, negative, or neither), and strength (strong, moderate, or weak).
  • The correlation coefficient ($r$) is a number between $-1$ and $1$ that tells us the direction and strength of the linear association between two variables. A positive $r$ corresponds to a positive association while a negative $r$ corresponds to a negative association. A value of $r$ closer to $-1$ or $1$ indicates a stronger association than a value of $r$ closer to zero.
  • The covariance is a measure of how two variables vary together. The formula for the covariance is $s_{xy}=r \cdot s_x \cdot s_y$.


6 Navigation

Previous Reading:
Lesson 20:
Review for Exam 3
                   This Reading:
Lesson 21:
Describing Bivariate Data: Scatterplots, Correlation, & Covariance
                   Next Reading:
Lesson 22:
Simple Linear Regression