For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something

4. Around correlation

4.1 Informal Definitions

Today, we will talk about the correlation of things. You have definitely heard about it before, at least when reading or saying claims like "these two things are surprisingly correlated with each other". Some of you may even know the definition of "correlation" between two random variables, but we won't need it today. It is too mathematically involved for this introductory course anyway.

But for now, let's go with an intuitive understanding of this word, namely 'some measure of the extent to which two or more variables behave in relation to each other'. So, in particular, despite correlation being a real number (between $-1$ and $1$), we will not talk about how to assign this number to a pair of variables. But there is a way, so for two things like, say, "coffee consumption" and "sleepiness", we can say something like "their correlation is $-0.34$" (just in case, this number is made up. But it is true that it is negative).

We will focus on talking about the correlation between just two things (or "two variables" to say it more mathematically). Therefore you will see quite a few of 2D plots and we expect you understand them a bit.

For example, there is a special kind of plot called a scatter plot, which uses dots to represent values for two different numeric variables a lot. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Here is an example:

On the scatter plot above, we have information about 16 people, which are represented by dots. To be more precise, we know the weights of each of the 16 people and how long, on average, they run per week. For example, one of the people weighs around 78 kilograms and runs on average 3 hours a week.

These plots help visualise the relation between the two variables. Judging from a scatter plot, it is often easy to say whether "two variables look correlated" or "two variables seem uncorrelated" check out the plots below:

To be a bit more precise, people tend to separate pairs into three categories: positively correlated, negatively correlated and uncorrelated. The first one usually means "the bigger the first variable is, the bigger the second variable is" (or the correlation is between 0 and 1), the second one usually means "the bigger the first variable is, the smaller the second variable is" (or that the correlation is between $-1$ and $0$) and the third one means 'not the first two cases'. For example, the "Running VS Body Weight" plot looks like a negative correlation (especially if you ignore the one dot in the top-right corner), and the two plots right above show 'positive correlation' and 'uncorrelation'.

4.2 Exercises

Exercise 1.

The local ice cream shop keeps track of how much ice cream they sell versus the temperature on that day. Here are their figures for the last 12 days:

Given the information, what can you say about the relationship between the two variables?

Explanation and comments

click on blur to reveal

It is difficult to say much without a scatter plot. So this is your first idea: make one. The scatter plot for the information given looks like this: (note that on the x-axis we started from 10 and not from zero)

This looks like a clear positive correlation (in fact, this relationship looks almost perfectly linear...). This positive correlation does make sense by the way: the warmer it is outside, the more likely someone will buy an ice cream to "cool down". Plus, it seems reasonable that fewer people want to eat cold things in low temperatures.

Exercise 2.

Have a look at the pictures below. The first one is about the number of COVID cases (total number per city), and the second is the number of traffic fatalities (total number per city as well). These are real plots, just so you know.
Anyway, they look very similar, right? So do these pictures suggest that the more COVID cases there are, the more likely you will end up in a traffic accident? Or that the less attentive on the road you are, the more likely you will get covid? Or what do you think?

Explanation and comments

click on blur to reveal

In reality neither of two suggested conclusions are correct. There is something else at play here... This "something else" is the third variable, that both of the above clearly depend on. The name of this third variable is "density of the population":

So the correlation we see is simply due to the facts that "the more dense the population is, the easier for a virus like covid to spread" and "the more people there are, the more cars there are, so the more accidents there are". This is one of many examples showing that correlation does not imply causation.

Exercise 3.

If we collect data for the total number of measles cases in the U.S. yearly and the marriage rate yearly, we will find that the two variables are highly correlated (see the picture below). What do you think about this?

Explanation and comments

click on blur to reveal

It would be pretty interesting if one actually caused another... But in fact, the two variables are independent. Modern medicine is simply causing measles cases to drop and fewer people are getting married due to various reasons each year. The former is rather obvious, but why the latter is happening is interesting. There are many potential reasons, what do you think?

4.3 Money-related Remark

Money is not the most important thing in life, but it is good to know that something you know or about to learn can help you make your wallet bigger :)
Continuing with the story after the first homework, let us share some snapshots from a presentation from that internship (the one that was paying big bucks):

Of course, it is not expected you perfectly understand these plots at this point. Maybe you have guessed that 'corr' in the top-right corners of the snapshots stands for 'correlation'... Moreover, the labels for $x$- and $y$- axes are erased here. This is since it is somewhat secret information o_O

Whatever those labels were, they helped save that company millions of dollars on some trades they were doing. Without going into the details, it turned out that there was a strong enough correlation between the future price of a certain thing and some variables. Those variables were easy to calculate, which helped predict the price movements better.

Conclusions and next steps

Hopefully, all of the above gives extra motivation for why doing probability and statistics. One thing to note here: despite being used a lot as a word in a speech, "correlation" is actually not a trivial concept in probability theory. So we will only be able to get to it much later. We have worked with it a little, though, just by using our intuitive understanding of the word.

We are going to change the topic of discussion in the next lesson. It will be about "Artificial Intelligence". Hot topic nowadays, surely you can feel it :)

Open the Problem Set

PA-1 / Lesson 4

4. Around correlation

4.1 Informal Definitions

4.2 Exercises

4.3 Money-related Remark

Conclusions and next steps