For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something
We will continue talking about the same set of mathematical terms as last time, i.e Variance, Covariance, Correlation and Linear Regression, but from the more formal and university-like point of view. This is how most educational institutions approach them, they rarely spend time on any preliminary discussions nor on talking about the results they get from a real-world-data point of view. That is what we did that last time, and that is what you should keep in mind while reading through the material below.
There won't be lots of general remarks & comments today, we will dive straight into maths. First, let's quickly recall the very basics from the discrete probability set up, something that we assume you already know:
Definition [Probability Space (from the Discrete Set up)]
A probability space (in the discrete set up) is a pair $(\Omega, \mathbb{P})$, where $\Omega = \{ \omega_1, \omega_2, .... \}$ is a set of all simple events and $\mathbb{P}$ is a function somehow defined on $\Omega$, say $\mathbb{P}(\omega_i) = p_i \geq 0$ where $p_1 + p_2 + p_3 + ... = 1$.
The definition above is a more formal way of saying that there is a bunch of simple events (finite or infinite number of them) with some probabilities assigned to them. Note that even if there are infinitely many simple events, there are countably many of them, i.e we can enumerate them! This is not always the case, e.g we cannot enumerate all of the points on an interval $[0,1]$ — this is precisely where the conversation switches to a conversation about the "continuous set up". We will talk a bit more about this later, let's now define one of the key concepts in probability theory:
Definition [Random Variables (from the Discrete Set up)]
Assuming there is a probability space $(\Omega, \mathbb{P})$, we can define a random variable as a function from $\Omega$ to $\mathbb{R}$. We will usually use capital letters $X, Y, ..$ or small greek letters $\delta, \xi, ...$ to denote random variables.
The definition of a random variable in the continuous set up looks almost exactly the same. But we do not need it for now, and so we technically have only defined a discrete random variable.
Note that there is nothing random about random variables, even though the names suggests it :) Any random variable is, formally speaking, a function $\Omega \to \mathbb{R}$. That's it. It is just it has a meaning in real world that is related to randomness. For example, the usual words of "rolling a fair die and seeing number 1, 2, ..., or 6" are formally interpreted (using our fancy notation and definitions) as "Given probability space $\Omega = ${roll 1, roll 2, ..., roll 6} with $\mathbb{P}(\text{roll }k) = \frac{1}{6}$, and we can define a random variable $X: \Omega \to \{1, 2, ..., 6 \} \subset \mathbb{R}$ such that $X(\text{roll }k) = k$". Once we have this formality, which clarifies the mathematical framework we are working in, we can get to defining important concepts like the one below:
Definition [Mathematical Expectation (from the Discrete Set up)]
Assume there is a probability space $(\Omega, \mathbb{P})$ with probability distribution $\{ p_1, p_2, ... \}$. Suppose that there are finitely many simple events, so we have $\{ p_1, p_2, ..., p_n \}$ as probability distribution (we call such a probability space finite). Then, for any random variable $X: \Omega \to \mathbb{R}$ we can define \[ \mathbb{E}[X] = p_1 X(\omega_1) + p_2 X(\omega_2) + ... + p_n X(\omega_n) \] and we call it the mathematical expectation of a random variable $X$.
We could also define the mathematical expectation even if there are infinitely many simple events, there is no issue in writing down the $\mathbb{E}[X] = p_1 X(\omega_1) + p_2 X(\omega_2) + ... $ (provided there are countably many simple events, which is what the word "discrete" in the "discrete probability model" ensures), and with playing with rearranging terms any way you like without getting any paradoxes as long as \[\lim_{n \to \infty} \left( p_1 |X(\omega_1)| + p_2 |X(\omega_2)| + ... + p_n |X(\omega_n)| \right) \] is finite. This is important in the sense that unless that limit is finite (i.e the sum in the definition of $\mathbb{E}[X]$ is absolutely convergent, you might have heard of this term from Analysis or Calculus) we cannot define the mathematical expectation of $X$.
That last bit of the definition of the mathematical expectation is too-calculus like, so we will not be focused on it and all of our uses cases today will not violate that absolute convergence property.
Also, we are merely reminding the three basic definitions: we assume you have met all of them before and in particular that you have dealt with the mathematical expectation. E.g knowledge about the linearity of expectation: \[ \mathbb{E}[a \cdot X + b \cdot Y] = a \cdot \mathbb{E}[X] + b \cdot \mathbb{E}[Y] \] is assumed.
All of the above was hopefully not new to you. We will now revisit the terms defined in the previous lesson and define them as they should be in formal probability theory. Although we have terminology for the discrete case, these definitions are the same for the continuous case. Therefore, there will be no "from the Discrete Set Up" in brackets after their names.
Definition [Variance]
Given a probability space, and given a random variable $X$ from it, we can define the variance of it as \[ \text{Var}(X) = \mathbb{E}[ \, (X - \mathbb{E}[X])^2 \, ] \] Note that for this definition to work (and exist in the first place), we must assume that for both of the random variables $X^2$ and $X$ we can define mathematical expectation.
So, if $X$ takes values $x_1, ..., x_n, ...$ (could be infinitely many, could be finitely many) with probabilities $p_1, ..., p_n, ...$, then the variance is equal $ \sum_{i=1}^{\infty} p_i (x_i - \mu)^2 $ where $\mu = \mathbb{E}[X] = \sum_{i=1}^{\infty}$.
As last time, the variance is a measure of how crazy the random variable. If the word "crazy" is not to your liking, you could also say that variance is a measure of "how spread-out the variable is", "degree to which the values of a random variable differ from the expected value" or "how much it deviates from its mean".
Definition [Covariance]
Given a probability space, and given two random variables $X$ and $Y$ with finite (i.e existing) mathematical expectation, we can define the covariance between them as \[ \text{cov}(X, Y) = \mathbb{E}[ \; \; (X - \mathbb{E}(X))(Y-\mathbb{E}(Y)) \; \; ]\] By the way, note that $\text{Var}(X)$ could be defined as $\text{cov}(X,X)$. Also, you will show in one of the exercises below you will show that $ \text{cov}(X, Y) = \mathbb{E}[X \cdot Y] - \mathbb{E}(X) \cdot \mathbb{E}(Y)$.
However, one could note that these definitions do not look exactly the same as what we defined last time. For example, there is no $\frac{1}{n}$... But why would it be here? The variance and covariance we just defined are for the random variables, not for a bunch of data points! This is a conceptual difference between today and last time: today we work with mathematical formalising rather than life-and-data-driven mathematical definitions. From a mathematical point of view, the variance we just defined is an operator: it takes in a function (which we call a "random variable") and outputs a number. It happens to be a useful operator that can be utilised to analyse data and derive educated conclusions. It is also, of course, closely related to the variance defined last time. In fact, if there is a sequence $x = [x_1, ..., x_n]$ of numbers, then if we define a random variable $X^u$ to be a random variable that takes values $x_1, ..., x_n$ with equal probabilities, then \[ \text{Var}(X^u) = \text{Var}(x) \] Similar claims apply to the covariance. Finally:
Definition [Correlation]
Given a probability space, and given two non-constant random variables $X$ and $Y$ with finite (i.e existing) mathematical expectation and variance from this probability space, we can define the correlation between them as \[ \text{corr}(X, Y) = \frac{\text{cov}(X,Y)}{\sqrt{\text{Var}(X) \cdot \text{Var}(Y)}} \]
Exercise 1.
Let $X$ and $Y$ be two random variables. Prove that their correlation belongs to the $[-1,1]$ interval.
Explanation and comments
click on blur to reveal
There is a Cauchy-Schwarz inequality version for random variables (there is a version for each inner product space actually), which is that "For any random variables who variances and mathematical expectations are defined, it is true that \[ (\mathbb{E}[X \cdot Y])^2 \leq \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2] \] It is true for both discrete and continuous set ups, and even if $X$ and $Y$ take infinitely many values. The proofs are all very similar, but sometimes infinite summations pop out (hence you need to make sure you can properly formalise your logic) or integrals (in case of the continuous set up).
Those details in the proofs of Cauchy-Schwarz inequality not matter for this course at all. What matters more is that this exercise is almost equivalent to this version of the Cauchy-Schwarz Inequality. To clearly see this, you can note that replacing a random variable $X$ with $X - c$ for some fixed constant $c$ does not change $\text{Var}(X)$ nor it changes the correlation between $X$ and any other random variables (it is a part of the exercise below, it is rather straightforward). Therefore, if we shift $X$ by the constant $\mathbb{E}[X]$, and if we shift $Y$ by the constant $\mathbb{E}[Y]$, the correlation between $X$ and $Y$ won't change. But then it will become equal to $\frac{\mathbb{E}[X \cdot Y] }{\sqrt{ \mathbb{E}[X^2] \cdot \mathbb{E}[Y^2]}} $ which is almost equivalent to the Cauchy-Schwarz Inequality. It is "almost equivalent" only because the Cauchy-Schwarz Inequality holds for any $X$ and $Y$, even constant ones, whereas for the correlation to make any sense we need to assume that $X$ and $Y$ are non-constant.
This is exactly the same exercise as from the previous lesson. Also, as it was pointed out last time, the conclusion of this exercise implies that correlation is nice in the sense that it is a way to assign a number from the fixed interval ($[-1,1]$) that will tell us how "related" those random variables are. This fixed interval is completely independent of the nature of random variables.
Exercise 2.
Let $X$ and $Y$ be two random variables. Prove the following
Explanation and comments
click on blur to reveal
All of these are about performing algebraic manipulations, nothing too difficult:
This exercise is almost a copy of whatever was done in the previous lesson as well. The proofs are very similar too. The fact that we can copy-paste basic properties like this shows that the concepts defined last time and the concepts defined today are very closely related. Thus, their names are the same. The main differences between them are contextual.
Before we move on to the next subtopic, let's solve two exercises to gain more confidence with the definitions. The first one is actually pretty much the same one as from the problem from last problem set, while the second one will be important for our future discussion.
Exercise 3.
If the correlation between random variables $X$ and $Y$ is 0.9 and the correlation between random variables $Y$ and $Z$ is 0.9, what can the correlation between random variables $X$ and $Z$ be?
Exercise 4.
Let $X$ and $Y$ be non-constant two random variables, for which both mathematical expectation as well as the variance exist. Then there exist unique real numbers $a$ and $b$ such that \[ Y = (aX + b) + \epsilon \] where $\epsilon$ is some random variable satisfying $\mathbb{E}[\epsilon] = 0$ and $\text{cov}(\epsilon, X) = 0$.
Explanation and comments
click on blur to reveal
Many people find this problem difficult. In particular, the $\epsilon$ confuses some individuals: this is not a given random variable! The problem basically states that there are unique $a$ and $b$ for which such an $\epsilon$ exists. For it to exist, two conditions have to be satisfied: \[
\left\{
\begin{array}{l}
0 = \mathbb{E}[\epsilon] = \mathbb{E}[Y - aX - b] \\
0 = \text{cov}(\epsilon, X) = \text{cov}(Y - aX - b, X)
\end{array}
\right. \] i.e this problem is about finding constants $a$ and $b$ such that two conditions on $Y - aX - b$ are satisfied. If you got to this point, the remaining job is about algebraic manipulations. Let's do it:
Thus, we get \[
\left\{
\begin{array}{l}
0 = \mathbb{E}[Y] - a \mathbb{E}[X] - b \\
0 = \text{cov}(X,Y) - a \cdot \text{Var}(X)
\end{array}
\right. \] which is a system of two linear equations on $a$ and $b$ (once again, the rest of the terms are given). We can easily solve it (e.g the $a$ can be found directly from the second equation) to obtain that \[ a = \frac{\text{cov}(X,Y)}{\text{Var}(X)} \qquad b = \mathbb{E}[Y] - \frac{\text{cov}(X,Y)}{\text{Var}(X)} \cdot \mathbb{E}[X] \] Formally speaking, we have only shown that if such an $\epsilon$ exists then $a$ and $b$ must be equal to what is written above. However, it is easy to reverse the arguments to show that if $a$ and $b$ equal to those formulas above, then if we take $\epsilon = Y - aX - b$, it will satisfy the required conditions. So we are fully done with this exercise: those $a$ and $b$ exist and they are unique.
Note that the formulas for $a$ and $b$ we got here are exactly the same as for the last exercise from the previous classwork! This is not quite a coincidence, even though the set up for the exercise above is not the same as the set up for the exercise from the previous classwork. While the exercise above looks like a problem about covariances and expected values, it is in fact tightly related to the Linear Regression!
Let's look at this exercise again: we basically want to approximate $Y$ using $aX + b$ (i.e a linear function of $X$) in such a way, that the error term (that $\epsilon$) is on average 0, and is uncorrelated with the $X$. Both of these conditions on the error make sense: if the error term is non-zero, we can adjust the $b$ so that our linear regression is "better", while if there is non-zero correlation between $\epsilon$ and $X$ (which implies non-zero covariance) it means are not using the full extent of the relation between $X$ and $Y$. It turns out that these two conditions on the error term, $\epsilon$, are both reasonable and fully define the coefficients $a$ and $b$.
Last time we had the following set up: there was a bunch of data, which was a set of points with coordinates $(x_i, y_i)$ with $y_i$ corresponding to $x_i$ in some way (some life-related relation, like being the gpa-s, i.e "grade point average", for the same person). The task was to find a way to explain the $y_i$ in terms of $x_i$. If we can answer this question well, then we would probably also be able to predict the value of a new $y_j$ using the value of the corresponding $x_j$ only — e.g we could then potentially reasonably predict the university gpa of a person given his/her high school gpa. It will almost certainly not be a perfect prediction, but it should be better than a random guess.
One of the ways to approach that task, and is the way we considered last time, is to fit "the best" line to the set of point and call this line "our prediction model". The adjective "best" has to be clarified, which gave rise to the Quadratic Loss Function, which translated the problem of finding "the best line" into finding the values of two parameters $a$ and $b$ (by which the model can be characterised) that minimise the quadratic loss function $L(a,b)$ for this problem. Once those values were found, we could say that if a new $x_j$ arrives for which we do not know the $y_j$, we could use the value of $a \cdot x_j + b$ as a prediction. E.g in the with peoples grades, we could use the \[ a \cdot \text{high school gpa} + b \] as a prediction for the university gpa.
But let's look at the last formula again. When thinking of how to predict the "university gpa", we think of it as something random. Moreover, before the new data point arrived, the "high school gpa" can be thought of as random. Thus, we could think about both of them as random variables at the time of not knowing their true values! From a mathematical perspective, it is a new (relative to what we did last time) way of thinking.
This new way of thinking and new formalism could lead to answering the question of finding "the best linear model" somewhat differently. One of the new ways to answer this question is described in the previous exercise: we impose two reasonable conditions on the error term, and then find $a$ and $b$ that would satisfy those conditions. It turned out they are unique and the formulas for them are nice, in particular they resemble what we did in the previous classwork.
Alternatively, we could approach the question of finding "the best linear model" with random variables floating around as follows: try to estimate the probability distributions of $X$ and $Y$ (using common sense, data and maybe Bayes formula), and then solve the following problem below:
(From the problem set)
Let $X$ and $Y$ be non-constant and given random variables, for which both mathematical expectation as well as the variance exist. Find the values of constanst $c$ and $d$ (in terms of $X$ and $Y$) that minimise \[ \mathbb{E}[ \, (Y - cX - d)^2 \, ] \] Compare these values to the formulas for $a$ and $b$ from the previous exercise.
The resemblance between the formulas for $a$ and $b$ from today, for the $a$ and $b$ from last time should now be even more motivated. In fact, if you let $Y$ be a random variable that takes values $y_1, ..., y_n$ with equal probabilities, and if you let $X$ to be a random variable that takes values $x_1, ..., x_n$ with equal probabilities, then the previous exercise is precisely what the Least Squares Estimation (LSE) solution, that we talked about last time, is about!
All of the definitions from today are exactly the same in all of the books, universities, online resources, etc.. So if you google "covariance of two random variables" you will find the exact same formulas as above, same properties and so on. However, the same does not apply for the definitions we gave last time and at the end of the classwork there was a remark that "we will get to it next lesson". And the time has come :)
Let's temporarily forget about the previous lesson completely, the definitions from today will stay exactly as they are, no dependencies have been broken. Note that today we only worked with random variables and for example we defined the "variance of a random variable". This is completely fine, but now ... here comes the "life".
In real-world settings we do not have pre-specified random variables that produce, let's say, daily prices of Apple Stock. There is no website where you could go and which will tell you the probability distribution for the random variable $X_{apple}$ officialy representing the company. Instead, we have the data. Only the data. For example, we have $x_1, x_2, ..., x_n$ as the prices of Apple Stock 1 day ago, 2 days ago, ..., $n$ days ago. We say that "let's consider a random variable $X$ representing the Apple Stock prices", and so those $x_1, ..., x_n$ are viewed as realisations of that $X$. Same way we say something like "Let $T$ be a random variable representing what we roll on a fair die, and so those rolls we just got, 2, 3, 2, 6, 1, 4 and 3 are realisations of $T$" Both $X$ and $T$ are merely models that we decided to consider for making predictions and analysing our results. Maybe it is a good idea to do things this way, maybe it is not. But over the course of >100 years we saw that it works fine and can be super helpful.
Anyway, since there is no official website telling us the exact distribution of $X$ nor of $Y$, a random variable representing daily Amazon prices (as an example), how do we find the correlation between $X$ and $Y$ for example? It seems impossible! We only have $x_1, ..., x_n$ and $y_1, ..., y_n$, a bunch of maths books and a brain. So what do we do? This is where the Law of Large Numbers, or just common sense come in.
Since $x_i$ are realisations of $X$, it feels right and intuitive that $\mathbb{E}[X]$ should be approximately $\frac{1}{n} (x_1 + ... + x_n)$. In fact, the Law of Large Numbers is precisely about making this claim formal! But we do not have to stop just here, why not use a similar logic to try to approximate $\text{Var}(X)$ using those $x_1, .., x_n$? We know that $\text{Var}(X) = \mathbb{E}[ \; (X - \mathbb{E}[X])^2 \; $, and we can already approximate the $\mathbb{E}[X]$ inside the formula with $\overline{x} = \frac{1}{n} (x_1 + ... + x_n)$. We still have $X$ inside to deal with, but why not use the same intuition we just used and average out the possible realisations of $ (X - \overline{x})^2$ to approximate the variance? We can! It will then tell us that the $\text{Var}(X)$ should be approximately \[ \frac{1}{n} ( (x_1 - \overline{x})^2 + ... + (x_n - \overline{x})^2) \] Voilà, we got the definition of the variance of a sequence of numbers from the previous lesson! However, the "it makes sense to average out results to approximate the variance" is not a formal proof of anything. Perhaps, this formula right above is not the best way to approximate $\text{Var}(X)$ given the realisations of $X$, i.e those $x_1, ..., x_n$. Maybe, there is a "better" formula for this purpose?
As you might have guessed, this is exactly where the "sample variance" that you might stumble upon when googling things related to this course comes in: it is usually defined as follows: \[ \text{Sample Variance} = \frac{1}{n-1} \left( (x_1 - \overline{x})^2 + ... + (x_n - \overline{x})^2 \right) \], i.e the $n$ is replaces with $n-1$. While, this is almost exactly the same as the above (and exactly the same in the limit as $n \to \infty$, it is not the equal to it. There are reasons for why this approximation is better, and so relatively often, without providing any further context or details, people simply call this other formula (with $\frac{1}{n-1}$ instead of $\frac{1}{n}$) the definition of variance for a bunch of numbers. You must be careful about this even when coding in Python: the default is not always clear! Same things apply for covariance, correlation and standard deviation (which was not mentioned in this class for now, so don't worry about it yet).
It might look like we are driving away from the course objective, which is about getting into trading or quant research as well as learning a bit more about it. But we are not: