For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something
Unfortunately, securing an offer isn't as easy as 'if you can solve most of the problems from the previous lesson, you'll get an offer.' I don't mean to say solving most of those problems is easy, but it's nonetheless surprising how much closer you can get to an offer by being good only at basic probability and critical thinking! Therefore, please make sure you're comfortable with the standard topics such as combinatorics, conditional probability, Bayes' theorem, mathematical expectation with its applications and the beginning of continuous probability set up. This is a must.
However, the requirements often extend beyond these basics. For instance, a decent amount of companies will ask questions to test deeper understanding of probability theory, such as 'find the $N$th moment of a standard normal distribution' or 'bound the correlation between the two random variables given some condition'. It is understandable that you might not be prepared for such specific inquiries right now (it is fine!). The good news is that it primarily involves theoretical knowledge, which means that if you dedicate time to "read and practice", you will become well-prepared! This is in fact less challenging than developing great mathematical intuition and analytical thinking, which could take very long time. Thus, once more, you're closer to being ready for a high-paying job than you might think.
For the sake of completeness, here is the list of the topics of the problems & questions that could come up at an interview or in an online assessment (both could be a part of the application process). Once again, this for a junior position or an internship (since this course is focused on the inexperienced people trying to "get in") as a quant-researcher or as a trader. Just to have it all in one place together with a few useful references:
Time for maths. This subsection's name may sound like as a narrow and unimportant topic, however it is a building block for the theory that most quants and traders use, especially during an internship or at a junior positions. This theory includes the topic of Linear Regression, whose importance has been stressed above. In particular, it means that this subsection's problems are a part of the preparation for the interviews. In fact, we will get to doing a few questions from real & recent interviews again!
Unlike most educational institutions, we will approach those two unattractive terms in the subsection's name from a different angle: less formal definitions with lots of $\mathbb{E}[X]$ floating around, more "real life and data" kind of discussions. We will even play a little trading game soon!
So, the word "correlation" comes up in many conversations every once in a while, and it always means some measure of the extent to which two or more variables behave in relation to each other. We don't even have to perfectly formalise it to understand its potential use: if I know that today's stock price of Microsoft is strongly correlated with yesterdays stock price of Nvidia, I can try to make use of this information to predict the future moves of Microsoft's price. Even if it is not "always true" that "if Nvidia dropped in price, then Microsoft will also drop tomorrow", but merely "more often than not true", it is already non-trivial amount of information that could be used to make money provided you trade a lot (so you can apply the Law Of Large Numbers).
Spotting those helpful correlations is not easy. Let's do a quick exercise and try to spot a (potentially-useful for predicting the price moves) correlation on this plot below:
The key observation here is that almost every the price goes up, it goes up again in the next hour and if the price drops, it drops in the next hour as well. This is not a totally "random" behaviour if you think about it. If it were a totally random behaviour, then right after the growth we would expect downgrade in around 50% of the times. I.e the next price change is posititvely correlated with the the last price change. This phenomena even has a name, it is called positive autocorrelation (guess what kind of behaviour is called negative autocorrelation). It is a fancy, but self-explanatory, term. You could also call it a "trend".
Exercise 1.
By making use of the positive autocorrelation phenomena, find a way (=algorithm) to almost never have less than $1050$ dollars in the "total worth" by the end of the trading game below. Ideally also try to sometimes get more than $1150$ by the end. It should be clear what is going on, but here are the details just in case:
Explanation and comments
click on blur to reveal
Once again, intuitively this phenomena means that if last time the price went up, it means that the price is more likely to go up again. On the other hand, if the price just dropped, it is more likely to drop again. With this in mind, let's stick to the following strategy:
This is not a $100 \%$ winning strategy, but we are doing probability theory, i.e theory of chance :) We can further modify this strategy by the way: we can do "Sell All" and then always "Do Nothing" once our total worth is above $1050 \$ $. This will be an almost-always (~90% of the time) working strategy to finish with $>1050 \$ $ at the end. It will not however ensure that we get to $> 1200 \$$ sometimes. For this, we do need to stick the strategy above until the end, at least every once in a while — but then we increase the risk of finishing with less than $1050 \$ $. Thus there is some trade off between risk and expected final total worth.
The exercise above is cute and insightful, but we still have not quantified our intuition at all! We have not formalised anything in this subsection really. For example, what exactly is a "weak correlation"? How to quantify any of this?
Let's answer those questions. Just like before, we will be engaging with a real-world problem involving two entities that generate somewhat random values. These could be, for example, the minute-by-minute prices of certain stocks or the monthly totals of specific virus cases globally. The precise nature of these values is not important to us, mathematicians; rather, we will treat them as two finite sequences of numbers: $x_1, ..., x_n$ and $y_1, ..., y_n$ (same length is important of course). The question is, given these two s of numbers, how do we nicely assign a number that measures the extent of these two strings of values are related to each other?
Definition [(Pearson) Correlation Coefficient]
The correlation between two non-constant sequences of numbers $x=[x_1, ..., x_n]$ and $y=[y_1, ..., y_n]$, more officially known as the Pearson Correlation Coefficient (or sample correlation coefficient as well), is defined as \[ \text{corr}(x,y) = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{ \sqrt{ \left( \sum_{i=1}^n (x_i - \overline{x})^2 \right) \cdot \left( \sum_{i=1}^n (y_i - \overline{y})^2 \right)}} \] where $\overline{x}$ is the average of the $x_i$-s and the $\overline{y}$ is the average of the $y_i$-s.
(Those $x$ and $y$ are merely the names for those sequences, not random variables or constants or something else).
The formula right above is almost objectively damn-ugly. But I nonetheless recommend you stare at it for a while to think if it makes any sense to you. Moreover, please prove the following convenient property of it:
Exercise 2.
Show the Pearson Correlation Coefficient belongs to the interval $[-1, 1]$. When does it equal to $1$ or $-1$?
Explanation and comments
click on blur to reveal
It is a direct implication of the Cauchy-Schwarz inequality for the numbers $|x_i - \overline{x}|$ and $|y_i - \overline{y_i}|$. The absolute values around those are important here. Correlation being equal to $-1$ or $1$, is equivalent to the equality case in that inequality, which happens online when $[x_1, ..., x_n]$ and $[y_1, ..., y_n]$ are linearly dependent (so $y_i = ax_i + b$ for some constants $a$ and $b$, or $x_i = ay_i + b$ for some constants $a$ and $b$).
We finally have a measure of how related two things are! It is even "nice" in the sense that it is independent of the nature of those sequences of numbers we can always assign a number $\in [-1, 1]$ telling us how "related" those sequences (and thus the entities producing those random numbers) are. We are yet to understand better why would it make sense to say that this weird Pearson Correlation can be thought of as "a measure of relation", but we are already getting somehwere.
To be honest, you will rarely see the formal definition of a correlation before the two definitions that are about to appear, which are variance and covariance. However, it is the word "correlation" that is often used in speech, and so it feels more intuitive to start with it. Anyway:
Definition [Variance of a sequence (set) of numbers]
The variance of a sequence of numbers $x = [ x_1, x_2, ..., x_n ]$ is the average of the squared distances from each term to the mean, i.e it is equal to \[ \text{Var}(x) = \frac{1}{n} \cdot \sum_{i=1}^n \left( x_i - \overline{x} \right)^2 \] where $\overline{x} = \left( 1/n \cdot \sum_{i=1}^nx_i \right)$, i.e again the average of the $x_i$-s.
(The $x$ is merely the name for that sequence, not a random variable or a constant or something else).
Remark: Technically, we do not need to care that it is a sequence (an ordered list) for the definition of the variance to make perfect sense. It could be a set of numbers, it will be exactly the same thing. I am merely sticking to the words "sequence" and "list", because we are discussing all of this in the context of a correlation between two things where it is important that $x_i$ kind of corresponds to $y_i$. This is a pretty obvious remark, but still, please do not get confused.
As the name suggests, variance gives us a numerical measure how scattered a data set is. In simpler words, it measures how "crazy" the data set is. Just stare at a formula and you will see why it is true, it luckily makes perfect intuitive sense. Next, let's define:
Definition [Covariance between two sequences]
The covariance between two sequences $x=[x_1, ..., x_n]$ and $y=[y_1, ..., y_n]$, once again the order here matters, is defined as follows: \[ \text{cov}(x,y) = \frac{1}{n} \cdot \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) \] where $\overline{x}$ is the average of the $x_i$-s, and the $\overline{y}$ is the average of the $y_i$-s as before.
(Those $x$ and $y$ are merely the names for those sequences, not random variables or constants or something else).
It is more difficult to find the right words for the meaning for this. One of the things one could say is that covariance indicates the direction of the linear relationship between two things. However, you cannot judge by its magnitude how strong is the strength of that relationship. To be able to talk about the "strength" of it, we need a baseline of some form, something to compare it to. This is where correlation comes in: \[ \text{corr}(x, y) = \frac{\text{cov}(x,y)}{\sqrt{\text{Var}(x) \cdot \text{Var}(y)}} \in [-1, 1] \]
Let's finish off this subsection with a few standard properties of the recently defined terms.
Exercise 3.
Let $x=[x_1, ..., x_n]$ and $y=[y_1, ..., y_n]$ be two sequences with averages $\overline{x}$ and $\overline{y}$. Prove the following
Explanation and comments
click on blur to reveal
All of these properties are easy to show, they are merely about rearranging a bunch of stuff:
We will now move on to working with the mathematical tool that seems to be used by most of the trading companies to make money. It is genuinely surprising how effective and popular it is despite its simplicity: only around one in five people I talked to about their trading or quant internships, told me that they went significantly beyond (mathematically speaking) just using Linear Regression in different contests! This mathematical tool also happens to be a popular interview question.
Unlike what has been done with the Variance, Correlation & Covariance, we will not be talking about the motivation for the Linear Regression. Nor will we tip toe around it too much. Thus, we will jump straight into using it, building some theory around it and solving problems on it. It would be fair to talk about prediction models in general before jumping "straight into action", and in particular show you that even in a rather simple scenario, the Linear Regression Model might not be "the best" or "nearly the best". For this discussion, please check out the second half of this classwork from the introductory course.
So, Linear Regression Model is merely one of the possible models that helps generating predictions. And even though it is written above "...went significantly beyond just using Linear Regression..." suggesting it was not complicated, it is not the right conclusion. It is not a simple thing in general! There are a lot of interesting facts about it as well as there are a few caveats, especially once you go multi-dimensional.
Luckily for you, we will not be going multi-dimensional today, meaning we will be working with a Simple Linear Regression Model in which there is one variable, call it $Y$, that is our target variable, and there is just one feature, call it $X$, that is our predictor, and our ultimate goal is to find the "best" or "as good as possible" constants $a$ and $b$ such that $aX + b$ predicts $Y$. Here $X, Y$ are both random variables.
We need to be more specific and formal here, in particular about the word "best". One of the ways to go about all this, is the following: let's collect some data (or it might be given already) with realisations of $Y$ and $X$, say they are $(x_1, y_1), ..., (x_n, y_n)$. A standard example is the price of a house is corresponding to $Y$, while the distance from it to the city centre is corresponding to $X$. Then those $(x_i, y_i)$ are pairs of numbers that is the corresponding information about some $n$ houses. Define
Definition [Quadratic Loss Function]
If $y_1, y_2, ..., y_n$ are the true values (that we observed), and $f(x_1), f(x_2), ..., f(x_n)$ are our predictions (where $f$ is our model), then the quadratic loss function $L$ is defined as \[ L(f) = \frac{1}{n} \sum_{i=1}^{n} (y_i - f(x_i))^2 \]
and then say that we will pick $a$ and $b$ in such a way, that the loss function for the model $f(x) = ax + b$ is the minimal possible (all over all possible choices of $a$ and $b$). This is called Least Squares Estimation (LSE) because it gives rise to the least value for the sum of squared errors. So it makes sense and you probably have already seen all this before.
The picture above shows a solution to the LSE in the problem where there are $100$ students, each having a pair of grades (high school GPA, university GPA) and where our ultimate goal is to find the best linear prediction for the "university GPA", our target variable, given just one feature which is "high school GPA". The orange line in the picture is that solution, i.e that "line that fits the date best". For this particular case it turned out that $a=0.68$ and $b=1.07$, so our model is $f(x) = 0.68 x + 1.07$, i.e \[\text{prediction} = 0.68 \cdot \text{(high school GPA)} + 1.07 \]
Even if this is somewhat new to you, it should make sense. Of course, one can ask questions like "why linear model?" or "why squared errors, why not just absolute values?" – but we simply picked such a model with such a way of measuring "bestness of the model" through that particular loss function. Neither of these choices have to be "the best", they are merely "a choice that makes intuitive sense".
Finally, we are about to get to something that you might not have seen or realised before. Since this classwork is getting too long already, we will actually solve just one more exercise here, and move on to the next step, the Problem Set. You are left with digesting all of the information (the vast majority of which is likely to be not new), and with solving a few interview-level problems.
Exercise 4.
To sum up: there is data $(x_1, y_1), ..., (x_n, y_n)$, and we would like to pick constants $a$ and $b$, so that the loss function \[ L(f) = L(a, b) = \frac{1}{n} \sum_{i=1}^{n} (y_i - ax_i - b)^2 \] is minimised. While this "makes sense" for our purposes of predicting something, a maths-minded person can view it as an algebra problem – and so this exercise is about solving it. Well, at least find the value of $a$ in terms of $x_i, y_i$.
Explanation and comments
click on blur to reveal
It is not a difficult exercise if you know calculus: simply equate the partial derivatives to 0 and find the values $a$ and $b$ from the equations you get. You don't have to do calculus by the way, there are other ways to solve it: including a way that is merely about algebraic manipulations – the $L(a,b)$ is quadratic in $a$ and $b$ after all, it is not a sophisticated function. So one way or another (but please do it!) you can find that
\[ a = \frac{\sum_{i=1}^n y_i (x_i - \overline{x})}{\sum_{i=1}^n (x_i - \overline{x})^2} \qquad b = ...\]
where $\overline{x}$ is the man of the $x_i$-s. The "..." for $b$ mean that it is left for you to actually calculate :)
What is more interesting here is that we rewrite the formula for $a$ as \[ a = \frac{1/n \cdot \sum_{i=1}^n (y_i - \overline{y}) (x_i - \overline{x})}{1/n \cdot \sum_{i=1}^n (x_i - \overline{x})^2}\] by adding the $0 = -\overline{y} \cdot 0 = -\overline{y} \cdot \sum_{i=1}^n(x_i - \overline{x})$ to the numerator, and then multiplying both numerator and denominator by $\frac{1}{n}$. And this new formula for $a$ should resemble something... Indeed, using the definitions from above, we can further rewrite \[ a = \frac{\text{cov}(x,y)}{\text{var}(x)} \] which is nicely interpretable! It suggests that the slope of linear regression measures how related $y$ and $x$ are, i.e how much of $y$ can be explained using $x$. It also quantifies the degree to which $x$ can be predicted from $x$ based on the historical data.
This course adopts a different approach for introducing the terms like "correlation", "variance" and "covariance" — instead of introducing those as some operators on random variables (like universities do), we defined them in the context of having data. Now, the main remark here is that if you now try to google things you might see things like \[ \text{sample variance} = \frac{1}{n-1} \cdot \sum_{i=1}^n \left( x_i - \overline{x} \right)^2 \] which is slightly different from what we did! If $n$ is large it will give almost exactly the same result, but it is technically different. So a reasonable question can pop up in your head: who is lying? Quanta or those internet resources?
The answer is "Noone". The truth is that the word "variance", just like the words "covariance" and "correlation" mean different things depending on the context. Unfortunately, many online resources or even people are too lazy or uneducated to clarify the context. Without going in too deep, we simply ask you to bear with us till the end of the next lesson where we will define those terms in a probabilistic setting (and those definitions are everywhere the same), and then comment a bit more about the $\frac{1}{n}$ vs $\frac{1}{n-1}$ situation.
Linear Regression Model is a super star in a way: both the interviews and the trading/quant jobs themselves like it. We have just started talking about this model, the simple 2D version of it to be more precise. But as you can already see from the last exercise, the coefficients in the simple linear regression are not meaningless. Well, you definitely know that at least one of them makes sense given the basic intuition and knowledge about the correlation and variance! Hint: the other coefficient is also insightful.. Being able to provide motivation or talk about the meaning of your result is also a skill that can be tested at interview by the way.
Moreover, we will use a simple linear regression for creating a trading algorithm that has actually helped saving millions of dollars for a trading firm for a particular kind of trades. For real! However, we will get to it later, during the lesson number 4. For now, please master the concepts by solving the problems from todays Problem Set. As for the lesson 3, i.e the next one, we will go deeper into the maths theory and redefine those terms we just talked about. Once again, strong intuition around them as well as the mathematical formalism and the ability to apply it, are vitally important for passing the interviews! So don't try to avoid it.