For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something

5. Conditional Probability

5.1 Basic Definitions

Two fair dice are rolled one after another. As the dice are fair, we assume a uniform probability model, where each of the 36 possible outcomes has a probability of $\tfrac{1}{36}$. Let $A$ be the event that "both rolled numbers are even". There are exactly 9 outcomes that satisfy $A$, so the probability of $A$ is $\tfrac{9}{36} = \tfrac{1}{4}$.Now, suppose you have the opportunity to see the result on the first die before rolling the second one, and define $B$ as the event "the number on the first die is even". Clearly, if $B$ does not occur (i.e., the first die shows an odd number), then $A$ cannot occur. Conversely, if $B$ does occur, the probability that $A$ will also occur (i.e., the second die shows an even number) is $\tfrac{1}{2}$. Thus, the probability of $A$ happening seems to depend on whether $B$ occurs...

However, this is not the case. The crucial thing to consider here is how the question is framed: are we interested in the probability of event $A$ alone, or are we considering "the probability of $A$ given $B$"? These are technically different questions (both are valid and could be of interest to someone), and so no wonder they have different numerical answers. Speaking more generally, the scenario where we are interested in the probability of something given something else is quite common. Therefore, let's introduce a few definitions to formalize discussions of this nature.

Let $(\Omega, \mathbb{P})$ represent a probability space, and let ${ p_1, p_2, \dots }$ denote the probability distribution. Remember that any event $A$ is thought of as a collection of several simple events, essentially forming a set. The probability of $A$, denoted $\mathbb{P}(A)$, is defined as the sum of the probabilities of these simple events (which we pick ourselves when defining the model we are working in). This interpretation extends to understanding the event "$A$ and $B$" as an event that consists of all simple events found in both $A$ and $B$. This intersection is denoted as $A \cap B$ or simply "$A$ and $B$". Therefore, $\mathbb{P}(A \cap B)$ represents the probability of the event "$A$ and $B$". Building on this, we can define:

Definition [Conditional Probability]

In the given probability space $(\Omega, \mathbb{P})$, consider any two events $A$ and $B$, such that $\mathbb{P}(B) \neq 0$. We define the probability of $A$ given $B$ as \[ \mathbb{P}( A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \] This $\mathbb{P}(A \mid B)$ is known as conditional probability. Although one might argue that all probabilities are conditional (simply set $B = \Omega$), the term "conditional" specifically highlights that we are considering the probability of an event $A$ given some additional information, represented by the event $B$.

Exercise 1.

There is a box with $a$ green balls and $b$ orange balls. Two balls, one after another, are gonna be taken from this box. Let

$B$ be the event "first ball picked is green".
$A$ be the event "second ball picked is green".

What is $\mathbb{P}(A \mid B)$?

Explanation and comments

click on blur to reveal

Despite considering the case where event $B$ does not occur in the introduction above, this case is not of our interest at all, when calculating $\mathbb{P}(A \mid B)$. That discussion was there to highlight how $B$ influences $A$, that is all.
To address the problem's question, we need to determine $\mathbb{P}(A \cap B)$ and $\mathbb{P}(B)$:
1) As for $\mathbb{P}(A \cap B)$: This is the probability that both the first and second balls are green. Assuming a probability model where each ordered pair of balls is equally likely (a reasonable assumption for this problem), this probability is given by $\tfrac{a(a-1)}{(a+b)(a+b-1)}$.
2) As for the $\mathbb{P}(B)$: Sticking to the same probability space as above, note that among all of the $(a+b)(a+b-1)$ ordered pairs of balls, $a \cdot (a+b-1)$ have the first ball white. Hence this probability is $\tfrac{a \cdot (a+b-1)}{(a+b)(a+b-1)}$
Thus the final answer is \[ \frac{ (a(a-1)) / ((a+b)(a+b-1))}{(a (a+b-1)) / ((a+b)(a+b-1))} = \frac{a-1}{a+b-1}\]
This is the formal and correct way to answer this question. The fact that this answer is kind of obvious is a related, but different story... Indeed, note that "$B$ given $A$" describes the situation when there are $a+b-1$ balls left in the box, among which $a-1$ are green, so of course the answer should be $\tfrac{a-1}{a+b-1}$. However, given how we defined things, the long-ish calculations above is how one should answer this question to obtain a formal clean solution.

In the example above, $A$ is dependent on $B$ in the sense that the outcome of $B$ influences the probability of $A$ happening. On the other hand, when rolling two fair dice, the result on one is independent from the result on the other one in the sense that the probability of the first result being 1, 2, 3, 4, 5 or 6 is completely unaffected by what the the result on the other die will be (or even if we throw it in the first place). These two situations with pairs of events feel different, but how exactly do we formalise this word "dependent"? What if the situation is very complex and it is not straightforward whether one influences another? For example, having cancer and drinking more Coca Cola — are these events independent or not quite? This discussion leads to another crucial definition in probability theory:

Definition [Independent Events]

Let $(\Omega, \mathbb{P})$ be a probability space let $A$ and $B$ be any two events from it. We say that $A$ and $B$ are independent if and only if \[ \mathbb{P}( A \cup B) = \mathbb{P}(A) \cdot \mathbb{P}(B) \] One could say that $A$ and $B$ are independent if $\mathbb{P}(A \mid B) = \mathbb{P}(A)$, which is even more intuitive, however this works fine only if $\mathbb{P}(B) \not= 0$, because $\mathbb{P}(A \mid B)$ is otherwise undefined.

Note that the definition says nothing about the nature of $A$, $B$ or the probability space itself. It says that if the probability of $A$ and $B$ happens to be equal to the product of probabilities $A$ and $B$, then $A$ and $B$ are independent. NOT vice versa like most people think!

This is an ultra important point, which gets violated (or not properly explained) in almost every single school book. The word "independent" used in everyday speech can mean a totally different thing of course, i.e not what it formally means in probability theory. Just like with the definition of the word "probability" itself, you should not be scared because of these formalities. Probability theory, which includes all of its definitions and theorems, is an attempt to formalise and quantify uncertainties. Using these formalities, fancy theorems and sometimes counter-intuitive results, we can derive much better conclusions about what happens (or going to happen) in real life. This is a model for some processes in our universe. This model is not perfect, but just like all of the mathematics, it needs to be perfectly formal and clear. Thus, we cannot let "independent events" to be defined as "some two events that feel like they do not affect one another".

Since it is expected that you to have met these basic definitions before, or you have enough experience with good-quality maths to process them immediately, we will not be spending much more time on this. For more on this topic, please visit the PA-2 course. We will give you just one more example to stress the difference between the formal definition above and the word "independent" is used in real life.

Suppose, there is a bunch of students who just sat their maths exams and assume that 70% of them passed it. However, the administration did not quite believe these results, hence it decided to run an independent maths test for these same students one more time. Only 50% of the students passed this new test. What's the probability that a randomly chosen student passed both tests?

Well, if we were to use our theories and definitions directly, without thinking, we would say that this probability is $0.7 \cdot 0.5 = 0.35$. However, this does not make any sense: good results on a maths test is not quite the same as winning a lottery. It is clear that a great student is much more likely to pass both tests than most others. Thus, despite the word "independent" being used in the problem statement, claiming that "a particular student passed the first test" and "a particular student passed the second test" are two independent events is absurd.

5.2 Motivation

This course is created for those with at least some previous exposure to probability theory or for those comfortable enough with formal mathematics and high-school formulas. Just in case, you are struggling to see or remember why the formula from above for the conditional probability is inspired, let us briefly consider a much simpler problem.

Exercise 2.

There is a school where 15% of the students speak French and 20% of the students speak Italian. The percentage of the students that speak both is 5% of the students. What fraction of the Italian speakers also speaks French?

Explanation and comments

click on blur to reveal

Let the total number of students in the school be $x$. Then there are $0.15x$ students who speak French, $0.2x$ students who speak Italian and $0.05x$ students who speak both. So, there are $0.05x$ students among the $0.2x$ Italian speakers who also speak French and thus the fraction of the Italian speakers also speaks French is $\tfrac{0.05x}{0.2x} = 0.25 = 25\%$.

Even though the problem and the solution do not contain the word "probability", it is easy to interpret this question in a probabilistic setting. Simply replace all of the percentages and fractions with the word "probability". Thus, basically, what we were asked is what is the probability that an Italian speak also speak French, and the answer was derived to be $0.25$.

Hopefully, this exercise motivates the formula a bit more.
By the way, in a uniform probability space with $n$ simple events, if $A$ consists of $k$ simple events and $B$ consists of $m$ simple events, then the formula gives $\mathbb{P}(A \mid B) = \tfrac{k/n}{m/n} = k/m$, which makes perfect sense: conditioning on $B$ could be viewed as switching to the world, where there are only $m$ simple events (all equally likely again), and where we are interested in finding out what portion of those simple events are a part of $A$, because this is exactly what will tell us what $\mathbb{P}(A \mid B)$ is.

5.3 Law of Total Probability

There are two key formulas from the topic of conditional probability that show up all them, easy to prove but they kind of look scary when formally written down. The "Law of Total Probability" is one of them, another one is the Bayes' formula, which is a simple yet foundational result for the entire theory in the field of statistics called "Bayesian Statistics". We will start with the first one.

Theorem [Law of Total Probability]

As always, let $(\Omega, \mathbb{P})$ be a probability space and let $A$ be an event from it. Consider a partition of $\Omega$ into events $B_1, ..., B_k$ of non-zero probability: the word partition means that no two of the events intersect and their union is the whole of $\Omega$. Then it is true that
\begin{align*}
\mathbb{P}(A) &= \mathbb{P}(A \cap B_1) + \mathbb{P}(A \cap B_2) + ... + \mathbb{P}(A \cap B_k) \\
&= \mathbb{P}(A \mid B_1) \cdot \mathbb{P}(B_1) + \mathbb{P}(A \mid B_2) \cdot \mathbb{P}(B_2) + ... + \mathbb{P}(A \mid B_k) \cdot \mathbb{P}(B_k)
\end{align*}

Any theorem requires a proof, but in this case the proof is pretty much written in the statement itself. Indeed, the first equality is clear: any contribution from any of the simple events $\omega_i \in \Omega$ will happen exactly once for the left-hand side as well as for the right-hand side, while the second equality is about using the $\mathbb{P}( A \mid B_i) = \frac{\mathbb{P}(A \cap B_i)}{\mathbb{P}(B_i)}$.

Exercise 3.

There are items produced by two workers separately. The first one made 2/3 of all the items, the second one made 1/3. The chance that an item from the first worker will be broken is 0.01, while for the second one this chance is 0.1. What is the probability that a randomly chosen item will be broken?

Explanation and comments

click on blur to reveal

Suppose there are $n$ items produced in total. Consider a uniform probability model with $n$ simple events corresponding to choosing one of the $n$ items. To be honest, this $n$ does not even play a role in the future calculations, it was introduced for making things more formal :)

Let $A$ be the event that a randomly chosen item (from the $n$ there are) is broken. Define $B_1$ and $B_2$ to be the events that a randomly chosen item is made by the worker 1/worker 2 correspondingly. Then $\mathbb{P}(B_1) = 2/3$, while $\mathbb{P}(B_2) = 1/3$. Moreover, the 0.01 from the problem statement is exactly $\mathbb{P}(A \mid B_1)$, while the 0.1 from the problem statement is exactly $\mathbb{P}(A \mid B_2)$. Thus, using the Law of Total Probability: \[ \mathbb{P}(A) = \mathbb{P}(A \mid B_1) \cdot \mathbb{P}(B_1) + \mathbb{P}(A \mid B_2) \cdot \mathbb{P}(B_2) = 0.01 \cdot 2/3 + 0.1 \cdot 1/3 = 0.12 / 3 = 0.04 = 4 \% \]

5.4 Bayes' Formula

One of the reasons Law of Total Probability turns out to be helpful is that it helps breaking a non-obvious probability that needs to be calculated into more manageable steps. Bayes' formula plays a similar role, it helps expressing $\mathbb{P}(B \mid A)$, that turns out to be of interest often, as a formula of some easier (or simply given) terms.

Theorem [Bayes' Formula]

Let $(\Omega, \mathbb{P})$ be a probability space and let $A$ be an event from it. Consider a partition of $\Omega$ into events $B_1, ..., B_k$ of non-zero probability. Then the following holds: \[ \mathbb{P}(B_i \mid A) = \frac{\mathbb{P}(A \mid B_i) \cdot \mathbb{P}(B_i)}{\mathbb{P}(A \mid B_1) \cdot \mathbb{P}(B_1) + \mathbb{P}(A \mid B_2) \cdot \mathbb{P}(B_2) + ... + \mathbb{P}(A \mid B_k) \cdot \mathbb{P}(B_k)} \]

In its simple case, when the partition has just two events, i.e events $B$ and $B^c$ (this is a way to denote the complement to the event $B$, i.e $\Omega / B$), the formula above becomes \[ \mathbb{P}(B \mid A) = \frac{\mathbb{P}(A \mid B)\mathbb{P}(B)}{\mathbb{P}(A \mid B)\mathbb{P}(B) + \mathbb{P}(A \mid B^c)\mathbb{P}(B^c)}\]
The idea of this formula is to show how to update a set of probabilities $\mathbb{P}(B_1), ..., \mathbb{P}(B_k)$, which represent someone's beliefs about something, when new information, event $A$ arrives. This is why on the left-hand side you see $\mathbb{P}(B_i \mid A)$.

However from the algebraic stand point, this formula is quite obvious. By the Law of Total Probability, the denominator on the right-hand side is equal to $\mathbb{P}(A)$. Plugging this in the Bayes formula turns it into $\mathbb{P}(B_i \mid A) = \frac{\mathbb{P}(A \mid B_i)\mathbb{P}(B_i)}{\mathbb{P}(A)}$, which is equivalent to $\mathbb{P}(B_i \mid A) \cdot \mathbb{P}(A) = \mathbb{P}(A \mid B_i) \cdot \mathbb{P}(B_i)$ which is true since both sides are equal to $\mathbb{P}(B_i \cap A)$.

While the formula may be easy to prove, people find it difficult to apply it. You will get to practise more in the Problem Set (and Extra, if you have time and if you are brave enough), let's just do one standard Wikipedia exercise on it:

Exercise 5.

A factory produces items using three machines — A, B, and C — which account for 20%, 30%, and 50% of its output respectively. Of the items produced by machine A, 5% are defective; similarly, 3% of machine B's items and 1% of machine C's are defective. If a randomly selected item is defective, what is the probability it was produced by machine C?

Explanation and comments

click on blur to reveal

Let $X_i$ denote the event that a randomly chosen item was made by the $i$-th machine (for $i = A, B, C$). Let $Y$ denote the event that a randomly chosen item is defective. Then, we are given the following information: \[ \mathbb{P}(X_A) = 0.2, \quad \mathbb{P}(X_B) = 0.3, \quad \mathbb{P}(X_C) = 0.5 \] If the item was made by the first machine, then the probability that it is defective is 0.05; that is, $ \mathbb{P}(Y|X_A) = 0.05 $. Overall, we have \[\mathbb{P}(Y|X_A) = 0.05, \quad \mathbb{P}(Y|X_B) = 0.03, \quad \mathbb{P}(Y|X_C) = 0.01\] To answer the original question, we first find $\mathbb{P}(Y)$. That can be done in the following way: \[\mathbb{P}(Y) = \sum_i \mathbb{P}(Y|X_i)\mathbb{P}(X_i) = (0.05)(0.2) + (0.03)(0.3) + (0.01)(0.5) = 0.024\] Hence, 2.4\% of the total output is defective. We are given that $Y$ has occurred, and we want to calculate the conditional probability of $X_C$. By Bayes' formula, \[ \mathbb{P}(X_C|Y) = \frac{\mathbb{P}(Y|X_C)\mathbb{P}(X_C)}{\mathbb{P}(Y)} = \frac{0.01 \cdot 0.50}{0.024} = \frac{5}{24} \] Given that the item is defective, the probability that it was made by machine C is $ \frac{5}{24}$, and so this is the final answer. Note that although machine C produces half of the total output, it produces a much smaller fraction of the defective items.

Conclusions, and next steps

Todays lesson contains a few definitions and results of paramount importance for the entire sphere of probability theory. The Law of Total Probability is used so often and it is so intuitive, that it is rarely even explicitly referenced. As for the Bayes' formula: despite its simplicity, it will almost surely come up during an interview to a finance company or you will heavily rely on it when doing Machine Learning.

The next step is to do a problem set, as always. Note that this is the last problem set of the course, as there will be none after the last lesson. Good luck!

Open the Problem Set

PA-4 / Lesson 5

5. Conditional Probability

5.1 Basic Definitions

5.2 Motivation

5.3 Law of Total Probability

5.4 Bayes' Formula

Conclusions, and next steps