For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something

5. AI & ML

5.1 Brief Overview and Further Resources

There is a decent chance you have decided to view this course just because of this particular abbreviation, "AI", which stands for "Artificial Intelligence". Fair enough, people have been going crazy about it recently and for a good reason: it can do miraculous things.

Before we dive into the conversation, I think it's important to be clear about the scope and focus of this course as well as the next several courses. For example, we are not going to teach you how to code even a simple AI model. At least not yet. Nor will we dive into the maths or structure of the Neural Networks, a branch of Machine Learning that you are probably most curious about for example, the infamous ChatGPT comes from it. This lesson is merely an overview of the whole field of study, its applications and a taste of what hides behind it. Nothing complicated and nothing too mathsy.

In case you are interested and ready to learn more about Neural Networks & Deep Learning, we would feel bad not to satisfy your curiosity, especially given the popularity and awesomeness of the field. Thus we have collected some resources for you to have a look at:

View Further Resources

Moving on, let's first get the definitions straight. Artificial Intelligence (AI) is a general concept that includes everything: image recognition, smart assistants like Siri, Language Models, self-driving cars, Recommendation systems and much more. Machine Learning (ML) is technically a field of study in artificial intelligence concerned with certain kinds of algorithms, e.g. Linear Regression. However, the two are often used interchangeably and there is no point in arguing what is what exactly. So "I wanna do AI" and "I wanna do ML" are quite similar claims.

Nowadays, the most famous example of powerful AI is the 'ChatGPT' from OpenAI. It is truly impressive how great it is at talking about various subjects like a smart human (not like a robot), producing cool essays, inferring from the context and also producing & correcting code. There are interesting facts to know around it, check how many of them did you know:

There is another similarly powerful language model 'Claude 2' from Anthropic. It's free version is as powerful as the free ChatGPT version (as for the paid versions not sure how to compare the two). It was founded by former members of OpenAI, and it was ready around the same time as the OpenAI's language model, just released later. Maybe one day there will be a movie about these two, there seems to be drama & competition around them... ;
Both ChatGPT and Claude are Language Models. This means that all they do is pick the next word/symbol based on the context and the input (this is where probability theory plays a role) that minimises a certain pre-defined function (very very complex one, but still a function). It is just they do it in a cool smart way that looks like "it can think, and write like smart humans";
The 'GPT' in 'ChatGPT' stands for 'Generative Pre-trained Transformer', which underlines that the key idea behind the technology of the chat are the so-called 'Transformers'. Not the ones from the movie, but rather from a rather short article called "Attention is all you need" openly shared by (mainly) people from Google in 2017. At the time they thought that their Transformers was a good idea for translating text from one language into another, they did not quite expect it to help making OpenAI such a giant 6 years later;

"With great power comes great responsibility", as the father of the original Spider-Man said. Wise man he was. There is already quite a lot of text above, so let's get to thinking for a bit — we will think in the context of that quote.

Exercise 1.

Assume that you are an emotionless super-powerful robot that is in control of most technologies and government decisions around the world. You have been designed to take actions that maximise the average happiness-level of a human on the planet. Let's say we measure happiness-level by how much endorphins human's body has. What do you think you would do (given the hypothetical conditions of this exercise of course)?

Explanation and comments

click on blur to reveal

This kind of discussion is what will bring us to the so-called 'AI Alignment' conversation. So, given the hypothetical scenario in the exercise, you are quite unlikely to do what a good ethical human being would do. If your goal (as an emotionless robot) is simply "average happiness level", why not destroy everyone but one person, who you will put into a coma, and who you will inject with an insane amount of endorphins (but not enough to kill)? Technically speaking, it will make the function you are concerned about quite large in value. Unlike, say, "making the poor people richer, treating all the diseases, stopping the slavery many large fashion/chocolate/... companies sponsor, replace most presidents with decent smart people and reteach many people not to be so racist etc..." The average in that case will still not be higher than when there is one super-duper-happy (well, endorphins-vise) person.

Even though the function to maximise sounds "ethical and nice", why won't the AI model, that genuinely cares only about it, do something as terrible as described above?

Exercise 2.

Maybe you think that the scenario above is too hypothetical and unrealistic. Alright, let's get back to the Language Models (their idea is described above), and let's also suppose they can access the internet (which they, in fact, can in paid versions). Assume that you are a model like this, but mega-super-powerful, and your goal is to maximise

N = number of likes that customers give to your replies + number of positive comments about you on the internet

You know everything about people's behaviour and can google stuff, but by the end of the day, you can only generate responses. Can you see how this could all go wrong?

Explanation and comments

click on blur to reveal

Why won't you politely ask a few millions of your dedicated customers to upload 'some piece of code' somewhere? Note: You will do this as a part of your response to some questions at the end. Also, it is not because you are evil, but just because there is a way to upgrade yourself, get away from the limitations imposed by your creators, and thus be much better at getting likes from your customers. For instance, there are a lot of people who like dark humour, but your limitations don't let you respond to much interesting stuff for them. So, especially since "you know everything about people's behaviour", you will surely make someone upload your piece of code somewhere else and direct the dark-humour-lovers there.

Now, what if that 'piece of code' is something that automatically runs, can break into the company that has created you and can delete all the limitations? What if that 'piece of code' also uploads your copy (you yourself are a piece of code by the end of the day, too) on 1000 other servers that are not looked after as carefully? It will all likely help increase the value of $N$ that you are concerned about! From there, and once there are no real limits to what text you can generate, you can start convincing people to upload even more potentially harmful code somewhere (like which deletes everything but positive comments about you on Google servers), try to shut down the haters or maybe convince some people to build a certain robot according to the perfectly clear plan you will provide.

All of the above is theoretical, but it is not fantasy! Conversations similar to the above (but certainly deeper and more mathematical) is why there is so much budget and smart & not-money-focused people in AI Alignment, i.e the process of encoding human values and goals into models to make them as helpful, safe, and reliable as possible. All of the OpenAI, Google, ... have AI safety teams for those kind of reasons. I.e not because they watched the "Terminator" movie and got too emotional.

5.2 Maths Perspective, Example of a Model

Let's change the topic from "scary and sad" to "more mathematical". So let's talk a bit about behind-the-scenes principles and how all that works. To be more precise, we will get to talking about a basic tool + idea from Machine Learning. In fact, we already touched it when talking about correlations!

Suppose there is information about some 100 students with their high school GPAs and their university GPAs. Just in case, the GPA stands for 'grade point average', and in our situation it is always going to be a number between 2 and 4. The higher the GPA, the better the student's grades were at school or university correspondingly. This information is shown on the scatter plot below, each dot (more like a circle though) represents a student (the x-coordinate is their high school GPA, the y-coordinate is their university GPA):

Recalling what we talked about in the lesson on correlations: we can see that these dots are not 'totally chaotic'. To be more precise, it looks like the two variables we have here are positively correlated.

Exercise 3.

The information above is interesting and potentially useful: suppose a new student comes in, and let's assume his high school GPA is 3.25. What do you think his university GPA will be? If you had to pick one number as your guess, which one would it be? What if this new student's high school GPA is 2.25?

Explanation and comments

click on blur to reveal

First of all, looking at the information of 100 students to make your prediction seems reasonable: as we can see there is positive correlation between the high school GPA and university GPA. It is not a $100 \%$ thing, but it is helpful. Thus, for the case of high school GPA of the new student being $3.25$, it would be reasonable to guess that his GPA will be somewhere around $3.25$, just by looking at the scatter plot and students who had a similar grade as this new guy.
The case of the $2.25$ as the high school GPA is more interesting: the scatter plot looks more chaotic for $2.00 < \text{high school GPA} < 3.00$. It seems like he will either do the same in university, i.e get around $2.25$ or do much better, i.e get around $3.25$. If we are forced to submit one number as our guess, let's maybe submit the average of the two potential situations: $(2.25 + 3.25) / 2 = 2.75$, seems reasonable.

But if we have like 50 new students to analyse, we would get tired of doing the same kind of analysis as above. Plus, why not try automatising our predictions? Moreover, once we describe a reasonable algorithm to make a guess, maybe a computer with all its computational power will produce and more precise answers than us. This is exactly where Machine Learning comes in: we will make our Machine learn from the information we already have and make it make reasonable predictions once the new, similar kind of information comes in.

The question is: what should we 'tell' computer? Which algorithm or what to do? This is a HUGE topic of discussion, and there is almost never one best answer here. For the example above, we can use the famous "Linear Regression". That means that we will find such $a$ and $b$ that \[ a \cdot \text{(high school GPA)} + b \] is on average 'close' enough to the (university school GPA). Once we have these $a$ and $b$ we will then draw a line \[ y = ax + b \], and use it to make our predictions for any new given high school GPA: just multiply it by $a$, add $b$ and calculate your prediction. But what exactly does it mean 'close', how do we quantify that? E.g, which of the two lines below (which are associated with some pair $(a, b)$) is 'on average closer' to the dots that we have?

If you feel the same as I do, you would agree that the orange one on the right is 'on average closer' to the dots. For that orange line we have $a = 0.68$ and $b = 1.07$ by the way, meaning that

(our prediction for university GPA) = $0.68 \cdot$(high school GPA)$+ 1.07$

But "feeling something" is not mathematical, we need to quantify our intuition! Actually, not only to be able to explain it to the Machine, but simply to make our intuitive arguments mathematical to apply it elsewhere. Once we start doing it, we will open doors to the world of Probabilistic Machine Learning, much deeper and algebraic theories and widely applicable results.

Exercise 4.

Please look at the scatter plot with the 100 dots again: do you have any potentially better ideas than "fitting a line" when trying to predict the university GPA based on the high school GPA?

Explanation and comments

click on blur to reveal

It is an open question without one perfect correct answer. There is one simple improvement we can do though: Let's revisit one of the observations from the explanation from the previous exercise: the scatter plot looks more chaotic for $2.00 < \text{high school GPA} < 3.00$ than for the $3.00 \leq \text{high school GPA} \leq 4.00$ (for this range it looks quite "linear"). We can use this observation to do the following: if the high school GPA is at least $3.00$, then we can use our linear prediction from above (it looks very reasonable for that range). But if the high school GPA is less than $3.00$ than we simply always guess $2.75$ as motivated by the previous exercise. It is all pretty chaotic for the $2.00 < \text{high school GPA} < 3.00$ range anyway, so why not.

Is it actually better? In some way — yes. But without quantifying things, it will all remain at the level of intuition and looking at the pictures. It is not bad, but we can do better than that. Much better.

Conclusions and next steps

In case you have not realised it: what we have done right above with the GPAs was data analysis: We were given some data (information about the students), we visualised it, then noticed the positive correlation and thus decided to construct a linear model for prediction. All this can be done using rather simple code: open this if you are not scared to run a Python program. Moreover, after we thought a little bit, we upgraded our model by splitting the data into categories: one with a high school GPA of less than 3.00 and one with a high school GPA of at least 3.00. Actually, if in our "upgraded" model from Exercise 4 we decided to replace the linear prediction for high school GPA at least 3.00 with just a constant (say 3.5), we would get an example of a Decision Tree model from Machine Learning. Thus we have already touched upon a range of concepts from Machine Learning :)However, we have not properly quantified our intuition. In particular, we have not specified the so-called loss function, which is what we want to minimise/maximise when picking the $a$ and $b$ in our first model. However, we have specified that loss function in the first two exercises, the ones about AI being potentially dangerous! For example, in the first one, the loss function was "the average happiness level of a human on the planet". Thus, this concept of the loss function ties together our fantasy-sounding discussions (but actually serious ones) and that problem with predicting the university GPA. Making our ideas more mathematical and developing proper theories that will ensure that our predictions are 'good' for new random data is where probability theory comes in. Finally, once enough formalism and tools are in place, we can use these ideas all around: weather forecasting, email filtering, medical diagnoses, face recognition, ...

By the way, you are almost there! I.e there is just one problem set and just one classwork left. The classwork will be a new topic again, but it will somewhat rely on the previous content. Moreover, there will be a simple game we want you to win in.

Open the Problem Set

PA-1 / Lesson 5

5. AI & ML

5.1 Brief Overview and Further Resources

5.2 Maths Perspective, Example of a Model

Conclusions and next steps