go-back arrow

PA-1 / Lesson 3

For easier and less intimidating viewing, better use a bigger display. Unless you're watching a video or recalling something

ClassworkProblem Set

3. Lying with statistics

3.1 Introduction

Statistics is a part of the scientific world dedicated to collecting and interpreting a lot of information. It relies on probability theory. And this is exactly the part that supplies humanity with a looot of junk, while also being incredibly useful for pushing  science forward.

Media, magazines, sofa-experts — all "love" the scientific studies. They "love" them so much that they spread and talk about them A LOT. Listening to some of them a lot may make you believe something weird, e.g you may get to a point of thinking that everything we eat both cures and causes cancer. Actually, this is not an exaggeration:

Schrodinger's Cat

So, "Everything causes cancer". Is this really true? No. And this is not the type of conclusion you want to draw from the science, especially given that it is wrong. To be fair, there is not that much you can conclude from the picture above. Ideally, we should know what kind of "studies" are (i.e., go through them one by one or have a nice, honest summary) and what group of people (including how many) they are for.

This is what all of us should be careful about: conclusions. But we are not. Especially most of the media messes them up, mainly because they only want to share catchy titles which often leads to turning a non-revolutionary study into an incredible-sounding news title. Sometimes they go beyond any limits of comprehension...

Schrodinger's Cat

As a rule of thumb: try not to have a lot of 'slogan-opinions'. Sadly, the truth is often more complicated than a quote like "Try everything" or "Medicine heals". Plus, obviously, be careful with the sources and overall have your critical thinking turned on, especially because far not all the tricks (that media or certain people use) are mathematical. Some are purely "visual", and some are not even tricks but just lying. We will have a look at some of those as well, but note that the number of ways to lie with statistics is huge, therefore mentioning and remembering them all is impossible. Luckily, it is not needed to remember them as long as, once again, you have good critical thinking.

3.2 Exercises

Exercise 1.

Based on his research at work, the psychiatrist concluded that $87\%$ of the people in the world are neurotic. What do you think about this $87\%$?
(You do not have to think anything, of course, but whenever we ask such questions, we want you to process the information and tell us your reasonable feedback)

Explanation and comments

click on blur to reveal

That "research from work" is unfair, where "unfair" here means that clearly more neurotic people are more likely to visit a psychiatrist. In other words, this research does not show much about people all over the world, but rather about people who go to psychiatrists, which is a special group of people (who are, of course, more neurotic cause, well... who else goes to such doctors?).

This is an incredibly popular way to mess up with statistics, called bias. In smarter terms, the word bias means "an inclination or prejudice for or against one person or group". Another somewhat extreme example of this is when an owner of two orange cats, who are both crazy, concludes that "all orange cats are crazy". Another common example is people claiming "everyone likes/does/prefers/..." by arguing, "all my friends and their friends are like that": more often than not, most of your friends are from the same kind of community: either all middle-class, all foreign students in a good university, all STEM, ... Thus people's opinion on most topics like that is biased. Often extremely biased.

Exercise 2.

Mr. Anas brought a piece of paper from his presentation where he was showing that the company's total amount of money is growing a lot over the years. What do you think about this piece of information?
(This exercise has a graph in it and we hope you already worked a bit with them. Otherwise, feel free to skip this one.)

Schrodinger's Cat

Explanation and comments

click on blur to reveal

This picture with a line going up (i.e. this graph) does not really show how big the change is.. For people who know about graphs: this plot lacks a y-label. E.g what if each little square represents a change in 1 penny? Then all we know is that the company's total amount of money increased by less than a dollar over the course of 8 years. This is not much at all... For those who heard about "inflation" — this is actually very bad results for any company.

Exercise 3.

There is information about imaginary Hairland that people without moustaches usually earn twice as much as people without moustaches. Which of the two pictures below would you use to present such information and why?

Schrodinger's Cat

Explanation and comments

click on blur to reveal

If you want to be fair, then you must use the picture on the left. The picture on the right is deceiving. The reason is that for the picture on the right, the bag for a non-moustache guy is both twice as tall and twice as wide. This means that his bag is visually 4 times as large (this is why it looks much bigger, actually), i.e. not 2 times as large as it should be. This gives a slightly wrong impression of what the conclusion is.

This is a visual trick. News and media love using pictures to dramatize something, you will surely see a lot of such examples in the future. And this trick is quite cool, because technically newspapers do not lie when present true information just the picture is deceiving. I.e noone can seriously accuse this newspaper of lying in this case, but most people who read it will definitely check out the picture and will almost surely get a wrong impression of what is going on.

Exercise 4.

"Four times more fatalities occur on the highways at 7p.m (i.e. in the evening) than at 7a.m (i.e. in the morning). Does this information show that driving in the morning is safer than driving in the evening?

Explanation and comments

click on blur to reveal

No, it doesn't. This is simply because there might be many more cars in the evening (which is true, by the way). So if say there are 10 times more cars at 7p.m than at 7a.m, then it will actually be more reasonable to conclude that driving in the evening is safer.

This trick is what many people use in conversations, it is called "if you want to prove something, demonstrate something else and pretend that they are the same thing".

3.3 Interesting Observation about Studies

The tricks described above are indeed quite common, so please watch out. However, there is also a huge issue in the usage of statistical analysis that is mathematically deeper than just "visual trick" or "bias". We won't be able to talk about all the details of the problem, but we will be able to touch the topic.

Suppose there is a field where there are 1000 hypotheses to test (claims of the sort of "Does aspirin help with cancer" that we check by running a survey or something like this) and suppose that 100 of them are actually true. The exact numbers do not matter as much (only that 100 is around $10\%$ of a $1000$), it is just they are convenient for the calculations and showing the point. Now, we don't know which of the hypotheses are true, so we will run 1000 research projects, wherein each of them, we will run some surveys and hopefully find out which of the hypotheses are true and which ones are not.

However, we will sometimes get wrong conclusions. No such research is perfect, and there is always some bad/good luck involved in surveys. Often the researches are organised in such a way that there is $80\%$ chance that if the hypothesis is true, we will find out it is true; and if the hypothesis is in fact, wrong, there is still $5\%$ chance that we will obtain results showing that it is true (some people can recognise the usual value of p-value here). Thus we will find $0.8 \cdot 100 = 80$ of the true hypothesis (so miss $20$ of the true hypothesis) and think that $0.05 \cdot (1000-100) = 45$ of the wrong hypothesis are in fact true (all others we will show to be wrong as we should). All $80+45=125$ of them will be published as "cool findings".

Schrodinger's Cat

Finally, note that people rarely publish studies showing that "some hypothesis is not true". Who would read an article saying "eating 10 blueberries a day does not make your brain larger" or "nothing up with frill-necked lizards"? Noone really. But let's say only 20 of such articles are published anyway (the ones that sound the most interesting). Thus, altogether, we will publish $125+20=145$ researches and 45 of them will be wrong. This makes up $\frac{45}{145} \approx 0.31 = 31\%$ of all the researches.

Schrodinger's Cat

Nearly a third of all published results are wrong! Of course, this is just an approximation and in particular, it assumes those $80 \%$ and $5 \%$ to be true (in some disciplines, the requirements are stricter). Plus, it does not mean that all scientific journals, including famous ones, are like that: I would still trust most of the articles from, say, Nature. However, when it comes to all publications everywhere — the estimate that we got turns out to be reasonable...

Conclusions, and next steps

There is a famous quote, "There are three kinds of lies: Lies, Damned Lies, and Statistics", which was popularized by Mark Twain (among others). Just like most slogans, you need to be careful when interpreting it: statistics is not pure junk, but there are quite a number of tricks and ways to deceive to be aware of when reading a "statistically proven" result. We have looked through some of them above.

After the problem set, we will move on to talking about "correlation". It is related to scientific studies a lot since many of them claim things like "there is a strong correlation between X and Y".