SP-1 / Lesson 2

ClassworkProblem SetExtra

Extra №2

It could be argued that a significant part of your higher-level education comes from times when you encountered an interesting problem that you had actually barely any idea how to approach or perhaps did not even fully understand.  These experiences are when, despite a lack of guidance, you do end up learning a lot provided you are interested enough to do well.
With this in mind, today's Extra involves just one project, credits to Emily Zitek for it. To excel at it, you'll need to engage in Python coding, use your understanding of Linear Regression Models (potentially dig in a bit deeper) and do a bit of Googling about data analysis programming tasks. It's possible that you haven't done any Python programming before, or perhaps you've used it, but not specifically for data analysis tasks. Again, this is the point. This time.
To be fair, if you have done data analysis in Python (or similar languages, I keep on writing "Python" since this is what I recommend you use here) before, it won't take long to get to something reasonable here. For everyone else: this is your chance to research, try & fail, talk to ChatGPT, browse Kaggle and finally: improve your coding skills. In particular, get some knowledge about Pandas, Matplotlib and SciPy libraries, whatever those things are... o_O

Project Description

As you know, when deciding whether to admit an applicant, colleges take lots of factors, such as grades, sports, activities, leadership positions, awards, teacher recommendations, and test scores, into consideration. Using SAT scores as a basis of whether to admit a student or not has created some controversy. Among other things, people question whether the SATs are fair and whether they predict college performance. This is what you are going to research.

There is data about 100 students grades & scores in the .csv format, you can download it here. Brief description of the columns from the data file are below:

Variable

Description

high_GPA

High school grade point average

math_SAT

Math SAT score

verb_SAT

Verbal SAT score

comp_GPA

Computer science grade point average

univ_GPA

Overall university grade point average

Below are the questions you need to answer, some are very specific (to get you started), some are more general (to leave some room for research). You do not have to focus solely on these questions: you are welcome to branch out!

  • What would you expect the correlation between math and verbal SAT scores to be? Now calculate this correlation.
  • What is the correlation between the students' overall university GPAs and their computer science GPAs?
  • How would you answer "Are the high school and college GPAs related?"
  • Find the regression line for predicting the overall university GPA from both the math SAT score and the verbal SAT score. What is the r-square of the mode? So how would you answer the "can the math and verbal SAT scores be used to predict college GPA?"
  • Try to build a model that predicts the overall university GPA using high school scores as good as you can. How would you measure the "goodness" of your model? Would you call it "really good"?

If you do want us to check your work, please do it in Google Colab: it is an easily sharable Jupyter Notebook. You can both do data analysis, Python coding, writing comments (using Latex even), etc... all in one place. In particular, you can produce a great report with functioning pieces of code all there. No special set up needed.