Lab 4: College Admissions Data

from the show Community -- 6 seasons and a movie!
a typical college study group

You all applied to colleges. Here you are at NOT college. This lab will investigate college data to draw basic insights about admissions, cost, and test scores from a range of institutions! We're just in week 4 of learning to program, but you'll see that we can still do quite a bit!

In today's lab, you will perform ... Data Processing, Data Visualization.

Step 0: Folder and System Setup

Create a new folder lab04 inside your SD211 folder for this lab. Today you will write a single Python program called college.py.

This lab uses the Plotly visualization library. You need to install that into your conda environment. It requires both Plotly and a support library called Pandas:

conda activate sd211
conda install plotly pandas

Step 1: Files and Data Filters

In this initial part, you'll read a file of college data in a format called comma-separated-values (CSV). This is a very common format for datasets, and Python has several libraries that read it for you. A popular one is called Pandas, but since we're still learning to program, we will practice our new skills without these libraries for now. You will then be free to use them later in the semester once you master the basics.

A file in CSV format starts with a header line that lists the "names" of each value that you will find on the rest of the lines in the file. Think of it as a spreadsheet format. Our college file has dozens of column names that look like this:

instnm,FIPS state code (HD2020),State abbreviation (HD2020),Level of institution (HD2020),Highest level of offering (HD2020),Carnegie Classification 2018: Basic (HD2020),Admission test scores (ADM2020),Applicants total (ADM2020),Applicants men (ADM2020),Applicants women (ADM2020),Admissions men (ADM2020),Admissions total (ADM2020),Admissions women (ADM2020),Enrolled total (ADM2020), ...

That's just some of them. A typical line in the file is a little easier to see. Here is the line for USNA:

United States Naval Academy,24,MD,1,5,21,1,15699,11121,4578,996,1426,430,1194,830,364,1194,830,364,,,,71,709,60,610,710,620,740,28,33,26,35,26,32,842,1186,2,1184,1067,90,1080,91,0,1184,0,1080,91,0,0,,

As you can see, commas separate the values (CSV!), so your task is to read a line and split it based on the commas. This returns a List of values. We just learned about lists in lecture, right?

Download the full college CSV file here

Step 2: Basic Filtering

Write your first program, college.py, so that it opens the CSV file, loops over the lines, and prints out the college names with their male/female enrollment ratio.

There are a few constant variables you need to do this. You need to know the index of the institution name, and the indices of its Enrolled men and women counts. Let's do something simple and hacky to start. Create constant variables with hard-coded values. Put this at the top of your program:

# Constant variables. Index of each column in the CSV file.
NAME = 0
TOTAL_MEN = 14
TOTAL_WOMEN = 15

These are the indices of each column we need. We'll do something more sophisticated later to find these indices, but for now, this is fine.

Now write the program. Open the file, read each line, split it, and grab the information you need. Calculate the average male/female ratio, and print out all the institutions

REQUIRED:

Your output should match this:

Abilene Christian University	0.5732600732600732
Adelphi University	0.48878923766816146
Adrian College	1.1869918699186992
Alabama A & M University	0.6842105263157895
...
...
...
Yale University	1.0789473684210527
Yeshiva University	1.1056603773584905
York College of Pennsylvania	0.8780487804878049
Youngstown State University	0.8181818181818182
AVERAGE M/F RATIO: 0.845842509574169

Step 3: Make a Scatterplot

Copy college.py to scatter.py.

We're doing data science! It's time to make a visualization so that humans can better "feel" the data, not just see a dump of numbers with an average. This part will display a scatterplot that links school "quality" to male/female ratios. I don't know if there is a correlation, and neither do you. It sounds kind of strange to think there would be one, so let's see what the data shows!

Let's estimate "quality" by the SAT scores of enrolled students. We'll sum their math SAT and verbal SAT to get one "quality score". This obviously isn't a great measure, but it provides some sort of signal, at least. Compare SAT sums to your m/f ratios by visualizing the two values on a 2d graph. Python has a scatterplot library that is very easy to use. Take a look at these 3 lines:

# IMPORT PLOTLY LIBRARY TOP OF PROGRAM
import plotly.express as px
import plotly.graph_objects as go

...

fig = go.Figure()      
fig.add_traces(go.Scatter(x=XDATA, y=YDATA, text=NAMES, mode='markers'))
fig.show()

You provide it XDATA and YDATA lists. Each list should of course be the same size, and the values should be aligned with each other. Make the XDATA variable your m/f ratios, and make YDATA your SAT scores.

The scatterplot also takes a third optional list, NAMES. These are the institution names we want to show up with each dot on the plot. Create that list too.

In order to do this, we need more constant variables to find the MSAT and VSAT in your input. Add those to your variable declarations with the correct indices. You want the "SAT Math 25th percentile" column and the "SAT Evidence-Based Reading and Writing 25th" column.

To summarize: your goal in this part is to generate a scatterplot using the two lines of code above. It requires three lists of values, so create those 3 lists first!

QUESTIONS:
Do you see a trend in the scatterplot? Do the SAT scores and M/F ratios appear to be correlated in any way?
Who are the outliers? Why might that be?

Step 4: Tuition information

Copy scatter.py to tuition.py

This final step will compare something people care about more: tuition. Does tuition costs and quality of students have a relation? Further, let's stop looking at all institutions, and compare like-minded ones. We'll focus on the 4-year colleges only.

In this step, you will create another scatterplot along the two dimensions of tuition and SAT scores, limiting your visualization to just 4-year colleges.

This is getting more involved now with two more columns (type of institution, and tuition). It's also not a great idea to hard-code indices like we've been doing because the input data format might change the order of its columns, and that will break your program! Here's another option that instead reads the first line of the file and finds the indices automatically:

with open("filtered.csv") as fh:
      header = fh.readline()
      parts = header.split(',')
      
      NAME = parts.index('instnm')
      CARNEGIE = parts.index('Carnegie Classification 2018: Basic (HD2020)')
      TOTAL_ENROLL = parts.index('Enrolled total (ADM2020)')
      TOTAL_MEN = parts.index('Enrolled full time men (ADM2020)')
      TOTAL_WOMEN = parts.index('Enrolled full time women (ADM2020)')            
      MSAT = parts.index('SAT Math 25th percentile score (ADM2020)')
      VSAT = parts.index('SAT Evidence-Based Reading and Writing 25th percentile score (ADM2020)')
      TUITION = parts.index('Total price for in-state students living on campus 2020-21 (DRVIC2020)')

Replace your constant variables with the above code. Do you understand how this code? It finds the index of each column name, so you can use TUITION later to find that column in any college's CSV line. For instance, if "Total price" is the 18th column, this sets TUITION=17. It will work on any order.

Now you are all set with a bunch of indices. Complete the requirements for this step.

REQUIRED:

Step 5: Visualize the Types of Institutions

This step further modifies your tuition.py program.

You have a graph of colleges with what looks like a trend, but this data is inter-mixing different types of universities and colleges. Let's visually separate them to try and better understand what we're looking at. There are three main types of institutions in this data using what is called the "Carnegie Classification":

  1. Universities with BS+MS+PhD programs (15 <= CARNEGIE <= 17)
  2. Universities just BS+MS programs (18 <= CARNEGIE <= 20)
  3. 4-year BS Colleges (21 <= CARNEGIE <= 22)
  4. Misc. other

Your task is to generate the same scatterplot as before, but now each dot is colored according to its type above. Here is how you add color to a scatterplot:

fig.add_traces(go.Scatter(x=XDATA, y=YDATA, text=NAMES, mode='markers',
                          marker=dict(color=COLORS, colorscale='rainbow', size=10)))

The COLORS variable is a 4th list you'll need: a list of colors. The Plotly scatterplot is smart about your input, and will generate a different color for each integer you give it -- so I suggest just assigning 0 to the first, 1 to the second type, 2 the third, and 3 the Misc. category. Your COLORS list is thus a list of 0-4 integers for each institution.

REQUIRED: you should generate the same scatterplot as Part 4, but now with different colored dots.

QUESTIONS: - Find an example of a "best value" 4-year College, and a "worst value" 4-year College (ignore the $0 ones). - If you were looking for a type of institution that tends to be a good value for its quality of students, how would you describe those institutions? What do they tend to be? - What do you find interesting about this view of the data? Describe one conclusion you might draw from it.

What to turn in

Having fun? Great!

Create a document in Word and insert the 3 scatterplots from Parts 3, 4, and 5. Answer the questions from Parts 3 and 5 in this document as well. Then save as a PDF called answers.pdf.

Use the command-line to submit your files:

submit -c=sd211 -p=lab04 college.py scatter.py tuition.py answers.pdf

...or if you're in the lab not your laptop, you can visit the submit website and upload the four files: college.py, scatter.py, tuition.py, answers.pdf