TDM 10200: Project 5 — 2023
Motivation: Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use.
Context: Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools.
Scope: functions
Dataset(s)
The following questions will use the following dataset(s):
/anvil/projects/tdm/data/craigslist/vehicles.csv
/anvil/projects/tdm/data/flights/subset/
/anvil/projects/tdm/data/death_records/DeathRecords.csv
If it helps, you also have a longer article available here. It is a very detailed article going through many things that you can do with functions in Python. In particular, the section on Argument Passing might be helpful. |
Questions
ONE
Read in the dataset from Project 4
/anvil/projects/tdm/data/craigslist/vehicles.csv
and name it cars
.
-
Write a function called
mycarcount
that takes two parameters:cars
as a data frame, andyear
as an integer, and outputs the number of cars from thatyear
. (Alternatively, you can just use 1 argument, theyear
, as a parameter, and then read through thecars
data frame inside the function. Either way is OK.) -
Run the function for each of the years from Project 4, Question 4, namely, for the years 2011, 1989, 1997. Make sure that your answers agree with the results from that earlier project.
You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame called |
-
Code used to answer the question.
-
Result of code.
TWO
-
Run the function
mycarcount
for each year in the data set. (Of course, be sure to only run it once for each year!) -
Now make sure that the results agree, if you compare with the
value_counts()
from theyear
column.
It will take a long time to run |
-
Code used to answer the question.
-
Result of code.
THREE
Use the csv data sets from this directory (the same as from Project 3):
/anvil/projects/tdm/data/flights/subset/
-
Write a function that takes two parameters:
myorigin
as a string with three characters, andyear
as an integer, and outputs the number of flights that depart during that year, from theOrigin
airport indicated inmyorigin
. -
Test your function for a few years and airports of your choice. You can choose! Do your results look reasonable, i.e., do the airports in the big cities have lots of flights, compared to airports in smaller cities?
-
Run the function for each of the years from 1987 to 2008, checking how many flights depart from
IND
in each year. Make sure that you use the method from the end of Project 3, Question 5.
For this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the |
It will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years. |
Helpful Hint
total_count = 0
for df in pd.read_csv(putthefilenamehere, chunksize=10000):
for index, row in df.iterrows():
if row['Origin'] == myorigin:
total_count += 1
-
Code used to answer the question.
-
Result of code.
FOUR
Extend your function from Question 3 as follows:
-
Modify your function so that it takes three parameters:
myorigin
andmydest
as strings that each have three characters, andyear
as an integer, and outputs the number of flights that depart during that year, from theOrigin
airport indicated inmyorigin
, and arrive at theDest
airport indicated inmydest
. -
Test your function for a few years and pairs of airports (origin and destination airports) of your choice. Do the results look reasonable, e.g., if you compare popular flight paths, versus unpopular flight paths?
-
Run the function for each of the years from 1987 to 2008, checking how many flights depart from
IND
and arrive atORD
in each year.
Again, for this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the |
Again, it will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years. |
-
Code used to answer the question.
-
Result of code.
FIVE
Use the csv data set for the DeathRecords from Project 2:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
Write a function that takes two parameters:
Sex
(which will beF
orM
) andMaritalStatus
(D
orM
orS
orU
orW
), and outputs the number of people with the indicatedSex
andMaritalStatus
in the data set. (If you look at an earlier version of this question, in which we asked about the year of death, well, everyone in the data set died in 2014, so you do not need to worry about the year of death.)
You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame, or you can read just a few lines at a time, using the |
-
Code used to answer the question
-
Result of the code
TA applications for The Data Mine are currently being accepted. Please visit us here to apply! |
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |