STAT 39000: Project 7 — Fall 2020
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk
. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
Dataset:
The following questions will use the dataset found in Scholar:
/class/datamine/data/flights/subset/YYYY.csv
An example of the data for the year 1987 can be found here.
Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. |
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like |
Question 1
Write a line of code that prints a list of the unique values in the DayOfWeek
column. Write a line of code that prints a list of the unique values in the DayOfMonth
column. Write a line of code that prints a list of the unique values in the Month
column. Use the 1987.csv
dataset. Are the results what you expected?
-
3 lines of code used to get a list of unique values for the chosen columns.
-
1-2 sentences explaining whether or not the results are what you expected.
Question 2
Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv
, were there any rows without 29 columns?
Checking built-in variables for |
-
Line of code used to solve the problem.
-
1-2 sentences explaining whether or not there were any rows without 29 columns.
Question 3
Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.
-
The content of your bash script (starting with "#!/bin/bash") in a code chunk.
-
The results of running your bash scripts from year 1987 to 2008.
Question 4
awk
is a really good tool to quickly get some data and manipulate it a little bit. The column Distance
contains the distances of the flights in miles. Use awk
to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.
Below is some example output:
Miles: 12345 Kilometers: 19867.35168
-
The code used to solve the problem.
-
The results of running the code.
Question 5
Use awk
to calculate the sum of the number of DepDelay
minutes, grouped according to DayOfWeek
. Use 2007.csv
.
Below is some example output:
DayOfWeek: 0
1: 1234567
2: 1234567
3: 1234567
4: 1234567
5: 1234567
6: 1234567
7: 1234567
1 is Monday. |
-
The code used to solve the problem.
-
The output from running the code.
Question 6
It wouldn’t be fair to compare the total DepDelay
minutes by DayOfWeek
as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay
minutes by the number of flights per DayOfWeek
. Use 2007.csv
.
Below is some example output:
DayOfWeek: 0
1: 1.234567
2: 1.234567
3: 1.234567
4: 1.234567
5: 1.234567
6: 1.234567
7: 1.234567
-
The code used to solve the problem.
-
The output from running the code.
Question 7
Anyone who has flown knows how frustrating it can be waiting for takeoff, or deboarding the aircraft. These roughly translate to TaxiOut
and TaxiIn
respectively. If you were to fly into or out of IND what is your expected total taxi time? Use 2007.csv
.
Taxi times are in minutes. |
-
The code used to solve the problem.
-
The output from running the code.