Exploring nycflights13 using basic data manipulation with dplyr (2024)

Dino Hadjiyannis

2022-08-29

Prerequisites:

In this exploratory analysis I’ll be using data from thenycflights13, a package that contains information aboutall flights that departed from NYC (e.g.EWR, JFK and LGA) todestinations in the United States, Puerto Rico, and the American VirginIslands) in 2013. This is available to anyone at https://cran.r-project.org/.

To install, in the R console typeinstall.packages(‘nycflights13’)

This package provides the following data tables.

flights: all flights that departed from NYC in 2013.weather: hourly meteorological data for each airport.planes: construction information about each plane.airports: airport names and locations.airlines: translation between two letter carrier codesand names.

To load the dataset type:

library(nycflights13)library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4## ✔ tibble 3.1.8 ✔ dplyr 1.0.9## ✔ tidyr 1.2.0 ✔ stringr 1.4.1## ✔ readr 2.1.2 ✔ forcats 0.5.2## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag() masks stats::lag()

I’ll be using nycflights13::flights, a dataframe that contains336,776 flights that departed from New York City in 2013.

First, I’ll explore the dataset’s structure

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

n_distinct(flights$carrier)

## [1] 16

n_distinct(flights$origin)

## [1] 3

n_distinct(flights$dest)

## [1] 105

#Number of departures getting cancelledsum(is.na(flights$dep_time))

## [1] 8255

From basic inspection we can find that there are 16 differentcarriers flying out of NYC airports. NYC has 3 different airports. Thereare 105 different destination locations to which flights fly out of NYCairports. 8255 flights departures were cancelled as the data has NA.

Now let’s examine the relationship between the distance and theaverage delay for each location.

 new_flights <- flights %>% group_by(dest) %>% summarize( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) )new_flights

lets explore the output visually

ggplot(new_flights, aes(dist,delay)) + geom_point(aes(size = count),alpha = 0.5) + geom_smooth() + geom_jitter()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).## Removed 1 rows containing missing values (geom_point).

Flights tend to have more delays in short-medium distances. Longdistance flights do not have as much delay.

Now lets look at planes with the highest average delays.

(I opted to avoid aggregating data with missing values)

#remove all **NA** values from flights dataset.not_cancelled <- flights %>% filter(!is.na(arr_delay), !is.na(dep_delay))not_cancelled

planes_with_delays <- not_cancelled %>% group_by(tailnum) %>% summarize( count = n(), avg_delay = mean(arr_delay) )planes_with_delays

Using geom_freqpoly() to display the counts withlines, instead of counts with bars. But using geom_histogram in thisspecific issue is also a safe option.

(It is much easier to understand overlapping lines than bars.)

ggplot(planes_with_delays, aes(avg_delay)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

lets find that peak

max_delay <- planes_with_delays %>% arrange(-avg_delay)max_delay

Some planes have an average delay of 300 minutes = 5 hours. Letsexplore even further.

ggplot(planes_with_delays, aes(count, avg_delay)) + geom_point()

With a small amount of flights there is a much greater variation inthe average delay.

Let’s filter out the extreme variation.

# number 20 was used as an educated guessplanes_with_delays %>% filter(count > 20) %>% ggplot(aes(count, avg_delay)) + geom_point()

As we can see, the variation decreases as the sample size increases.I believe that the Law of Large Numbers explains why the variance(standard error) goes down when the sample size increases.

Exploring nycflights13 using basic data manipulation with dplyr (2024)

Dino Hadjiyannis

2022-08-29

Prerequisites:

References