Time-Series Data Analysis of Flight Delays at US Airports, January-June 2020 (2021-US-EPO-925)
Aishwarya Krishna Prasad, Student, Singapore Management University
Ruiyun Yan, Student, Singapore Management University
Linli Zhong, Student, Singapore Management University
Prof Kam Tin Seong, Singapore Management University
There are several reasons for a flight to be delayed, such as air system issues, weather, airline delays, security issues, and so on. But interestingly, the most frequent reason for a flight delay is not about weather but about air system issues. The Federal Aviation Administration (FAA) usually considers a flight to be delayed when it is 15 minutes or more late in arriving or departing than its scheduled time. Flight delays are inconvenient for both airlines and customers. This paper employs dynamic time warping (DTW) techniques for 54 airports in the US. The study aims to cluster airports with similar delay patterns over time. In addition, the paper builds some explanatory models to explain the similarity between different airports or distances. In this analysis, we aim to use the time-series techniques to discover the similarity in the top 15% busiest American airports. This paper first filters the top 15% busiest American airports and calculates the departure delay rate for each airport and then uses DTW to cluster these airports based on departure similarities. Next, the similarities and differences between clusters are identified. This analysis will help inform passengers and airport officials about departure delays at 54 American airports from January to June 2020.
Speaker |
Transcript |
ZHONG, LINLI _ | Okay let's get started. Hi, everyone. This is the poster of time series data analysis of flight delay in the US airports from January 2020 to June 2020. We are students of Singapore Management University. I'm Linli. |
YAN Ruiyun | I'm Ruiyun. |
Aishwarya KRISHNA PRASAD | And I am Aishwarya Krishna Prasad. Now let's quickly dive in to the introduction of our project. Over to you Linli. |
ZHONG, LINLI _ | Thank you, Ash. |
In the left hand side, we can see that there is a line chart. This shows the annual passenger traffic at top 10 busiest US airport and... | |
in in the...from the graph, we can see that the number of the passengers in each airport experienced a sharp drop. This is because the passengers in airports showed the response to the spread of the COVID-19 in 2020. | |
And for our analysis, we would like to discover the delay similarity of top 15% of airports in America from features of the delay and geographic location. | |
time series, dynamic time wrapping, exploratory data analysis. | |
The time series and DTW are employed to find out the similarities between the clusters, based on the departure delays. EDA is used to draw the geographic map. Okay, let's go back to the data set. | |
Thank you, this is the data set. | |
Actually, our data set comes from the United States of Department of Transportation and from our | |
data preparation in the left hand side, this is the process of our data preparation. We firstly imported the csv file into JMP Pro 16.0. | |
And then we remove the columns and values which are not really useful for our analysis. | |
And after that, for the data transformation, we summarize the data for airports from different cities, and then we filter out the | |
top 54 airports, which is based on the total number of the fights in each airports and calculate the rate of the delay. | |
And after the data preparation, we save this file as SAS format and we import the SAS format into the SAS Enterprise Miner 40.1 for our further analysis, | |
namely the DTW analysis and time series analysis. After DTW process JMP Pro 16.0 was used again by finding out the singularity of different clusters and draw geographic maps. | |
And this is the introduction for data set. Let's welcome my partner to introduce more about our analysis. | |
Aishwarya KRISHNA PRASAD | Thank you, Linli. Now let's dive into the time series and cluster analysis. So we did the time series and cluster analysis using the SAS Enterprise Miner. |
So this graph is one of the outputs that we obtained using the DTW nodes in SAS Enterprise Miner. So in the X axis, you can see that, you know, it contains the months from January 2020 to June...to July 2020. | |
And in the y axis, we can observe that there's a percentage of delays in the flights that we have included in our data set. | |
Now we can see that there is a sharp spike in the early February and in the late June, which seems to be strongly correlated with around the holiday periods of USA. | |
But, in general, other than these two spikes...major spikes, we can also see a steady decrease in the number of flight delays in general. | |
We then performed a time series clustering based on hierarchical clustering and the constellation plot of the same can be observed over here, using SAS. And we chose that...we felt that the number of clusters (7) is the most optimal number of clusters for our analysis. | |
Now, these are the clusters that are formed by using the TS similarity node of the SAS Enterprise Miner, so let's just take a...quickly take | |
the instance of Cluster1. So in this Cluster1, it contains mostly the international airports in the US. | |
So some of these airports are the Denver international airport, the Kansas City international airport, the Washington international airport, | |
just to name a few. So the delay in these airports are pretty large, as you can see, and this can be attributed to, you know, because this is located in the city that is frequented by tourists. | |
So similarly, the remaining clusters are formed by this similar behavior of the delays that are experienced in the flights. | |
Now the clusters that were generated in the previous step was then fed into the JMP Pro, and using the Graph Builder functionality, | |
we were able to build these graphs. So this graph contains the causes of delays in each of the clusters. So in over here, we can clearly see | |
which causes of delay is more prominent in each cluster. So for example, as you can see for Cluster1, | |
the late aircraft delay, that is, the delay caused by the previous flight to the current flight is more prominent compared to the rest. And the same queue follows for the rest of the clusters. | |
But if you see this cluster, right, so although this visualization in SAS is pretty intuitive, | |
we felt like for a...for a data set with large number of points, or more number of airports in our case, it would be quite difficult to analyze. So I'm just calling upon my peer Ruiyun to present another approach to analyze the clusters. Over to you, Ruiyun. | |
YAN Ruiyun | Okay, geographic location is another part that we focused on. The clusters were formed in SAS then we used Graph Builder feature in JMP Pro 16.0 to generate this map to |
show where the different airports are located by cluster. Obviously airports from western and middle US are only included in Cluster 1 and Cluster 3. | |
And these two clusters show that cluster is not distributed in a specific region. | |
Cluster 2, Cluster 4, Cluster 5 and Cluster 6 demonstrate an aggregation of airports with specific region. | |
Airports from Cluster 2 are mainly concentrated in eastern United States, while the Cluster 5 and 6 are more likely contain the airports of some | |
tourist attractions, such as Houston, Phoenix, Baltimore, and Honolulu, which are the largest cities of Texas, Arizona, Maryland and Hawaii. | |
Even more to the point, Phoenix Sky Harbor International Airport is the backbone of national airlines and southwest airlines. That's | |
one of the key transportation hubs in the southwest America. In addition Cluster 7 is a particular case, as it just has one airport, San Juan airport from Puerto Rico. | |
We surmised that because of the special geographical location, any flight departing from San Juan airport has a long distance to travel. | |
And that's all about the geographical analysis and now my partner Aishwarya will give us a conclusion. | |
Aishwarya KRISHNA PRASAD | So in conclusion here, we tried exploiting the ease of usage of the DTW nodes in the SAS Enterprise Miner and also the sophisticated visualization and pre processing techniques in JMP Pro 16.0 to perform our time series analysis for our flight data. |
So we performed the dynamic time clustering for 54 airports. And these airports were formed into seven clusters, based on the delay patterns during January and June 2020. | |
We observed that the carrier delay is mostly the main reason for delay in each cluster, while the late aircraft delay is not very far behind on being a major cause of delay in most of the clusters. | |
As part of the future work, one can include the COVID data points to improve this analysis further and also discuss the correlation between the delay and the cancellation rate of flights. | |
Thank you so much for listening to us. I hope you liked our presentation. |