5.2 Flight analysis¶
This lesson aims to introduce students to data analysis and data management using Python Pandas library.
Data¶
We will work with three CSV files that are available at https://geo.fsv.cvut.cz/courses/155isdp/data/05/.
The first one called airports.csv contains information about airports in the US as the name of airport, the airport code given by International Air Transport Association (IATA) and the information about airport position (city, state, country, coordinates in WGS84 coordinate system (EPSG:4326).
The second file stored as airlines.csv contains information about airlines operating in the US as the name of the airline and the airline code given by IATA.
The last file called flights.csv contains lots of information about flights. Particularly interesting are origin airport, destination airport, arrival time, arrival delay, diverted and cancelled. As the file was originally very big for storing on school S: disk, it was made smaller. Now it contains only flights that happened in one particular day (1st January 2015).
Intro¶
In our case we have data in csv. Hovever, using Pandas library, we can process data in many other formats as TXT, JSON, XML, SQL and others.
How big data can we process? What about really big data?
< 5GB ... Pandas
< 30GB ... Pandas chunks
< 200 GB ... DASK dataframes (https://www.dask.org/)
> 200 GB ... PySpark = Python API for running applications parallelly on the distributed Apache Spark cluster (https://sparkbyexamples.com/pyspark-tutorial/#what-is-pyspark)
Our data are less than 5GB so Pandas is just fine.
1. Data loading and first steps¶
import pandas as pd
airports=pd.read_csv('/mnt/repository/155ISDP/05/airports.csv')
airlines=pd.read_csv('/mnt/repository/155ISDP/05/airlines.csv')
flights=pd.read_csv('/mnt/repository/155ISDP/05/flights.csv', low_memory=False)
# Show small data portions
airports.head()
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
0 | ABE | Lehigh Valley International Airport | Allentown | PA | USA | 40.65236 | -75.44040 |
1 | ABI | Abilene Regional Airport | Abilene | TX | USA | 32.41132 | -99.68190 |
2 | ABQ | Albuquerque International Sunport | Albuquerque | NM | USA | 35.04022 | -106.60919 |
3 | ABR | Aberdeen Regional Airport | Aberdeen | SD | USA | 45.44906 | -98.42183 |
4 | ABY | Southwest Georgia Regional Airport | Albany | GA | USA | 31.53552 | -84.19447 |
airports.tail()
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
317 | WRG | Wrangell Airport | Wrangell | AK | USA | 56.48433 | -132.36982 |
318 | WYS | Westerly State Airport | West Yellowstone | MT | USA | 44.68840 | -111.11764 |
319 | XNA | Northwest Arkansas Regional Airport | Fayetteville/Springdale/Rogers | AR | USA | 36.28187 | -94.30681 |
320 | YAK | Yakutat Airport | Yakutat | AK | USA | 59.50336 | -139.66023 |
321 | YUM | Yuma International Airport | Yuma | AZ | USA | 32.65658 | -114.60597 |
airlines.head()
IATA_CODE | AIRLINE | |
---|---|---|
0 | UA | United Air Lines Inc. |
1 | AA | American Airlines Inc. |
2 | US | US Airways Inc. |
3 | F9 | Frontier Airlines Inc. |
4 | B6 | JetBlue Airways |
flights.head(10)
Unnamed: 0 | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 4 | AS | 98 | N407AS | ANC | SEA | 5 | ... | 408.0 | -22.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1 | 1 | 1 | 4 | AA | 2336 | N3KUAA | LAX | PBI | 10 | ... | 741.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | 1 | 1 | 4 | US | 840 | N171US | SFO | CLT | 20 | ... | 811.0 | 5.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | 1 | 1 | 4 | AA | 258 | N3HYAA | LAX | MIA | 20 | ... | 756.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4 | 1 | 1 | 4 | AS | 135 | N527AS | SEA | ANC | 25 | ... | 259.0 | -21.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 5 | 1 | 1 | 4 | DL | 806 | N3730B | SFO | MSP | 25 | ... | 610.0 | 8.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 6 | 1 | 1 | 4 | NK | 612 | N635NK | LAS | MSP | 25 | ... | 509.0 | -17.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 7 | 1 | 1 | 4 | US | 2013 | N584UW | LAX | CLT | 30 | ... | 753.0 | -10.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 8 | 1 | 1 | 4 | AA | 1112 | N3LAAA | SFO | DFW | 30 | ... | 532.0 | -13.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 9 | 1 | 1 | 4 | DL | 1173 | N826DN | LAS | ATL | 30 | ... | 656.0 | -15.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 31 columns
# Show columns
flights.columns
Index(['Unnamed: 0', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], dtype='object')
# Number of columns
airlines.shape
(14, 2)
airports.shape
(322, 7)
flights.shape
(13950, 31)
# Data size in megabytes (MB)
flights.__sizeof__()/1024/1024
6.4234418869018555
import sys
sys.getsizeof(flights)/1024/1024 # adds some additional byte overhead
6.423457145690918
Q1: What is the size of airports object in kilobytes (KB)?
# Number of unique values for each column
for col in flights.columns:
print(col,"\t=>\t",len(flights[col].unique()))
Unnamed: 0 => 13950 MONTH => 1 DAY => 1 DAY_OF_WEEK => 1 AIRLINE => 14 FLIGHT_NUMBER => 5010 TAIL_NUMBER => 3753 ORIGIN_AIRPORT => 307 DESTINATION_AIRPORT => 306 SCHEDULED_DEPARTURE => 1057 DEPARTURE_TIME => 1192 DEPARTURE_DELAY => 246 TAXI_OUT => 78 WHEELS_OFF => 1193 SCHEDULED_TIME => 412 ELAPSED_TIME => 426 AIR_TIME => 411 DISTANCE => 1188 WHEELS_ON => 1282 TAXI_IN => 63 SCHEDULED_ARRIVAL => 1215 ARRIVAL_TIME => 1276 ARRIVAL_DELAY => 262 DIVERTED => 2 CANCELLED => 2 CANCELLATION_REASON => 4 AIR_SYSTEM_DELAY => 87 SECURITY_DELAY => 13 AIRLINE_DELAY => 166 LATE_AIRCRAFT_DELAY => 156 WEATHER_DELAY => 65
for col in airports.columns:
print(col,"\t=>\t",len(airports[col].unique()))
IATA_CODE => 322 AIRPORT => 322 CITY => 308 STATE => 54 COUNTRY => 1 LATITUDE => 320 LONGITUDE => 320
Q2: In how many unique states are airports spread out?
# Data edits
flights.drop(columns=["Unnamed: 0", "MONTH", "DAY", "DAY_OF_WEEK"], inplace=True)
flights.columns
Index(['AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], dtype='object')
flights.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | 408.0 | -22.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | 741.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | 811.0 | 5.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | 756.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | 259.0 | -21.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 27 columns
print(flights["AIRLINE"].unique())
['AS' 'AA' 'US' 'DL' 'NK' 'UA' 'HA' 'B6' 'OO' 'EV' 'MQ' 'F9' 'WN' 'VX']
2. Data filtering¶
Which flights were diverted or canceled?
diverted = flights[flights['DIVERTED'] == 1]
diverted
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
724 | DL | 716 | N944DL | OMA | ATL | 645 | 637.0 | -8.0 | 27.0 | 704.0 | ... | 1451.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1455 | OO | 5237 | N712SK | MKE | IAH | 745 | 742.0 | -3.0 | 15.0 | 757.0 | ... | 1505.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2218 | WN | 1966 | N685SW | ATL | JAX | 845 | 922.0 | 37.0 | 9.0 | 931.0 | ... | 1219.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2526 | EV | 4555 | N12552 | IAH | HRL | 902 | 859.0 | -3.0 | 33.0 | 932.0 | ... | 1319.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2866 | WN | 1081 | N214WN | MDW | OKC | 930 | 926.0 | -4.0 | 9.0 | 935.0 | ... | 1343.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4225 | AA | 1279 | N541AA | OMA | DFW | 1100 | 1058.0 | -2.0 | 12.0 | 1110.0 | ... | 1838.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4394 | OO | 4506 | N613SK | SLC | SUN | 1110 | 1108.0 | -2.0 | 33.0 | 1141.0 | ... | 1353.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4574 | EV | 6171 | N17159 | DEN | GUC | 1122 | 1120.0 | -2.0 | 14.0 | 1134.0 | ... | 1635.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4636 | EV | 4169 | N11547 | IAH | ABQ | 1125 | 1122.0 | -3.0 | 24.0 | 1146.0 | ... | 1630.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4639 | EV | 4654 | N29515 | IAH | HRL | 1125 | 1206.0 | 41.0 | 11.0 | 1217.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
6422 | UA | 399 | N456UA | SAN | IAD | 1317 | 1504.0 | 107.0 | 15.0 | 1519.0 | ... | 119.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7281 | EV | 4654 | N14543 | CRP | HRL | 1410 | 1736.0 | 206.0 | 9.0 | 1745.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7508 | OO | 5488 | N791SK | IAH | ASE | 1427 | 1436.0 | 9.0 | 19.0 | 1455.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9204 | WN | 2324 | N8655D | LAS | MCO | 1615 | 1758.0 | 103.0 | 11.0 | 1809.0 | ... | 241.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9277 | US | 735 | N657AW | PHL | LAS | 1620 | 1617.0 | -3.0 | 44.0 | 1701.0 | ... | 2107.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
10147 | OO | 5165 | N932SW | DEN | BZN | 1715 | 1744.0 | 29.0 | 25.0 | 1809.0 | ... | 2206.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11120 | OO | 5203 | N774SK | LAX | ASE | 1816 | 1840.0 | 24.0 | 22.0 | 1902.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11427 | F9 | 369 | N905FR | PHF | DEN | 1840 | 1830.0 | -10.0 | 11.0 | 1841.0 | ... | 2231.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11888 | WN | 1004 | N8315C | MSY | DEN | 1915 | 1911.0 | -4.0 | 8.0 | 1919.0 | ... | 2322.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
12113 | EV | 4333 | N11191 | EWR | TUL | 1931 | 2052.0 | 81.0 | 18.0 | 2110.0 | ... | 36.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
20 rows × 27 columns
len(diverted)
20
cancelled = flights[flights['CANCELLED'] == 1]
cancelled
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | AS | 136 | N431AS | ANC | SEA | 135 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
42 | AA | 2459 | N3BDAA | PHX | DFW | 200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
68 | OO | 5254 | N746SK | MAF | IAH | 510 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
82 | MQ | 2859 | N660MQ | SGF | DFW | 525 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
90 | OO | 5460 | N583SW | RDD | SFO | 530 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13589 | EV | 4173 | N17196 | DEN | ABQ | 2200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | C | NaN | NaN | NaN | NaN | NaN |
13594 | OO | 5340 | N958SW | DEN | MAF | 2200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
13602 | OO | 6210 | N910SW | DEN | PSP | 2201 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | C | NaN | NaN | NaN | NaN | NaN |
13925 | UA | 470 | NaN | HNL | LAX | 2358 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
13949 | OO | 6419 | N117SY | SFO | SAT | 2359 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
466 rows × 27 columns
len(cancelled)
466
Q3: Are there flights that were both - diverted AND cancelled?
div_canc = flights[(flights['DIVERTED'] == 1) & (flights['CANCELLED'] == 1)]
div_canc
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY |
---|
0 rows × 27 columns
Select all flights where Origin airport is Wrangell Airport.
airport = airports[(airports['AIRPORT'] == "Wrangell Airport")]
airport
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
317 | WRG | Wrangell Airport | Wrangell | AK | USA | 56.48433 | -132.36982 |
iata_code = airport.at[317, 'IATA_CODE']
iata_code
'WRG'
flights_wrg_orig = flights[(flights['ORIGIN_AIRPORT'] == iata_code)]
flights_wrg_orig
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4231 | AS | 65 | N713AS | WRG | PSG | 1100 | 1048.0 | -12.0 | 5.0 | 1053.0 | ... | 1112.0 | -14.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9454 | AS | 64 | N799AS | WRG | KTN | 1630 | 1613.0 | -17.0 | 4.0 | 1617.0 | ... | 1702.0 | 0.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 27 columns
Q4: Select all flights where Destination airport is Yakutat Airport.
airport = airports[(airports['AIRPORT'] == "Aberdeen Regional Airport")]
airport
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
3 | ABR | Aberdeen Regional Airport | Aberdeen | SD | USA | 45.44906 | -98.42183 |
iata_code = airport.at[3, 'IATA_CODE']
iata_code
'ABR'
flights_wys_dest = flights[(flights['DESTINATION_AIRPORT'] == iata_code)]
flights_wys_dest
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4904 | OO | 7363 | N429SW | MSP | ABR | 1140 | 1135.0 | -5.0 | 15.0 | 1150.0 | ... | 1255.0 | -8.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9034 | OO | 7405 | N779CA | MSP | ABR | 1600 | 1557.0 | -3.0 | 25.0 | 1622.0 | ... | 1715.0 | -11.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 27 columns
3. Data ordering and grouping¶
Order flights according to arrival delay.
flights.groupby('AIRLINE')[['ARRIVAL_DELAY']].sum().sort_values('ARRIVAL_DELAY',ascending=False)
ARRIVAL_DELAY | |
---|---|
AIRLINE | |
AA | 21845.0 |
OO | 17953.0 |
WN | 15889.0 |
MQ | 9773.0 |
UA | 8163.0 |
F9 | 4405.0 |
EV | 3672.0 |
NK | 1992.0 |
HA | 1410.0 |
B6 | 895.0 |
US | 195.0 |
VX | -608.0 |
AS | -1623.0 |
DL | -11895.0 |
mean_delay = flights.groupby('AIRLINE')[['ARRIVAL_DELAY']].mean().sort_values('ARRIVAL_DELAY',ascending=False)
mean_delay
ARRIVAL_DELAY | |
---|---|
AIRLINE | |
F9 | 16.749049 |
AA | 15.841189 |
MQ | 13.941512 |
OO | 12.580939 |
NK | 7.191336 |
HA | 6.746411 |
UA | 6.432624 |
WN | 5.779920 |
EV | 2.923567 |
B6 | 1.182299 |
US | 0.189689 |
AS | -3.697039 |
VX | -4.000000 |
DL | -7.634788 |
Which airline has the greater arrival delay according to sum / according to mean?
greatest_delay = airlines[airlines['IATA_CODE'].isin(["AA", "OO", "MQ", "F9", "WN"])]
greatest_delay
IATA_CODE | AIRLINE | |
---|---|---|
1 | AA | American Airlines Inc. |
3 | F9 | Frontier Airlines Inc. |
5 | OO | Skywest Airlines Inc. |
8 | WN | Southwest Airlines Co. |
12 | MQ | American Eagle Airlines Inc. |
Which airline has the biggest number of occured flights?
ax = mean_delay.plot(kind = 'bar')
ax.set_ylabel('Mean arrival delay [mins]')
ax.set_xlabel('Airline IATA code')
Text(0.5, 0, 'Airline IATA code')
Sum arrival delay for particular airline, tail_number, origin airport and destination airport.
group=['AIRLINE','TAIL_NUMBER','ORIGIN_AIRPORT','DESTINATION_AIRPORT']
what=['ARRIVAL_DELAY']
delay_grouped =flights.groupby(group)[what].sum()
delay_grouped.head()
ARRIVAL_DELAY | ||||
---|---|---|---|---|
AIRLINE | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | |
AA | N001AA | DFW | MEM | -6.0 |
MEM | DFW | -9.0 | ||
N002AA | ORD | DFW | 26.0 | |
N003AA | DFW | ATL | 112.0 | |
HDN | 78.0 |
4. Data merging¶
Add coordinates of origin and destination airports to flights dataframe.
# Add origin airport info
# We are using left join (https://www.w3schools.com/sql/sql_join_left.asp) but the result after inner join is the same in that case
flights_orig_airport = pd.merge(flights,airports,how='left',left_on='ORIGIN_AIRPORT',right_on='IATA_CODE')
flights_orig_airport.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | NaN | NaN | NaN | ANC | Ted Stevens Anchorage International Airport | Anchorage | AK | USA | 61.17432 | -149.99619 |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | NaN | NaN | NaN | LAX | Los Angeles International Airport | Los Angeles | CA | USA | 33.94254 | -118.40807 |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | NaN | NaN | NaN | SFO | San Francisco International Airport | San Francisco | CA | USA | 37.61900 | -122.37484 |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | NaN | NaN | NaN | LAX | Los Angeles International Airport | Los Angeles | CA | USA | 33.94254 | -118.40807 |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | NaN | NaN | NaN | SEA | Seattle-Tacoma International Airport | Seattle | WA | USA | 47.44898 | -122.30931 |
5 rows × 34 columns
final = pd.merge(flights_orig_airport, airports,how='left',left_on='DESTINATION_AIRPORT',right_on='IATA_CODE',suffixes=['_FROM','_TO'])
final.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | COUNTRY_FROM | LATITUDE_FROM | LONGITUDE_FROM | IATA_CODE_TO | AIRPORT_TO | CITY_TO | STATE_TO | COUNTRY_TO | LATITUDE_TO | LONGITUDE_TO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | USA | 61.17432 | -149.99619 | SEA | Seattle-Tacoma International Airport | Seattle | WA | USA | 47.44898 | -122.30931 |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | USA | 33.94254 | -118.40807 | PBI | Palm Beach International Airport | West Palm Beach | FL | USA | 26.68316 | -80.09559 |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | USA | 37.61900 | -122.37484 | CLT | Charlotte Douglas International Airport | Charlotte | NC | USA | 35.21401 | -80.94313 |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | USA | 33.94254 | -118.40807 | MIA | Miami International Airport | Miami | FL | USA | 25.79325 | -80.29056 |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | USA | 47.44898 | -122.30931 | ANC | Ted Stevens Anchorage International Airport | Anchorage | AK | USA | 61.17432 | -149.99619 |
5 rows × 41 columns
final.columns
Index(['AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY', 'IATA_CODE_FROM', 'AIRPORT_FROM', 'CITY_FROM', 'STATE_FROM', 'COUNTRY_FROM', 'LATITUDE_FROM', 'LONGITUDE_FROM', 'IATA_CODE_TO', 'AIRPORT_TO', 'CITY_TO', 'STATE_TO', 'COUNTRY_TO', 'LATITUDE_TO', 'LONGITUDE_TO'], dtype='object')
final.describe()
FLIGHT_NUMBER | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | SCHEDULED_TIME | ELAPSED_TIME | AIR_TIME | DISTANCE | ... | CANCELLED | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | LATITUDE_FROM | LONGITUDE_FROM | LATITUDE_TO | LONGITUDE_TO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 13950.000000 | 13950.000000 | 13490.000000 | 13490.000000 | 13487.000000 | 13487.000000 | 13950.000000 | 13464.000000 | 13464.000000 | 13950.000000 | ... | 13950.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 13942.000000 | 13942.000000 | 13941.000000 | 13941.000000 |
mean | 2132.832903 | 1371.355412 | 1384.750482 | 9.610897 | 15.192185 | 1404.201898 | 146.253835 | 143.411096 | 120.778149 | 861.954552 | ... | 0.033405 | 9.845999 | 0.093070 | 19.108805 | 18.202544 | 1.489120 | 36.183917 | -96.206237 | 36.205137 | -96.120561 |
std | 1828.054396 | 457.929399 | 469.241484 | 32.286361 | 7.606390 | 471.950535 | 77.850137 | 77.583432 | 76.086711 | 624.169097 | ... | 0.179698 | 15.062108 | 2.288043 | 43.554639 | 33.921487 | 10.648063 | 6.105642 | 18.462665 | 6.105763 | 18.470415 |
min | 1.000000 | 5.000000 | 1.000000 | -27.000000 | 2.000000 | 1.000000 | 23.000000 | 23.000000 | 14.000000 | 31.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.483450 | -176.646030 | 13.483450 | -176.646030 |
25% | 688.000000 | 1010.000000 | 1017.000000 | -4.000000 | 11.000000 | 1031.000000 | 88.000000 | 85.000000 | 63.000000 | 399.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 32.895950 | -111.977770 | 32.895950 | -111.977770 |
50% | 1548.000000 | 1351.000000 | 1405.500000 | -1.000000 | 13.000000 | 1417.000000 | 130.000000 | 127.000000 | 104.000000 | 718.000000 | ... | 0.000000 | 3.000000 | 0.000000 | 7.000000 | 3.000000 | 0.000000 | 36.080360 | -93.216920 | 36.080360 | -93.216920 |
75% | 3158.000000 | 1735.000000 | 1748.000000 | 11.000000 | 17.000000 | 1800.000000 | 180.000000 | 177.000000 | 153.000000 | 1101.250000 | ... | 0.000000 | 16.000000 | 0.000000 | 22.000000 | 23.000000 | 0.000000 | 40.692500 | -81.442470 | 40.692500 | -81.316030 |
max | 7438.000000 | 2359.000000 | 2359.000000 | 1190.000000 | 113.000000 | 2400.000000 | 718.000000 | 711.000000 | 669.000000 | 4983.000000 | ... | 1.000000 | 169.000000 | 107.000000 | 1169.000000 | 467.000000 | 213.000000 | 71.285450 | -64.798560 | 71.285450 | -64.798560 |
8 rows × 26 columns
5. Visualization using Folium library¶
final.LATITUDE_FROM.mean()
36.18391741285325
final.LONGITUDE_FROM.mean()
-96.20623731889255
center=[36.18,-96.21]
import folium
colors= [
'red',
'blue',
'gray',
'darkred',
'lightred',
'orange',
'beige',
'green',
'darkgreen',
'lightgreen',
'darkblue',
'lightblue',
'purple',
'darkpurple',
'pink',
'cadetblue',
'lightgray',
'black'
]
my_map = folium.Map(location=center, zoom_start=4)
for i, row in final.head(15).iterrows():
folium.Marker([row.LATITUDE_FROM, row.LONGITUDE_FROM],
icon=folium.Icon(color=colors[i]),
tooltip=row.AIRLINE,popup="origin airport: {}, origin city: {}".format(row.AIRPORT_FROM, row.CITY_FROM)).add_to(my_map)
folium.Marker([row.LATITUDE_TO, row.LONGITUDE_TO],
icon=folium.Icon(color=colors[i]),
tooltip=row.AIRLINE,popup="destination airport: {}, destination city: {}".format(row.AIRPORT_TO, row.CITY_TO)).add_to(my_map)
my_map
my_map = folium.Map(location=center, zoom_start=4)
for i, row in final.head(15).iterrows():
folium.CircleMarker([row.LATITUDE_FROM, row.LONGITUDE_FROM],
s=1,alpha=1,edgecolor=None, color=colors[i],
tooltip=row.AIRLINE,popup="origin airport: {}, origin city: {}".format(row.AIRPORT_FROM, row.CITY_FROM)).add_to(my_map)
folium.CircleMarker([row.LATITUDE_TO, row.LONGITUDE_TO],
s=1,alpha=1,edgecolor=None, color=colors[i],
tooltip=row.AIRLINE,popup="destination airport: {}, destination city: {}".format(row.AIRPORT_TO, row.CITY_TO)).add_to(my_map)
line = [[row.LATITUDE_FROM, row.LONGITUDE_FROM], [row.LATITUDE_TO, row.LONGITUDE_TO]]
folium.PolyLine(line, color=colors[i], weight=2.5, opacity=1).add_to(my_map)
my_map
6. Saving resulting table to SQLITE database¶
import sqlite3
# Connects or creates SQL database
conn = sqlite3.connect("/home/linduska/eramus_vyuka/155isdp/flights.db")
conn.commit()
# Create a table called flights in the database:
final.to_sql('flights', conn, if_exists='replace')
conn.commit()
conn.close() #This is the one you need
conn.close()
Checking the results
import sqlite3
# Read sqlite query results into a pandas DataFrame
conn = sqlite3.connect("/home/linduska/eramus_vyuka/155isdp/flights.db")
check = pd.read_sql_query("SELECT * from flights limit 10", conn)
# Verify that result of SQL query is stored in the dataframe
print(check.head(5))
conn.close()
index AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT \ 0 0 AS 98 N407AS ANC 1 1 AA 2336 N3KUAA LAX 2 2 US 840 N171US SFO 3 3 AA 258 N3HYAA LAX 4 4 AS 135 N527AS SEA DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY \ 0 SEA 5 2354.0 -11.0 1 PBI 10 2.0 -8.0 2 CLT 20 18.0 -2.0 3 MIA 20 15.0 -5.0 4 ANC 25 24.0 -1.0 TAXI_OUT ... COUNTRY_FROM LATITUDE_FROM LONGITUDE_FROM IATA_CODE_TO \ 0 21.0 ... USA 61.17432 -149.99619 SEA 1 12.0 ... USA 33.94254 -118.40807 PBI 2 16.0 ... USA 37.61900 -122.37484 CLT 3 15.0 ... USA 33.94254 -118.40807 MIA 4 11.0 ... USA 47.44898 -122.30931 ANC AIRPORT_TO CITY_TO STATE_TO \ 0 Seattle-Tacoma International Airport Seattle WA 1 Palm Beach International Airport West Palm Beach FL 2 Charlotte Douglas International Airport Charlotte NC 3 Miami International Airport Miami FL 4 Ted Stevens Anchorage International Airport Anchorage AK COUNTRY_TO LATITUDE_TO LONGITUDE_TO 0 USA 47.44898 -122.30931 1 USA 26.68316 -80.09559 2 USA 35.21401 -80.94313 3 USA 25.79325 -80.29056 4 USA 61.17432 -149.99619 [5 rows x 42 columns]