Analýza letů¶
V tomto cvičení si studenti osvojí základy analýzy a správy dat pomocí knihovny Pandas.
Data¶
Zdroj dat: https://www.kaggle.com/datasets/usdot/flight-delays
První soubor airports.csv obsahuje informace o letištích v U.S.A. Konrétně jde o název letiště, kód letiště dle International Air Transport Association (IATA) a informace o umístění letiště (město, stát, souřadnice v souřadnicovém systému WGS-84 (EPSG:4326).
Druhý soubor airlines.csv obsahuje informace o aerolinkách operujících v U.S.A. Konkrétně jde o název aerolinky a kódu dle IATA.
Poslední soubor flights.csv obsahuje řadu informace o letech. Konkrétně jde o výchozím a cílovém letišti, čas odletu, zpoždění, o případném odklonu či zrušení letu. Původní soubor byl příliš veliký. Pro naše účely byly použity pouze lety, které se uskutečnily dne 1.1.2015.
Úvod¶
Naše dat jsou uložena v formátu CSV (Comma Separated Values). Nicméně knihovna Pandas uložňuje zpracovat data i v jiných formátech jako je např. TXT, JSON, XML, SQL a další.
Jak "velká" data lze zpracovat?
< 5GB ... Pandas
< 30GB ... Pandas chunks
< 200 GB ... DASK dataframes (https://www.dask.org/)
> 200 GB ... PySpark = Python API pro spouštění aplikaci paralelně na distribuovaném Apache Spark clusteru (https://sparkbyexamples.com/pyspark-tutorial/#what-is-pyspark)
1. Načtení dat a první kroky¶
import pandas as pd
airports=pd.read_csv('airports.csv')
airlines=pd.read_csv('airlines.csv')
flights=pd.read_csv('flights.csv')
Zobrazíme úvodní část načtených dat
airports.head()
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
0 | ABE | Lehigh Valley International Airport | Allentown | PA | USA | 40.65236 | -75.44040 |
1 | ABI | Abilene Regional Airport | Abilene | TX | USA | 32.41132 | -99.68190 |
2 | ABQ | Albuquerque International Sunport | Albuquerque | NM | USA | 35.04022 | -106.60919 |
3 | ABR | Aberdeen Regional Airport | Aberdeen | SD | USA | 45.44906 | -98.42183 |
4 | ABY | Southwest Georgia Regional Airport | Albany | GA | USA | 31.53552 | -84.19447 |
airports.tail()
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
317 | WRG | Wrangell Airport | Wrangell | AK | USA | 56.48433 | -132.36982 |
318 | WYS | Westerly State Airport | West Yellowstone | MT | USA | 44.68840 | -111.11764 |
319 | XNA | Northwest Arkansas Regional Airport | Fayetteville/Springdale/Rogers | AR | USA | 36.28187 | -94.30681 |
320 | YAK | Yakutat Airport | Yakutat | AK | USA | 59.50336 | -139.66023 |
321 | YUM | Yuma International Airport | Yuma | AZ | USA | 32.65658 | -114.60597 |
airlines.head()
IATA_CODE | AIRLINE | |
---|---|---|
0 | UA | United Air Lines Inc. |
1 | AA | American Airlines Inc. |
2 | US | US Airways Inc. |
3 | F9 | Frontier Airlines Inc. |
4 | B6 | JetBlue Airways |
flights.head(10)
Unnamed: 0 | MONTH | DAY | DAY_OF_WEEK | AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 4 | AS | 98 | N407AS | ANC | SEA | 5 | ... | 408.0 | -22.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1 | 1 | 1 | 4 | AA | 2336 | N3KUAA | LAX | PBI | 10 | ... | 741.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | 1 | 1 | 4 | US | 840 | N171US | SFO | CLT | 20 | ... | 811.0 | 5.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | 1 | 1 | 4 | AA | 258 | N3HYAA | LAX | MIA | 20 | ... | 756.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 4 | 1 | 1 | 4 | AS | 135 | N527AS | SEA | ANC | 25 | ... | 259.0 | -21.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 5 | 1 | 1 | 4 | DL | 806 | N3730B | SFO | MSP | 25 | ... | 610.0 | 8.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 6 | 1 | 1 | 4 | NK | 612 | N635NK | LAS | MSP | 25 | ... | 509.0 | -17.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 7 | 1 | 1 | 4 | US | 2013 | N584UW | LAX | CLT | 30 | ... | 753.0 | -10.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 8 | 1 | 1 | 4 | AA | 1112 | N3LAAA | SFO | DFW | 30 | ... | 532.0 | -13.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 9 | 1 | 1 | 4 | DL | 1173 | N826DN | LAS | ATL | 30 | ... | 656.0 | -15.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 31 columns
Seznam sloupců:
flights.columns
Index(['Unnamed: 0', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], dtype='object')
Rozměr datové matice:
airlines.shape
(14, 2)
airports.shape
(322, 7)
flights.shape
(13950, 31)
Velikost dat v MB:
flights.__sizeof__()/1024/1024
6.423445701599121
O1: Jaká je velikost souboru letišť v kilobytech (KB)? Porovnej s velikostí souboru na disku.
Počet unikátních hodnot v jednotlivých sloupcích:
for col in flights.columns:
print(col, "\t=>\t", len(flights[col].unique()))
Unnamed: 0 => 13950 MONTH => 1 DAY => 1 DAY_OF_WEEK => 1 AIRLINE => 14 FLIGHT_NUMBER => 5010 TAIL_NUMBER => 3753 ORIGIN_AIRPORT => 307 DESTINATION_AIRPORT => 306 SCHEDULED_DEPARTURE => 1057 DEPARTURE_TIME => 1192 DEPARTURE_DELAY => 246 TAXI_OUT => 78 WHEELS_OFF => 1193 SCHEDULED_TIME => 412 ELAPSED_TIME => 426 AIR_TIME => 411 DISTANCE => 1188 WHEELS_ON => 1282 TAXI_IN => 63 SCHEDULED_ARRIVAL => 1215 ARRIVAL_TIME => 1276 ARRIVAL_DELAY => 262 DIVERTED => 2 CANCELLED => 2 CANCELLATION_REASON => 4 AIR_SYSTEM_DELAY => 87 SECURITY_DELAY => 13 AIRLINE_DELAY => 166 LATE_AIRCRAFT_DELAY => 156 WEATHER_DELAY => 65
for col in airports.columns:
print(col, "\t=>\t", len(airports[col].unique()))
IATA_CODE => 322 AIRPORT => 322 CITY => 308 STATE => 54 COUNTRY => 1 LATITUDE => 320 LONGITUDE => 320
O2: V kolika státech se nachází alespoň jedno letiště?
Editace dat:
flights.drop(columns=["Unnamed: 0", "MONTH", "DAY", "DAY_OF_WEEK"], inplace=True)
flights.columns
Index(['AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'], dtype='object')
flights.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | 408.0 | -22.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | 741.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | 811.0 | 5.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | 756.0 | -9.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | 259.0 | -21.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 27 columns
print(flights["AIRLINE"].unique())
['AS' 'AA' 'US' 'DL' 'NK' 'UA' 'HA' 'B6' 'OO' 'EV' 'MQ' 'F9' 'WN' 'VX']
2. Filtrace dat¶
Kolik letů bylo odkloněno či zrušeno?
diverted = flights[flights['DIVERTED'] == 1]
diverted
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
724 | DL | 716 | N944DL | OMA | ATL | 645 | 637.0 | -8.0 | 27.0 | 704.0 | ... | 1451.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
1455 | OO | 5237 | N712SK | MKE | IAH | 745 | 742.0 | -3.0 | 15.0 | 757.0 | ... | 1505.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2218 | WN | 1966 | N685SW | ATL | JAX | 845 | 922.0 | 37.0 | 9.0 | 931.0 | ... | 1219.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2526 | EV | 4555 | N12552 | IAH | HRL | 902 | 859.0 | -3.0 | 33.0 | 932.0 | ... | 1319.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2866 | WN | 1081 | N214WN | MDW | OKC | 930 | 926.0 | -4.0 | 9.0 | 935.0 | ... | 1343.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4225 | AA | 1279 | N541AA | OMA | DFW | 1100 | 1058.0 | -2.0 | 12.0 | 1110.0 | ... | 1838.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4394 | OO | 4506 | N613SK | SLC | SUN | 1110 | 1108.0 | -2.0 | 33.0 | 1141.0 | ... | 1353.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4574 | EV | 6171 | N17159 | DEN | GUC | 1122 | 1120.0 | -2.0 | 14.0 | 1134.0 | ... | 1635.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4636 | EV | 4169 | N11547 | IAH | ABQ | 1125 | 1122.0 | -3.0 | 24.0 | 1146.0 | ... | 1630.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
4639 | EV | 4654 | N29515 | IAH | HRL | 1125 | 1206.0 | 41.0 | 11.0 | 1217.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
6422 | UA | 399 | N456UA | SAN | IAD | 1317 | 1504.0 | 107.0 | 15.0 | 1519.0 | ... | 119.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7281 | EV | 4654 | N14543 | CRP | HRL | 1410 | 1736.0 | 206.0 | 9.0 | 1745.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
7508 | OO | 5488 | N791SK | IAH | ASE | 1427 | 1436.0 | 9.0 | 19.0 | 1455.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9204 | WN | 2324 | N8655D | LAS | MCO | 1615 | 1758.0 | 103.0 | 11.0 | 1809.0 | ... | 241.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9277 | US | 735 | N657AW | PHL | LAS | 1620 | 1617.0 | -3.0 | 44.0 | 1701.0 | ... | 2107.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
10147 | OO | 5165 | N932SW | DEN | BZN | 1715 | 1744.0 | 29.0 | 25.0 | 1809.0 | ... | 2206.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11120 | OO | 5203 | N774SK | LAX | ASE | 1816 | 1840.0 | 24.0 | 22.0 | 1902.0 | ... | NaN | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11427 | F9 | 369 | N905FR | PHF | DEN | 1840 | 1830.0 | -10.0 | 11.0 | 1841.0 | ... | 2231.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
11888 | WN | 1004 | N8315C | MSY | DEN | 1915 | 1911.0 | -4.0 | 8.0 | 1919.0 | ... | 2322.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
12113 | EV | 4333 | N11191 | EWR | TUL | 1931 | 2052.0 | 81.0 | 18.0 | 2110.0 | ... | 36.0 | NaN | 1 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
20 rows × 27 columns
len(diverted)
20
cancelled = flights[flights['CANCELLED'] == 1]
cancelled
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | AS | 136 | N431AS | ANC | SEA | 135 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
42 | AA | 2459 | N3BDAA | PHX | DFW | 200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
68 | OO | 5254 | N746SK | MAF | IAH | 510 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
82 | MQ | 2859 | N660MQ | SGF | DFW | 525 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
90 | OO | 5460 | N583SW | RDD | SFO | 530 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13589 | EV | 4173 | N17196 | DEN | ABQ | 2200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | C | NaN | NaN | NaN | NaN | NaN |
13594 | OO | 5340 | N958SW | DEN | MAF | 2200 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
13602 | OO | 6210 | N910SW | DEN | PSP | 2201 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | C | NaN | NaN | NaN | NaN | NaN |
13925 | UA | 470 | NaN | HNL | LAX | 2358 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | A | NaN | NaN | NaN | NaN | NaN |
13949 | OO | 6419 | N117SY | SFO | SAT | 2359 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 0 | 1 | B | NaN | NaN | NaN | NaN | NaN |
466 rows × 27 columns
len(cancelled)
466
O3: Existují lety, které byly zároveň odkloněny a zrušeny?
Vyberte všechny lety, které byly odlétaly z Wrangell Airport.
airport = airports[(airports['AIRPORT'] == "Wrangell Airport")]
airport
IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
317 | WRG | Wrangell Airport | Wrangell | AK | USA | 56.48433 | -132.36982 |
iata_code = airport.iloc[0]['IATA_CODE']
iata_code
'WRG'
flights_wrg_orig = flights[flights['ORIGIN_AIRPORT'] == iata_code]
flights_wrg_orig
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | ARRIVAL_TIME | ARRIVAL_DELAY | DIVERTED | CANCELLED | CANCELLATION_REASON | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4231 | AS | 65 | N713AS | WRG | PSG | 1100 | 1048.0 | -12.0 | 5.0 | 1053.0 | ... | 1112.0 | -14.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
9454 | AS | 64 | N799AS | WRG | KTN | 1630 | 1613.0 | -17.0 | 4.0 | 1617.0 | ... | 1702.0 | 0.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 27 columns
O4: Vyberte všechny lety, které mířily na Aberdeen Regional Airport.
3. Řazení a agregace dat¶
sum_canceled = flights.groupby('AIRLINE')[['CANCELLED']].sum().sort_values('CANCELLED')
sum_canceled = sum_canceled[sum_canceled > 0].dropna()
ax = sum_canceled.plot(kind="bar")
ax.legend(["Zrušené lety"])
Seřaďte lety podle zpoždění příletu.
sum_delay = flights.groupby('AIRLINE', as_index=False)[['ARRIVAL_DELAY']].sum().sort_values('ARRIVAL_DELAY', ascending=False)
sum_delay
AIRLINE | ARRIVAL_DELAY | |
---|---|---|
0 | AA | 21845.0 |
9 | OO | 17953.0 |
13 | WN | 15889.0 |
7 | MQ | 9773.0 |
10 | UA | 8163.0 |
5 | F9 | 4405.0 |
4 | EV | 3672.0 |
8 | NK | 1992.0 |
6 | HA | 1410.0 |
2 | B6 | 895.0 |
11 | US | 195.0 |
12 | VX | -608.0 |
1 | AS | -1623.0 |
3 | DL | -11895.0 |
mean_delay = flights.groupby('AIRLINE')[['ARRIVAL_DELAY']].mean().sort_values('ARRIVAL_DELAY', ascending=False)
mean_delay
ARRIVAL_DELAY | |
---|---|
AIRLINE | |
F9 | 16.749049 |
AA | 15.841189 |
MQ | 13.941512 |
OO | 12.580939 |
NK | 7.191336 |
HA | 6.746411 |
UA | 6.432624 |
WN | 5.779920 |
EV | 2.923567 |
B6 | 1.182299 |
US | 0.189689 |
AS | -3.697039 |
VX | -4.000000 |
DL | -7.634788 |
Které aerolinie májí největší míru zpoždění příletu podle součtu / průměrné hodnoty?
first_five = sum_delay.head(5)["AIRLINE"].tolist()
greatest_delay = airlines[airlines['IATA_CODE'].isin(first_five)]
greatest_delay
IATA_CODE | AIRLINE | |
---|---|---|
0 | UA | United Air Lines Inc. |
1 | AA | American Airlines Inc. |
5 | OO | Skywest Airlines Inc. |
8 | WN | Southwest Airlines Co. |
12 | MQ | American Eagle Airlines Inc. |
Která linka má největší počet letů?
ax = mean_delay.plot(kind = 'bar')
ax.set_ylabel('Průměrná hodnota zpoždění letu [min]')
ax.set_xlabel('IATA kód linky')
ax.legend(["Zpoždění letu"])
Jaký je součet zpoždění pro danou linku, číslo letadla, výchozí a cílové letiště.
group=['AIRLINE','TAIL_NUMBER','ORIGIN_AIRPORT','DESTINATION_AIRPORT']
what=['ARRIVAL_DELAY']
delay_grouped =flights.groupby(group)[what].sum()
delay_grouped.head()
ARRIVAL_DELAY | ||||
---|---|---|---|---|
AIRLINE | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | |
AA | N001AA | DFW | MEM | -6.0 |
MEM | DFW | -9.0 | ||
N002AA | ORD | DFW | 26.0 | |
N003AA | DFW | ATL | 112.0 | |
HDN | 78.0 |
4. Spojovaní dat¶
Přidejme souřadnice výchozího a cílového letiště k letům. V příkladu níže použijeme left join.
flights_orig_airport = pd.merge(flights,airports,how='left',left_on='ORIGIN_AIRPORT',right_on='IATA_CODE')
flights_orig_airport.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | IATA_CODE | AIRPORT | CITY | STATE | COUNTRY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | NaN | NaN | NaN | ANC | Ted Stevens Anchorage International Airport | Anchorage | AK | USA | 61.17432 | -149.99619 |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | NaN | NaN | NaN | LAX | Los Angeles International Airport | Los Angeles | CA | USA | 33.94254 | -118.40807 |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | NaN | NaN | NaN | SFO | San Francisco International Airport | San Francisco | CA | USA | 37.61900 | -122.37484 |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | NaN | NaN | NaN | LAX | Los Angeles International Airport | Los Angeles | CA | USA | 33.94254 | -118.40807 |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | NaN | NaN | NaN | SEA | Seattle-Tacoma International Airport | Seattle | WA | USA | 47.44898 | -122.30931 |
5 rows × 34 columns
final = pd.merge(flights_orig_airport, airports,how='left',left_on='DESTINATION_AIRPORT',right_on='IATA_CODE',suffixes=['_FROM','_TO'])
final.head(5)
AIRLINE | FLIGHT_NUMBER | TAIL_NUMBER | ORIGIN_AIRPORT | DESTINATION_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | ... | COUNTRY_FROM | LATITUDE_FROM | LONGITUDE_FROM | IATA_CODE_TO | AIRPORT_TO | CITY_TO | STATE_TO | COUNTRY_TO | LATITUDE_TO | LONGITUDE_TO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AS | 98 | N407AS | ANC | SEA | 5 | 2354.0 | -11.0 | 21.0 | 15.0 | ... | USA | 61.17432 | -149.99619 | SEA | Seattle-Tacoma International Airport | Seattle | WA | USA | 47.44898 | -122.30931 |
1 | AA | 2336 | N3KUAA | LAX | PBI | 10 | 2.0 | -8.0 | 12.0 | 14.0 | ... | USA | 33.94254 | -118.40807 | PBI | Palm Beach International Airport | West Palm Beach | FL | USA | 26.68316 | -80.09559 |
2 | US | 840 | N171US | SFO | CLT | 20 | 18.0 | -2.0 | 16.0 | 34.0 | ... | USA | 37.61900 | -122.37484 | CLT | Charlotte Douglas International Airport | Charlotte | NC | USA | 35.21401 | -80.94313 |
3 | AA | 258 | N3HYAA | LAX | MIA | 20 | 15.0 | -5.0 | 15.0 | 30.0 | ... | USA | 33.94254 | -118.40807 | MIA | Miami International Airport | Miami | FL | USA | 25.79325 | -80.29056 |
4 | AS | 135 | N527AS | SEA | ANC | 25 | 24.0 | -1.0 | 11.0 | 35.0 | ... | USA | 47.44898 | -122.30931 | ANC | Ted Stevens Anchorage International Airport | Anchorage | AK | USA | 61.17432 | -149.99619 |
5 rows × 41 columns
final.columns
Index(['AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY', 'IATA_CODE_FROM', 'AIRPORT_FROM', 'CITY_FROM', 'STATE_FROM', 'COUNTRY_FROM', 'LATITUDE_FROM', 'LONGITUDE_FROM', 'IATA_CODE_TO', 'AIRPORT_TO', 'CITY_TO', 'STATE_TO', 'COUNTRY_TO', 'LATITUDE_TO', 'LONGITUDE_TO'], dtype='object')
final.describe()
FLIGHT_NUMBER | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | TAXI_OUT | WHEELS_OFF | SCHEDULED_TIME | ELAPSED_TIME | AIR_TIME | DISTANCE | ... | CANCELLED | AIR_SYSTEM_DELAY | SECURITY_DELAY | AIRLINE_DELAY | LATE_AIRCRAFT_DELAY | WEATHER_DELAY | LATITUDE_FROM | LONGITUDE_FROM | LATITUDE_TO | LONGITUDE_TO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 13950.000000 | 13950.000000 | 13490.000000 | 13490.000000 | 13487.000000 | 13487.000000 | 13950.000000 | 13464.000000 | 13464.000000 | 13950.000000 | ... | 13950.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 2987.000000 | 13942.000000 | 13942.000000 | 13941.000000 | 13941.000000 |
mean | 2132.832903 | 1371.355412 | 1384.750482 | 9.610897 | 15.192185 | 1404.201898 | 146.253835 | 143.411096 | 120.778149 | 861.954552 | ... | 0.033405 | 9.845999 | 0.093070 | 19.108805 | 18.202544 | 1.489120 | 36.183917 | -96.206237 | 36.205137 | -96.120561 |
std | 1828.054396 | 457.929399 | 469.241484 | 32.286361 | 7.606390 | 471.950535 | 77.850137 | 77.583432 | 76.086711 | 624.169097 | ... | 0.179698 | 15.062108 | 2.288043 | 43.554639 | 33.921487 | 10.648063 | 6.105642 | 18.462665 | 6.105763 | 18.470415 |
min | 1.000000 | 5.000000 | 1.000000 | -27.000000 | 2.000000 | 1.000000 | 23.000000 | 23.000000 | 14.000000 | 31.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 13.483450 | -176.646030 | 13.483450 | -176.646030 |
25% | 688.000000 | 1010.000000 | 1017.000000 | -4.000000 | 11.000000 | 1031.000000 | 88.000000 | 85.000000 | 63.000000 | 399.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 32.895950 | -111.977770 | 32.895950 | -111.977770 |
50% | 1548.000000 | 1351.000000 | 1405.500000 | -1.000000 | 13.000000 | 1417.000000 | 130.000000 | 127.000000 | 104.000000 | 718.000000 | ... | 0.000000 | 3.000000 | 0.000000 | 7.000000 | 3.000000 | 0.000000 | 36.080360 | -93.216920 | 36.080360 | -93.216920 |
75% | 3158.000000 | 1735.000000 | 1748.000000 | 11.000000 | 17.000000 | 1800.000000 | 180.000000 | 177.000000 | 153.000000 | 1101.250000 | ... | 0.000000 | 16.000000 | 0.000000 | 22.000000 | 23.000000 | 0.000000 | 40.692500 | -81.442470 | 40.692500 | -81.316030 |
max | 7438.000000 | 2359.000000 | 2359.000000 | 1190.000000 | 113.000000 | 2400.000000 | 718.000000 | 711.000000 | 669.000000 | 4983.000000 | ... | 1.000000 | 169.000000 | 107.000000 | 1169.000000 | 467.000000 | 213.000000 | 71.285450 | -64.798560 | 71.285450 | -64.798560 |
8 rows × 26 columns
5. Uložení výsledné tabulky do SQLITE databáze¶
Vytvoříme SQLite databázi a připojení k ní:
# Connects or creates SQL database
import sqlite3
db = "/home/martin/Downloads/flights.db"
conn = sqlite3.connect(db)
conn.commit()
Dále vytvoříme tabulku letů:
final.to_sql('flights', conn, if_exists='replace')
conn.commit()
conn.close()
Zkontrolujeme výsledek:
import sqlite3
# Read sqlite query results into a pandas DataFrame
conn = sqlite3.connect(db)
check = pd.read_sql_query("SELECT * from flights limit 10", conn)
# Verify that result of SQL query is stored in the dataframe
print(check.head(5))
conn.close()
index AIRLINE FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT \ 0 0 AS 98 N407AS ANC 1 1 AA 2336 N3KUAA LAX 2 2 US 840 N171US SFO 3 3 AA 258 N3HYAA LAX 4 4 AS 135 N527AS SEA DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY \ 0 SEA 5 2354.0 -11.0 1 PBI 10 2.0 -8.0 2 CLT 20 18.0 -2.0 3 MIA 20 15.0 -5.0 4 ANC 25 24.0 -1.0 TAXI_OUT ... COUNTRY_FROM LATITUDE_FROM LONGITUDE_FROM IATA_CODE_TO \ 0 21.0 ... USA 61.17432 -149.99619 SEA 1 12.0 ... USA 33.94254 -118.40807 PBI 2 16.0 ... USA 37.61900 -122.37484 CLT 3 15.0 ... USA 33.94254 -118.40807 MIA 4 11.0 ... USA 47.44898 -122.30931 ANC AIRPORT_TO CITY_TO STATE_TO \ 0 Seattle-Tacoma International Airport Seattle WA 1 Palm Beach International Airport West Palm Beach FL 2 Charlotte Douglas International Airport Charlotte NC 3 Miami International Airport Miami FL 4 Ted Stevens Anchorage International Airport Anchorage AK COUNTRY_TO LATITUDE_TO LONGITUDE_TO 0 USA 47.44898 -122.30931 1 USA 26.68316 -80.09559 2 USA 35.21401 -80.94313 3 USA 25.79325 -80.29056 4 USA 61.17432 -149.99619 [5 rows x 42 columns]
6. Vizualizace pomocí knihovny Folium¶
final.LATITUDE_FROM.mean()
36.18391741285325
final.LONGITUDE_FROM.mean()
-96.20623731889255
stred=[final.LATITUDE_FROM.mean(), final.LONGITUDE_FROM.mean()]
import folium
colors= [
'red',
'blue',
'gray',
'darkred',
'lightred',
'orange',
'beige',
'green',
'darkgreen',
'lightgreen',
'darkblue',
'lightblue',
'purple',
'darkpurple',
'pink',
'cadetblue',
'lightgray',
'black'
]
mapa = folium.Map(location=stred, zoom_start=4)
for i, row in final.head(15).iterrows():
folium.Marker([row.LATITUDE_FROM, row.LONGITUDE_FROM],
icon=folium.Icon(color=colors[i]),
tooltip=row.AIRLINE,popup="origin airport: {}, origin city: {}".format(row.AIRPORT_FROM, row.CITY_FROM)).add_to(mapa)
folium.Marker([row.LATITUDE_TO, row.LONGITUDE_TO],
icon=folium.Icon(color=colors[i]),
tooltip=row.AIRLINE,popup="destination airport: {}, destination city: {}".format(row.AIRPORT_TO, row.CITY_TO)).add_to(mapa)
mapa
mapa = folium.Map(location=stred, zoom_start=4)
for i, row in final.head(15).iterrows():
folium.CircleMarker([row.LATITUDE_FROM, row.LONGITUDE_FROM],
s=1,alpha=1,edgecolor=None, color=colors[i],
tooltip=row.AIRLINE,popup="origin airport: {}, origin city: {}".format(row.AIRPORT_FROM, row.CITY_FROM)).add_to(mapa)
folium.CircleMarker([row.LATITUDE_TO, row.LONGITUDE_TO],
s=1,alpha=1,edgecolor=None, color=colors[i],
tooltip=row.AIRLINE,popup="destination airport: {}, destination city: {}".format(row.AIRPORT_TO, row.CITY_TO)).add_to(mapa)
line = [[row.LATITUDE_FROM, row.LONGITUDE_FROM], [row.LATITUDE_TO, row.LONGITUDE_TO]]
folium.PolyLine(line, color=colors[i], weight=2.5, opacity=1).add_to(mapa)
mapa