8.2 Flight analysis in SQL¶
This lesson aims to introduce students to data analysis and data management using SQL.
Data¶
We will work with three CSV files introduced in Lesson 5. Full dataset is available at https://geo.fsv.cvut.cz/courses/155isdp/data/08/.
We will import all three files into SQLite database using DB Browser for SQLite. First create a new database (skip enter new table definition). A CSV file can be imported into database using Import -> Table from CSV file.
Note
Don't forget to write changes into database.
Warm-up exercise¶
select * from airports limit 1;
select * from airports where state = 'PA';
select * from airports where airport like '%Regional%';
select (
(select count(*) from airports where airport like '%Regional%') /
(select cast(count(*) as float) from airports)) * 100;
select * from flights limit 1;
select count(*) from flights WHERE origin_airport = 'ANC';
select * from flights WHERE origin_airport = 'ANC' ORDER BY DESTINATION_AIRPORT;
select month,count(*) from flights group by month;
select month,count(*) as count from flights group by month having origin_airport = 'ANC' order by count desc;
Indexing data¶
Compare query plan with/without index created on columns origin_airport and month.
explain query plan select count(*) from flights WHERE origin_airport = 'ANC';
explain query plan select month,count(*) as count from flights group by month having origin_airport = 'ANC' order by count desc;
Indeces are created by CREATE INDEX SQL statement:
create index flights_origin_airport on flights (ORIGIN_AIRPORT);
create index flights_month on flights (month);
Investigate:
Note that an index may be created on multiple columns:
Views¶
View is created by CREATE VIEW statement.
create view flights_anc as select * from flights where origin_airport = 'ANC';
select * from flights_anc;
Views query plan is affected by indeces. Compare query plan below with and without index on column destination_airport.
create view flights_anc_all as select * from flights where origin_airport = 'ANC' or destination_airport = 'ANC';
explain query plan select * from flights_anc_all;
Joins¶
select flight_number, day, month, year, a.airline from flights as f
join airlines as a on f.airline = a.iata_code;
Compare performance with indeces defined:
create index airlines_code on airlines (iata_code);
create index flights_airline on flights (airline);
create index airports_code on airports (iata_code);
select flight_number, day, month, year, a.airline,
ao.airport as origin_airport, ao.city as origin_city, ao.state as origin_state,
ad.airport as destination_airport, ad.city as destination_city, ad.state as destination_state
from flights as f
join airlines as a on f.airline = a.iata_code
join airports as ao on f.origin_airport = ao.iata_code
join airports as ad on f.destination_airport = ad.iata_code;
Join Types¶
SQL defines three major types of joins:
- INNER JOIN (or JOIN)
- OUTER JOIN (LEFT, RIGHT, FULL)
- CROSS JOIN
Source: https://www.geeksforgeeks.org/python-sqlite-join-clause/
Let's create view with flights on 1/1/2015 from Ted Stevens Anchorage International Airport to Seattle-Tacoma International Airport:
create view flights_anc_sea_1_1_2015 as select * from flights
where day = 1 and month = 1 and year = 2015 and origin_airport = 'ANC' and destination_airport = 'SEA';
JOIN statement:
select flight_number, day, month, year, a.airline from flights_anc_sea_1_1_2015 as f
join airlines as a on f.airline = a.iata_code;
The statement gives the same result as INNER JOIN:
select flight_number, day, month, year, a.airline from flights_anc_sea_1_1_2015 as f
inner join airlines as a on f.airline = a.iata_code;
Compare with LEFT/RIGHT/FULL (OUTER) JOIN:
select flight_number, day, month, year, a.airline from flights_anc_sea_1_1_2015 as f
left join airlines as a on f.airline = a.iata_code;
select flight_number, day, month, year, a.airline from flights_anc_sea_1_1_2015 as f
right join airlines as a on f.airline = a.iata_code;
select flight_number, day, month, year, a.airline from flights_anc_sea_1_1_2015 as f
full join airlines as a on f.airline = a.iata_code;
Note
RIGHT/FULL OUTER JOIN is supported by newer SQLite version. In older version RIGHT OUTER JOIN may be replaced by LEFT OUTER JOIN and tables switched:
select flight_number, day, month, year, a.airline from airlines as a
left join flights_anc_sea_1_1_2015 as f on f.airline = a.iata_code;
FULL OUTER JOIN may be replaced by combination of LEFT OUTER JOIN and UNION:
Delete Alaska Airlines record and try to perform INNER and OUTER JOINs. Compare results with previous run.
CROSS JOIN
CROSS JOIN is a very special thing that is not used very often because it allows you to combine attributes without considering their relations. Eg. you have 5 airlines and you have 5 airports and you want all their combinations. CROSS JOIN is a cartesian product. The result is therefore 25 rows = 5 x 5.
with five_airports as (select airport from airports limit 5)
select airport, airline from five_airports as f
cross join (select airline from airlines a limit 5)
order by airport, airline;
Primary and Foreign keys¶
Note
SQLite3 doesn't support adding primary and foreign keys by ALTER TABLE statement.
CREATE TABLE "airlines_pk" (
"IATA_CODE" TEXT PRIMARY KEY,
"AIRLINE" TEXT
);
INSERT INTO airlines_pk select * from airlines;
CREATE TABLE "airports_pk" (
"IATA_CODE" TEXT PRIMARY KEY,
"AIRPORT" TEXT,
"CITY" TEXT,
"STATE" TEXT,
"COUNTRY" TEXT,
"LATITUDE" REAL,
"LONGITUDE" REAL
);
INSERT INTO airports_pk select * from airports;
CREATE TABLE "flights_pk" (
"YEAR" INTEGER,
"MONTH" INTEGER,
"DAY" INTEGER,
"DAY_OF_WEEK" INTEGER,
"AIRLINE" TEXT,
"FLIGHT_NUMBER" INTEGER,
"TAIL_NUMBER" TEXT,
"ORIGIN_AIRPORT" TEXT,
"DESTINATION_AIRPORT" TEXT,
"SCHEDULED_DEPARTURE" INTEGER,
"DEPARTURE_TIME" INTEGER,
"DEPARTURE_DELAY" INTEGER,
"TAXI_OUT" INTEGER,
"WHEELS_OFF" INTEGER,
"SCHEDULED_TIME" INTEGER,
"ELAPSED_TIME" INTEGER,
"AIR_TIME" INTEGER,
"DISTANCE" INTEGER,
"WHEELS_ON" INTEGER,
"TAXI_IN" INTEGER,
"SCHEDULED_ARRIVAL" INTEGER,
"ARRIVAL_TIME" INTEGER,
"ARRIVAL_DELAY" INTEGER,
"DIVERTED" INTEGER,
"CANCELLED" INTEGER,
"CANCELLATION_REASON" TEXT,
"AIR_SYSTEM_DELAY" TEXT,
"SECURITY_DELAY" TEXT,
"AIRLINE_DELAY" TEXT,
"LATE_AIRCRAFT_DELAY" TEXT,
"WEATHER_DELAY" TEXT,
FOREIGN KEY (airline) REFERENCES airlines_pk (IATA_CODE)
FOREIGN KEY (origin_airport) REFERENCES airports_pk (IATA_CODE)
FOREIGN KEY (destination_airport) REFERENCES airports_pk (IATA_CODE)
);
PRAGMA foreign_keys = OFF;
INSERT INTO flights_pk select * from flights;
PRAGMA foreign_keys = ON;
Foreign keys improves consistency check: