8.1 Introduction into databases, SQL¶

Fundamental terms¶

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS) which is a software to store, organize, manage, and retrieve data. Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database.

Relational Databases (RDB) use a relational model that organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. RDB commonly uses Structured Query Language (SQL) for operations like creating, reading, updating, and deleting (CRUD) data.

NoSQL Databases does not use relational table for storing data, instead, they often use documents similar to JSON format. They are suitable for storing big data.

Types of relational databases:¶

File databases are embedded databases which means they are server-less. The example is SQLite database and its special extension for spatial data called SpatiaLite. You can work with SQLite through the DB Browser for SQLite.

Client-server databases are based on a client-server model which requires a DB server to set up and run over the network. Examples are MySQL, Microsoft SQL, PostreSQL, Oracle DB.

Today we will talk about file database solutions. They are suitable for data where we do not need to access data simultaneously by more people.

File database pros:

We can easily transfer our database to anyone at any time.
Anyone can then easily load it, because the SQLite database system is supported by a wide range of programming libraries (e.g. Sqlite3 module in Python).

Limitations:

Not possible to access data simultaneously by more people

Data preparations¶

Europian land territory grid and base statistics

1) Download the grid with basic statistics information for your country at eurostat. Choose the cell resolution of 20 km and as the format the CSV.

2) Download CSV with information about NUTS units at nuts_units.

3) Establish a new SQLite database in the DB Browser for SQLite - call it 'eurostat'.

4) Import the Csv files into that database.

Simple DDL operations (Data Definition Language)¶

1) Explore the table grid_10km in DB Browser for SQLite

select * from grid_10km limit 10;

2) Of what type the attributes in the table are? We can use the PRAGMA statement table_info. The PRAGMA statement is used to query the SQLite library for internal information.

PRAGMA table_info(grid_10km);

The info about attributes can be found at gisco_grid:

4) Select only particular columns.

select  grd_id,
        nuts2021_0,
        nuts2021_1,
        nuts2021_2,
        nuts2021_3,
        TOT_P_2006
        from grid_10km;

5) Get information about the country name.

select  g.grd_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.TOT_P_2006,
        n.country
        from grid_10km g 
        join nuts n ON (g.nuts2021_0 = n.nuts_code);

5) Select grids and total population in these grids only applicable for your country (population is filled only for nuts units of level 3).

select  g.grd_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.TOT_P_2006,
        n.country
        from grid_10km g 
        join nuts n ON (g.nuts2021_0 = n.nuts_code)
        where g.cntr_id = 'CZ';

5) Select grids only applicable for CZ, DE and ES.

select  g.grd_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.TOT_P_2006,
        n.country
        from grid_10km g 
        join nuts n ON (g.nuts2021_0 = n.nuts_code)
        where cntr_id in ('CZ', 'DE', 'ES');

6) Which grid id in which NUTS unit has the maximum distance to the coast?

select  g.grd_id, 
        g.cntr_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.dist_coast,
        n.country
        from grid_10km g
        join nuts n ON (g.nuts2021_0 = n.nuts_code)
        where g.nuts2021_0 is not null
        order by g.dist_coast desc
        limit 1;

We can do it also differently using WHERE clause and MAX function...

select  g.grd_id, 
        g.cntr_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.dist_coast,
        n.country
        from grid_10km g
        join nuts n ON (g.nuts2021_0 = n.nuts_code)
        where g.dist_coast in (select max(dist_coast) from grid_10km where NUTS2021_0 is not null)

7) Select grid ids where the minimum distance to the coast is minimal or maximal (a result should have two records):

select  g.grd_id, 
        g.cntr_id,
        g.nuts2021_0,
        g.nuts2021_1,
        g.nuts2021_2,
        g.nuts2021_3,
        g.dist_coast,
        n.country
        from grid_10km g
        join nuts n ON (g.nuts2021_0 = n.nuts_code)
        where g.dist_coast in (select min(dist_coast) from grid_10km WHERE NUTS2021_0 is not null)
        or g.dist_coast in (select max(dist_coast) from grid_10km WHERE NUTS2021_0 is not null);

8) Count total population numbers in grids for all provided years (population is filled only for nuts units of level 3 so we can just sum it up):

select  sum(tot_p_2006) t_2006,
        sum(tot_p_2011) t_2011,
        sum(tot_p_2018) t_2018,
        sum(tot_p_2021) t_2021
from grid_10km where cntr_id = "CZ";

Simple DML operations (Data Manipulation Language)¶

1) Find out the DML command for creating table eurostat_grid.

SELECT sql 
FROM sqlite_schema 
WHERE name = 'grid_10km';

The command looks as follows:

CREATE TABLE "grid_10km" (
    "DIST_BORD" REAL,
    "TOT_P_2018"    INTEGER,
    "TOT_P_2006"    INTEGER,
    "GRD_ID"    TEXT,
    "TOT_P_2011"    INTEGER,
    "TOT_P_2021"    INTEGER,
    "Y_LLC" INTEGER,
    "CNTR_ID"   TEXT,
    "NUTS2016_3"    TEXT,
    "NUTS2016_2"    TEXT,
    "NUTS2016_1"    TEXT,
    "NUTS2016_0"    TEXT,
    "LAND_PC"   INTEGER,
    "X_LLC" INTEGER,
    "NUTS2021_3"    TEXT,
    "NUTS2021_2"    TEXT,
    "DIST_COAST"    INTEGER,
    "NUTS2021_1"    TEXT,
    "NUTS2021_0"    TEXT
)

2) Create partial table called nuts_country containing grid info only for your country:

CREATE TABLE grid_10km_cz AS
  SELECT g.grd_id,
         g.nuts2021_3,
         n.nuts_3,
         g.tot_p_2006,
         g.tot_p_2011,
         g.tot_p_2018,
         g.tot_p_2021
  from grid_10km g
  left join nuts n ON (g.nuts2021_3 = n.nuts_code)
  where g.cntr_id = "CZ";

select * from grid_10km_cz;

3) Look at tables in our SQLITE database. We should have three tables - nuts and grid_10km and grid_10km_cz.

SELECT 
    name
FROM 
    sqlite_schema
WHERE 
    type ='table'

4) Select the the greatest population for each grid id.

select g.*, max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
from grid_10km_cz g;

5) Select the year with the greatest population for each grid id.

select sub.*,
    CASE greatest_population
         WHEN TOT_P_2006 THEN '2006'
         WHEN TOT_P_2011 THEN '2011'
         WHEN TOT_P_2018 THEN '2018'
         WHEN TOT_P_2021 THEN '2021'
    END year_with_greatest_population
from
(select g.*, max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
from grid_10km_cz g) sub;

6) Add a new column for writing the year when the population was the biggest.

ALTER TABLE grid_10km_cz
  ADD year_with_greatest_population TEXT ;

PRAGMA table_info(grid_10km_cz);

7) Update the newly created table by the year where the population in the grid was the greatest.

UPDATE grid_10km_cz
SET year_with_greatest_population = greatest_pop.year_with_greatest_population
FROM (
    SELECT grd_id, CASE greatest_population
        WHEN TOT_P_2006 THEN '2006'
        WHEN TOT_P_2011 THEN '2011'
        WHEN TOT_P_2018 THEN '2018'
        WHEN TOT_P_2021 THEN '2021'
        END year_with_greatest_population
    FROM
        (SELECT grd_id, TOT_P_2006, TOT_P_2011, TOT_P_2018, TOT_P_2021, max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
         FROM grid_10km_cz) gr
) greatest_pop
WHERE grid_10km_cz.grd_id = greatest_pop.grd_id;

8) Drop all values and delete the table.

The difference between drop and delete from:

DELETE FROM grid_cz;

DROP table grid_cz;