Skip to content

8.1 Introduction into databases, SQL

Fundamental terms

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS) which is a software to store, organize, manage, and retrieve data. Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database.

Relational Databases (RDB) use a relational model that organizes data into one or more tables (or "relations") of columns and rows, with a unique key identifying each row. RDB commonly uses Structured Query Language (SQL) for operations like creating, reading, updating, and deleting (CRUD) data.

NoSQL Databases does not use relational table for storing data, instead, they often use documents similar to JSON format. They are suitable for storing big data.

Types of relational databases

  • File databases are embedded databases which means they are server-less. The example is SQLite database and its special extension for spatial data called SpatiaLite. You can work with SQLite through the DB Browser for SQLite.

  • Client-server databases are based on a client-server model which requires a DB server to set up and run over the network. Examples are MySQL, Microsoft SQL, PostreSQL, Oracle DB.

File databases

This lesson is dedicated to database solutions. They are suitable for situations where we do not require data to be queried by multiple users at the same time.

File database pros:

  • We can easily transfer our database to anyone at any time.

  • Anyone can then easily load it, because the SQLite database system is supported by a wide range of programming libraries (e.g. sqlite3 module in Python).

Limitations:

  • Not possible to access data simultaneously by more users
  • They are not designed for larger data volumes

Data preparations

European land territory grid and base statistics

  1. Download the grid with basic statistics information for your country at Eurostat. Choose the cell resolution of 10 km and as the format the CSV.

  2. Download CSV with information about NUTS units at nuts_units.

  3. Establish a new SQLite database in the in DB Browser for SQLite - call it 'eurostat.db' and skip enter new table definition.

  4. Import the CSV files into that database (Import > Table from CSV file)

  5. Write changes

Simple DML operations (Data Manipulation Language)

DML stands for Data Manipulation Language, and it refers to a subset of SQL commands that are used to manipulate data within database tables. DML operations allow users to interact with the data, such as retrieving, inserting, updating, or deleting records.

  • select - retrieves data from one or more tables.
  • insert - adds new rows of data into a table.
  • update - modifies existing data in a table.
  • delete - removes existing data from a table.

Examples:

select count(*) from table;
select * from table;
select column1, columns2 from table;
select * from where column1 = 'value';
select * from where column1 like 'v%';
select * from table order by column1 desc;

Join tables

JOIN is used to combine rows from two or more tables based on a related column between them:

select * from table_a join table_b on table_a.col1 = table_b.col2;

SQL defines three major types of joins:

  • INNER JOIN (or JOIN)
  • OUTER JOIN (LEFT, RIGHT, FULL)
  • CROSS JOIN

Join types Source: https://www.geeksforgeeks.org/python-sqlite-join-clause/

Exercises

  • Explore the table grid_10km

    select * from grid_10km limit 10;
    select count(*) from grid_10km;
    

  • Of what type the attributes in the table are? We can use the PRAGMA statement table_info. The PRAGMA statement is used to query the SQLite library for internal information.

    PRAGMA table_info(grid_10km);
    

Note

The info about attributes can be found at gisco_grid: Info about grid attributes

  • Select only particular columns.

    select  grd_id,
            nuts2021_0,
            nuts2021_1,
            nuts2021_2,
            nuts2021_3,
            TOT_P_2006
    from grid_10km;
    

  • Count total population numbers in grids for all provided years (population is filled only for nuts units of level 3 so we can just sum it up):

    select  sum(tot_p_2006) t_2006,
            sum(tot_p_2011) t_2011,
            sum(tot_p_2018) t_2018,
            sum(tot_p_2021) t_2021
    from grid_10km where cntr_id = 'CZ';
    

Task

select distinct cntr_id from grid_10km order by cntr_id;

Compare with:

select  sum(tot_p_2006) t_2006,
    sum(tot_p_2011) t_2011,
    sum(tot_p_2018) t_2018,
    sum(tot_p_2021) t_2021
from grid_10km where cntr_id like '%CZ%';
  • Which grid id in which NUTS unit has the maximum distance to the coast?
    select  grd_id, 
            cntr_id,
            nuts2021_0,
            nuts2021_1,
            nuts2021_2,
            nuts2021_3,
            dist_coast
    from grid_10km
    where nuts2021_0 is not null
    order by dist_coast desc
    limit 1;
    

Task

We can do it also differently using WHERE clause and MAX function...

select  grd_id, 
    cntr_id,
    nuts2021_0,
    nuts2021_1,
    nuts2021_2,
    nuts2021_3,
    dist_coast
from grid_10km
where dist_coast = (select max(dist_coast) from grid_10km where nuts2021_0 is not null)

  • Select grid ids where the minimum distance to the coast is minimal or maximal (a result should have two records):

    select  grd_id, 
            cntr_id,
            nuts2021_0,
            nuts2021_1,
            nuts2021_2,
            nuts2021_3,
            dist_coast
    from grid_10km
    where dist_coast = (select min(dist_coast) from grid_10km WHERE NUTS2021_0 is not null)
    or dist_coast in (select max(dist_coast) from grid_10km WHERE NUTS2021_0 is not null);
    

  • Get information about the country name.

    select  g.grd_id,
            g.nuts2021_0,
            g.nuts2021_1,
            g.nuts2021_2,
            g.nuts2021_3,
            g.TOT_P_2006,
            n.country
    from grid_10km g 
    join nuts n ON (g.nuts2021_0 = n.nuts_code);
    

  • Select grids and total population in these grids only applicable for your country (population is filled only for nuts units of level 3).

    select  g.grd_id,
            g.nuts2021_0,
            g.nuts2021_1,
            g.nuts2021_2,
            g.nuts2021_3,
            g.TOT_P_2006,
            n.country
    from grid_10km g 
    join nuts n ON (g.nuts2021_0 = n.nuts_code)
    where g.cntr_id = 'CZ';
    

  • Select grids only applicable for CZ, DE and ES.

    select  g.grd_id,
            g.nuts2021_0,
            g.nuts2021_1,
            g.nuts2021_2,
            g.nuts2021_3,
            g.TOT_P_2006,
            n.country
    from grid_10km g 
    join nuts n ON (g.nuts2021_0 = n.nuts_code)
    where cntr_id in ('CZ', 'DE', 'ES');
    

Simple DDL operations (Data Definition Language)

DDL stands for Data Definition Language, which is a subset of SQL used to define and manage database structures like tables, schemas, and indexes.

  • create - create a new table, database, index, or view.
  • alter - modify an existing database object like a table (add/remove columns, change data types, etc.).
  • drop - deletes an entire table, view, or database permanently.
  • truncate - removes all records from a table

Exercises

  • Find out the DDL command for creating table grid_10km.
    select sql 
    from sqlite_schema 
    where name = 'grid_10km';
    

The command looks as follows:

CREATE TABLE "grid_10km" (
    "DIST_BORD" REAL,
    "TOT_P_2018"    INTEGER,
    "TOT_P_2006"    INTEGER,
    "GRD_ID"    TEXT,
    "TOT_P_2011"    INTEGER,
    "TOT_P_2021"    INTEGER,
    "Y_LLC" INTEGER,
    "CNTR_ID"   TEXT,
    "NUTS2016_3"    TEXT,
    "NUTS2016_2"    TEXT,
    "NUTS2016_1"    TEXT,
    "NUTS2016_0"    TEXT,
    "LAND_PC"   INTEGER,
    "X_LLC" INTEGER,
    "NUTS2021_3"    TEXT,
    "NUTS2021_2"    TEXT,
    "DIST_COAST"    INTEGER,
    "NUTS2021_1"    TEXT,
    "NUTS2021_0"    TEXT
)

  • Create partial table called nuts_country containing grid info only for your country:
create table grid_10km_cz as
  select g.grd_id,
         g.nuts2021_3,
         n.nuts_3,
         g.tot_p_2006,
         g.tot_p_2011,
         g.tot_p_2018,
         g.tot_p_2021
  from grid_10km g
  left join nuts n on (g.nuts2021_3 = n.nuts_code)
  where g.cntr_id = "CZ";
select * from grid_10km_cz;
  • Look at tables in our SQLITE database. We should have three tables - nuts and grid_10km and grid_10km_cz.
select name
from sqlite_schema
where type ='table';
  • Select the the greatest population for each grid id.
select *,
       max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
from grid_10km_cz;
  • Select the year with the greatest population for each grid id.
select sub.*,
    case greatest_population
         when TOT_P_2006 then '2006'
         when TOT_P_2011 then '2011'
         when TOT_P_2018 then '2018'
         when TOT_P_2021 then '2021'
    end year_with_greatest_population
from (
    select *, max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
    from grid_10km_cz)
sub;
  • Add a new column for writing the year when the population was the biggest.
alter table grid_10km_cz
  add year_with_greatest_population text;
PRAGMA table_info(grid_10km_cz);
  • Update the newly created table by the year where the population in the grid was the greatest.
update grid_10km_cz
set year_with_greatest_population = greatest_pop.year_with_greatest_population
from (
    select grd_id, CASE greatest_population
        when TOT_P_2006 THEN '2006'
        when TOT_P_2011 THEN '2011'
        when TOT_P_2018 THEN '2018'
        when TOT_P_2021 THEN '2021'
        end year_with_greatest_population
    from (
        select grd_id, TOT_P_2006, TOT_P_2011, TOT_P_2018, TOT_P_2021, 
               max(tot_p_2006, tot_p_2011, tot_p_2018, tot_p_2021) greatest_population
        from grid_10km_cz) gr
) greatest_pop
where grid_10km_cz.grd_id = greatest_pop.grd_id;
select * from grid_10km_cz;
  • Drop all values and delete the table.

The difference between drop and delete from:

delete from grid_10km_cz;
drop table grid_10km_cz;