02. Předzpracování dat -- lineární algebra v NumPy, vizualizace v Matplotlib¶

důležitý krok rozvrhu projektu strojového učení
transformace surových dat do vhodné formy a vhodného formátu
ani nejlepší model strojového učení si neporadí s nevhodně předzpracovanými daty

Způsoby předzpracování dat:

naložení s chybějícími hodnotami;
škálování příznaků;
label encoding;
snížení šumu;
snížení dimenze dat;
transformace;
generování nových příznaků/dat.

Data¶

Data, s kterými budeme pracovat, budou náhodně vygenerována:

In [1]:

Copied!





import numpy as np

min_value = 0
max_value = 11
size = 5

r1 = np.random.randint(min_value, max_value, size=size)
print('1D pole:\n', r1, '\n')
r2 = np.random.randint(min_value, max_value, size=(size, size))
print('2D pole:\n', r2)
import numpy as np

min_value = 0
max_value = 11
size = 5

r1 = np.random.randint(min_value, max_value, size=size)
print('1D pole:\n', r1, '\n')
r2 = np.random.randint(min_value, max_value, size=(size, size))
print('2D pole:\n', r2)

1D pole:
 [3 3 5 9 0] 

2D pole:
 [[ 8 10  5  7  1]
 [ 8  3  7  4  8]
 [ 7  8  7  8  0]
 [ 3 10 10  8  8]
 [ 7  4  3  5 10]]

Škálování příznaků -- Normalizace dat -- min-max škálování¶

Nevýhody:

Ztráta původního rozsahu (často je třeba jej držeti stranou)
Odlehlé hodnoty ovlivňují i normalizovaný dataset
Všechna data nemusí být nutně normalizována, natožpak stejným způsobem

Funkce, které využijeme¶

In [2]:

Copied!





def normalize_globally(data, new_min=0, new_max=1):
    """Normalizuj pole [0, 1]."""
    return new_min + (data - data.min()) * (new_max - new_min) / (data.max() - data.min())


def normalize_2d_columnwise(data, new_min=0, new_max=1):
    """Normalizuj dvourozmerne pole podle sloupcu [0, 1]."""
    return new_min + (data - data.min(axis=0)) * (new_max - new_min) / (data.max(axis=0) - data.min(axis=0))


def normalize_2d_rowwise(data, new_min=0, new_max=1):
    """Normalizuj dvourozmerne pole podle radku [0, 1]."""
    row_min = data.min(axis=1).reshape(-1, 1)  # reshape vytvori z radkoveho vektoru sloupcovy
    row_max = data.max(axis=1).reshape(-1, 1)  # reshape vytvori z radkoveho vektoru sloupcovy
    return new_min + (data - row_min) * (new_max - new_min) / (row_max - row_min)
def normalize_globally(data, new_min=0, new_max=1):
    """Normalizuj pole [0, 1]."""
    return new_min + (data - data.min()) * (new_max - new_min) / (data.max() - data.min())


def normalize_2d_columnwise(data, new_min=0, new_max=1):
    """Normalizuj dvourozmerne pole podle sloupcu [0, 1]."""
    return new_min + (data - data.min(axis=0)) * (new_max - new_min) / (data.max(axis=0) - data.min(axis=0))


def normalize_2d_rowwise(data, new_min=0, new_max=1):
    """Normalizuj dvourozmerne pole podle radku [0, 1]."""
    row_min = data.min(axis=1).reshape(-1, 1)  # reshape vytvori z radkoveho vektoru sloupcovy
    row_max = data.max(axis=1).reshape(-1, 1)  # reshape vytvori z radkoveho vektoru sloupcovy
    return new_min + (data - row_min) * (new_max - new_min) / (row_max - row_min)

Aplikace¶

In [3]:

Copied!

normalize_globally(r1)
normalize_globally(r1)

Out[3]:

array([0.33333333, 0.33333333, 0.55555556, 1.        , 0.        ])

In [4]:

Copied!

normalize_2d_columnwise(r2)
normalize_2d_columnwise(r2)

Out[4]:

array([[1.        , 1.        , 0.28571429, 0.75      , 0.1       ],
       [1.        , 0.        , 0.57142857, 0.        , 0.8       ],
       [0.8       , 0.71428571, 0.57142857, 1.        , 0.        ],
       [0.        , 1.        , 1.        , 1.        , 0.8       ],
       [0.8       , 0.14285714, 0.        , 0.25      , 1.        ]])

In [5]:

Copied!

normalize_2d_rowwise(r2)
normalize_2d_rowwise(r2)

Out[5]:

array([[0.77777778, 1.        , 0.44444444, 0.66666667, 0.        ],
       [1.        , 0.        , 0.8       , 0.2       , 1.        ],
       [0.875     , 1.        , 0.875     , 1.        , 0.        ],
       [0.        , 1.        , 1.        , 0.71428571, 0.71428571],
       [0.57142857, 0.14285714, 0.        , 0.28571429, 1.        ]])

In [6]:

Copied!

normalize_globally(r2)
normalize_globally(r2)

Out[6]:

array([[0.8, 1. , 0.5, 0.7, 0.1],
       [0.8, 0.3, 0.7, 0.4, 0.8],
       [0.7, 0.8, 0.7, 0.8, 0. ],
       [0.3, 1. , 1. , 0.8, 0.8],
       [0.7, 0.4, 0.3, 0.5, 1. ]])

Škálování příznaků -- Standardizace dat - standardizované z-skóre¶

Funkce, které využijeme¶

In [7]:

Copied!





def z_score_standardize_manual(data, axis=None):
    """Standardizuj dvourozmerne pole pomoci z-score."""
    if axis is None:
        size = data.size
    else:
        size = data.shape[axis]

    mean = data.sum(axis=axis) / size
    sum_body = (data - mean) ** 2  # v pripade komplexnich cisel jestit doporuceno pouzivati absolutni hodnoty rozdilu
    var = sum_body.sum(axis=axis) / (size - 1)

    return (data - mean) / np.sqrt(var)
def z_score_standardize_manual(data, axis=None):
    """Standardizuj dvourozmerne pole pomoci z-score."""
    if axis is None:
        size = data.size
    else:
        size = data.shape[axis]

    mean = data.sum(axis=axis) / size
    sum_body = (data - mean) ** 2  # v pripade komplexnich cisel jestit doporuceno pouzivati absolutni hodnoty rozdilu
    var = sum_body.sum(axis=axis) / (size - 1)

    return (data - mean) / np.sqrt(var)

In [8]:

Copied!

def z_score_standardize_np(data, axis=None, ddof=0):
    """Standardizuj dvourozmerne pole pomoci z-score."""
    return (data - data.mean(axis=axis)) / data.std(axis=axis, ddof=ddof)
def z_score_standardize_np(data, axis=None, ddof=0):
    """Standardizuj dvourozmerne pole pomoci z-score."""
    return (data - data.mean(axis=axis)) / data.std(axis=axis, ddof=ddof)

Aplikace¶

In [9]:

Copied!

z_score_standardize_manual(r2)
z_score_standardize_manual(r2)

Out[9]:

array([[ 0.58019029,  1.28773943, -0.48113341,  0.22641572, -1.89623169],
       [ 0.58019029, -1.18868255,  0.22641572, -0.83490798,  0.58019029],
       [ 0.22641572,  0.58019029,  0.22641572,  0.58019029, -2.25000626],
       [-1.18868255,  1.28773943,  1.28773943,  0.58019029,  0.58019029],
       [ 0.22641572, -0.83490798, -1.18868255, -0.48113341,  1.28773943]])

In [10]:

Copied!

z_score_standardize_np(r2)
z_score_standardize_np(r2)

Out[10]:

array([[ 0.59215424,  1.31429355, -0.49105473,  0.23108458, -1.93533336],
       [ 0.59215424, -1.21319405,  0.23108458, -0.85212439,  0.59215424],
       [ 0.23108458,  0.59215424,  0.23108458,  0.59215424, -2.29640302],
       [-1.21319405,  1.31429355,  1.31429355,  0.59215424,  0.59215424],
       [ 0.23108458, -0.85212439, -1.21319405, -0.49105473,  1.31429355]])

In [11]:

Copied!

z_score_standardize_np(r2, ddof=1)
z_score_standardize_np(r2, ddof=1)

Out[11]:

array([[ 0.58019029,  1.28773943, -0.48113341,  0.22641572, -1.89623169],
       [ 0.58019029, -1.18868255,  0.22641572, -0.83490798,  0.58019029],
       [ 0.22641572,  0.58019029,  0.22641572,  0.58019029, -2.25000626],
       [-1.18868255,  1.28773943,  1.28773943,  0.58019029,  0.58019029],
       [ 0.22641572, -0.83490798, -1.18868255, -0.48113341,  1.28773943]])

Snížení dimenze dat -- Analýza hlavních komponent¶

PCA použijeme za účelem:

vyvarování se přílišně dimenzionálních dat;
snížení šumu;
vizualizace;
snížení kolinearity;
snížení nároků na paměť a místo na disku;
zvýšení přesnosti modelu;
snížení hrozby přetrénování.

Funkce, které využijeme¶

In [12]:

Copied!





def pca(data):
    # standardizace (kovariance je citliva na velke rozdily hodnot)
    standardized_data = z_score_standardize_manual(data, 0)  # axis=0, protoze jednotlive rozmery mohou mit ruzne rozsahy

    # vypocet kovariancni matice
    covariance_matrix = np.zeros(data.shape)
    for i in range(data.shape[1]):
        # prumer sloupce i
        mean_i = np.sum(standardized_data[:, i]) / data.shape[0]
        for j in range(data.shape[1]):
            # prumer sloupce j
            mean_j = np.sum(standardized_data[:, j]) / data.shape[0]
            # kovariance mezi sloupci
            covariance_matrix[i, j] = np.sum((standardized_data[:, i] - mean_i) * (standardized_data[:, j] - mean_j)) / (data.shape[1] - 1)

    # vypocet vlastnich hodnot a vektoru
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # serazeni hlavnich komponent
    order_of_importance = np.argsort(eigenvalues)[::-1]

    # serazeni vlastnich_hodnot a vlastnich_vektoru
    sorted_eigenvalues = eigenvalues[order_of_importance]
    sorted_eigenvectors = eigenvectors[:, order_of_importance]

    ### Step 6: Reduce the Data via the Principal Components
    k = 2 # select the number of principal components
    reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:,:k]) # transform the original data

    # vypo Step 5: Calculate the Explained Variance
    # use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
    explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues)

    ### Step 7: Determine the Explained Variance
    total_explained_variance = sum(explained_variance[:k])

    return reduced_data
def pca(data):
    # standardizace (kovariance je citliva na velke rozdily hodnot)
    standardized_data = z_score_standardize_manual(data, 0)  # axis=0, protoze jednotlive rozmery mohou mit ruzne rozsahy

    # vypocet kovariancni matice
    covariance_matrix = np.zeros(data.shape)
    for i in range(data.shape[1]):
        # prumer sloupce i
        mean_i = np.sum(standardized_data[:, i]) / data.shape[0]
        for j in range(data.shape[1]):
            # prumer sloupce j
            mean_j = np.sum(standardized_data[:, j]) / data.shape[0]
            # kovariance mezi sloupci
            covariance_matrix[i, j] = np.sum((standardized_data[:, i] - mean_i) * (standardized_data[:, j] - mean_j)) / (data.shape[1] - 1)

    # vypocet vlastnich hodnot a vektoru
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # serazeni hlavnich komponent
    order_of_importance = np.argsort(eigenvalues)[::-1]

    # serazeni vlastnich_hodnot a vlastnich_vektoru
    sorted_eigenvalues = eigenvalues[order_of_importance]
    sorted_eigenvectors = eigenvectors[:, order_of_importance]

    ### Step 6: Reduce the Data via the Principal Components
    k = 2 # select the number of principal components
    reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:,:k]) # transform the original data

    # vypo Step 5: Calculate the Explained Variance
    # use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
    explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues)

    ### Step 7: Determine the Explained Variance
    total_explained_variance = sum(explained_variance[:k])

    return reduced_data

In [13]:

Copied!





def pca_np(data):
    # standardizace (kovariance je citliva na velke rozdily hodnot)
    standardized_data = z_score_standardize_np(data, 0, ddof=1)  # axis=0, protoze jednotlive rozmery mohou mit ruzne rozsahy

    # vypocet kovariancni matice
    covariance_matrix = np.cov(standardized_data, rowvar = False)

    # vypocet vlastnich hodnot a vektoru
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # serazeni hlavnich komponent
    order_of_importance = np.argsort(eigenvalues)[::-1]

    # serazeni vlastnich_hodnot a vlastnich_vektoru
    sorted_eigenvalues = eigenvalues[order_of_importance]
    sorted_eigenvectors = eigenvectors[:, order_of_importance]

    ### Step 6: Reduce the Data via the Principal Components
    k = 2 # select the number of principal components
    reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:,:k]) # transform the original data

    # vypo Step 5: Calculate the Explained Variance
    # use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
    explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues)

    ### Step 7: Determine the Explained Variance
    total_explained_variance = sum(explained_variance[:k])

    return reduced_data
def pca_np(data):
    # standardizace (kovariance je citliva na velke rozdily hodnot)
    standardized_data = z_score_standardize_np(data, 0, ddof=1)  # axis=0, protoze jednotlive rozmery mohou mit ruzne rozsahy

    # vypocet kovariancni matice
    covariance_matrix = np.cov(standardized_data, rowvar = False)

    # vypocet vlastnich hodnot a vektoru
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # serazeni hlavnich komponent
    order_of_importance = np.argsort(eigenvalues)[::-1]

    # serazeni vlastnich_hodnot a vlastnich_vektoru
    sorted_eigenvalues = eigenvalues[order_of_importance]
    sorted_eigenvectors = eigenvectors[:, order_of_importance]

    ### Step 6: Reduce the Data via the Principal Components
    k = 2 # select the number of principal components
    reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:,:k]) # transform the original data

    # vypo Step 5: Calculate the Explained Variance
    # use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors
    explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues)

    ### Step 7: Determine the Explained Variance
    total_explained_variance = sum(explained_variance[:k])

    return reduced_data

Aplikace¶

In [14]:

Copied!

res = pca(r2)

import matplotlib.pyplot as plt
plt.scatter(res[:, 0], res[:, 1])
res = pca(r2)

import matplotlib.pyplot as plt
plt.scatter(res[:, 0], res[:, 1])

Out[14]:

<matplotlib.collections.PathCollection at 0x7f7e03737190>

No description has been provided for this image

In [15]:

Copied!

pca_np(r2)
pca_np(r2)

Out[15]:

array([[ 0.45052063,  1.47632635],
       [-1.71301458, -0.53878129],
       [ 0.99908723,  1.02892263],
       [ 2.07040518, -1.59112741],
       [-1.80699845, -0.37534028]])