07: Návrh a řešení projektu ML¶

Predikce lesních požárů v parku Montesinho (PT)¶

Lesní požáry představují závažný environmentální problém, který způsobuje ekonomické a ekologické škody a zároveň ohrožuje lidské životy. Rychlá detekce je klíčovým prvkem pro kontrolu těchto jevů. Jednou z možností, jak toho dosáhnout, je použití automatických nástrojů založených na místních senzorech, například na meteorologických stanicích. Je známo, že meteorologické podmínky (např. teplota, vítr) ovlivňují lesní požáry, a několik požárních indexů, jako je například index požárně rizikového počasí (forest fire weather index, FWI), tyto údaje využívá. V této práci zkoumáme přístup strojového učení (ML) k předpovědi plochy spálené lesními požáry.

Zdroj:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.

Problém¶

regrese více neznámých
data nejsou kontinuální -> dávkové učení

FFMC - kód vlhkosti jemného paliva (fine fuel moisture code)
DMC - Duffův kód vlhkosti (duff moisture code)
DC - kód sucha (drought code)
ISI - index počátečního šíření (initial spread index)
BUI - index nárůstu (buildup index)
FWI - index požárně rizikového počasí (forest fire weather index)

Více informací viz [Cortez and Morais, 2007] či Fire Weather Maps Canada.

Nastavení prostředí¶

In [1]:

Copied!





# import
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib

# stabilizujme pseudonáhodný generátor
np.random.seed(23)
# import
import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib

# stabilizujme pseudonáhodný generátor
np.random.seed(23)

Načtěme dataset¶

Data ke stažení zde.

In [2]:

Copied!

# precteme data
fires = pd.read_csv('data/07/forestfires.csv')

# podivejme se na data 
fires.head()
# precteme data
fires = pd.read_csv('data/07/forestfires.csv')

# podivejme se na data 
fires.head()

Out[2]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7	5	mar	fri	86.2	26.2	94.3	5.1	8.2	51	6.7	0.0
1	7	4	oct	tue	90.6	35.4	669.1	6.7	18.0	33	0.9	0.0
2	7	4	oct	sat	90.6	43.7	686.9	6.7	14.6	33	1.3	0.0
3	8	6	mar	fri	91.7	33.3	77.5	9.0	8.3	97	4.0	0.2
4	8	6	mar	sun	89.3	51.3	102.2	9.6	11.4	99	1.8	0.0

X - souřadnice X uvnitř parku Montesinho: 1 až 9
Y - souřadnice X uvnitř parku Montesinho: 2 až 9
month - měsíc v angličtině: "jan" až "dec"
day - den v týdnu v angličtině: "mon" až "sun"
FFMC: 18.7 až 96.20
DMC: 1.1 až 291.3
DC: 7.9 až 860.6
ISI: 0.0 až 56.10
temp - teplota ve stupních Celsia: 2.2 až 33.30
RH - relativní vlhkost v %: 15.0 až 100
wind - rychlost větru v km/h: 0.40 až 9.40
rain - déšť v mm/m2 : 0.0 to 6.4
area - spálená oblast v ha: 0.00 až 1090.84 (tato výstupní proměnná je velmi zkreslená směrem k hodnotě 0; mohlo by tedy dávat smysl modelovat pomocí logaritmické transformace).

Více informací viz [Cortez and Morais, 2007].

In [3]:

Copied!

# Check the histogram normality (change the variables)
fires["DMC"].hist()
# Check the histogram normality (change the variables)
fires["DMC"].hist()

Out[3]:

<Axes: >

No description has been provided for this image

In [4]:

Copied!

# mnozstvi mereni s oblasti vetsi nez 0 ha
print(len(fires))
print(len(fires[fires['area'] > 0]))
# mnozstvi mereni s oblasti vetsi nez 0 ha
print(len(fires))
print(len(fires[fires['area'] > 0]))

517
270

In [5]:

Copied!





# sklearn má funkci train_test_split() - tvorba vlastní funkce slouží pouze k procvičení a pochopení algoritmu v pozadí
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]

    return data.iloc[train_indices], data.iloc[test_indices]
# sklearn má funkci train_test_split() - tvorba vlastní funkce slouží pouze k procvičení a pochopení algoritmu v pozadí
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]

    return data.iloc[train_indices], data.iloc[test_indices]

In [6]:

Copied!

train_set, test_set = split_train_test(fires, 0.2)
print(len(train_set), "train +", len(test_set), "test")
train_set, test_set = split_train_test(fires, 0.2)
print(len(train_set), "train +", len(test_set), "test")

414 train + 103 test

In [7]:

Copied!

test_set.head()
test_set.head()

Out[7]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	area
156	2	4	sep	sat	93.4	145.4	721.4	8.1	28.6	27	2.2	1.61
337	6	3	sep	mon	91.6	108.4	764.0	6.2	23.0	34	2.2	56.04
161	6	4	aug	thu	95.2	131.7	578.8	10.4	20.3	41	4.0	1.90
442	6	5	apr	mon	87.9	24.9	41.6	3.7	10.9	64	3.1	3.35
392	1	3	sep	sun	91.0	276.3	825.1	7.1	21.9	43	4.0	70.76

Rozdělení datasetu za využití sklearn¶

In [8]:

Copied!

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(fires, test_size=0.2, random_state=23)
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(fires, test_size=0.2, random_state=23)

In [9]:

Copied!

print(len(train_set), "train +", len(test_set), "test")
print(len(train_set), "train +", len(test_set), "test")

413 train + 104 test

In [10]:

Copied!

test_set.head()
test_set.head()

Out[10]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	area
156	2	4	sep	sat	93.4	145.4	721.4	8.1	28.6	27	2.2	1.61
337	6	3	sep	mon	91.6	108.4	764.0	6.2	23.0	34	2.2	56.04
161	6	4	aug	thu	95.2	131.7	578.8	10.4	20.3	41	4.0	1.90
442	6	5	apr	mon	87.9	24.9	41.6	3.7	10.9	64	3.1	3.35
392	1	3	sep	sun	91.0	276.3	825.1	7.1	21.9	43	4.0	70.76

Stratifikované vzorkování na základě tříd pomocí StratifiedShuffleSplit¶

In [11]:

Copied!

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=23)
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=23)

In [12]:

Copied!





# ve vstupnich datech neocekavame 'area'
attributes_sel = [
    'X', 'Y', 'month', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain'
]
X = np.array(fires[attributes_sel])
# do y muzeme vlozit kategoricke tridy (1 == pozar, 0 == zadny pozar)
y = np.array((fires['area'] > 0))
# ve vstupnich datech neocekavame 'area'
attributes_sel = [
    'X', 'Y', 'month', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain'
]
X = np.array(fires[attributes_sel])
# do y muzeme vlozit kategoricke tridy (1 == pozar, 0 == zadny pozar)
y = np.array((fires['area'] > 0))

In [13]:

Copied!

print(X.shape)
print(y.shape)
print(X.shape)
print(y.shape)

(517, 11)
(517,)

In [14]:

Copied!

for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index)
    print("\nTEST:", test_index)
for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index)
    print("\nTEST:", test_index)

TRAIN: [124  20 197 484 175 236 435 289 398  54 303 486 391  87 463 293 432 208
  56  34 348 448   1 151 133 311 136 137 256 297 266 161 498 190 160  11
 214 245 429 174 209 374  47 352  19 513 150 331 276 260 335 146  85 338
 349 186 102 505 452 200 105 446 308 316 241 126 370 453 194  14 507  76
 112 482 421 292 138 401 362 363 346 140 430 199 188  17 447  96 465  93
 273 501 320 172 201 315 360 106 319 471 171 426 408 414 466 251 499   3
 347 500 213 462 454 337 420 433 339 198 488  10 332 478 101 473  69 134
 490 144 165 192 173 272 496 322 400 380 225 516 424 295 364  55 268 417
 207 485 265 224 358 405 228 114 324 226 442  66 167 369 183  33 184 494
 286 313   5 350 217 390 244 223 464 195  18 367 135 240 104  13 147  29
 239 253  60  84 449 283 444 302 220 353 368 127  73   4 232 120 145  68
 259 246 152  88 258 170 113  81  59  44 409 437 310 211 277 402 280 376
 479  67 382 235   2  72 389  92 189 159 460 510 515 132  71 377  45 457
 288 181  42 384 341  70  74 326  89 222 294 502 336 238 328 270   8 477
  22 318 162  28 204 227  99  23 440 411 287 343 458 508 191 176  97 248
 394 212 111  52 361 216  30 418  46 242 404 480 476 237 139  62 506 257
  82   0 131  50 373 107 115   9 355 143 243  64 119 344  24 459  38 267
 385 375 215 425 231  16  32 323 300 196  51 386 109 403 193 164 312 261
 234 512 431 330 168 306 356 179  58  26 345 249 177 511 334 321  37  78
 365 379 514 333 445 381 503 438 354   7 415  80 262 182 130 100 416 474
 481 250 309 285 509 428 455 475 110 314  75 301 392 269 218  98 443 163
  61 366  94 427 305 247 493  95 108 233 117  43 229 122  27 489 450 397
 271  86 298 274 153 169 128 359 156  77 487 121 142  63 299 155  53]

TEST: [383  35 141 263 264 399  39 388 340 419 439 396 291   6 180 166  49 472
 371 357  25  15  65 461  12 158  57 118 187 495 351 255 230 329 468 483
 304 497  91 423  40 116 470 154 254 157 491 206 210 504  21 185 275 221
 395 202 407 284 451  83 325 406 422 252 378 372 456 434 327 413 178 203
 296 123  79 317 149 342  31  90 281 219 148 393 103 307 412 492 467 282
 441 205 387 469  36 129 410 125 290 278  48 279  41 436]

In [15]:

Copied!

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

In [16]:

Copied!





fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

fig.set_figwidth(15)

ax1.set_title('Full data')
ax1.hist(fires['DMC'])

X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])

X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

fig.set_figwidth(15)

ax1.set_title('Full data')
ax1.hist(fires['DMC'])

X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])

X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])

Out[16]:

(array([20., 13., 10., 23., 19.,  5.,  4.,  5.,  2.,  3.]),
 array([  4.9 ,  33.54,  62.18,  90.82, 119.46, 148.1 , 176.74, 205.38,
        234.02, 262.66, 291.3 ]),
 <BarContainer object of 10 artists>)

In [17]:

Copied!

bins = np.linspace(0, 500, 5)
print(bins)
bins = np.linspace(0, 500, 5)
print(bins)

[  0. 125. 250. 375. 500.]

Souvislé stratifikované vzorkování na základě y¶

In [18]:

Copied!

len(fires.index)
len(fires.index)

Out[18]:

In [19]:

Copied!

fires['area'].hist()
fires['area'].hist()

Out[19]:

<Axes: >

In [20]:

Copied!

fires[fires['area']<100]['area'].hist()
fires[fires['area']<100]['area'].hist()

Out[20]:

<Axes: >

In [21]:

Copied!

fires[fires['area']<10]['area'].hist()
fires[fires['area']<10]['area'].hist()

Out[21]:

<Axes: >

In [22]:

Copied!





# vytvorme oddily dat podle y
y = np.array(fires['area'])
bins = [0, 1, 10]
y_binned = np.digitize(y, bins)
# vytvorme oddily dat podle y
y = np.array(fires['area'])
bins = [0, 1, 10]
y_binned = np.digitize(y, bins)

In [23]:

Copied!

y_binned
y_binned

Out[23]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1,
       3, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 2, 2,
       2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 1, 2, 3, 1,
       1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2,
       1, 1, 1, 3, 1, 1, 2, 1, 1, 2, 1, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2,
       2, 3, 3, 2, 1, 1, 1, 3, 2, 2, 2, 1, 1, 2, 2, 2, 3, 1, 1, 2, 2, 2,
       2, 2, 2, 3, 2, 1, 1, 2, 2, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 1, 1, 1,
       3, 3, 1, 3, 1, 1, 2, 3, 2, 3, 3, 3, 3, 1, 3, 1, 2, 3, 3, 1, 1, 3,
       2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 3, 2, 1,
       2, 2, 3, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1,
       1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 3, 2, 2, 1, 1,
       2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 2, 3, 3, 2, 2, 2, 2, 3, 2, 1, 2, 1,
       3, 2, 2, 3, 3, 1, 1, 1, 1, 3, 2, 1, 2, 3, 3, 3, 1, 1, 1, 2, 3, 2,
       1, 1, 1, 2, 1, 1, 2, 3, 3, 1, 1])

In [24]:

Copied!





# vyuzijme jiz znamou funkci train_test_split, s definovanym argumentem stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y_binned
)
# vyuzijme jiz znamou funkci train_test_split, s definovanym argumentem stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y_binned
)

In [25]:

Copied!

X_train
X_train

Out[25]:

array([[1, 2, 'aug', ..., 47, 0.9, 0.0],
       [8, 6, 'aug', ..., 41, 3.6, 0.0],
       [1, 4, 'sep', ..., 28, 4.0, 0.0],
       ...,
       [6, 4, 'feb', ..., 77, 5.4, 0.0],
       [7, 3, 'oct', ..., 27, 4.0, 0.0],
       [7, 4, 'sep', ..., 77, 4.0, 0.0]], dtype=object)

In [26]:

Copied!





fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

fig.set_figwidth(15)

ax1.set_title('Full data')
ax1.hist(fires['DMC'])

X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])

X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

fig.set_figwidth(15)

ax1.set_title('Full data')
ax1.hist(fires['DMC'])

X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])

X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])

Out[26]:

(array([24., 17., 15., 29., 36., 11.,  7.,  9.,  3.,  5.]),
 array([  2.4 ,  31.16,  59.92,  88.68, 117.44, 146.2 , 174.96, 203.72,
        232.48, 261.24, 290.  ]),
 <BarContainer object of 10 artists>)

Vizualizace¶

In [27]:

Copied!





plt.scatter(
    fires['X'], fires['Y'],
    c=fires['area'], s=fires['area'], cmap="jet",  # barvu a velikost urcuje rozsireni pozaru
    alpha=0.5  # vice pozaru mohlo vzniknout na tomtez miste
)

# y souradnice jde shora dolu, viz [Cortez and Morais, 2007]
plt.gca().invert_yaxis()
plt.colorbar(label="area")
plt.show()
plt.scatter(
    fires['X'], fires['Y'],
    c=fires['area'], s=fires['area'], cmap="jet",  # barvu a velikost urcuje rozsireni pozaru
    alpha=0.5  # vice pozaru mohlo vzniknout na tomtez miste
)

# y souradnice jde shora dolu, viz [Cortez and Morais, 2007]
plt.gca().invert_yaxis()
plt.colorbar(label="area")
plt.show()

In [28]:

Copied!

# korelacni matice
corr_matrix = fires.corr() 
corr_matrix["area"].sort_values(ascending=False)
# korelacni matice
corr_matrix = fires.corr() 
corr_matrix["area"].sort_values(ascending=False)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[28], line 2
      1 # korelacni matice
----> 2 corr_matrix = fires.corr() 
      3 corr_matrix["area"].sort_values(ascending=False) 

File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/frame.py:11049, in DataFrame.corr(self, method, min_periods, numeric_only)
  11047 cols = data.columns
  11048 idx = cols.copy()
> 11049 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  11051 if method == "pearson":
  11052     correl = libalgos.nancorr(mat, minp=min_periods)

File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/frame.py:1993, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1991 if dtype is not None:
   1992     dtype = np.dtype(dtype)
-> 1993 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1994 if result.dtype is not dtype:
   1995     result = np.asarray(result, dtype=dtype)

File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/internals/managers.py:1694, in BlockManager.as_array(self, dtype, copy, na_value)
   1692         arr.flags.writeable = False
   1693 else:
-> 1694     arr = self._interleave(dtype=dtype, na_value=na_value)
   1695     # The underlying data was copied within _interleave, so no need
   1696     # to further copy if copy=True or setting na_value
   1698 if na_value is lib.no_default:

File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/internals/managers.py:1753, in BlockManager._interleave(self, dtype, na_value)
   1751     else:
   1752         arr = blk.get_values(dtype)
-> 1753     result[rl.indexer] = arr
   1754     itemmask[rl.indexer] = 1
   1756 if not itemmask.all():

ValueError: could not convert string to float: 'mar'

In [29]:

Copied!





# data ve sloupcich 'day' a 'month' musime prevest na cisla
days = [
    'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun'
]
months = [
    'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'
]

for i, day in enumerate(days):
    fires.loc[fires['day'] == day, 'day'] = i
for i, month in enumerate(months):
    fires.loc[fires['month'] == month, 'month'] = i
# data ve sloupcich 'day' a 'month' musime prevest na cisla
days = [
    'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun'
]
months = [
    'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'
]

for i, day in enumerate(days):
    fires.loc[fires['day'] == day, 'day'] = i
for i, month in enumerate(months):
    fires.loc[fires['month'] == month, 'month'] = i

In [30]:

Copied!

# korelacni matice
corr_matrix = fires.corr() 
corr_matrix["area"].sort_values(ascending=False)
# korelacni matice
corr_matrix = fires.corr() 
corr_matrix["area"].sort_values(ascending=False)

Out[30]:

area     1.000000
temp     0.097844
DMC      0.072994
X        0.063385
month    0.056496
DC       0.049383
Y        0.044873
FFMC     0.040122
day      0.023226
wind     0.012317
ISI      0.008258
rain    -0.007366
RH      -0.075519
Name: area, dtype: float64

In [31]:

Copied!

from pandas.plotting import scatter_matrix

attributes = ["temp", "DMC", "DC", "area"]
scatter_matrix(fires[attributes], figsize=(12, 8), alpha=0.5)
from pandas.plotting import scatter_matrix

attributes = ["temp", "DMC", "DC", "area"]
scatter_matrix(fires[attributes], figsize=(12, 8), alpha=0.5)

Out[31]:

array([[<Axes: xlabel='temp', ylabel='temp'>,
        <Axes: xlabel='DMC', ylabel='temp'>,
        <Axes: xlabel='DC', ylabel='temp'>,
        <Axes: xlabel='area', ylabel='temp'>],
       [<Axes: xlabel='temp', ylabel='DMC'>,
        <Axes: xlabel='DMC', ylabel='DMC'>,
        <Axes: xlabel='DC', ylabel='DMC'>,
        <Axes: xlabel='area', ylabel='DMC'>],
       [<Axes: xlabel='temp', ylabel='DC'>,
        <Axes: xlabel='DMC', ylabel='DC'>,
        <Axes: xlabel='DC', ylabel='DC'>,
        <Axes: xlabel='area', ylabel='DC'>],
       [<Axes: xlabel='temp', ylabel='area'>,
        <Axes: xlabel='DMC', ylabel='area'>,
        <Axes: xlabel='DC', ylabel='area'>,
        <Axes: xlabel='area', ylabel='area'>]], dtype=object)

Příprava dat¶

In [32]:

Copied!





# rozdelme data na mereni a cilove tridy
fires_features = fires.drop("area", axis=1) 
fires_area = fires["area"].copy()
fires_area.head()
# rozdelme data na mereni a cilove tridy
fires_features = fires.drop("area", axis=1) 
fires_area = fires["area"].copy()
fires_area.head()

Out[32]:

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: area, dtype: float64

In [33]:

Copied!

# fires_features.drop("month", inplace=True, axis=1)
# fires_features.drop("day", inplace=True, axis=1)
# fires_features.drop("month", inplace=True, axis=1)
# fires_features.drop("day", inplace=True, axis=1)

In [34]:

Copied!

fires_features.head(10)
fires_features.head(10)

Out[34]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7	5	2	4	86.2	26.2	94.3	5.1	8.2	51	6.7	0.0
1	7	4	9	1	90.6	35.4	669.1	6.7	18.0	33	0.9	0.0
2	7	4	9	5	90.6	43.7	686.9	6.7	14.6	33	1.3	0.0
3	8	6	2	4	91.7	33.3	77.5	9.0	8.3	97	4.0	0.2
4	8	6	2	6	89.3	51.3	102.2	9.6	11.4	99	1.8	0.0
5	8	6	7	6	92.3	85.3	488.0	14.7	22.2	29	5.4	0.0
6	8	6	7	0	92.3	88.9	495.6	8.5	24.1	27	3.1	0.0
7	8	6	7	0	91.5	145.4	608.2	10.7	8.0	86	2.2	0.0
8	8	6	8	1	91.0	129.5	692.6	7.0	13.1	63	5.4	0.0
9	7	5	8	5	92.5	88.0	698.6	7.1	22.8	40	4.0	0.0

In [35]:

Copied!





# nasimulujme chybejici hodnoty - to jen aby vse bylo zajimavejsi...
# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
fires_features['temp'][4] = np.nan
fires_features['temp'][104] = np.nan
fires_features['temp'][240] = np.nan
# nasimulujme chybejici hodnoty - to jen aby vse bylo zajimavejsi...
# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
fires_features['temp'][4] = np.nan
fires_features['temp'][104] = np.nan
fires_features['temp'][240] = np.nan

/tmp/ipykernel_7376/2677343666.py:2: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  fires_features['temp'][4] = np.nan
/tmp/ipykernel_7376/2677343666.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fires_features['temp'][4] = np.nan
/tmp/ipykernel_7376/2677343666.py:3: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  fires_features['temp'][104] = np.nan
/tmp/ipykernel_7376/2677343666.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fires_features['temp'][104] = np.nan
/tmp/ipykernel_7376/2677343666.py:4: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  fires_features['temp'][240] = np.nan
/tmp/ipykernel_7376/2677343666.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fires_features['temp'][240] = np.nan

In [36]:

Copied!





# jake radky obsahuji nan hodnotu?
sample_incomplete_rows = fires_features[
    fires_features.isnull().any(axis=1)
]
sample_incomplete_rows
# jake radky obsahuji nan hodnotu?
sample_incomplete_rows = fires_features[
    fires_features.isnull().any(axis=1)
]
sample_incomplete_rows

Out[36]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind
4	8	6	2	6	89.3	51.3	102.2	9.6	NaN	99	1.8
104	2	4	0	5	82.1	3.7	9.3	2.9	NaN	78	3.1
240	6	3	3	2	88.0	17.2	43.5	3.8	NaN	51	2.7

In [37]:

Copied!

# jak bychom se jich zbavili?
sample_incomplete_rows.dropna(subset=["temp"])
# jak bychom se jich zbavili?
sample_incomplete_rows.dropna(subset=["temp"])

Out[37]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain

Zaplňme chybějící hodnoty mediánem¶

In [38]:

Copied!





# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
median = fires_features["temp"].median()
sample_incomplete_rows["temp"].fillna(median, inplace=True)
sample_incomplete_rows
# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
median = fires_features["temp"].median()
sample_incomplete_rows["temp"].fillna(median, inplace=True)
sample_incomplete_rows

/tmp/ipykernel_7376/2230859196.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sample_incomplete_rows["temp"].fillna(median, inplace=True)
/tmp/ipykernel_7376/2230859196.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_incomplete_rows["temp"].fillna(median, inplace=True)

Out[38]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind
4	8	6	2	6	89.3	51.3	102.2	9.6	19.3	99	1.8
104	2	4	0	5	82.1	3.7	9.3	2.9	19.3	78	3.1
240	6	3	3	2	88.0	17.2	43.5	3.8	19.3	51	2.7

In [39]:

Copied!





# jake byly puvodni hodnoty?
print(fires['temp'][4]) 
print(fires['temp'][104]) 
print(fires['temp'][240])
# jake byly puvodni hodnoty?
print(fires['temp'][4]) 
print(fires['temp'][104]) 
print(fires['temp'][240]) 

11.4
5.3
15.2

In [40]:

Copied!

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median", missing_values=np.nan)
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median", missing_values=np.nan)

In [41]:

Copied!

fires_features.head()
fires_features.head()

Out[41]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7	5	2	4	86.2	26.2	94.3	5.1	8.2	51	6.7	0.0
1	7	4	9	1	90.6	35.4	669.1	6.7	18.0	33	0.9	0.0
2	7	4	9	5	90.6	43.7	686.9	6.7	14.6	33	1.3	0.0
3	8	6	2	4	91.7	33.3	77.5	9.0	8.3	97	4.0	0.2
4	8	6	2	6	89.3	51.3	102.2	9.6	NaN	99	1.8	0.0

In [42]:

Copied!





# natrenujme imputer a vytvorme z nej dataframe
fires_features_imputed = pd.DataFrame(
    imputer.fit_transform(fires_features), columns=fires_features.columns
)
# natrenujme imputer a vytvorme z nej dataframe
fires_features_imputed = pd.DataFrame(
    imputer.fit_transform(fires_features), columns=fires_features.columns
)

In [43]:

Copied!

fires_features_imputed.head()
fires_features_imputed.head()

Out[43]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7.0	5.0	2.0	4.0	86.2	26.2	94.3	5.1	8.2	51.0	6.7	0.0
1	7.0	4.0	9.0	1.0	90.6	35.4	669.1	6.7	18.0	33.0	0.9	0.0
2	7.0	4.0	9.0	5.0	90.6	43.7	686.9	6.7	14.6	33.0	1.3	0.0
3	8.0	6.0	2.0	4.0	91.7	33.3	77.5	9.0	8.3	97.0	4.0	0.2
4	8.0	6.0	2.0	6.0	89.3	51.3	102.2	9.6	19.3	99.0	1.8	0.0

In [44]:

Copied!

rh_index = fires_features.columns.get_loc('RH')
wind_index = fires_features.columns.get_loc('wind')
rh_index = fires_features.columns.get_loc('RH')
wind_index = fires_features.columns.get_loc('wind')

In [45]:

Copied!





# funkce pro tvorbu novych priznaku
def add_extra_features(X):
    RH_per_wind = X[:, rh_index] / X[:, wind_index]
    return np.c_[X, RH_per_wind]
# funkce pro tvorbu novych priznaku
def add_extra_features(X):
    RH_per_wind = X[:, rh_index] / X[:, wind_index]
    return np.c_[X, RH_per_wind]

Celý postup v souhrnu¶

In [46]:

Copied!





from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median", missing_values=np.nan)),
    ('attribs_adder', FunctionTransformer(add_extra_features)),
    ('std_scaler', StandardScaler()),
])

# fires_features
fires_features_tr = pd.DataFrame(
    num_pipeline.fit_transform(fires_features),
    columns=fires_features.columns.append(pd.Index(['RHw']))
)
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median", missing_values=np.nan)),
    ('attribs_adder', FunctionTransformer(add_extra_features)),
    ('std_scaler', StandardScaler()),
])

# fires_features
fires_features_tr = pd.DataFrame(
    num_pipeline.fit_transform(fires_features),
    columns=fires_features.columns.append(pd.Index(['RHw']))
)

In [47]:

Copied!

fires_features.head(305)
fires_features.head(305)

Out[47]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain
0	7	5	2	4	86.2	26.2	94.3	5.1	8.2	51	6.7	0.0
1	7	4	9	1	90.6	35.4	669.1	6.7	18.0	33	0.9	0.0
2	7	4	9	5	90.6	43.7	686.9	6.7	14.6	33	1.3	0.0
3	8	6	2	4	91.7	33.3	77.5	9.0	8.3	97	4.0	0.2
4	8	6	2	6	89.3	51.3	102.2	9.6	NaN	99	1.8	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
300	6	5	5	0	90.4	93.3	298.1	7.5	20.7	25	4.9	0.0
301	6	5	5	0	90.4	93.3	298.1	7.5	19.1	39	5.4	0.0
302	3	6	5	4	91.1	94.1	232.1	7.1	19.2	38	4.5	0.0
303	3	6	5	4	91.1	94.1	232.1	7.1	19.2	38	4.5	0.0
304	6	5	4	5	85.1	28.0	113.8	3.5	11.3	94	4.9	0.0

305 rows × 12 columns

In [48]:

Copied!

fires_features_tr.head(305)
fires_features_tr.head(305)

Out[48]:

	X	Y	month	day	FFMC	DMC	DC	ISI	temp	RH	wind	rain	RHw
0	1.008313	0.569860	-1.968443	0.357721	-0.805959	-1.323326	-1.830477	-0.860946	-1.865037	0.411724	1.498614	-0.073268	-0.575544
1	1.008313	-0.244001	1.110120	-1.090909	-0.008102	-1.179541	0.488891	-0.509688	-0.163148	-0.692456	-1.741756	-0.073268	1.993477
2	1.008313	-0.244001	1.110120	0.840597	-0.008102	-1.049822	0.560715	-0.509688	-0.753599	-0.692456	-1.518282	-0.073268	0.995917
3	1.440925	1.383722	-1.968443	0.357721	0.191362	-1.212361	-1.898266	-0.004756	-1.847670	3.233519	-0.009834	0.603155	0.895595
4	1.440925	1.383722	-1.968443	1.323474	-0.243833	-0.931043	-1.798600	0.126966	0.062612	3.356206	-1.238940	-0.073268	3.614511
...	...	...	...	...	...	...	...	...	...	...	...	...	...
300	0.575701	0.569860	-0.649059	-1.573785	-0.044368	-0.274634	-1.008126	-0.334060	0.305739	-1.183203	0.492982	-0.073268	-0.797469
301	0.575701	0.569860	-0.649059	-1.573785	-0.044368	-0.274634	-1.008126	-0.334060	0.027880	-0.324396	0.772325	-0.073268	-0.610003
302	-0.722136	1.383722	-0.649059	0.357721	0.082564	-0.262131	-1.274442	-0.421874	0.045246	-0.385739	0.269509	-0.073268	-0.501934
303	-0.722136	1.383722	-0.649059	0.357721	0.082564	-0.262131	-1.274442	-0.421874	0.045246	-0.385739	0.269509	-0.073268	-0.501934
304	0.575701	0.569860	-1.088854	0.840597	-1.005424	-1.295194	-1.751793	-1.212203	-1.326684	3.049489	0.492982	-0.073268	0.447630

305 rows × 13 columns

In [49]:

Copied!

# from sklearn.compose import ColumnTransformer  
# from sklearn.preprocessing import OrdinalEncoder
# from sklearn.compose import ColumnTransformer  
# from sklearn.preprocessing import OrdinalEncoder

In [50]:

Copied!





# full_pipeline = ColumnTransformer([
#         ("num", num_pipeline, num_attribs),
#         ("cat", OneHotEncoder(), cat_attribs),
#     ])

# df_prepared = full_pipeline.fit_transform(housing)
# full_pipeline = ColumnTransformer([
#         ("num", num_pipeline, num_attribs),
#         ("cat", OneHotEncoder(), cat_attribs),
#     ])

# df_prepared = full_pipeline.fit_transform(housing)

Zpět na začátek - rozdělení dat na trénovací a testovací¶

In [51]:

Copied!





# náhodné rozdělení za využití sklearn
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    fires_features_tr, test_size=0.3, random_state=23
)
y_train, y_test = train_test_split(
    fires_area, test_size=0.3, random_state=23
)

y_test.head()
# náhodné rozdělení za využití sklearn
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    fires_features_tr, test_size=0.3, random_state=23
)
y_train, y_test = train_test_split(
    fires_area, test_size=0.3, random_state=23
)

y_test.head()

Out[51]:

156     1.61
337    56.04
161     1.90
442     3.35
392    70.76
Name: area, dtype: float64

Výběr a trénování modelu¶

In [52]:

Copied!

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=23)
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=23)

In [53]:

Copied!

# prvni, pokusny trenink
tree_reg.fit(X_train, y_train)
# prvni, pokusny trenink
tree_reg.fit(X_train, y_train)

Out[53]:

DecisionTreeRegressor(random_state=23)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [54]:

Copied!

# predikce
fires_predictions = tree_reg.predict(X_test)
# predikce
fires_predictions = tree_reg.predict(X_test)

In [55]:

Copied!





# prvni zhodnoceni pomoci RMSE
from sklearn.metrics import mean_squared_error

tree_mse = mean_squared_error(y_test, fires_predictions)
tree_rmse = np.sqrt(tree_mse)
print(round(tree_rmse, 2))
# prvni zhodnoceni pomoci RMSE
from sklearn.metrics import mean_squared_error

tree_mse = mean_squared_error(y_test, fires_predictions)
tree_rmse = np.sqrt(tree_mse)
print(round(tree_rmse, 2))

66.63

In [56]:

Copied!

from sklearn.metrics import mean_absolute_error

tree_mae = mean_absolute_error(y_test, fires_predictions)
print(round(tree_mae, 2))
from sklearn.metrics import mean_absolute_error

tree_mae = mean_absolute_error(y_test, fires_predictions)
print(round(tree_mae, 2))

22.97

Ladění modelu¶

In [57]:

Copied!





from sklearn.model_selection import cross_val_score

# Decision Tree regressor 
scores = cross_val_score(
    tree_reg, X_train, y_train, scoring="neg_mean_absolute_error", cv=10
)
tree_mae_scores = (-scores)
from sklearn.model_selection import cross_val_score

# Decision Tree regressor 
scores = cross_val_score(
    tree_reg, X_train, y_train, scoring="neg_mean_absolute_error", cv=10
)
tree_mae_scores = (-scores)

In [58]:

Copied!





def display_scores(scores):
    # print("Scores:", scores)
    print("Mean MAE:", round(scores.mean(), 2))
    print("Standard deviation:", round(scores.std(), 2))

display_scores(tree_mae_scores)
def display_scores(scores):
    # print("Scores:", scores)
    print("Mean MAE:", round(scores.mean(), 2))
    print("Standard deviation:", round(scores.std(), 2))

display_scores(tree_mae_scores)

Mean MAE: 20.24
Standard deviation: 10.68

Grid Search¶

In [59]:

Copied!





from sklearn.model_selection import GridSearchCV

# the model hyper-parameters 
# help: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html 
# DecisionTreeRegressor(splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, ...)

param_grid = [
    {
        'max_depth': [3, 4, 5, 10], 
        'min_samples_split': [3, 4, 5, 10],
        'splitter': ['random', 'best']
    }
]

# grid seach application
tree_reg = DecisionTreeRegressor(random_state=23)

# cv - petinasobna krizova validace, tedy cv * max_depth epoch
grid_search = GridSearchCV(
    tree_reg, param_grid, cv=10, scoring="neg_mean_absolute_error", return_train_score=True
)
from sklearn.model_selection import GridSearchCV

# the model hyper-parameters 
# help: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html 
# DecisionTreeRegressor(splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, ...)

param_grid = [
    {
        'max_depth': [3, 4, 5, 10], 
        'min_samples_split': [3, 4, 5, 10],
        'splitter': ['random', 'best']
    }
]

# grid seach application
tree_reg = DecisionTreeRegressor(random_state=23)

# cv - petinasobna krizova validace, tedy cv * max_depth epoch
grid_search = GridSearchCV(
    tree_reg, param_grid, cv=10, scoring="neg_mean_absolute_error", return_train_score=True
)

In [60]:

Copied!

grid_search.fit(X_train, y_train)
grid_search.fit(X_train, y_train)

/home/pesek/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/numpy/ma/core.py:2881: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,

Out[60]:

GridSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23),
             param_grid=[{'max_depth': [3, 4, 5, 10],
                          'min_samples_split': [3, 4, 5, 10],
                          'splitter': ['random', 'best']}],
             return_train_score=True, scoring='neg_mean_absolute_error')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [61]:

Copied!

grid_search.best_params_
grid_search.best_params_

Out[61]:

{'max_depth': 3, 'min_samples_split': 3, 'splitter': 'best'}

In [62]:

Copied!

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(-mean_score, params)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(-mean_score, params)

18.514869629268855 {'max_depth': 3, 'min_samples_split': 3, 'splitter': 'random'}
15.959582872920553 {'max_depth': 3, 'min_samples_split': 3, 'splitter': 'best'}
18.514869629268855 {'max_depth': 3, 'min_samples_split': 4, 'splitter': 'random'}
15.959582872920553 {'max_depth': 3, 'min_samples_split': 4, 'splitter': 'best'}
18.514869629268855 {'max_depth': 3, 'min_samples_split': 5, 'splitter': 'random'}
15.959582872920553 {'max_depth': 3, 'min_samples_split': 5, 'splitter': 'best'}
18.490661198514218 {'max_depth': 3, 'min_samples_split': 10, 'splitter': 'random'}
16.041471761809444 {'max_depth': 3, 'min_samples_split': 10, 'splitter': 'best'}
19.435980243140406 {'max_depth': 4, 'min_samples_split': 3, 'splitter': 'random'}
17.62946315452267 {'max_depth': 4, 'min_samples_split': 3, 'splitter': 'best'}
19.05396623932332 {'max_depth': 4, 'min_samples_split': 4, 'splitter': 'random'}
17.55007426563378 {'max_depth': 4, 'min_samples_split': 4, 'splitter': 'best'}
18.58107237595373 {'max_depth': 4, 'min_samples_split': 5, 'splitter': 'random'}
17.688783725093238 {'max_depth': 4, 'min_samples_split': 5, 'splitter': 'best'}
20.12845363862052 {'max_depth': 4, 'min_samples_split': 10, 'splitter': 'random'}
17.775441132500646 {'max_depth': 4, 'min_samples_split': 10, 'splitter': 'best'}
19.9505417184069 {'max_depth': 5, 'min_samples_split': 3, 'splitter': 'random'}
18.60957245145 {'max_depth': 5, 'min_samples_split': 3, 'splitter': 'best'}
20.303624660893725 {'max_depth': 5, 'min_samples_split': 4, 'splitter': 'random'}
18.125947076074624 {'max_depth': 5, 'min_samples_split': 4, 'splitter': 'best'}
21.911845419693922 {'max_depth': 5, 'min_samples_split': 5, 'splitter': 'random'}
18.287378757756308 {'max_depth': 5, 'min_samples_split': 5, 'splitter': 'best'}
18.968975573126443 {'max_depth': 5, 'min_samples_split': 10, 'splitter': 'random'}
18.133173841530557 {'max_depth': 5, 'min_samples_split': 10, 'splitter': 'best'}
26.559439062648032 {'max_depth': 10, 'min_samples_split': 3, 'splitter': 'random'}
18.071162044051874 {'max_depth': 10, 'min_samples_split': 3, 'splitter': 'best'}
21.534413780549198 {'max_depth': 10, 'min_samples_split': 4, 'splitter': 'random'}
18.102266241106072 {'max_depth': 10, 'min_samples_split': 4, 'splitter': 'best'}
22.4833334290822 {'max_depth': 10, 'min_samples_split': 5, 'splitter': 'random'}
19.753292174182008 {'max_depth': 10, 'min_samples_split': 5, 'splitter': 'best'}
22.52042047089359 {'max_depth': 10, 'min_samples_split': 10, 'splitter': 'random'}
18.29324359971468 {'max_depth': 10, 'min_samples_split': 10, 'splitter': 'best'}

Randomized Search¶

In [63]:

Copied!





from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'max_depth': randint(low=3, high=10),
    'min_samples_split': randint(low=3, high=10),
}

tree_reg = DecisionTreeRegressor(random_state=23)
rnd_search = RandomizedSearchCV(
    tree_reg, param_distributions=param_distribs, n_iter=10,
    cv=10, scoring="neg_mean_absolute_error", random_state=23
)

rnd_search.fit(X_train, y_train)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
    'max_depth': randint(low=3, high=10),
    'min_samples_split': randint(low=3, high=10),
}

tree_reg = DecisionTreeRegressor(random_state=23)
rnd_search = RandomizedSearchCV(
    tree_reg, param_distributions=param_distribs, n_iter=10,
    cv=10, scoring="neg_mean_absolute_error", random_state=23
)

rnd_search.fit(X_train, y_train)

/home/pesek/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/numpy/ma/core.py:2881: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,

Out[63]:

RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23),
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b9fe80>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b1f610>},
                   random_state=23, scoring='neg_mean_absolute_error')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [64]:

Copied!

rnd_search.best_estimator_
rnd_search.best_estimator_

Out[64]:

DecisionTreeRegressor(max_depth=3, min_samples_split=4, random_state=23)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [65]:

Copied!

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(-mean_score, params)
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(-mean_score, params)

18.369909683541902 {'max_depth': 6, 'min_samples_split': 9}
15.959582872920553 {'max_depth': 3, 'min_samples_split': 4}
20.204774203776793 {'max_depth': 9, 'min_samples_split': 3}
19.705889251512108 {'max_depth': 8, 'min_samples_split': 7}
18.148671621372586 {'max_depth': 6, 'min_samples_split': 5}
17.77067261398213 {'max_depth': 4, 'min_samples_split': 6}
20.130670309085396 {'max_depth': 9, 'min_samples_split': 6}
19.705889251512108 {'max_depth': 8, 'min_samples_split': 7}
17.688783725093238 {'max_depth': 4, 'min_samples_split': 5}
18.434572877986348 {'max_depth': 6, 'min_samples_split': 8}

In [66]:

Copied!





# Select the model and evaluate it with test set!
sel_model = grid_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test,  sel_predictions), 2)))
# Select the model and evaluate it with test set!
sel_model = grid_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test,  sel_predictions), 2)))

MAE: 18.43

In [67]:

Copied!





# Select final model and evaluate it with test set!
sel_model = rnd_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test,  sel_predictions), 2)))
# Select final model and evaluate it with test set!
sel_model = rnd_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test,  sel_predictions), 2)))

MAE: 18.43

In [ ]: