07: Návrh a řešení projektu ML¶
Predikce lesních požárů v parku Montesinho (PT)¶
Lesní požáry představují závažný environmentální problém, který způsobuje ekonomické a ekologické škody a zároveň ohrožuje lidské životy. Rychlá detekce je klíčovým prvkem pro kontrolu těchto jevů. Jednou z možností, jak toho dosáhnout, je použití automatických nástrojů založených na místních senzorech, například na meteorologických stanicích. Je známo, že meteorologické podmínky (např. teplota, vítr) ovlivňují lesní požáry, a několik požárních indexů, jako je například index požárně rizikového počasí (forest fire weather index, FWI), tyto údaje využívá. V této práci zkoumáme přístup strojového učení (ML) k předpovědi plochy spálené lesními požáry.
Zdroj:
P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
Problém¶
- regrese více neznámých
- data nejsou kontinuální -> dávkové učení
- FFMC - kód vlhkosti jemného paliva (fine fuel moisture code)
- DMC - Duffův kód vlhkosti (duff moisture code)
- DC - kód sucha (drought code)
- ISI - index počátečního šíření (initial spread index)
- BUI - index nárůstu (buildup index)
- FWI - index požárně rizikového počasí (forest fire weather index)
Více informací viz [Cortez and Morais, 2007] či Fire Weather Maps Canada.
Nastavení prostředí¶
# import
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
# stabilizujme pseudonáhodný generátor
np.random.seed(23)
Načtěme dataset¶
Data ke stažení zde.
# precteme data
fires = pd.read_csv('data/07/forestfires.csv')
# podivejme se na data
fires.head()
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 5 | mar | fri | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51 | 6.7 | 0.0 | 0.0 |
1 | 7 | 4 | oct | tue | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33 | 0.9 | 0.0 | 0.0 |
2 | 7 | 4 | oct | sat | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33 | 1.3 | 0.0 | 0.0 |
3 | 8 | 6 | mar | fri | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97 | 4.0 | 0.2 | 0.0 |
4 | 8 | 6 | mar | sun | 89.3 | 51.3 | 102.2 | 9.6 | 11.4 | 99 | 1.8 | 0.0 | 0.0 |
- X - souřadnice X uvnitř parku Montesinho: 1 až 9
- Y - souřadnice X uvnitř parku Montesinho: 2 až 9
- month - měsíc v angličtině: "jan" až "dec"
- day - den v týdnu v angličtině: "mon" až "sun"
- FFMC: 18.7 až 96.20
- DMC: 1.1 až 291.3
- DC: 7.9 až 860.6
- ISI: 0.0 až 56.10
- temp - teplota ve stupních Celsia: 2.2 až 33.30
- RH - relativní vlhkost v %: 15.0 až 100
- wind - rychlost větru v km/h: 0.40 až 9.40
- rain - déšť v mm/m2 : 0.0 to 6.4
- area - spálená oblast v ha: 0.00 až 1090.84 (tato výstupní proměnná je velmi zkreslená směrem k hodnotě 0; mohlo by tedy dávat smysl modelovat pomocí logaritmické transformace).
Více informací viz [Cortez and Morais, 2007].
# mnozstvi mereni s oblasti vetsi nez 0 ha
print(len(fires))
print(len(fires[fires['area'] > 0]))
517 270
# sklearn má funkci train_test_split() - tvorba vlastní funkce slouží pouze k procvičení a pochopení algoritmu v pozadí
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(fires, 0.2)
print(len(train_set), "train +", len(test_set), "test")
414 train + 103 test
test_set.head()
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
156 | 2 | 4 | sep | sat | 93.4 | 145.4 | 721.4 | 8.1 | 28.6 | 27 | 2.2 | 0.0 | 1.61 |
337 | 6 | 3 | sep | mon | 91.6 | 108.4 | 764.0 | 6.2 | 23.0 | 34 | 2.2 | 0.0 | 56.04 |
161 | 6 | 4 | aug | thu | 95.2 | 131.7 | 578.8 | 10.4 | 20.3 | 41 | 4.0 | 0.0 | 1.90 |
442 | 6 | 5 | apr | mon | 87.9 | 24.9 | 41.6 | 3.7 | 10.9 | 64 | 3.1 | 0.0 | 3.35 |
392 | 1 | 3 | sep | sun | 91.0 | 276.3 | 825.1 | 7.1 | 21.9 | 43 | 4.0 | 0.0 | 70.76 |
Rozdělení datasetu za využití sklearn¶
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(fires, test_size=0.2, random_state=23)
print(len(train_set), "train +", len(test_set), "test")
413 train + 104 test
test_set.head()
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
156 | 2 | 4 | sep | sat | 93.4 | 145.4 | 721.4 | 8.1 | 28.6 | 27 | 2.2 | 0.0 | 1.61 |
337 | 6 | 3 | sep | mon | 91.6 | 108.4 | 764.0 | 6.2 | 23.0 | 34 | 2.2 | 0.0 | 56.04 |
161 | 6 | 4 | aug | thu | 95.2 | 131.7 | 578.8 | 10.4 | 20.3 | 41 | 4.0 | 0.0 | 1.90 |
442 | 6 | 5 | apr | mon | 87.9 | 24.9 | 41.6 | 3.7 | 10.9 | 64 | 3.1 | 0.0 | 3.35 |
392 | 1 | 3 | sep | sun | 91.0 | 276.3 | 825.1 | 7.1 | 21.9 | 43 | 4.0 | 0.0 | 70.76 |
Stratifikované vzorkování na základě tříd pomocí StratifiedShuffleSplit¶
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=23)
# ve vstupnich datech neocekavame 'area'
attributes_sel = [
'X', 'Y', 'month', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain'
]
X = np.array(fires[attributes_sel])
# do y muzeme vlozit kategoricke tridy (1 == pozar, 0 == zadny pozar)
y = np.array((fires['area'] > 0))
print(X.shape)
print(y.shape)
(517, 11) (517,)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index)
print("\nTEST:", test_index)
TRAIN: [124 20 197 484 175 236 435 289 398 54 303 486 391 87 463 293 432 208 56 34 348 448 1 151 133 311 136 137 256 297 266 161 498 190 160 11 214 245 429 174 209 374 47 352 19 513 150 331 276 260 335 146 85 338 349 186 102 505 452 200 105 446 308 316 241 126 370 453 194 14 507 76 112 482 421 292 138 401 362 363 346 140 430 199 188 17 447 96 465 93 273 501 320 172 201 315 360 106 319 471 171 426 408 414 466 251 499 3 347 500 213 462 454 337 420 433 339 198 488 10 332 478 101 473 69 134 490 144 165 192 173 272 496 322 400 380 225 516 424 295 364 55 268 417 207 485 265 224 358 405 228 114 324 226 442 66 167 369 183 33 184 494 286 313 5 350 217 390 244 223 464 195 18 367 135 240 104 13 147 29 239 253 60 84 449 283 444 302 220 353 368 127 73 4 232 120 145 68 259 246 152 88 258 170 113 81 59 44 409 437 310 211 277 402 280 376 479 67 382 235 2 72 389 92 189 159 460 510 515 132 71 377 45 457 288 181 42 384 341 70 74 326 89 222 294 502 336 238 328 270 8 477 22 318 162 28 204 227 99 23 440 411 287 343 458 508 191 176 97 248 394 212 111 52 361 216 30 418 46 242 404 480 476 237 139 62 506 257 82 0 131 50 373 107 115 9 355 143 243 64 119 344 24 459 38 267 385 375 215 425 231 16 32 323 300 196 51 386 109 403 193 164 312 261 234 512 431 330 168 306 356 179 58 26 345 249 177 511 334 321 37 78 365 379 514 333 445 381 503 438 354 7 415 80 262 182 130 100 416 474 481 250 309 285 509 428 455 475 110 314 75 301 392 269 218 98 443 163 61 366 94 427 305 247 493 95 108 233 117 43 229 122 27 489 450 397 271 86 298 274 153 169 128 359 156 77 487 121 142 63 299 155 53] TEST: [383 35 141 263 264 399 39 388 340 419 439 396 291 6 180 166 49 472 371 357 25 15 65 461 12 158 57 118 187 495 351 255 230 329 468 483 304 497 91 423 40 116 470 154 254 157 491 206 210 504 21 185 275 221 395 202 407 284 451 83 325 406 422 252 378 372 456 434 327 413 178 203 296 123 79 317 149 342 31 90 281 219 148 393 103 307 412 492 467 282 441 205 387 469 36 129 410 125 290 278 48 279 41 436]
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_figwidth(15)
ax1.set_title('Full data')
ax1.hist(fires['DMC'])
X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])
X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])
bins = np.linspace(0, 500, 5)
print(bins)
[ 0. 125. 250. 375. 500.]
Souvislé stratifikované vzorkování na základě y¶
len(fires.index)
517
# vytvorme oddily dat podle y
y = np.array(fires['area'])
bins = [0, 1, 10]
y_binned = np.digitize(y, bins)
y_binned
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, 3, 2, 3, 3, 3, 2, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 1, 2, 1, 1, 2, 1, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 3, 3, 2, 1, 1, 1, 3, 2, 2, 2, 1, 1, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 2, 2, 2, 1, 2, 2, 3, 2, 1, 3, 1, 3, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 3, 2, 3, 3, 3, 3, 1, 3, 1, 2, 3, 3, 1, 1, 3, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 3, 2, 1, 2, 2, 3, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 1, 3, 2, 3, 3, 2, 2, 2, 2, 3, 2, 1, 2, 1, 3, 2, 2, 3, 3, 1, 1, 1, 1, 3, 2, 1, 2, 3, 3, 3, 1, 1, 1, 2, 3, 2, 1, 1, 1, 2, 1, 1, 2, 3, 3, 1, 1])
# vyuzijme jiz znamou funkci train_test_split, s definovanym argumentem stratify
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y_binned
)
X_train
array([[1, 2, 'aug', ..., 47, 0.9, 0.0], [8, 6, 'aug', ..., 41, 3.6, 0.0], [1, 4, 'sep', ..., 28, 4.0, 0.0], ..., [6, 4, 'feb', ..., 77, 5.4, 0.0], [7, 3, 'oct', ..., 27, 4.0, 0.0], [7, 4, 'sep', ..., 77, 4.0, 0.0]], dtype=object)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.set_figwidth(15)
ax1.set_title('Full data')
ax1.hist(fires['DMC'])
X_train_df = pd.DataFrame(X_train, columns=attributes_sel)
ax2.set_title('Train data')
ax2.hist(X_train_df['DMC'])
X_test_df = pd.DataFrame(X_test, columns=attributes_sel)
ax3.set_title('Test data')
ax3.hist(X_test_df['DMC'])
Vizualizace¶
plt.scatter(
fires['X'], fires['Y'],
c=fires['area'], s=fires['area'], cmap="jet", # barvu a velikost urcuje rozsireni pozaru
alpha=0.5 # vice pozaru mohlo vzniknout na tomtez miste
)
# y souradnice jde shora dolu, viz [Cortez and Morais, 2007]
plt.gca().invert_yaxis()
plt.colorbar(label="area")
plt.show()
# korelacni matice
corr_matrix = fires.corr()
corr_matrix["area"].sort_values(ascending=False)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[28], line 2 1 # korelacni matice ----> 2 corr_matrix = fires.corr() 3 corr_matrix["area"].sort_values(ascending=False) File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/frame.py:11049, in DataFrame.corr(self, method, min_periods, numeric_only) 11047 cols = data.columns 11048 idx = cols.copy() > 11049 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False) 11051 if method == "pearson": 11052 correl = libalgos.nancorr(mat, minp=min_periods) File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/frame.py:1993, in DataFrame.to_numpy(self, dtype, copy, na_value) 1991 if dtype is not None: 1992 dtype = np.dtype(dtype) -> 1993 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value) 1994 if result.dtype is not dtype: 1995 result = np.asarray(result, dtype=dtype) File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/internals/managers.py:1694, in BlockManager.as_array(self, dtype, copy, na_value) 1692 arr.flags.writeable = False 1693 else: -> 1694 arr = self._interleave(dtype=dtype, na_value=na_value) 1695 # The underlying data was copied within _interleave, so no need 1696 # to further copy if copy=True or setting na_value 1698 if na_value is lib.no_default: File ~/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/pandas/core/internals/managers.py:1753, in BlockManager._interleave(self, dtype, na_value) 1751 else: 1752 arr = blk.get_values(dtype) -> 1753 result[rl.indexer] = arr 1754 itemmask[rl.indexer] = 1 1756 if not itemmask.all(): ValueError: could not convert string to float: 'mar'
# data ve sloupcich 'day' a 'month' musime prevest na cisla
days = [
'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun'
]
months = [
'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'
]
for i, day in enumerate(days):
fires.loc[fires['day'] == day, 'day'] = i
for i, month in enumerate(months):
fires.loc[fires['month'] == month, 'month'] = i
# korelacni matice
corr_matrix = fires.corr()
corr_matrix["area"].sort_values(ascending=False)
area 1.000000 temp 0.097844 DMC 0.072994 X 0.063385 month 0.056496 DC 0.049383 Y 0.044873 FFMC 0.040122 day 0.023226 wind 0.012317 ISI 0.008258 rain -0.007366 RH -0.075519 Name: area, dtype: float64
from pandas.plotting import scatter_matrix
attributes = ["temp", "DMC", "DC", "area"]
scatter_matrix(fires[attributes], figsize=(12, 8), alpha=0.5)
array([[<Axes: xlabel='temp', ylabel='temp'>, <Axes: xlabel='DMC', ylabel='temp'>, <Axes: xlabel='DC', ylabel='temp'>, <Axes: xlabel='area', ylabel='temp'>], [<Axes: xlabel='temp', ylabel='DMC'>, <Axes: xlabel='DMC', ylabel='DMC'>, <Axes: xlabel='DC', ylabel='DMC'>, <Axes: xlabel='area', ylabel='DMC'>], [<Axes: xlabel='temp', ylabel='DC'>, <Axes: xlabel='DMC', ylabel='DC'>, <Axes: xlabel='DC', ylabel='DC'>, <Axes: xlabel='area', ylabel='DC'>], [<Axes: xlabel='temp', ylabel='area'>, <Axes: xlabel='DMC', ylabel='area'>, <Axes: xlabel='DC', ylabel='area'>, <Axes: xlabel='area', ylabel='area'>]], dtype=object)
Příprava dat¶
# rozdelme data na mereni a cilove tridy
fires_features = fires.drop("area", axis=1)
fires_area = fires["area"].copy()
fires_area.head()
0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 Name: area, dtype: float64
# fires_features.drop("month", inplace=True, axis=1)
# fires_features.drop("day", inplace=True, axis=1)
fires_features.head(10)
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 5 | 2 | 4 | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51 | 6.7 | 0.0 |
1 | 7 | 4 | 9 | 1 | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33 | 0.9 | 0.0 |
2 | 7 | 4 | 9 | 5 | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33 | 1.3 | 0.0 |
3 | 8 | 6 | 2 | 4 | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97 | 4.0 | 0.2 |
4 | 8 | 6 | 2 | 6 | 89.3 | 51.3 | 102.2 | 9.6 | 11.4 | 99 | 1.8 | 0.0 |
5 | 8 | 6 | 7 | 6 | 92.3 | 85.3 | 488.0 | 14.7 | 22.2 | 29 | 5.4 | 0.0 |
6 | 8 | 6 | 7 | 0 | 92.3 | 88.9 | 495.6 | 8.5 | 24.1 | 27 | 3.1 | 0.0 |
7 | 8 | 6 | 7 | 0 | 91.5 | 145.4 | 608.2 | 10.7 | 8.0 | 86 | 2.2 | 0.0 |
8 | 8 | 6 | 8 | 1 | 91.0 | 129.5 | 692.6 | 7.0 | 13.1 | 63 | 5.4 | 0.0 |
9 | 7 | 5 | 8 | 5 | 92.5 | 88.0 | 698.6 | 7.1 | 22.8 | 40 | 4.0 | 0.0 |
# nasimulujme chybejici hodnoty - to jen aby vse bylo zajimavejsi...
# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
fires_features['temp'][4] = np.nan
fires_features['temp'][104] = np.nan
fires_features['temp'][240] = np.nan
/tmp/ipykernel_7376/2677343666.py:2: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][4] = np.nan /tmp/ipykernel_7376/2677343666.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][4] = np.nan /tmp/ipykernel_7376/2677343666.py:3: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][104] = np.nan /tmp/ipykernel_7376/2677343666.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][104] = np.nan /tmp/ipykernel_7376/2677343666.py:4: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0! You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy. A typical example is when you are setting values in a column of a DataFrame, like: df["col"][row_indexer] = value Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`. See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][240] = np.nan /tmp/ipykernel_7376/2677343666.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy fires_features['temp'][240] = np.nan
# jake radky obsahuji nan hodnotu?
sample_incomplete_rows = fires_features[
fires_features.isnull().any(axis=1)
]
sample_incomplete_rows
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 8 | 6 | 2 | 6 | 89.3 | 51.3 | 102.2 | 9.6 | NaN | 99 | 1.8 | 0.0 |
104 | 2 | 4 | 0 | 5 | 82.1 | 3.7 | 9.3 | 2.9 | NaN | 78 | 3.1 | 0.0 |
240 | 6 | 3 | 3 | 2 | 88.0 | 17.2 | 43.5 | 3.8 | NaN | 51 | 2.7 | 0.0 |
# jak bychom se jich zbavili?
sample_incomplete_rows.dropna(subset=["temp"])
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain |
---|
Zaplňme chybějící hodnoty mediánem¶
# pouzita syntaxe je zastarala (viz varovani) - je uvadena jen jako priklad, se kterym se stale lze setkati, ale doporucujeme pouzit syntaxi pouzivanou vyse
median = fires_features["temp"].median()
sample_incomplete_rows["temp"].fillna(median, inplace=True)
sample_incomplete_rows
/tmp/ipykernel_7376/2230859196.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy. For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object. sample_incomplete_rows["temp"].fillna(median, inplace=True) /tmp/ipykernel_7376/2230859196.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy sample_incomplete_rows["temp"].fillna(median, inplace=True)
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 8 | 6 | 2 | 6 | 89.3 | 51.3 | 102.2 | 9.6 | 19.3 | 99 | 1.8 | 0.0 |
104 | 2 | 4 | 0 | 5 | 82.1 | 3.7 | 9.3 | 2.9 | 19.3 | 78 | 3.1 | 0.0 |
240 | 6 | 3 | 3 | 2 | 88.0 | 17.2 | 43.5 | 3.8 | 19.3 | 51 | 2.7 | 0.0 |
# jake byly puvodni hodnoty?
print(fires['temp'][4])
print(fires['temp'][104])
print(fires['temp'][240])
11.4 5.3 15.2
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median", missing_values=np.nan)
fires_features.head()
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 5 | 2 | 4 | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51 | 6.7 | 0.0 |
1 | 7 | 4 | 9 | 1 | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33 | 0.9 | 0.0 |
2 | 7 | 4 | 9 | 5 | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33 | 1.3 | 0.0 |
3 | 8 | 6 | 2 | 4 | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97 | 4.0 | 0.2 |
4 | 8 | 6 | 2 | 6 | 89.3 | 51.3 | 102.2 | 9.6 | NaN | 99 | 1.8 | 0.0 |
# natrenujme imputer a vytvorme z nej dataframe
fires_features_imputed = pd.DataFrame(
imputer.fit_transform(fires_features), columns=fires_features.columns
)
fires_features_imputed.head()
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 5.0 | 2.0 | 4.0 | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51.0 | 6.7 | 0.0 |
1 | 7.0 | 4.0 | 9.0 | 1.0 | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33.0 | 0.9 | 0.0 |
2 | 7.0 | 4.0 | 9.0 | 5.0 | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33.0 | 1.3 | 0.0 |
3 | 8.0 | 6.0 | 2.0 | 4.0 | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97.0 | 4.0 | 0.2 |
4 | 8.0 | 6.0 | 2.0 | 6.0 | 89.3 | 51.3 | 102.2 | 9.6 | 19.3 | 99.0 | 1.8 | 0.0 |
rh_index = fires_features.columns.get_loc('RH')
wind_index = fires_features.columns.get_loc('wind')
# funkce pro tvorbu novych priznaku
def add_extra_features(X):
RH_per_wind = X[:, rh_index] / X[:, wind_index]
return np.c_[X, RH_per_wind]
Celý postup v souhrnu¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median", missing_values=np.nan)),
('attribs_adder', FunctionTransformer(add_extra_features)),
('std_scaler', StandardScaler()),
])
# fires_features
fires_features_tr = pd.DataFrame(
num_pipeline.fit_transform(fires_features),
columns=fires_features.columns.append(pd.Index(['RHw']))
)
fires_features.head(305)
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | 5 | 2 | 4 | 86.2 | 26.2 | 94.3 | 5.1 | 8.2 | 51 | 6.7 | 0.0 |
1 | 7 | 4 | 9 | 1 | 90.6 | 35.4 | 669.1 | 6.7 | 18.0 | 33 | 0.9 | 0.0 |
2 | 7 | 4 | 9 | 5 | 90.6 | 43.7 | 686.9 | 6.7 | 14.6 | 33 | 1.3 | 0.0 |
3 | 8 | 6 | 2 | 4 | 91.7 | 33.3 | 77.5 | 9.0 | 8.3 | 97 | 4.0 | 0.2 |
4 | 8 | 6 | 2 | 6 | 89.3 | 51.3 | 102.2 | 9.6 | NaN | 99 | 1.8 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
300 | 6 | 5 | 5 | 0 | 90.4 | 93.3 | 298.1 | 7.5 | 20.7 | 25 | 4.9 | 0.0 |
301 | 6 | 5 | 5 | 0 | 90.4 | 93.3 | 298.1 | 7.5 | 19.1 | 39 | 5.4 | 0.0 |
302 | 3 | 6 | 5 | 4 | 91.1 | 94.1 | 232.1 | 7.1 | 19.2 | 38 | 4.5 | 0.0 |
303 | 3 | 6 | 5 | 4 | 91.1 | 94.1 | 232.1 | 7.1 | 19.2 | 38 | 4.5 | 0.0 |
304 | 6 | 5 | 4 | 5 | 85.1 | 28.0 | 113.8 | 3.5 | 11.3 | 94 | 4.9 | 0.0 |
305 rows × 12 columns
fires_features_tr.head(305)
X | Y | month | day | FFMC | DMC | DC | ISI | temp | RH | wind | rain | RHw | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.008313 | 0.569860 | -1.968443 | 0.357721 | -0.805959 | -1.323326 | -1.830477 | -0.860946 | -1.865037 | 0.411724 | 1.498614 | -0.073268 | -0.575544 |
1 | 1.008313 | -0.244001 | 1.110120 | -1.090909 | -0.008102 | -1.179541 | 0.488891 | -0.509688 | -0.163148 | -0.692456 | -1.741756 | -0.073268 | 1.993477 |
2 | 1.008313 | -0.244001 | 1.110120 | 0.840597 | -0.008102 | -1.049822 | 0.560715 | -0.509688 | -0.753599 | -0.692456 | -1.518282 | -0.073268 | 0.995917 |
3 | 1.440925 | 1.383722 | -1.968443 | 0.357721 | 0.191362 | -1.212361 | -1.898266 | -0.004756 | -1.847670 | 3.233519 | -0.009834 | 0.603155 | 0.895595 |
4 | 1.440925 | 1.383722 | -1.968443 | 1.323474 | -0.243833 | -0.931043 | -1.798600 | 0.126966 | 0.062612 | 3.356206 | -1.238940 | -0.073268 | 3.614511 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
300 | 0.575701 | 0.569860 | -0.649059 | -1.573785 | -0.044368 | -0.274634 | -1.008126 | -0.334060 | 0.305739 | -1.183203 | 0.492982 | -0.073268 | -0.797469 |
301 | 0.575701 | 0.569860 | -0.649059 | -1.573785 | -0.044368 | -0.274634 | -1.008126 | -0.334060 | 0.027880 | -0.324396 | 0.772325 | -0.073268 | -0.610003 |
302 | -0.722136 | 1.383722 | -0.649059 | 0.357721 | 0.082564 | -0.262131 | -1.274442 | -0.421874 | 0.045246 | -0.385739 | 0.269509 | -0.073268 | -0.501934 |
303 | -0.722136 | 1.383722 | -0.649059 | 0.357721 | 0.082564 | -0.262131 | -1.274442 | -0.421874 | 0.045246 | -0.385739 | 0.269509 | -0.073268 | -0.501934 |
304 | 0.575701 | 0.569860 | -1.088854 | 0.840597 | -1.005424 | -1.295194 | -1.751793 | -1.212203 | -1.326684 | 3.049489 | 0.492982 | -0.073268 | 0.447630 |
305 rows × 13 columns
# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OrdinalEncoder
# full_pipeline = ColumnTransformer([
# ("num", num_pipeline, num_attribs),
# ("cat", OneHotEncoder(), cat_attribs),
# ])
# df_prepared = full_pipeline.fit_transform(housing)
Zpět na začátek - rozdělení dat na trénovací a testovací¶
# náhodné rozdělení za využití sklearn
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(
fires_features_tr, test_size=0.3, random_state=23
)
y_train, y_test = train_test_split(
fires_area, test_size=0.3, random_state=23
)
y_test.head()
156 1.61 337 56.04 161 1.90 442 3.35 392 70.76 Name: area, dtype: float64
Výběr a trénování modelu¶
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=23)
# prvni, pokusny trenink
tree_reg.fit(X_train, y_train)
DecisionTreeRegressor(random_state=23)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(random_state=23)
# predikce
fires_predictions = tree_reg.predict(X_test)
# prvni zhodnoceni pomoci RMSE
from sklearn.metrics import mean_squared_error
tree_mse = mean_squared_error(y_test, fires_predictions)
tree_rmse = np.sqrt(tree_mse)
print(round(tree_rmse, 2))
66.63
from sklearn.metrics import mean_absolute_error
tree_mae = mean_absolute_error(y_test, fires_predictions)
print(round(tree_mae, 2))
22.97
Ladění modelu¶
from sklearn.model_selection import cross_val_score
# Decision Tree regressor
scores = cross_val_score(
tree_reg, X_train, y_train, scoring="neg_mean_absolute_error", cv=10
)
tree_mae_scores = (-scores)
def display_scores(scores):
# print("Scores:", scores)
print("Mean MAE:", round(scores.mean(), 2))
print("Standard deviation:", round(scores.std(), 2))
display_scores(tree_mae_scores)
Mean MAE: 20.24 Standard deviation: 10.68
Grid Search¶
from sklearn.model_selection import GridSearchCV
# the model hyper-parameters
# help: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
# DecisionTreeRegressor(splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, ...)
param_grid = [
{
'max_depth': [3, 4, 5, 10],
'min_samples_split': [3, 4, 5, 10],
'splitter': ['random', 'best']
}
]
# grid seach application
tree_reg = DecisionTreeRegressor(random_state=23)
# cv - petinasobna krizova validace, tedy cv * max_depth epoch
grid_search = GridSearchCV(
tree_reg, param_grid, cv=10, scoring="neg_mean_absolute_error", return_train_score=True
)
grid_search.fit(X_train, y_train)
/home/pesek/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/numpy/ma/core.py:2881: RuntimeWarning: invalid value encountered in cast _data = np.array(data, dtype=dtype, copy=copy,
GridSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23), param_grid=[{'max_depth': [3, 4, 5, 10], 'min_samples_split': [3, 4, 5, 10], 'splitter': ['random', 'best']}], return_train_score=True, scoring='neg_mean_absolute_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23), param_grid=[{'max_depth': [3, 4, 5, 10], 'min_samples_split': [3, 4, 5, 10], 'splitter': ['random', 'best']}], return_train_score=True, scoring='neg_mean_absolute_error')
DecisionTreeRegressor(max_depth=3, min_samples_split=3, random_state=23)
DecisionTreeRegressor(max_depth=3, min_samples_split=3, random_state=23)
grid_search.best_params_
{'max_depth': 3, 'min_samples_split': 3, 'splitter': 'best'}
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
18.514869629268855 {'max_depth': 3, 'min_samples_split': 3, 'splitter': 'random'} 15.959582872920553 {'max_depth': 3, 'min_samples_split': 3, 'splitter': 'best'} 18.514869629268855 {'max_depth': 3, 'min_samples_split': 4, 'splitter': 'random'} 15.959582872920553 {'max_depth': 3, 'min_samples_split': 4, 'splitter': 'best'} 18.514869629268855 {'max_depth': 3, 'min_samples_split': 5, 'splitter': 'random'} 15.959582872920553 {'max_depth': 3, 'min_samples_split': 5, 'splitter': 'best'} 18.490661198514218 {'max_depth': 3, 'min_samples_split': 10, 'splitter': 'random'} 16.041471761809444 {'max_depth': 3, 'min_samples_split': 10, 'splitter': 'best'} 19.435980243140406 {'max_depth': 4, 'min_samples_split': 3, 'splitter': 'random'} 17.62946315452267 {'max_depth': 4, 'min_samples_split': 3, 'splitter': 'best'} 19.05396623932332 {'max_depth': 4, 'min_samples_split': 4, 'splitter': 'random'} 17.55007426563378 {'max_depth': 4, 'min_samples_split': 4, 'splitter': 'best'} 18.58107237595373 {'max_depth': 4, 'min_samples_split': 5, 'splitter': 'random'} 17.688783725093238 {'max_depth': 4, 'min_samples_split': 5, 'splitter': 'best'} 20.12845363862052 {'max_depth': 4, 'min_samples_split': 10, 'splitter': 'random'} 17.775441132500646 {'max_depth': 4, 'min_samples_split': 10, 'splitter': 'best'} 19.9505417184069 {'max_depth': 5, 'min_samples_split': 3, 'splitter': 'random'} 18.60957245145 {'max_depth': 5, 'min_samples_split': 3, 'splitter': 'best'} 20.303624660893725 {'max_depth': 5, 'min_samples_split': 4, 'splitter': 'random'} 18.125947076074624 {'max_depth': 5, 'min_samples_split': 4, 'splitter': 'best'} 21.911845419693922 {'max_depth': 5, 'min_samples_split': 5, 'splitter': 'random'} 18.287378757756308 {'max_depth': 5, 'min_samples_split': 5, 'splitter': 'best'} 18.968975573126443 {'max_depth': 5, 'min_samples_split': 10, 'splitter': 'random'} 18.133173841530557 {'max_depth': 5, 'min_samples_split': 10, 'splitter': 'best'} 26.559439062648032 {'max_depth': 10, 'min_samples_split': 3, 'splitter': 'random'} 18.071162044051874 {'max_depth': 10, 'min_samples_split': 3, 'splitter': 'best'} 21.534413780549198 {'max_depth': 10, 'min_samples_split': 4, 'splitter': 'random'} 18.102266241106072 {'max_depth': 10, 'min_samples_split': 4, 'splitter': 'best'} 22.4833334290822 {'max_depth': 10, 'min_samples_split': 5, 'splitter': 'random'} 19.753292174182008 {'max_depth': 10, 'min_samples_split': 5, 'splitter': 'best'} 22.52042047089359 {'max_depth': 10, 'min_samples_split': 10, 'splitter': 'random'} 18.29324359971468 {'max_depth': 10, 'min_samples_split': 10, 'splitter': 'best'}
Randomized Search¶
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {
'max_depth': randint(low=3, high=10),
'min_samples_split': randint(low=3, high=10),
}
tree_reg = DecisionTreeRegressor(random_state=23)
rnd_search = RandomizedSearchCV(
tree_reg, param_distributions=param_distribs, n_iter=10,
cv=10, scoring="neg_mean_absolute_error", random_state=23
)
rnd_search.fit(X_train, y_train)
/home/pesek/workspace/yusu/.direnv/python-3.10.12/lib/python3.10/site-packages/numpy/ma/core.py:2881: RuntimeWarning: invalid value encountered in cast _data = np.array(data, dtype=dtype, copy=copy,
RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23), param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b9fe80>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b1f610>}, random_state=23, scoring='neg_mean_absolute_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=10, estimator=DecisionTreeRegressor(random_state=23), param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b9fe80>, 'min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x761eb8b1f610>}, random_state=23, scoring='neg_mean_absolute_error')
DecisionTreeRegressor(max_depth=3, min_samples_split=4, random_state=23)
DecisionTreeRegressor(max_depth=3, min_samples_split=4, random_state=23)
rnd_search.best_estimator_
DecisionTreeRegressor(max_depth=3, min_samples_split=4, random_state=23)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=3, min_samples_split=4, random_state=23)
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
18.369909683541902 {'max_depth': 6, 'min_samples_split': 9} 15.959582872920553 {'max_depth': 3, 'min_samples_split': 4} 20.204774203776793 {'max_depth': 9, 'min_samples_split': 3} 19.705889251512108 {'max_depth': 8, 'min_samples_split': 7} 18.148671621372586 {'max_depth': 6, 'min_samples_split': 5} 17.77067261398213 {'max_depth': 4, 'min_samples_split': 6} 20.130670309085396 {'max_depth': 9, 'min_samples_split': 6} 19.705889251512108 {'max_depth': 8, 'min_samples_split': 7} 17.688783725093238 {'max_depth': 4, 'min_samples_split': 5} 18.434572877986348 {'max_depth': 6, 'min_samples_split': 8}
# Select the model and evaluate it with test set!
sel_model = grid_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test, sel_predictions), 2)))
MAE: 18.43
# Select final model and evaluate it with test set!
sel_model = rnd_search.best_estimator_
sel_predictions = sel_model.predict(X_test)
print('MAE: {}'.format(round(mean_absolute_error(y_test, sel_predictions), 2)))
MAE: 18.43