|
5 | 5 | \usepackage{amsmath} |
6 | 6 | \usepackage{amssymb} |
7 | 7 | \usepackage{enumerate} |
| 8 | +\usepackage{caption} |
8 | 9 |
|
9 | 10 | \setlength\parindent{0pt} |
10 | 11 |
|
@@ -1260,6 +1261,210 @@ \subsection{The scoring metrics for multiclass classification} |
1260 | 1261 | \section{Summary} |
1261 | 1262 |
|
1262 | 1263 |
|
| 1264 | + |
| 1265 | + |
| 1266 | +%%%%%%%%%%%%%%% |
| 1267 | +% CHAPTER 7 |
| 1268 | +%%%%%%%%%%%%%%% |
| 1269 | + |
| 1270 | +\chapter{Combining Different Models for Ensemble Learning} |
| 1271 | + |
| 1272 | +\section{Learning with ensembles} |
| 1273 | + |
| 1274 | +To predict a class label via a simple majority or plurality voting, we combine the predicted class labels of each individual classifier $C_j$ and select the class label $\hat{y}$ that received the most votes: |
| 1275 | + |
| 1276 | +\[ |
| 1277 | +\hat{y} = mode \{ C_1 (\mathbf{x}), C_2 (\mathbf{x}), \dots, C_m (\mathbf{x}) \} |
| 1278 | +\] |
| 1279 | + |
| 1280 | +For example, in a binary classification task where $class1 = -1$ and $class2 = +1$, we can write the majority vote prediction as follows: |
| 1281 | + |
| 1282 | +\[ |
| 1283 | +C(\mathbf{x}) = sign \Bigg[ \sum_{j}^{m} C_j (\mathbf{x} \Bigg] = \begin{cases} |
| 1284 | + 1 & \text{ if } \sum_j C_j (\mathbf{x}) \ge 0 \\ |
| 1285 | + -1 & \text{ otherwise }. |
| 1286 | + \end{cases} |
| 1287 | +\] |
| 1288 | + |
| 1289 | +To illustrate why ensemble methods can work better than individual classifiers alone, let's apply the simple concepts of combinatorics. For the following example, we make the assumption that all $n$ base classifiers for a binary classification task have an equal error rate $\epsilon$. Furthermore, we assume that the classifiers are independent and the error rates are not correlated. Under those assumptions, we can simply express the error probability of an ensemble of base classifiers as a probabilitymass function of a binomial distribution: |
| 1290 | + |
| 1291 | +\[ |
| 1292 | +P(y \ge k) = \sum_{k}^{n} \binom{n}{k} \epsilon^k (1 - \epsilon)^{n-k} = \epsilon_{\text{ensemble}} |
| 1293 | +\] |
| 1294 | + |
| 1295 | +Here, $\binom{n}{k}$ is the binomial coefficient \textit{n choose k}. In other words, we compute the probability that the prediction of the ensemble is wrong. Now let's take a look at a more concrete example of 11 base classifiers ($n=11$) with an error rate of 0.25 ($\epsilon = 0.25$): |
| 1296 | + |
| 1297 | +\[ |
| 1298 | +P(y \ge k) = \sum_{k=6}^{11} \binom{11}{k} 0.25^k (1 - \epsilon)^{11-k} = 0.034 |
| 1299 | +\] |
| 1300 | + |
| 1301 | +\section{Implementing a simple majority vote classifier} |
| 1302 | + |
| 1303 | +Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more precise mathematical terms, we can write the weighted majority vote as follows: |
| 1304 | + |
| 1305 | +\[ |
| 1306 | +\hat{y} = \text{arg} \max_i \sum_{j=1}^{m} w_j \chi_A \big(C_j (\mathbf{x})=i\big) |
| 1307 | +\] |
| 1308 | + |
| 1309 | +Let's assume that we have an ensemble of three base classifiers $C_j ( j \in {0,1})$ and want to predict the class label of a given sample instance x. Two out of three base classi ers predict the class label 0, and one $C_3$ predicts that the sample belongs to class 1. If we weight the predictions of each base classifier equally, the majority vote will predict that the sample belongs to class 0: |
| 1310 | + |
| 1311 | +\[ |
| 1312 | +C_1(\mathbf{x}) \rightarrow 0, C_2 (\mathbf{x}) \rightarrow 0, C_3(\mathbf{x}) \rightarrow 1 |
| 1313 | +\] |
| 1314 | + |
| 1315 | +\[ |
| 1316 | +\hat{y} = mode{0, 0, 1} = 0 |
| 1317 | +\] |
| 1318 | + |
| 1319 | +Now let's assign a weight of 0.6 to $C_3$ and weight $C_1$ and $C_2$ by a coefficient of 0.2, respectively. |
| 1320 | + |
| 1321 | +\[ |
| 1322 | +\hat{y} = \text{arg}\max_i \sum_{j=1}^{m} w_j \chi_A \big( C_j(\mathbf{x}) = i \big) |
| 1323 | +\] |
| 1324 | + |
| 1325 | +\[ |
| 1326 | += \text{arg}\max_i \big[0.2 \times i_0 + 0.2 \times i_0 + 0.6 \times i_1 \big] = 1 |
| 1327 | +\] |
| 1328 | + |
| 1329 | +More intuitively, since $3 \times 0.2 = 0.6$, we can say that the prediction made by $C_3$ has three times more weight than the predictions by $C_1$ or $C_2$ , respectively. We can write this as follows: |
| 1330 | + |
| 1331 | +\[ |
| 1332 | +\hat{y} = mode\{0,0,1,1,1\} = 1 |
| 1333 | +\] |
| 1334 | + |
| 1335 | +[...] The modified version of the majority vote for predicting class labels from probabilities can be written as follows: |
| 1336 | + |
| 1337 | +\[ |
| 1338 | +\hat{y} = \text{arg} \max_i \sum^{m}_{j=1} w_j p_{ij} |
| 1339 | +\] |
| 1340 | + |
| 1341 | +Here, $p_{ij}$ is the predicted probability of the $j$th classifier for class label $i$. |
| 1342 | + |
| 1343 | +To continue with our previous example, let's assume that we have a binary classification problem with class labels $i \in \{0, 1\}$ and an ensemble of three classifiers $C_j (j \in \{1, 2, 3\}$. Let's assume that the classifier $C_j$ returns the following class membership probabilities for a particular sample $\mathbf{x}$: |
| 1344 | + |
| 1345 | +\[ |
| 1346 | +C_1(\mathbf{x}) \rightarrow [0.9, 0.1], C_2 (\mathbf{x}) \rightarrow [0.8, 0.2], C_3(\mathbf{x}) \rightarrow [0.4, 0.6] |
| 1347 | +\] |
| 1348 | + |
| 1349 | +We can then calculate the individual class probabilities as follows: |
| 1350 | + |
| 1351 | +\[ |
| 1352 | +p(i_0 | \mathbf{x}) = 0.2 \times 0.9 + 0.2 \times 0.8 + 0.6 \times 0.4 = 0.58 |
| 1353 | +\] |
| 1354 | + |
| 1355 | +\[ |
| 1356 | +p(i_1 | \mathbf{x}) = 0.2 \times 0.1 + 0.2 \times 0.2 + 0.6 \times 0.06 = 0.42 |
| 1357 | +\] |
| 1358 | + |
| 1359 | +\[ |
| 1360 | +\hat{y} = \text{arg} \max_i \big[ p(i_0 | \mathbf{x}), p(i_1 | \mathbf{x}) \big] = 0 |
| 1361 | +\] |
| 1362 | + |
| 1363 | +\subsection{Combining different algorithms for classification with majority vote} |
| 1364 | +\section{Evaluating and tuning the ensemble classifier} |
| 1365 | +\section{Bagging -- building an ensemble of classifiers from bootstrap samples} |
| 1366 | +\section{Leveraging weak learners via adaptive boosting} |
| 1367 | + |
| 1368 | +[...] The original boosting procedure is summarized in four key steps as follows: |
| 1369 | + |
| 1370 | +\begin{enumerate} |
| 1371 | +\item Draw a random subset of training samples $d_1$ without replacement from the training set $D$ to train a weak learner $C_1$. |
| 1372 | +\item Draw second random training subset $d_2$ without replacement from the training set and add 50 percent of the samples that were previously misclassified to train a weak learner $C_2$. |
| 1373 | +\item Find the training samples $d_3$ in the training set $D$ on which $C_1$ and $C_2$ disagree to train a third weak learner $C_3$ |
| 1374 | +\item Combine the weak learners $C_1, C_2$, and $C_3$ via majority voting. |
| 1375 | +\end{enumerate} |
| 1376 | + |
| 1377 | +[...] Now that have a better understanding behind the basic concept of AdaBoost, let's take a more detailed look at the algorithm using pseudo code. For clarity, we will denote element-wise multiplication by the cross symbol $(\times)$ and the dot product between two vectors by a dot symbol $(\cdot)$, respectively. The steps are as follows: |
| 1378 | + |
| 1379 | +\begin{enumerate} |
| 1380 | +\item Set weight vector $\mathbf{w}$ to uniform weights where $\sum_i w_i = 1$. |
| 1381 | +\item For $j$ in $m$ boosting rounds, do the following: |
| 1382 | +\begin{enumerate} |
| 1383 | +\item Train a weighted weak learner: $C_j = train(\mathbf{X, y, w})$. |
| 1384 | +\item Predict class labels: $\hat{y} = predict(C_j, \mathbf{X})$. |
| 1385 | +\item Compute the weighted error rate: $\epsilon = \mathbf{w} \cdot (\mathbf{\hat{y}} == \mathbf{y})$. |
| 1386 | +\item Compute the coefficient $\alpha_j$: $\alpha_j=0.5 \log \frac{1 - \epsilon}{\epsilon}$. |
| 1387 | +\item Update the weights: $\mathbf{w} := \mathbf{w} \times \exp \big( -\alpha_j \times \mathbf{\hat{y}} \times \mathbf{y} \big)$. |
| 1388 | +\item Normalize weights to sum to 1: $\mathbf{w}:= \mathbf{w} / \sum_i w_i$. |
| 1389 | +\end{enumerate} |
| 1390 | +\item Compute the final prediction: $\mathbf{\hat{y}} = \big( \sum^{m}_{j=1} \big( \mathbf{\alpha}_j \times predict(C_j, \mathbf{X}) \big) > 0 \big)$. |
| 1391 | +\end{enumerate} |
| 1392 | + |
| 1393 | +Note that the expression ($\mathbf{\hat{y}} == \mathbf{y}$) in step 5 refers to a vector of 1s and 0s, where a 1 is assigned if the prediction is incorrect and 0 is assigned otherwise. |
| 1394 | + |
| 1395 | +\begin{table}[!htbp] |
| 1396 | +\centering |
| 1397 | +\caption*{} |
| 1398 | +\label{} |
| 1399 | +\begin{tabular}{r | c c c c c | l} |
| 1400 | +\hline |
| 1401 | +Sample indices & x & y & Weights & $\hat{y}$(x $\le$ 3.0)? & Correct? & Updated weights \\ \hline |
| 1402 | +1 & 1.0 & 1 & 0.1 & 1 & Yes & 0.072 \\ |
| 1403 | +2 & 2.0 & 1 & 0.1 & 1 & Yes & 0.072 \\ |
| 1404 | +3 & 3.0 & 1 & 0.1 & 1 & Yes & 0.072 \\ |
| 1405 | +4 & 4.0 & -1 & 0.1 & -1 & Yes & 0.072 \\ |
| 1406 | +5 & 5.0 & -1 & 0.1 & -1 & Yes & 0.072 \\ |
| 1407 | +6 & 6.0 & -1 & 0.1 & -1 & Yes & 0.072 \\ |
| 1408 | +7 & 7.0 & 1 & 0.1 & -1 & No & 0.167 \\ |
| 1409 | +8 & 8.0 & 1 & 0.1 & -1 & No & 0.167 \\ |
| 1410 | +9 & 9.0 & 1 & 0.1 & -1 & No & 0.167 \\ |
| 1411 | +10 & 10.0 & -1 & 0.1 & -1 & Yes & 0.072 \\ \hline |
| 1412 | +\end{tabular} |
| 1413 | +\end{table} |
| 1414 | + |
| 1415 | +Since the computation of the weight updates may look a little bit complicated at rst, we will now follow the calculation step by step. We start by computing the weighted error rate $\epsilon$ as described in step 5: |
| 1416 | + |
| 1417 | +\[ |
| 1418 | +\epsilon = 0.1\times 0+0.1\times 0+0.1 \times 0+0.1 \times 0+0.1 \times 0+0.1 \times 0+0.1\times 1+0.1 \times 1 + 0.1 \times 1+0.1 \times 0 |
| 1419 | +\] |
| 1420 | +\[ |
| 1421 | += \frac{3}{10} = 0.3 |
| 1422 | +\] |
| 1423 | + |
| 1424 | +Next we compute the coefficient $\alpha_j$ (shown in step 6), which is later used in step 7 to update the weights as well as for the weights in majority vote prediction (step 10): |
| 1425 | + |
| 1426 | +\[ |
| 1427 | +\alpha_j = 0.5 \log \Bigg( \frac{1 - \epsilon}{\epsilon} \Bigg) \approx 0.424 |
| 1428 | +\] |
| 1429 | + |
| 1430 | +After we have computed the coefficient $\alpha_j$ we can now update the weight vector using the following equation: |
| 1431 | + |
| 1432 | +\[ |
| 1433 | +\mathbf{w} := \mathbf{w} \times \exp ( -\alpha_j \times \mathbf{\hat{y}} \times \mathbf{y}) |
| 1434 | +\] |
| 1435 | + |
| 1436 | +Here, $\mathbf{\hat{y}} \times \mathbf{y}$ is an element-wise multiplication between the vectors of the predicted and true class labels, respectively. Thus, if a prediction $\hat{y}_i$ is correct, $\hat{y}_i \times y_i$ will have a positive sign so that we decrease the $i$th weight since $\alpha_j$ is a positive number as well: |
| 1437 | + |
| 1438 | +\[ |
| 1439 | +0.1 \times \exp (-0.424 \times 1 \times 1) \approx 0.065 |
| 1440 | +\] |
| 1441 | + |
| 1442 | +Similarly, we will increase the $i$th weight if $\hat{y}_i$ predicted the label incorrectly like this: |
| 1443 | + |
| 1444 | +\[ |
| 1445 | +0.1 \times \exp (-0.424 \times 1 \times (-1)) \approx 0.153 |
| 1446 | +\] |
| 1447 | + |
| 1448 | +Or like this: |
| 1449 | + |
| 1450 | +\[ |
| 1451 | +0.1 \times \exp (-0.424 \times (-1) \times 1) \approx 0.153 |
| 1452 | +\] |
| 1453 | + |
| 1454 | +After we update each weight in the weight vector, we normalize the weights so that they sum up to 1 (step 8): |
| 1455 | + |
| 1456 | +\[ |
| 1457 | +\mathbf{w} := \frac{\mathbf{w}}{\sum_i w_i} |
| 1458 | +\] |
| 1459 | + |
| 1460 | +Here, $\sum_i w_i = 7 \times 0.065 + 3 \times 0.153 = 0.914$. |
| 1461 | + |
| 1462 | +Thus, each weight that corresponds to a correctly classified sample will be reduced from the initial value of $0.1$ to $0.065 / 0.914 \approx 0.071$ for the next round of boosting. Similarly, the weights of each incorrectly classified sample will increase from $0.1$ to $0.153 / 0.914 \approx 0.167$. |
| 1463 | + |
| 1464 | + |
| 1465 | +\section{Summary} |
| 1466 | + |
| 1467 | + |
1263 | 1468 | \newpage |
1264 | 1469 |
|
1265 | 1470 | ... to be continued ... |
|
0 commit comments