Skip to content

Commit db9cfea

Browse files
committed
equation ch7
1 parent b56d5dc commit db9cfea

2 files changed

Lines changed: 205 additions & 0 deletions

File tree

docs/equations/pymle-equations.pdf

15.4 KB
Binary file not shown.

docs/equations/pymle-equations.tex

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
\usepackage{amsmath}
66
\usepackage{amssymb}
77
\usepackage{enumerate}
8+
\usepackage{caption}
89

910
\setlength\parindent{0pt}
1011

@@ -1260,6 +1261,210 @@ \subsection{The scoring metrics for multiclass classification}
12601261
\section{Summary}
12611262

12621263

1264+
1265+
1266+
%%%%%%%%%%%%%%%
1267+
% CHAPTER 7
1268+
%%%%%%%%%%%%%%%
1269+
1270+
\chapter{Combining Different Models for Ensemble Learning}
1271+
1272+
\section{Learning with ensembles}
1273+
1274+
To predict a class label via a simple majority or plurality voting, we combine the predicted class labels of each individual classifier $C_j$ and select the class label $\hat{y}$ that received the most votes:
1275+
1276+
\[
1277+
\hat{y} = mode \{ C_1 (\mathbf{x}), C_2 (\mathbf{x}), \dots, C_m (\mathbf{x}) \}
1278+
\]
1279+
1280+
For example, in a binary classification task where $class1 = -1$ and $class2 = +1$, we can write the majority vote prediction as follows:
1281+
1282+
\[
1283+
C(\mathbf{x}) = sign \Bigg[ \sum_{j}^{m} C_j (\mathbf{x} \Bigg] = \begin{cases}
1284+
1 & \text{ if } \sum_j C_j (\mathbf{x}) \ge 0 \\
1285+
-1 & \text{ otherwise }.
1286+
\end{cases}
1287+
\]
1288+
1289+
To illustrate why ensemble methods can work better than individual classifiers alone, let's apply the simple concepts of combinatorics. For the following example, we make the assumption that all $n$ base classifiers for a binary classification task have an equal error rate $\epsilon$. Furthermore, we assume that the classifiers are independent and the error rates are not correlated. Under those assumptions, we can simply express the error probability of an ensemble of base classifiers as a probabilitymass function of a binomial distribution:
1290+
1291+
\[
1292+
P(y \ge k) = \sum_{k}^{n} \binom{n}{k} \epsilon^k (1 - \epsilon)^{n-k} = \epsilon_{\text{ensemble}}
1293+
\]
1294+
1295+
Here, $\binom{n}{k}$ is the binomial coefficient \textit{n choose k}. In other words, we compute the probability that the prediction of the ensemble is wrong. Now let's take a look at a more concrete example of 11 base classifiers ($n=11$) with an error rate of 0.25 ($\epsilon = 0.25$):
1296+
1297+
\[
1298+
P(y \ge k) = \sum_{k=6}^{11} \binom{11}{k} 0.25^k (1 - \epsilon)^{11-k} = 0.034
1299+
\]
1300+
1301+
\section{Implementing a simple majority vote classifier}
1302+
1303+
Our goal is to build a stronger meta-classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more precise mathematical terms, we can write the weighted majority vote as follows:
1304+
1305+
\[
1306+
\hat{y} = \text{arg} \max_i \sum_{j=1}^{m} w_j \chi_A \big(C_j (\mathbf{x})=i\big)
1307+
\]
1308+
1309+
Let's assume that we have an ensemble of three base classifiers $C_j ( j \in {0,1})$ and want to predict the class label of a given sample instance x. Two out of three base classi ers predict the class label 0, and one $C_3$ predicts that the sample belongs to class 1. If we weight the predictions of each base classifier equally, the majority vote will predict that the sample belongs to class 0:
1310+
1311+
\[
1312+
C_1(\mathbf{x}) \rightarrow 0, C_2 (\mathbf{x}) \rightarrow 0, C_3(\mathbf{x}) \rightarrow 1
1313+
\]
1314+
1315+
\[
1316+
\hat{y} = mode{0, 0, 1} = 0
1317+
\]
1318+
1319+
Now let's assign a weight of 0.6 to $C_3$ and weight $C_1$ and $C_2$ by a coefficient of 0.2, respectively.
1320+
1321+
\[
1322+
\hat{y} = \text{arg}\max_i \sum_{j=1}^{m} w_j \chi_A \big( C_j(\mathbf{x}) = i \big)
1323+
\]
1324+
1325+
\[
1326+
= \text{arg}\max_i \big[0.2 \times i_0 + 0.2 \times i_0 + 0.6 \times i_1 \big] = 1
1327+
\]
1328+
1329+
More intuitively, since $3 \times 0.2 = 0.6$, we can say that the prediction made by $C_3$ has three times more weight than the predictions by $C_1$ or $C_2$ , respectively. We can write this as follows:
1330+
1331+
\[
1332+
\hat{y} = mode\{0,0,1,1,1\} = 1
1333+
\]
1334+
1335+
[...] The modified version of the majority vote for predicting class labels from probabilities can be written as follows:
1336+
1337+
\[
1338+
\hat{y} = \text{arg} \max_i \sum^{m}_{j=1} w_j p_{ij}
1339+
\]
1340+
1341+
Here, $p_{ij}$ is the predicted probability of the $j$th classifier for class label $i$.
1342+
1343+
To continue with our previous example, let's assume that we have a binary classification problem with class labels $i \in \{0, 1\}$ and an ensemble of three classifiers $C_j (j \in \{1, 2, 3\}$. Let's assume that the classifier $C_j$ returns the following class membership probabilities for a particular sample $\mathbf{x}$:
1344+
1345+
\[
1346+
C_1(\mathbf{x}) \rightarrow [0.9, 0.1], C_2 (\mathbf{x}) \rightarrow [0.8, 0.2], C_3(\mathbf{x}) \rightarrow [0.4, 0.6]
1347+
\]
1348+
1349+
We can then calculate the individual class probabilities as follows:
1350+
1351+
\[
1352+
p(i_0 | \mathbf{x}) = 0.2 \times 0.9 + 0.2 \times 0.8 + 0.6 \times 0.4 = 0.58
1353+
\]
1354+
1355+
\[
1356+
p(i_1 | \mathbf{x}) = 0.2 \times 0.1 + 0.2 \times 0.2 + 0.6 \times 0.06 = 0.42
1357+
\]
1358+
1359+
\[
1360+
\hat{y} = \text{arg} \max_i \big[ p(i_0 | \mathbf{x}), p(i_1 | \mathbf{x}) \big] = 0
1361+
\]
1362+
1363+
\subsection{Combining different algorithms for classification with majority vote}
1364+
\section{Evaluating and tuning the ensemble classifier}
1365+
\section{Bagging -- building an ensemble of classifiers from bootstrap samples}
1366+
\section{Leveraging weak learners via adaptive boosting}
1367+
1368+
[...] The original boosting procedure is summarized in four key steps as follows:
1369+
1370+
\begin{enumerate}
1371+
\item Draw a random subset of training samples $d_1$ without replacement from the training set $D$ to train a weak learner $C_1$.
1372+
\item Draw second random training subset $d_2$ without replacement from the training set and add 50 percent of the samples that were previously misclassified to train a weak learner $C_2$.
1373+
\item Find the training samples $d_3$ in the training set $D$ on which $C_1$ and $C_2$ disagree to train a third weak learner $C_3$
1374+
\item Combine the weak learners $C_1, C_2$, and $C_3$ via majority voting.
1375+
\end{enumerate}
1376+
1377+
[...] Now that have a better understanding behind the basic concept of AdaBoost, let's take a more detailed look at the algorithm using pseudo code. For clarity, we will denote element-wise multiplication by the cross symbol $(\times)$ and the dot product between two vectors by a dot symbol $(\cdot)$, respectively. The steps are as follows:
1378+
1379+
\begin{enumerate}
1380+
\item Set weight vector $\mathbf{w}$ to uniform weights where $\sum_i w_i = 1$.
1381+
\item For $j$ in $m$ boosting rounds, do the following:
1382+
\begin{enumerate}
1383+
\item Train a weighted weak learner: $C_j = train(\mathbf{X, y, w})$.
1384+
\item Predict class labels: $\hat{y} = predict(C_j, \mathbf{X})$.
1385+
\item Compute the weighted error rate: $\epsilon = \mathbf{w} \cdot (\mathbf{\hat{y}} == \mathbf{y})$.
1386+
\item Compute the coefficient $\alpha_j$: $\alpha_j=0.5 \log \frac{1 - \epsilon}{\epsilon}$.
1387+
\item Update the weights: $\mathbf{w} := \mathbf{w} \times \exp \big( -\alpha_j \times \mathbf{\hat{y}} \times \mathbf{y} \big)$.
1388+
\item Normalize weights to sum to 1: $\mathbf{w}:= \mathbf{w} / \sum_i w_i$.
1389+
\end{enumerate}
1390+
\item Compute the final prediction: $\mathbf{\hat{y}} = \big( \sum^{m}_{j=1} \big( \mathbf{\alpha}_j \times predict(C_j, \mathbf{X}) \big) > 0 \big)$.
1391+
\end{enumerate}
1392+
1393+
Note that the expression ($\mathbf{\hat{y}} == \mathbf{y}$) in step 5 refers to a vector of 1s and 0s, where a 1 is assigned if the prediction is incorrect and 0 is assigned otherwise.
1394+
1395+
\begin{table}[!htbp]
1396+
\centering
1397+
\caption*{}
1398+
\label{}
1399+
\begin{tabular}{r | c c c c c | l}
1400+
\hline
1401+
Sample indices & x & y & Weights & $\hat{y}$(x $\le$ 3.0)? & Correct? & Updated weights \\ \hline
1402+
1 & 1.0 & 1 & 0.1 & 1 & Yes & 0.072 \\
1403+
2 & 2.0 & 1 & 0.1 & 1 & Yes & 0.072 \\
1404+
3 & 3.0 & 1 & 0.1 & 1 & Yes & 0.072 \\
1405+
4 & 4.0 & -1 & 0.1 & -1 & Yes & 0.072 \\
1406+
5 & 5.0 & -1 & 0.1 & -1 & Yes & 0.072 \\
1407+
6 & 6.0 & -1 & 0.1 & -1 & Yes & 0.072 \\
1408+
7 & 7.0 & 1 & 0.1 & -1 & No & 0.167 \\
1409+
8 & 8.0 & 1 & 0.1 & -1 & No & 0.167 \\
1410+
9 & 9.0 & 1 & 0.1 & -1 & No & 0.167 \\
1411+
10 & 10.0 & -1 & 0.1 & -1 & Yes & 0.072 \\ \hline
1412+
\end{tabular}
1413+
\end{table}
1414+
1415+
Since the computation of the weight updates may look a little bit complicated at rst, we will now follow the calculation step by step. We start by computing the weighted error rate $\epsilon$ as described in step 5:
1416+
1417+
\[
1418+
\epsilon = 0.1\times 0+0.1\times 0+0.1 \times 0+0.1 \times 0+0.1 \times 0+0.1 \times 0+0.1\times 1+0.1 \times 1 + 0.1 \times 1+0.1 \times 0
1419+
\]
1420+
\[
1421+
= \frac{3}{10} = 0.3
1422+
\]
1423+
1424+
Next we compute the coefficient $\alpha_j$ (shown in step 6), which is later used in step 7 to update the weights as well as for the weights in majority vote prediction (step 10):
1425+
1426+
\[
1427+
\alpha_j = 0.5 \log \Bigg( \frac{1 - \epsilon}{\epsilon} \Bigg) \approx 0.424
1428+
\]
1429+
1430+
After we have computed the coefficient $\alpha_j$ we can now update the weight vector using the following equation:
1431+
1432+
\[
1433+
\mathbf{w} := \mathbf{w} \times \exp ( -\alpha_j \times \mathbf{\hat{y}} \times \mathbf{y})
1434+
\]
1435+
1436+
Here, $\mathbf{\hat{y}} \times \mathbf{y}$ is an element-wise multiplication between the vectors of the predicted and true class labels, respectively. Thus, if a prediction $\hat{y}_i$ is correct, $\hat{y}_i \times y_i$ will have a positive sign so that we decrease the $i$th weight since $\alpha_j$ is a positive number as well:
1437+
1438+
\[
1439+
0.1 \times \exp (-0.424 \times 1 \times 1) \approx 0.065
1440+
\]
1441+
1442+
Similarly, we will increase the $i$th weight if $\hat{y}_i$ predicted the label incorrectly like this:
1443+
1444+
\[
1445+
0.1 \times \exp (-0.424 \times 1 \times (-1)) \approx 0.153
1446+
\]
1447+
1448+
Or like this:
1449+
1450+
\[
1451+
0.1 \times \exp (-0.424 \times (-1) \times 1) \approx 0.153
1452+
\]
1453+
1454+
After we update each weight in the weight vector, we normalize the weights so that they sum up to 1 (step 8):
1455+
1456+
\[
1457+
\mathbf{w} := \frac{\mathbf{w}}{\sum_i w_i}
1458+
\]
1459+
1460+
Here, $\sum_i w_i = 7 \times 0.065 + 3 \times 0.153 = 0.914$.
1461+
1462+
Thus, each weight that corresponds to a correctly classified sample will be reduced from the initial value of $0.1$ to $0.065 / 0.914 \approx 0.071$ for the next round of boosting. Similarly, the weights of each incorrectly classified sample will increase from $0.1$ to $0.153 / 0.914 \approx 0.167$.
1463+
1464+
1465+
\section{Summary}
1466+
1467+
12631468
\newpage
12641469

12651470
... to be continued ...

0 commit comments

Comments
 (0)