Skip to content

Commit b541b53

Browse files
committed
adding regularized weight update for clarification
1 parent eaf5875 commit b541b53

2 files changed

Lines changed: 28 additions & 19 deletions

File tree

docs/equations/pymle-equations.pdf

353 Bytes
Binary file not shown.

docs/equations/pymle-equations.tex

Lines changed: 28 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ \section{Artificial neurons -- a brief glimpse into the early history of machine
216216
\end{enumerate}
217217
\end{enumerate}
218218

219-
Here, the output value is the class label predicted by the unit step function that we de ned earlier, and the simultaneous update of each weight $w_j$ in the weight vector $\mathbf{w}$ can be more formally written as:
219+
Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight $w_j$ in the weight vector $\mathbf{w}$ can be more formally written as:
220220

221221
\[
222222
w_j := w_j + \Delta w_j
@@ -450,7 +450,7 @@ \subsection{Logistic regression intuition and conditional probabilities}
450450

451451
\subsection{Learning the weights of the logistic cost function}
452452

453-
In the previous chapter, we de ned the sum-squared-error cost function:
453+
In the previous chapter, we defined the sum-squared-error cost function:
454454

455455
\[
456456
J(\mathbf{w}) = \frac{1}{2} \sum_i \bigg( \phi \big( z^{(i)} \big) - y^{(i)} \bigg)^2.
@@ -571,13 +571,22 @@ \subsection{Tackling overfitting via regularization}
571571

572572
Here, $\lambda$ is the so-called regularization parameter.
573573

574-
In order to apply regularization, we just need to add the regularization term to the cost function that we de ned for logistic regression to shrink the weights:
574+
In order to apply regularization, we just need to add the regularization term to the cost function that we defined for logistic regression to shrink the weights:
575575

576576
\[
577-
J(\mathbf{w}) = \sum_{i=1}^{n} \bigg[ - y^{(i)} \log \big( \phi(z^{(i)}) \big) - \big( 1 - y ^{(i)} \big) \log \big( 1 - \phi(z^{(i)}) \big) \bigg] + \frac{\lambda}{2} \lVert \mathbf{w}\rVert^2
577+
J(\mathbf{w}) = - \sum_{i=1}^{n} \bigg[ y^{(i)} \log \big( \phi(z^{(i)}) \big) - \big( 1 - y ^{(i)} \big) \log \big( 1 - \phi(z^{(i)}) \big) \bigg] + \frac{\lambda}{2} \lVert \mathbf{w}\rVert^2
578578
\]
579579

580-
Via the regularization parameter $\lambda$, we can then control how well we t the training data while keeping the weights small. By increasing the value of $\lambda$, we increase the regularization strength.
580+
Then, we have the following regularized weight updates for weight $w_j$:
581+
582+
\[
583+
\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_{i=1}^{n} \bigg( y^{(i)} - \phi(z^{(i)}) \bigg)x_{j}^{(i)} - \eta \lambda w_j,
584+
\] for $j \in \{1, 2, ..., m \}$ (i.e., $j \neq 0 $) since we don't regularize the bias unit $w_0$. \\
585+
586+
587+
588+
589+
Via the regularization parameter $\lambda$, we can then control how well we fit the training data while keeping the weights small. By increasing the value of $\lambda$, we increase the regularization strength.
581590

582591
The parameter \textit{C} that is implemented for the \textit{LogisticRegression} class in scikit-learn comes from a convention in support vector machines, which will be the topic of the next section. \textit{C} is directly related to the regularization parameter $\lambda$ , which is its inverse:
583592

@@ -614,7 +623,7 @@ \subsection{Maximum margin intuition}
614623
\Rightarrow \mathbf{w}^T \big( \mathbf{x}_{pos} - \mathbf{x}_{neg} \big) = 2
615624
\]
616625

617-
We can normalize this by the length of the vector $\mathbf{w}$, which is de ned as follows:
626+
We can normalize this by the length of the vector $\mathbf{w}$, which is defined as follows:
618627

619628
\[
620629
\lVert \mathbf{w} \rVert = \sqrt{\sum_{j=1}^{m} w_{j}^{2}}
@@ -812,7 +821,7 @@ \section{Bringing features onto the same scale}
812821
\section{Selecting meaningful features}
813822
\subsection{Sparse solutions with L1 regularization}
814823

815-
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we de ned the L2 norm of our weight vector w as follows:
824+
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we defined the L2 norm of our weight vector w as follows:
816825

817826
\[
818827
L2: \lVert \mathbf{w} \rVert^{2}_{2} = \sum_{j=1}^{m} w^{2}_{j}
@@ -1487,7 +1496,7 @@ \section{Introducing the bag-of-words model}
14871496
\subsection{Transforming words into feature vectors}
14881497
\subsection{Assessing word relevancy via term frequency-inverse document frequency}
14891498

1490-
The \textit{tf-idf} can be de ned as the product of the \textit{term frequency} and the \textit{inverse document frequency}:
1499+
The \textit{tf-idf} can be defined as the product of the \textit{term frequency} and the \textit{inverse document frequency}:
14911500

14921501
\[
14931502
\text{tf-idf}(t, d) = \text{tf} (t, d) \times \text{idf}(t, d)
@@ -1502,7 +1511,7 @@ \subsection{Assessing word relevancy via term frequency-inverse document frequen
15021511

15031512
where $n_d$ is the total number of documents, and $\text{df}(d, t)$ is the number of documents $d$ that contain the term $t$. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.
15041513

1505-
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the \textit{TfidfTransformer} calculates the tf-idfs slightly differently compared to the standard textbook equations that we de ned earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
1514+
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the \textit{TfidfTransformer} calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
15061515

15071516
\[
15081517
\text{idf}(t, d) = \log \frac{1 + n_d}{1 + \text{df}(d, t)}
@@ -1618,7 +1627,7 @@ \subsection{Visualizing the important characteristics of a dataset}
16181627
\section{Implementing an ordinary least squares linear regression model}
16191628
\subsection{Solving regression for regression parameters with gradient descent}
16201629

1621-
Consider our implementation of the \textit{ADAptive LInear NEuron (Adaline)} from \textit{Chapter 2, Training Machine Learning Algorithms for Classifcation}; we remember that the artificial neuron uses a linear activation function and we de ned a cost function $J (\cdot)$, which we minimized to learn the weights via optimization algorithms, such as \textit{Gradient Descent (GD)} and \textit{Stochastic Gradient Descent (SGD)}. This cost function in Adaline is the \textit{Sum of Squared Errors (SSE)}. This is identical to the OLS cost function that we defined:
1630+
Consider our implementation of the \textit{ADAptive LInear NEuron (Adaline)} from \textit{Chapter 2, Training Machine Learning Algorithms for Classifcation}; we remember that the artificial neuron uses a linear activation function and we defined a cost function $J (\cdot)$, which we minimized to learn the weights via optimization algorithms, such as \textit{Gradient Descent (GD)} and \textit{Stochastic Gradient Descent (SGD)}. This cost function in Adaline is the \textit{Sum of Squared Errors (SSE)}. This is identical to the OLS cost function that we defined:
16221631

16231632
\[
16241633
J(w) = \frac{1}{2} \sum_{i=1}^{n} \big( y^{(i)} - \hat{y}^{(i)} \big)^2
@@ -1637,7 +1646,7 @@ \subsection{Estimating the coefficient of a regression model via scikit-learn}
16371646
\section{Fitting a robust regression model using RANSAC}
16381647
\section{Evaluating the performance of linear regression models}
16391648

1640-
Another useful quantitative measure of a model's performance is the so-called \textit{Mean Squared Error (MSE)}, which is simply the average value of the SSE cost function that we minimize to t the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation:
1649+
Another useful quantitative measure of a model's performance is the so-called \textit{Mean Squared Error (MSE)}, which is simply the average value of the SSE cost function that we minimize to fit the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation:
16411650

16421651
\[
16431652
MSE = \frac{1}{n} \sum_{i=1}^{n} \big( y^{(i)} - \hat{y}^{(i)} \big)^2
@@ -1716,7 +1725,7 @@ \subsection{Modeling nonlinear relationships in the Housing Dataset}
17161725
\subsection{Dealing with nonlinear relationships using random forests}
17171726
\subsubsection{Decision tree regression}
17181727

1719-
When we used decision trees for classi cation, we de ned entropy as a measure of impurity to determine which feature split maximizes the \textit{Information Gain (IG)}, which can be de ned as follows for a binary split:
1728+
When we used decision trees for classi cation, we defined entropy as a measure of impurity to determine which feature split maximizes the \textit{Information Gain (IG)}, which can be defined as follows for a binary split:
17201729

17211730
\[
17221731
IG(D_p, x_i) = I(D_p) - \frac{N_{left}}{N_{p}} I (D_{left}) - \frac{N_{right}}{N_p} I (D_{right})
@@ -1757,7 +1766,7 @@ \section{Grouping objects by similarity using k-means}
17571766
\item Randomly pick $k$ centroids from the sample points as initial cluster centers.
17581767
\item Assign each sample to the nearest centroid $\mu^{(j)}, \quad j \in {1, ..., k}.$
17591768
\item Move the centroids to the center of the samples that were assigned to it.
1760-
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1769+
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or a maximum number of iterations is reached.
17611770
\end{enumerate}
17621771

17631772
Now the next question is \textit{how do we measure similarity between objects?} We can de ne similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the \textit{squared Euclidean distance} between two points $\mathbf{x}$ and $\mathbf{y}$ in $m$-dimensional space:
@@ -1819,7 +1828,7 @@ \subsection{Hard versus soft clustering}
18191828
\item Specify the number of $k$ centroids and randomly assign the cluster memberships for each point.
18201829
\item Compute the cluster centroids $\mathbf{\mu^{(j)}}, j \in \{1, \dots, k \}$.
18211830
\item Update the cluster memberships for each point.
1822-
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1831+
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.
18231832
\end{enumerate}
18241833

18251834
The objective function of FCM -- we abbreviate it by $J_m$ -- looks very similar to the within cluster sum-squared-error that we minimize in $k$-means:
@@ -1907,7 +1916,7 @@ \subsection{Single-layer neural network recap}
19071916

19081917
In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient $\nabla J(\mathbf{w})$. In order to find the optimal weights of the model, we optimized an objective function that we defined as the \textit{Sum of Squared Errors (SSE)} cost function $J(\mathbf{w})$. Furthermore, we multiplied the gradient by a factor, the learning rate $\eta$ , which we chose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.
19091918

1910-
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we de ned the partial derivative for each weight $w_j$ in the weight vector
1919+
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight $w_j$ in the weight vector
19111920
$\mathbf{w}$ as follows:
19121921

19131922
\[
@@ -2083,7 +2092,7 @@ \subsection{Computing the logistic cost function}
20832092
a^{(i)} = \phi \big( z^{(i)} \big).
20842093
\]
20852094

2086-
Now, let's add a regularization term, which allows us to reduce the degree of over tting. As you will recall from earlier chapters, the L2 and L1 regularization terms are de ned as follows (remember that we don't regularize the bias units):
2095+
Now, let's add a regularization term, which allows us to reduce the degree of over tting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):
20872096

20882097
\[
20892098
L2 = \lambda \lVert \mathbf{w} \rVert^{2}_{2} = \lambda \sum_{j=1}^{m} w_{j}^{2} \text{ and } L1 = \lambda \lVert \mathbf{w} \rVert_{1} = \lambda \sum_{j=1}^{m} | w_j |.
@@ -2229,8 +2238,8 @@ \subsection{Training neural networks via backpropagation}
22292238
\section{Developing your intuition for backpropagation}
22302239
\section{Debugging neural networks with gradient checking}
22312240

2232-
In the previous sections, we de ned a cost function $J(\mathbf{W})$ where $\mathbf{W}$ is the matrix
2233-
of the weight coefficients of an artificial network. Note that $J(\mathbf{W})$ is -- roughly speaking -- a "stacked" matrix consisting of the matrices $\mathbf{W}^{(1)} $ and $W^{(2)}$ in a multi-layer perceptron with one hidden unit. We de ned $\mathbf{W}^{(1)}$ as the $h \times [m+1]$-dimensional matrix that connects the input layer to the hidden layer, where $h$ is the number of hidden units and $m$ is the number of features (input units). The matrix $\mathbf{W}^{(2)}$ that connects the hidden layer to the output layer has the dimensions $t \times h$, where $t$ is the number of output units. We then calculated the derivative of the cost function for a weight $w_{i, j}^{l}$ as follows:
2241+
In the previous sections, we defined a cost function $J(\mathbf{W})$ where $\mathbf{W}$ is the matrix
2242+
of the weight coefficients of an artificial network. Note that $J(\mathbf{W})$ is -- roughly speaking -- a "stacked" matrix consisting of the matrices $\mathbf{W}^{(1)} $ and $W^{(2)}$ in a multi-layer perceptron with one hidden unit. We defined $\mathbf{W}^{(1)}$ as the $h \times [m+1]$-dimensional matrix that connects the input layer to the hidden layer, where $h$ is the number of hidden units and $m$ is the number of features (input units). The matrix $\mathbf{W}^{(2)}$ that connects the hidden layer to the output layer has the dimensions $t \times h$, where $t$ is the number of output units. We then calculated the derivative of the cost function for a weight $w_{i, j}^{l}$ as follows:
22342243

22352244
\[
22362245
\frac{\partial}{\partial w_{i, j}^{(i)}}
@@ -2294,7 +2303,7 @@ \subsection{Logistic function recap}
22942303
\phi_{logistic} (z) = \frac{1}{1 + e^{-z}}
22952304
\]
22962305

2297-
Here, the scalar variable $z$ is de ned as the net input:
2306+
Here, the scalar variable $z$ is defined as the net input:
22982307

22992308
\[
23002309
z = w_0 x_0 + \dots + w_m x_m = \sum_{j=0}^{m} x_j w_j = \mathbf{w}^T \mathbf{x}

0 commit comments

Comments
 (0)