You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/equations/pymle-equations.tex
+28-19Lines changed: 28 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -216,7 +216,7 @@ \section{Artificial neurons -- a brief glimpse into the early history of machine
216
216
\end{enumerate}
217
217
\end{enumerate}
218
218
219
-
Here, the output value is the class label predicted by the unit step function that we de ned earlier, and the simultaneous update of each weight $w_j$ in the weight vector $\mathbf{w}$ can be more formally written as:
219
+
Here, the output value is the class label predicted by the unit step function that we defined earlier, and the simultaneous update of each weight $w_j$ in the weight vector $\mathbf{w}$ can be more formally written as:
220
220
221
221
\[
222
222
w_j := w_j + \Delta w_j
@@ -450,7 +450,7 @@ \subsection{Logistic regression intuition and conditional probabilities}
450
450
451
451
\subsection{Learning the weights of the logistic cost function}
452
452
453
-
In the previous chapter, we de ned the sum-squared-error cost function:
453
+
In the previous chapter, we defined the sum-squared-error cost function:
@@ -571,13 +571,22 @@ \subsection{Tackling overfitting via regularization}
571
571
572
572
Here, $\lambda$ is the so-called regularization parameter.
573
573
574
-
In order to apply regularization, we just need to add the regularization term to the cost function that we de ned for logistic regression to shrink the weights:
574
+
In order to apply regularization, we just need to add the regularization term to the cost function that we defined for logistic regression to shrink the weights:
Via the regularization parameter $\lambda$, we can then control how well we t the training data while keeping the weights small. By increasing the value of $\lambda$, we increase the regularization strength.
580
+
Then, we have the following regularized weight updates for weight $w_j$:
\] for $j \in\{1, 2, ..., m \}$ (i.e., $j \neq0$) since we don't regularize the bias unit $w_0$. \\
585
+
586
+
587
+
588
+
589
+
Via the regularization parameter $\lambda$, we can then control how well we fit the training data while keeping the weights small. By increasing the value of $\lambda$, we increase the regularization strength.
581
590
582
591
The parameter \textit{C} that is implemented for the \textit{LogisticRegression} class in scikit-learn comes from a convention in support vector machines, which will be the topic of the next section. \textit{C} is directly related to the regularization parameter $\lambda$ , which is its inverse:
@@ -812,7 +821,7 @@ \section{Bringing features onto the same scale}
812
821
\section{Selecting meaningful features}
813
822
\subsection{Sparse solutions with L1 regularization}
814
823
815
-
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we de ned the L2 norm of our weight vector w as follows:
824
+
We recall from \textit{Chapter 3, A Tour of Machine Learning Classfiers Using Scikit-learn}, that L2 regularization is one approach to reduce the complexity of a model by penalizing large individual weights, where we defined the L2 norm of our weight vector w as follows:
@@ -1502,7 +1511,7 @@ \subsection{Assessing word relevancy via term frequency-inverse document frequen
1502
1511
1503
1512
where $n_d$ is the total number of documents, and $\text{df}(d, t)$ is the number of documents $d$ that contain the term $t$. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.
1504
1513
1505
-
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the \textit{TfidfTransformer} calculates the tf-idfs slightly differently compared to the standard textbook equations that we de ned earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
1514
+
However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the \textit{TfidfTransformer} calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:
@@ -1618,7 +1627,7 @@ \subsection{Visualizing the important characteristics of a dataset}
1618
1627
\section{Implementing an ordinary least squares linear regression model}
1619
1628
\subsection{Solving regression for regression parameters with gradient descent}
1620
1629
1621
-
Consider our implementation of the \textit{ADAptive LInear NEuron (Adaline)} from \textit{Chapter 2, Training Machine Learning Algorithms for Classifcation}; we remember that the artificial neuron uses a linear activation function and we de ned a cost function $J (\cdot)$, which we minimized to learn the weights via optimization algorithms, such as \textit{Gradient Descent (GD)} and \textit{Stochastic Gradient Descent (SGD)}. This cost function in Adaline is the \textit{Sum of Squared Errors (SSE)}. This is identical to the OLS cost function that we defined:
1630
+
Consider our implementation of the \textit{ADAptive LInear NEuron (Adaline)} from \textit{Chapter 2, Training Machine Learning Algorithms for Classifcation}; we remember that the artificial neuron uses a linear activation function and we defined a cost function $J (\cdot)$, which we minimized to learn the weights via optimization algorithms, such as \textit{Gradient Descent (GD)} and \textit{Stochastic Gradient Descent (SGD)}. This cost function in Adaline is the \textit{Sum of Squared Errors (SSE)}. This is identical to the OLS cost function that we defined:
@@ -1637,7 +1646,7 @@ \subsection{Estimating the coefficient of a regression model via scikit-learn}
1637
1646
\section{Fitting a robust regression model using RANSAC}
1638
1647
\section{Evaluating the performance of linear regression models}
1639
1648
1640
-
Another useful quantitative measure of a model's performance is the so-called \textit{Mean Squared Error (MSE)}, which is simply the average value of the SSE cost function that we minimize to t the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation:
1649
+
Another useful quantitative measure of a model's performance is the so-called \textit{Mean Squared Error (MSE)}, which is simply the average value of the SSE cost function that we minimize to fit the linear regression model. The MSE is useful to for comparing different regression models or for tuning their parameters via a grid search and cross-validation:
@@ -1716,7 +1725,7 @@ \subsection{Modeling nonlinear relationships in the Housing Dataset}
1716
1725
\subsection{Dealing with nonlinear relationships using random forests}
1717
1726
\subsubsection{Decision tree regression}
1718
1727
1719
-
When we used decision trees for classi cation, we de ned entropy as a measure of impurity to determine which feature split maximizes the \textit{Information Gain (IG)}, which can be de ned as follows for a binary split:
1728
+
When we used decision trees for classi cation, we defined entropy as a measure of impurity to determine which feature split maximizes the \textit{Information Gain (IG)}, which can be defined as follows for a binary split:
1720
1729
1721
1730
\[
1722
1731
IG(D_p, x_i) = I(D_p) - \frac{N_{left}}{N_{p}} I (D_{left}) - \frac{N_{right}}{N_p} I (D_{right})
@@ -1757,7 +1766,7 @@ \section{Grouping objects by similarity using k-means}
1757
1766
\item Randomly pick $k$ centroids from the sample points as initial cluster centers.
1758
1767
\item Assign each sample to the nearest centroid $\mu^{(j)}, \quad j \in {1, ..., k}.$
1759
1768
\item Move the centroids to the center of the samples that were assigned to it.
1760
-
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1769
+
\item Repeat steps 2 and 3 until the cluster assignments do not change or a user-defined tolerance or a maximum number of iterations is reached.
1761
1770
\end{enumerate}
1762
1771
1763
1772
Now the next question is \textit{how do we measure similarity between objects?} We can de ne similarity as the opposite of distance, and a commonly used distance for clustering samples with continuous features is the \textit{squared Euclidean distance} between two points $\mathbf{x}$ and $\mathbf{y}$ in $m$-dimensional space:
@@ -1819,7 +1828,7 @@ \subsection{Hard versus soft clustering}
1819
1828
\item Specify the number of $k$ centroids and randomly assign the cluster memberships for each point.
1820
1829
\item Compute the cluster centroids $\mathbf{\mu^{(j)}}, j \in\{1, \dots, k \}$.
1821
1830
\item Update the cluster memberships for each point.
1822
-
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-de ned tolerance or a maximum number of iterations is reached.
1831
+
\item Repeat steps 2 and 3 until the membership coefficients do not change or a user-defined tolerance or a maximum number of iterations is reached.
1823
1832
\end{enumerate}
1824
1833
1825
1834
The objective function of FCM -- we abbreviate it by $J_m$ -- looks very similar to the within cluster sum-squared-error that we minimize in $k$-means:
In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient $\nabla J(\mathbf{w})$. In order to find the optimal weights of the model, we optimized an objective function that we defined as the \textit{Sum of Squared Errors (SSE)} cost function $J(\mathbf{w})$. Furthermore, we multiplied the gradient by a factor, the learning rate $\eta$ , which we chose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.
1909
1918
1910
-
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we de ned the partial derivative for each weight $w_j$ in the weight vector
1919
+
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight $w_j$ in the weight vector
1911
1920
$\mathbf{w}$ as follows:
1912
1921
1913
1922
\[
@@ -2083,7 +2092,7 @@ \subsection{Computing the logistic cost function}
2083
2092
a^{(i)} = \phi\big( z^{(i)} \big).
2084
2093
\]
2085
2094
2086
-
Now, let's add a regularization term, which allows us to reduce the degree of over tting. As you will recall from earlier chapters, the L2 and L1 regularization terms are de ned as follows (remember that we don't regularize the bias units):
2095
+
Now, let's add a regularization term, which allows us to reduce the degree of over tting. As you will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows (remember that we don't regularize the bias units):
@@ -2229,8 +2238,8 @@ \subsection{Training neural networks via backpropagation}
2229
2238
\section{Developing your intuition for backpropagation}
2230
2239
\section{Debugging neural networks with gradient checking}
2231
2240
2232
-
In the previous sections, we de ned a cost function $J(\mathbf{W})$ where $\mathbf{W}$ is the matrix
2233
-
of the weight coefficients of an artificial network. Note that $J(\mathbf{W})$ is -- roughly speaking -- a "stacked" matrix consisting of the matrices $\mathbf{W}^{(1)} $ and $W^{(2)}$ in a multi-layer perceptron with one hidden unit. We de ned$\mathbf{W}^{(1)}$ as the $h \times [m+1]$-dimensional matrix that connects the input layer to the hidden layer, where $h$ is the number of hidden units and $m$ is the number of features (input units). The matrix $\mathbf{W}^{(2)}$ that connects the hidden layer to the output layer has the dimensions $t \times h$, where $t$ is the number of output units. We then calculated the derivative of the cost function for a weight $w_{i, j}^{l}$ as follows:
2241
+
In the previous sections, we defined a cost function $J(\mathbf{W})$ where $\mathbf{W}$ is the matrix
2242
+
of the weight coefficients of an artificial network. Note that $J(\mathbf{W})$ is -- roughly speaking -- a "stacked" matrix consisting of the matrices $\mathbf{W}^{(1)} $ and $W^{(2)}$ in a multi-layer perceptron with one hidden unit. We defined$\mathbf{W}^{(1)}$ as the $h \times [m+1]$-dimensional matrix that connects the input layer to the hidden layer, where $h$ is the number of hidden units and $m$ is the number of features (input units). The matrix $\mathbf{W}^{(2)}$ that connects the hidden layer to the output layer has the dimensions $t \times h$, where $t$ is the number of output units. We then calculated the derivative of the cost function for a weight $w_{i, j}^{l}$ as follows:
2234
2243
2235
2244
\[
2236
2245
\frac{\partial}{\partial w_{i, j}^{(i)}}
@@ -2294,7 +2303,7 @@ \subsection{Logistic function recap}
2294
2303
\phi_{logistic} (z) = \frac{1}{1 + e^{-z}}
2295
2304
\]
2296
2305
2297
-
Here, the scalar variable $z$ is de ned as the net input:
2306
+
Here, the scalar variable $z$ is defined as the net input:
0 commit comments