I'm taking Professor Andrew Ng's Machine Learning (ML) class via Coursera https://class.coursera.org/ml-006.
This page is just personal notes on the course topics. Some of this has details that I did not see in Professor Ng's lecture notes; All so I may understand it better and I wanted to practice using latex with MathJax in HTML.
We wish to find a linear function \(h_\theta(x)\) where \(x \in \mathbb{R}^{n+1}\), \(\theta \in \mathbb{R}^{n+1}\), and \(n\) is the number of features. We call \(x_j\) the \(j\text{th}\) feature. By convention \(x_0\) is always \(1\) so that we may have a constant offset in this linear function as in: \begin{align} h_\theta(x) & = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \dotsb \nonumber\\ & = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dotsb \nonumber\\ & = \sum_{j=0}^n \theta_j x_j \nonumber\\ & = \theta^T x = x^T \theta . \end{align}
We are given \(m\) examples of training vector \(x\) values and a corresponding \(y\) value for each \(x\) vector, where \(y \in \mathbb{R}^1\). We denote \(x^{(i)}\) and \(y^{(i)}\) as the \(i\text{th}\) training example, where if our model \(h_\theta(x)\) was perfect we would have: \( y = h_\theta(x) . \)
We define the cost function, \(J(\theta)\) by the number of training examples \(m\) as \[ J(\theta) = \frac{1}{2 m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 \label{J_2} . \] We wish to find the values of all the components of \(\theta\) that minimize \(J(\theta)\). At this minimum \(J(\theta)\) will have: \[ \left. \frac{\partial J(\theta)}{\partial \theta_j} \right|_{\theta = \theta^*} = 0 \label{partialTheta_2} \] for all \(j = 0, 1, 2, \dotsc, n\), where we define \(\theta^*\) as the value of \(\theta\) there the all the partial derivatives are zero.
Differentiating [\ref{J_2}] we get: \begin{align} \frac{\partial J(\theta)}{\partial \theta_j} & = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) \frac{\partial h_\theta(x^{(i)})}{\partial \theta_j} \\ & = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \end{align} with which we can find the value of \(\theta\) that minimizes \(J(\theta\).
Ng's notes are good enough on this section. We will summarize them here.
We define the matrix \(X\) with all the sets of \(x\) training examples by for example \begin{align} X = \left[ \begin{array}{cc} x^{(1)}_0 & x^{(1)}_1 & x^{(1)}_2 \\ x^{(2)}_0 & x^{(2)}_1 & x^{(2)}_2 \\ x^{(3)}_0 & x^{(3)}_1 & x^{(3)}_2 \\ \end{array} \right] \mathrm{,} \quad \theta = \left[ \begin{array}{c} \theta_0 \\ \theta_1 \\ \theta_2 \end{array} \right] \end{align} where each row is a training set \(x\) vector, so that now we can define the hypothesis function as a column vector \[ h_\theta(X) = X \theta \] and we can get \(\theta\) from \[ \theta = \left(X^T X\right)^{-1} X^T y \, \mathrm{.} \] See wikipedia article linear least squares.
This just arguments Professor Andrew Ng's neural network lecture notes, filling in some details. For a general neural network with \( L \) layers, including the input \( x \) and outputs \( h_\Theta(x) \), with \( K \) output classes we have \[ \begin{bmatrix} x_0 \equiv 1 \\ x_1 \\ x_2 \\ \vdots \\ x_{s_1} \end{bmatrix} \equiv \begin{bmatrix} a_0^{(1)} \equiv 1 \\ a_1^{(1)} \\ a_2^{(1)} \\ \vdots \\ a_{s_1}^{(1)} \end{bmatrix} \to \begin{bmatrix} a_0^{(2)} \equiv 1 \\ a_1^{(2)} \\ a_2^{(2)} \\ \vdots \\ a_{s_2}^{(2)} \end{bmatrix} \to \begin{bmatrix} a_0^{(3)} \equiv 1 \\ a_1^{(3)} \\ a_2^{(3)} \\ \vdots \\ a_{s_3}^{(3)} \end{bmatrix} \to \dots \to \begin{bmatrix} a_0^{(L-1)} \equiv 1 \\ a_1^{(L-1)} \\ a_2^{(L-1)} \\ \vdots \\ a_{s_{L-1}}^{(L-1)} \end{bmatrix} \to \begin{bmatrix} a_1^{(L)} \\ a_2^{(L)} \\ \vdots \\ a_{s_{L}}^{(L)} \end{bmatrix} \equiv \begin{bmatrix} {h_\Theta}(x)_1 \\ {h_\Theta}(x)_2 \\ {h_\Theta}(x)_3 \\ \vdots \\ {h_\Theta}(x)_K \\ \end{bmatrix} \label{nn} \] where \( s_l \) denotes the number of values in a layer, excluding the bias values \( x_0 \) and \( a_0^{(l)}\), with \( l = 1, 2, \dotsc, L-1 \), and so \( K = s_L \), and \(a^{(l)}_i \in \mathbb{R}\) where \(i = 0, 1, 2, \dotsc, s_{l}\). All numbers \( a_i^{(l)} \) with \( i = 1, 2, \dotsc, s_l \), \(l = 2, 3, \dotsc, L\), are generated from the previous layer numbers \( a_i^{(l-1)} \). Note that the superscript in the \( a^{()} \) is the layer number not a training example as in last week. We define \(x \) as the vector in the first and input layer \[ x \equiv \begin{bmatrix} x_0 \equiv 1 \\ x_1 \\ x_2 \\ \vdots \\ x_{s_1} \end{bmatrix} , \] and so \(h_\Theta(x) \) is a function of the vector \(x\). \( a^{(l)} \) is the vector in the \( l\text{-th} \) layer \[ a^{(l)} \equiv \begin{bmatrix} a_0^{(l)} \equiv 1 \\ a_1^{(l)} \\ a_2^{(l)} \\ \vdots \\ a_{s_l}^{(l)} \end{bmatrix} , \] where \( l = 1, 2, 3, \dotsc, L \) and \(x = a^{(1)}\). \( h_\Theta(x) \) is the vector in the last and output layer \[ h_\Theta(x) \equiv \begin{bmatrix} {h_\Theta}(x)_1 \\ {h_\Theta}(x)_2 \\ {h_\Theta}(x)_3 \\ \vdots \\ {h_\Theta}(x)_K \\ \end{bmatrix} \] and so we have \(h_\Theta(x) = a^{(L)}\). The vector values \( a^{(l)} \) are computed from \[ a^{(l+1)} = g( \Theta^{(l)} a^{(l)}) \] where we define the \( g(z) \) the vector (and scalar) sigmoid function by \[ g(z) \equiv \frac{1}{1 + e^{-z}} \label{gz} \] which is applied to all \( z\) vector (or scalar) components independently, and \( \Theta^{(l)} \) is a \( s_{l+1} \times ( s_{l} + 1 ) \) matrix, where \( l = 1, 2, \dotsc, L-1 \). We train the neural network by finding all the numbers in all \( \Theta \) matrices. So the final vector output of our network can be written as \begin{align} h_\Theta(x) \equiv a^{(L)} & = g( \Theta^{(L-1)} a^{(L-1)} ) \label{hThetag}\\ & = g( \Theta^{(L-1)} g( \Theta^{(L-2)} g( \Theta^{(L-3)} \dots g( \Theta^{(1)} x)))) \nonumber. \end{align}
The cost function is \[ J(\Theta) = - \frac{1}{m} \left[ \sum^m_{i=1} \sum^K_{k=1} y^{(i)}_k \log( h_\Theta(x^{(i)})_k) + \left(1 - y^{(i)}_k \right) \log( 1 - h_\Theta(x^{(i)})_k) \right] + \frac{\lambda}{2 m} \sum^{L-1}_{l=1} \sum^{s_l}_{i=1} \sum^{s_l + 1}_{j=1} \left(\Theta^{(l)}_{j,i} \right)^2 \label{JTheta} \] where \(m\) is the number of training examples \(x^{(i)}\) vector input values, and \(y^{(i)}\) is the corresponding training example vector output. There must have been a lot of trial and error to finding this nice cost function. It has some very nice properties, as we'll see.
Differentiating [\ref{JTheta}] with respect to \( \Theta^{(l)}_{i,j} \) we get \begin{align} \frac{\partial J(\Theta)}{\partial \Theta^{(l)}_{i,j}} & = - \frac{1}{m} \left\{ \left[ \sum^m_{t=1} \sum^K_{k=1} y^{(t)}_k \frac{1}{h_\Theta(x^{(t)})_k} - \left(1 - y^{(t)}_k \right) \frac{1}{1 - h_\Theta(x^{(t)})_k} \right] \frac{\partial h_\Theta(x^{(t)})_k}{\partial \Theta^{(l)}_{i,j} } \right\} + \frac{\lambda}{m} \Theta^{(l)}_{i,j} \nonumber\\ & = - \frac{1}{m} \left[ \sum^m_{t=1} \sum^K_{k=1} \frac{ \left(y^{(t)}_k - h_\Theta(x^{(t)})_k y^{(t)}_k\right) - \left(h_\Theta(x^{(t)})_k - h_\Theta(x^{(t)})_k y^{(t)}_k\right) }{ h_\Theta(x^{(t)})_k \left( 1 - h_\Theta(x^{t)})_k \right) } \frac{\partial h_\Theta(x^{(t)})_k}{\partial \Theta^{(l)}_{i,j} } \right] + \frac{\lambda}{m} \Theta^{(l)}_{i,j} \nonumber\\ & = \frac{1}{m} \left[ \sum^m_{t=1} \sum^K_{k=1} \frac{h_\Theta(x^{(t)})_k - y^{(t)}_k}{ h_\Theta(x^{(t)})_k \left( 1 - h_\Theta(x^{(t)})_k \right) } \frac{\partial h_\Theta(x^{(t)})_k}{\partial \Theta^{(l)}_{i,j} } \right] + \frac{\lambda}{m} \Theta^{(l)}_{i,j} \label{dJdTheta_00}\\ \end{align} where we changed the dummy training sample summing index \(i \) to \(t \) to avoid conflicting symbols. From [\ref{hThetag}] the \( k \mathrm{-th} \) component of \( h_\Theta(x) \) is \[ h_\Theta(x)_k = g( \sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} a^{(L-1)}_p ) \label{hThetaa} \] where \(g\) is a scalar function of the scalar variable which is the sum. If the \(l = L-1\) in \(\Theta^{(l)}_{i,j}\), from making the last layer, we'd have \[ \begin{array}{l l} \frac{\partial h_\Theta(x^{(t)})_k}{\partial \Theta^{(L-1)}_{i,j} } = g^\prime(\sum_{p=1}^{s_{L-1}} \Theta_{i,p}^{(L-1)} a^{(L-1)}_p) a^{(L-1)}_j & \quad \text{if } k = i\\ \frac{\partial h_\Theta(x^{(t)})_k}{\partial \Theta^{(L-1)}_{i,j} } = 0 & \quad \text{if } k \neq i \end{array} \label{dhdTheta0} \] where \(g^\prime\) is the derivative of \(g\) with respect to it's argument. So for the case when \(l = L-1\) this makes [\ref{dJdTheta_00}] become \[ \frac{\partial J(\Theta)}{\partial \Theta^{(L-1)}_{i,j}} = \frac{1}{m} \left[ \sum^m_{t=1} \frac{h_\Theta(x^{(t)})_i - y^{(t)}_i}{ h_\Theta(x^{(t)})_i \left( 1 - h_\Theta(x^{(t)})_i \right) } \frac{\partial h_\Theta(x^{(t)})_i}{\partial \Theta^{(L-1)}_{i,j} } \right] + \frac{\lambda}{m} \Theta^{(L-1)}_{i,j} \label{dJdTheta}. \]
Using [\ref{gz}] we have \begin{align} g^\prime(z) \equiv \frac{\mathrm{d} g(z)}{\mathrm{d} z} & = \frac{e^{-z}}{\left(1 + e^{-z}\right)^{2}} \nonumber\\ & = \frac{1}{1 + e^{-z}} \frac{1 + e^{-z} - 1}{1 + e^{-z}} \nonumber\\ & = g(z) \left(1 - \frac{1}{1 + e^{-z}} \right) \nonumber\\ \implies g^\prime(z) & = g(z) \left[1 - g(z)\right] \label{gzprime} \end{align} where \( z \) is a dummy variable that we introduced. Applying [\ref{gzprime}] to [\ref{dhdTheta0}] with [\ref{dJdTheta}] and [\ref{hThetaa}] gives \begin{align} \frac{\partial J(\Theta)}{\partial \Theta^{(L-1)}_{i,j}} & = \frac{1}{m} \left\{ \sum^m_{t=1} \frac{h_\Theta(x^{(t)})_i - y^{(t)}_i}{h_\Theta(x^{(t)})_i \left( 1 - h_\Theta(x^{(t)})_i \right) } \left[g^\prime (\sum_{p=1}^{s_{L-1}} \Theta^{(L-1)}_{i,p} a^{(L-1)}_p) a^{(L-1)}_j \right] \right\} + \frac{\lambda}{m} \Theta^{(L-1)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \frac{h_\Theta(x^{(t)})_i - y^{(t)}_i}{h_\Theta(x^{(t)})_i \left( 1 - h_\Theta(x^{(t)})_i \right) } \left[ g(\sum_{p=1}^{s_{L-1}} \Theta^{(L-1)}_{i,p} a^{(L-1)}_p) \left\{1 - g(\sum_{p=1}^{s_{L-1}} \Theta^{(L-1)}_{i,p} a^{(L-1)}_p) \right\} a^{(L-1)}_j \right] \right\} + \frac{\lambda}{m} \Theta^{(L-1)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \frac{h_\Theta(x^{(t)})_i - y^{(t)}_i}{h_\Theta(x^{(t)})_i \left( 1 - h_\Theta(x^{(t)})_i \right) } \left[ h_\Theta(x^{(t)})_i \left\{1 - h_\Theta(x^{(t)})_i \right\} a^{(L-1)}_j \right] \right\} + \frac{\lambda}{m} \Theta^{(L-1)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \left(h_\Theta(x^{(t)})_i - y^{(t)}_i \right) a^{(L-1)}_j \right\} + \frac{\lambda}{m} \Theta^{(L-1)}_{i,j} \label{dJDThetaL1} , \end{align} hence this \(J(\Theta)\) is a very nice cost function.
To get partial derivatives in the next layer back \[ \frac{\partial J(\Theta)}{\partial \Theta^{(L-2)}_{i,j}} \] we rewrite [\ref{hThetag}] as \[ h_\Theta(x) = g\left( \Theta^{(L-1)} g\left( \Theta^{(L-2)} a^{(L-2)} \right) \right) \] so that \begin{align} h_\Theta(x^{(t)})_k & = g( \sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} g( \Theta^{(L-2)} a^{(L-2)} )_p ) \nonumber\\ & = g( \sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} g( \sum_{n=1}^{s_{L-2}} \Theta^{(L-2)}_{p,n} a^{(L-2)}_n ) ) \label{hTheta_Lm1} . \end{align} So by applying the chain rule for differentiation to [\ref{hTheta_Lm1}] we get (this is the most illuminating part) \begin{align} \frac{\partial h_\Theta(x)_k}{\partial \Theta^{(L-2)}_{i,j} } & = g^\prime(\sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} a^{(L-1)}_p) \left[ \Theta_{k,i}^{(L-1)} g^\prime(\sum_{n=1}^{s_{L-2}} \Theta^{(L-2)}_{i,n} a^{(L-2)}_n) \right] a^{(L-2)}_j \nonumber\\ & = g(\sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} a^{(L-1)}_p) \left[1 - g(\sum_{p=1}^{s_{L-1}} \Theta_{k,p}^{(L-1)} a^{(L-1)}_p)\right] \left\{ \Theta_{k,i}^{(L-1)} g(\sum_{n=1}^{s_{L-2}} \Theta^{(L-2)}_{i,n} a^{(L-2)}_n) \left[1 - g(\sum_{n=1}^{s_{L-2}} \Theta^{(L-2)}_{i,n} a^{(L-2)}_n)\right] \right\} a^{(L-2)}_j\nonumber\\ & = h_\Theta(x)_k \left[1 - h_\Theta(x)_k\right] \left\{ \Theta_{k,i}^{(L-1)} a^{(L-1)}_i \left[1 - a^{(L-1)}_i\right] \right\} a^{(L-2)}_j \label{dThetaL2}. \end{align} Plugging [\ref{dThetaL2}] into [\ref{dJdTheta_00}] gives \begin{align} \frac{\partial J(\Theta)}{\partial \Theta^{(L-2)}_{i,j}} & = \frac{1}{m} \left[ \sum^m_{t=1} \sum^K_{k=1} \frac{h_\Theta(x^{(t)})_k - y^{(t)}_k}{ h_\Theta(x^{(t)})_k \left( 1 - h_\Theta(x^{(t)})_k \right) } h_\Theta(x)_k \left[1 - h_\Theta(x)_k\right] \left\{ \Theta_{k,i}^{(L-1)} a^{(L-1)}_i \left[1 - a^{(L-1)}_i\right] \right\} a^{(L-2)}_j \right] + \frac{\lambda}{m} \Theta^{(L-2)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \sum^K_{k=1} \left(h_\Theta(x^{(t)})_k - y^{(t)}_k\right) \left[ \Theta_{k,i}^{(L-1)} a^{(L-1)}_i \left(1 - a^{(L-1)}_i\right) \right] a^{(L-2)}_j \right\} + \frac{\lambda}{m} \Theta^{(L-2)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \sum^K_{k=1} \left(h_\Theta(x^{(t)})_k - y^{(t)}_k\right) \Theta_{k,i}^{(L-1)} a^{(L-1)}_i \left(1 - a^{(L-1)}_i\right) a^{(L-2)}_j \right\} + \frac{\lambda}{m} \Theta^{(L-2)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \sum^K_{k=1} \Theta_{k,i}^{(L-1)} \left(h_\Theta(x^{(t)})_k - y^{(t)}_k\right) a^{(L-1)}_i \left(1 - a^{(L-1)}_i\right) a^{(L-2)}_j \right\} + \frac{\lambda}{m} \Theta^{(L-2)}_{i,j} \nonumber\\ & = \frac{1}{m} \left\{ \sum^m_{t=1} \left[ \Theta^{(L-1) T} \left(h_\Theta(x^{(t)}) - y^{(t)}\right) \right]_i a^{(L-1)}_i \left(1 - a^{(L-1)}_i\right) a^{(L-2)}_j \right\} + \frac{\lambda}{m} \Theta^{(L-2)}_{i,j} \label{dJdThetaL2}, \end{align} where \(\Theta^{(L-1) T}\) is the transpose of the \(\Theta^{(L-1)}\) matrix. It helped to look an this screen from Ng's lecture when it gets hairy.
By examining [\ref{dJDThetaL1}] and [\ref{dJdThetaL2}] we start to see a patten emerging. We find it convenient to define the following series of vectors \(\delta^{(l)}\), in a backward iterative manner like so \[ \begin{array}{ll} \delta^{(L)} = a^{(L)} - y & \\ \delta^{(l)} = \left(\Theta^{(l)}\right)^T \delta^{(l+1)} \cdot \ast a^{(l)} \cdot \ast \left(1 - a^{(l)}\right) & \text{for } l = L-1, L-2, \dotsc, 2 \end{array} \label{delta} \] where \(\cdot \ast\) is an element wise multiply and \(l\) is the layer number. Using this definition for \(\delta\) we can write the following general expression \[ \frac{\partial J(\Theta)}{\partial \Theta^{(l)}_{i,j}} = \frac{1}{m} \sum_{t=1}^m \delta^{(t)(l+1)}_i a^{(t)(l)}_j + \frac{\lambda}{m} \Theta_{i,j}^{(l)} \quad, \label{deltadJdTheta} \] where we sum on \(t\) the training examples. You can prove [\ref{deltadJdTheta}] by mathematical induction using the previous results. Yes, if you followed this to here you can.
So by [\ref{delta}] you need to calculate \(\delta^{(l+1)}\) to get \(\delta^{(l)}\) which is back one layer in the network, for all \(l\) layers, hence the term back propagation.