Gradients of Softmax Output Layer in Gory DetailsTweet

This article attempts to find gradients of a softmax output layer. This knowledge proves useful when we want to utilize backpropagation algorithm to compute gradients of neural networks with a softmax output layer. Furthermore, page 3 from the outstanding Notes on Backpropagation by Peter Sadowski has inspired this article a lot.

Suppose that we have a multiclass classification problem with 3 (three) choices that are label $1$, label $2$, and label $3$. The image below shows the very simple artificial neural networks with two layers; particulary, we set the output layer as a softmax output layer.

Concretely, we utilize one-hot encoding for the three choices as follows:

$\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$, $\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, and $\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$ are the representations for label $1$, label $2$, and label $3$ respectively.

Let us define our dataset, $X = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)}) \}$, which has $m$ instances and

$$$x^{(i)} = \begin{bmatrix} 1 \\ x_1^{(i)} \end{bmatrix} \text{ and } y^{(i)} = \begin{bmatrix} y_1^{(i)} \\ y_2^{(i)} \\ y_3^{(i)} \end{bmatrix}$$\tag{1}\label{eq:label-y}$

with $y_1^{(i)}$, $y_2^{(i)}$, $y_3^{(i)}$ have only binary values (either 0 or 1) for $i = 1, 2, \ldots, m$.

We employ softmax functions as our predictions. Specifically, we define our first hypoteses

$$$h_1(x^{(i)}) = \frac{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)}})}{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}+\exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})}+\exp{(\Theta_{30} + \Theta_{31} x_1^{(i)}})}$$\tag{2}\label{eq:hyphotesis-1}$

the second hyphotesis,

$$$h_2(x^{(i)}) = \frac{\exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}+\exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})}+\exp{(\Theta_{30} + \Theta_{31} x_1^{(i)}})}$$\tag{3}\label{eq:hyphotesis-2}$

the third hyphotesis,

$$$h_3(x^{(i)}) = \frac{\exp{(\Theta_{30} + \Theta_{31} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}+\exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})}+\exp{(\Theta_{30} + \Theta_{31} x_1^{(i)}})}$$\tag{4}\label{eq:hyphotesis-4}$

and the cost function,

$$$J(\Theta) = - \sum_{i=1}^{m} ( y_1^{(i)} \log h_1(x^{(i)}) + y_2^{(i)} \log h_2(x^{(i)}) + y_3^{(i)} \log h_3(x^{(i)}) )$$\tag{5}\label{eq:cost-function}$

with

$$$\Theta = \begin{bmatrix} \Theta_{10} & \Theta_{11} \\ \Theta_{20} & \Theta_{21} \\ \Theta_{30} & \Theta_{31} \end{bmatrix}.$$\tag{6}\label{eq:the-theta}$

Now we will show how to derive the gradient for these softmax activation function. In other words,

What are $\frac{\partial J}{\partial \Theta_{10}}$, $\frac{\partial J}{\partial \Theta_{11}}$, $\frac{\partial J}{\partial \Theta_{20}}$, $\frac{\partial J}{\partial \Theta_{21}}$, $\frac{\partial J}{\partial \Theta_{30}}$, and $\frac{\partial J}{\partial \Theta_{31}}$?

Firstly, We show how to derive $\frac{\partial J}{\partial \Theta_{10}}$ and $\frac{\partial J}{\partial \Theta_{11}}$.

Let’s derive $\frac{\pmb{\partial J}}{\pmb{\partial \Theta_{10}}}$

By employing Multivariable Calculus, we obtain

$$$\frac{\partial J}{\partial \Theta_{10}} = \underbrace{\frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}}}_{\text{Part I}} + \underbrace{\frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}}}_{\text{Part II}} + \underbrace{\frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}}}_{\text{Part III}}.$$\tag{7}\label{eq:gradient-10}$

The Part I consists of $\frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}}$. Specifically,

$$$\frac{\partial J}{\partial h_1} = - \sum_{i=1}^{m}{\frac{y_1^{(i)}}{h_1(x^{(i)})}}.$$\tag{8}\label{eq:gradient-10-1}$

By defining

$$$u = \exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})}$$\tag{9}\label{eq:gradient-10-2}$

and

$$$v = \exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}$$\tag{10}\label{eq:gradient-10-3}$

and Quotient Rule, we are able to compute $\frac{\partial h_1}{\partial \Theta_{10}}$ as follows:

\begin{align} \frac{\partial h_1}{\partial \Theta_{10}} &= \frac{u^{\prime} v - u v^{\prime}}{v^2} \\ &= \frac{(\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})})(\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}) - (\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})})^2}{(\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})})^2} \\ &= \frac{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} - \left( \frac{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} \right)^2 \\ &= h_1(x^{(i)}) - (h_1(x^{(i)}))^2 \\ &= h_1(x^{(i)}) (1 - h_1(x^{(i)}))\tag{11}\label{eq:gradient-10-4} \end{align}

Finally, we can compute $\frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}}$ by combining Equation \eqref{eq:gradient-10-1} and Equation \eqref{eq:gradient-10-4} as follows:

\require{cancel} \begin{align} \frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}} &= - \sum_{i=1}^{m}{\frac{y_1^{(i)}}{\cancel{h_1(x^{(i)})}} \cancel{h_1(x^{(i)})} (1 - h_1(x^{(i)}) )} \\ &= - \sum_{i=1}^{m}{y_1^{(i)} (1 - h_1(x^{(i)}))}\tag{12}\label{eq:gradient-10-5} \end{align}

The Part II consists of $\frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}}$. Specifically,

$$$\frac{\partial J}{\partial h_2} = - \sum_{i=1}^{m}{\frac{y_2^{(i)}}{h_2(x^{(i)})}}.$$\tag{13}\label{eq:gradient-10-6}$

Again, by defining

$$$u = \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})}$$,\tag{14}\label{eq:gradient-10-7}$

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute $\frac{\partial h_2}{\partial \Theta_{10}}$

\begin{align} \frac{\partial h_2}{\partial \Theta_{10}} &= \frac{u^{\prime} v - u v^{\prime}}{v^2} \\ &= \frac{0 - \exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})} \exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}}{(\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})})^2} \\ &= - \left(\frac{\exp{(\Theta_{20} + \Theta_{21} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} \right) \left( \frac{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} \right) \\ &= - h_2(x^{(i)}) h_1(x^{(i)}) \tag{15}\label{eq:gradient-10-8} \end{align}

By using Equation \eqref{eq:gradient-10-6} and Equation \eqref{eq:gradient-10-8}, $\frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}}$ can be computed as

\begin{align} \frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}} &= + \sum_{i=1}^{m}{\frac{y_2^{(i)}}{\cancel{h_2(x^{(i)})}} \cancel{h_2(x^{(i)})} h_1(x^{(i)})} \\ &= \sum_{i=1}^{m}{y_2^{(i)} h_1(x^{(i)})}.\tag{16}\label{eq:gradient-10-9} \end{align}

Lastly, the Part III consists of $\frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}}$.

Particularly,

$$$\frac{\partial J}{\partial h_3} = - \sum_{i=1}^{m}{\frac{y_3^{(i)}}{h_3(x^{(i)})}}.$$\tag{17}\label{eq:gradient-10-10}$

Again, by defining

$$$u = \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}$$,\tag{18}\label{eq:gradient-10-11}$

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute $\frac{\partial h_3}{\partial \Theta_{10}}$

\begin{align} \frac{\partial h_3}{\partial \Theta_{10}} &= \frac{u^{\prime} v - u v^{\prime}}{v^2} \\ &= \frac{0 - \exp{(\Theta_{30} + \Theta_{31} x_1^{(i)})} \exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}}{(\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})})^2} \\ &= - \left(\frac{\exp{(\Theta_{30} + \Theta_{31} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} \right) \left( \frac{\exp{(\Theta_{10} + \Theta_{11} x_1^{(i)})}}{\exp{(\Theta_{10} + \Theta_{11}x_1^{(i)})} + \exp{(\Theta_{20} + \Theta_{21}x_1^{(i)})} + \exp{(\Theta_{30} + \Theta_{31}x_1^{(i)})}} \right) \\ &= - h_3(x^{(i)}) h_1(x^{(i)}) \tag{19}\label{eq:gradient-10-12} \end{align}

Again by using Equation \eqref{eq:gradient-10-10} and Equation \eqref{eq:gradient-10-12}, $\frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}}$ can be computed as

\begin{align} \frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}} &= + \sum_{i=1}^{m}{\frac{y_3^{(i)}}{\cancel{h_3(x^{(i)})}} \cancel{h_3(x^{(i)})} h_1(x^{(i)})} \\ &= \sum_{i=1}^{m}{y_3^{(i)} h_1(x^{(i)})}.\tag{20}\label{eq:gradient-10-13} \end{align}

\begin{align} \frac{\partial J}{\partial \Theta_{10}} &= \frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}} + \frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}} + \frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}} \\ &= \sum_{i=1}^{m}{\left( -y_1^{(i)} + y_1^{(i)} h_1(x^{(i)}) + y_2^{(i)} h_1(x^{(i)}) + y_3^{(i)} h_1(x^{(i)}) \right)} \\ &= \sum_{i=1}^{m}{\left( -y_1^{(i)} + h_1(x^{(i)}) \underbrace{(y_1^{(i)} + y_2^{(i)} + y_3^{(i)})}_{\text{equals to }1} \right)} \\ &= \sum_{i=1}^{m}{(h_1(x^{(i)}) - y_1^{(i)})}\tag{21}\label{eq:final-gradient-1} \end{align}
With the same technique, we also obtain $$$\frac{\partial J}{\partial \Theta_{11}} = \sum_{i=1}^{m}{( h_1(x^{(i)}) - y_1^{(i)} ) x_1^{(i)}}$$\tag{22}\label{eq:final-gradient-2}$
$$$\frac{\partial J}{\partial \Theta_{kj}} = \sum_{i=1}^{m}{( h_k(x^{(i)}) - y_k^{(i)} ) x_j^{(i)}}$$\tag{23}\label{eq:final-gradient-3}$
with $x_j^{(i)} = 0$ if $j = 0$. Although the calculation in output layer is different, surprisingly, Equation \eqref{eq:final-gradient-3} is similar to gradients of sigmoid output layer. Hence, utilizing softmax output layer should be no worries.