 ### Hendra Bunyamin

Forgiven sinner and Lecturer at Maranatha Christian University

### Gradients of Softmax Output Layer in Gory DetailsTweet

This article attempts to find gradients of a softmax output layer. This knowledge proves useful when we want to utilize backpropagation algorithm to compute gradients of neural networks with a softmax output layer. Furthermore, page 3 from the outstanding Notes on Backpropagation by Peter Sadowski has inspired this article a lot.

Suppose that we have a multiclass classification problem with 3 (three) choices that are label $1$, label $2$, and label $3$. The image below shows the very simple artificial neural networks with two layers; particulary, we set the output layer as a softmax output layer. Concretely, we utilize one-hot encoding for the three choices as follows:

$\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}$, $\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, and $\begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix}$ are the representations for label $1$, label $2$, and label $3$ respectively.

Let us define our dataset, $X = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)}) \}$, which has $m$ instances and

with $y_1^{(i)}$, $y_2^{(i)}$, $y_3^{(i)}$ have only binary values (either 0 or 1) for $i = 1, 2, \ldots, m$.

We employ softmax functions as our predictions. Specifically, we define our first hypoteses

the second hyphotesis,

the third hyphotesis,

and the cost function,

with

Now we will show how to derive the gradient for these softmax activation function. In other words,

What are $\frac{\partial J}{\partial \Theta_{10}}$, $\frac{\partial J}{\partial \Theta_{11}}$, $\frac{\partial J}{\partial \Theta_{20}}$, $\frac{\partial J}{\partial \Theta_{21}}$, $\frac{\partial J}{\partial \Theta_{30}}$, and $\frac{\partial J}{\partial \Theta_{31}}$?

Firstly, We show how to derive $\frac{\partial J}{\partial \Theta_{10}}$ and $\frac{\partial J}{\partial \Theta_{11}}$.

#### Let’s derive $\frac{\pmb{\partial J}}{\pmb{\partial \Theta_{10}}}$

By employing Multivariable Calculus, we obtain

The Part I consists of $\frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}}$. Specifically,

By defining

and

and Quotient Rule, we are able to compute $\frac{\partial h_1}{\partial \Theta_{10}}$ as follows:

Finally, we can compute $\frac{\partial J}{\partial h_1}\frac{\partial h_1}{\partial \Theta_{10}}$ by combining Equation \eqref{eq:gradient-10-1} and Equation \eqref{eq:gradient-10-4} as follows:

The Part II consists of $\frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}}$. Specifically,

Again, by defining

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute $\frac{\partial h_2}{\partial \Theta_{10}}$

By using Equation \eqref{eq:gradient-10-6} and Equation \eqref{eq:gradient-10-8}, $\frac{\partial J}{\partial h_2}\frac{\partial h_2}{\partial \Theta_{10}}$ can be computed as

Lastly, the Part III consists of $\frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}}$.

Particularly,

Again, by defining

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute $\frac{\partial h_3}{\partial \Theta_{10}}$

Again by using Equation \eqref{eq:gradient-10-10} and Equation \eqref{eq:gradient-10-12}, $\frac{\partial J}{\partial h_3}\frac{\partial h_3}{\partial \Theta_{10}}$ can be computed as

With the same technique, we also obtain $\begin{equation} \frac{\partial J}{\partial \Theta_{11}} = \sum_{i=1}^{m}{( h_1(x^{(i)}) - y_1^{(i)} ) x_1^{(i)}} \end{equation}\tag{22}\label{eq:final-gradient-2}$
with $x_j^{(i)} = 0$ if $j = 0$. Although the calculation in output layer is different, surprisingly, Equation \eqref{eq:final-gradient-3} is similar to gradients of sigmoid output layer. Hence, utilizing softmax output layer should be no worries.