Hendra Bunyamin

Forgiven sinner and Lecturer at Maranatha Christian University

Gradients of Softmax Output Layer in Gory Details

This article attempts to find gradients of a softmax output layer. This knowledge proves useful when we want to utilize backpropagation algorithm to compute gradients of neural networks with a softmax output layer. Furthermore, page 3 from the outstanding Notes on Backpropagation by Peter Sadowski has inspired this article a lot.

Suppose that we have a multiclass classification problem with 3 (three) choices that are label $1$, label $2$, and label $3$. The image below shows the very simple artificial neural networks with two layers; particulary, we set the output layer as a softmax output layer.


Concretely, we utilize one-hot encoding for the three choices as follows:

, , and are the representations for label $1$, label $2$, and label $3$ respectively.

Let us define our dataset, , which has instances and

with , , have only binary values (either 0 or 1) for .

We employ softmax functions as our predictions. Specifically, we define our first hypoteses

the second hyphotesis,

the third hyphotesis,

and the cost function,


Now we will show how to derive the gradient for these softmax activation function. In other words,

What are , , , , , and ?

Firstly, We show how to derive and .

Let’s derive

By employing Multivariable Calculus, we obtain

The Part I consists of . Specifically,

By defining


and Quotient Rule, we are able to compute as follows:

Finally, we can compute by combining Equation \eqref{eq:gradient-10-1} and Equation \eqref{eq:gradient-10-4} as follows:

The Part II consists of . Specifically,

Again, by defining

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute

By using Equation \eqref{eq:gradient-10-6} and Equation \eqref{eq:gradient-10-8}, can be computed as

Lastly, the Part III consists of .


Again, by defining

using Equation \eqref{eq:gradient-10-3}, and Quotient Rule, we can compute

Again by using Equation \eqref{eq:gradient-10-10} and Equation \eqref{eq:gradient-10-12}, can be computed as

Finally, combining Equation \eqref{eq:gradient-10-5}, \eqref{eq:gradient-10-9}, and \eqref{eq:gradient-10-13} we obtain

With the same technique, we also obtain

or in general form,

with if . Although the calculation in output layer is different, surprisingly, Equation \eqref{eq:final-gradient-3} is similar to gradients of sigmoid output layer. Hence, utilizing softmax output layer should be no worries.

Written on May 27, 2020