This article is inspired by an excellent post by Eli Bendersky. Let’s continue with the derivation.

**Cost function** has been explained in Week 1 and Week 2 of *Machine Learning* course taught by Andrew Ng. This post tries to explain how to derive **normal equation** for *linear regression with multiple variables*. It is a good thing if all readers has studied Week 1 and Week 2 before reading this post.

The **cost function** of *linear regression with multiple variables*, $J(\theta)$ is formulated as follows:

with $m$ is number of instances in dataset, $h_{\theta}(x^{(i)})$ is our hyphotesis also known as prediction model for the $i$th instance, and $y^{(i)}$ is true value for the $i$th instance.

We also have studied that

By substituting \eqref{eq:the-hyphotesis} into \eqref{eq:cost-function}, we obtain

By defining

and

also

equation \eqref{eq:derivation-5} becomes

We have arrived into a matrix form from **linear regression cost function**. Our next step would be:

How can we minimize the

cost functionin Equation \eqref{eq:derivation-10}?

We will employ the derivation formula from Matrix Calculus; specifically, we use **two scalar-by-vector identities** with **denominator layout** (result: column vector). The identities are as follows:

and

Now equipped with these identities, let us minimize Equation \eqref{eq:derivation-10} by computing the first derivation of $J(\theta)$; specifically, the Part I is computed with Equation \eqref{eq:identity-1} and Part II with Equation \eqref{eq:identity-2}:

In order to find $\theta$ which minimize Equation \eqref{eq:derivation-10}, we need to solve

At last, we have derived **the normal equation of linear regression model** that is