Completing the Square for Multivariate Normal Model

Hendra Bunyamin

A Forgiven Sinner and Lecturer at Maranatha Christian University

github.com/hbunyamin
twitter.com/hendrabunyamin
hendra.bunyamin@it.maranatha.edu

Home / data-science-2 / completing the square for multivariate normal model

Completing the Square for Multivariate Normal ModelTweet

The subchapter 3.5 of Bayesian Data Analysis Third Edition gives distributional results of Bayesian inference for the parameters of a multivariate normal distribution with a known variance. Additionally, this article discusses the derivation of those results (Equation 3.13 of the book) in gory details.

$\pmb{\text{Figure 1}}$: A posterior distribution equals to a likelihood times a prior divided by a piece of evidence. Image taken from Wikipedia, some rights reserved.

Suppose we have a model for an observable vector $y$ of $d$ components, that is $y$ is a column vector of $d \times 1$, with the multivariate normal distribution,

\[\begin{equation} y \mid \mu, \Sigma \sim \text{N}(\mu, \Sigma) \tag{1}\label{eq:mvn-one-sample} \end{equation}\]

where $\mu$ is a column vector of length $d$ and $\Sigma$ is a known $d \times d$ variance matrix, which is symmetric and positive definite. Therefore, the likelihood function for a single observation is

\[\begin{equation} \Pr(y \mid \mu, \Sigma) \propto \lvert \Sigma \rvert^{-1/2} \exp \left( - \frac{1}{2} (y-\mu)^T \Sigma^{-1} (y - \mu) \right), \tag{2}\label{eq:likelihood-one-sample} \end{equation}\]

and for a sample of $n$ independent and identically distributed observations, $y_1, \ldots, y_n$, is

\[\begin{align} \Pr( y_1, \ldots, y_n \mid \mu, \Sigma ) &\propto \prod_{i=1}^{n}{ \Pr( y_i \mid \mu, \Sigma ) } \tag{3}\label{eq:likelihood-samples-1} \\ &= \prod_{i=1}^{n}{ \lvert \Sigma \rvert^{-1/2} \exp \left( - \frac{1}{2} (y_i-\mu)^T \Sigma^{-1} (y_i - \mu) \right) } \tag{4}\label{eq:likelihood-samples-2} && \text{using Equation }\eqref{eq:likelihood-one-sample} \\ &= \prod_{i=1}^{n}{ \lvert \Sigma \rvert^{-1/2} } \prod_{i=1}^{n}{\exp \left( - \frac{1}{2} (y_i-\mu)^T \Sigma^{-1} (y_i - \mu) \right) } \tag{5}\label{eq:likelihood-samples-3} \\ &= \lvert \Sigma \rvert^{-n/2} \exp \left( - \frac{1}{2} \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right). \tag{6}\label{eq:likelihood-samples-4} \\ \end{align}\]

Actually, given the following nice trace property,

\[\begin{equation} \sum_{i=1}^{n}{x_i^T A x_i} = \text{tr}\left( A \sum_{i=1}^{n}{x_i x_i^T} \right) \tag{7}\label{eq:trace-property} \end{equation}\]

with $x_i$ is a column vector whose dimension is $d \times 1$, $A$ is a symmetric matrix whose dimension is $d \times d$, and $\text{tr}$ is a trace function, we can rewrite Equation \eqref{eq:trace-property} as follows:

\[\begin{equation} \Pr( y_1, \ldots, y_n \mid \mu, \Sigma ) \propto \lvert \Sigma \rvert^{-n/2} \exp \left( -\frac{1}{2} \text{tr}(\Sigma^{-1} S_0) \right) \tag{8}\label{eq:likelihood-final-version} \end{equation}\]

where $S_0$ is the “sums squares” matrix relative to $\mu$,

\[\begin{equation} S_0 = \sum_{i=1}^{n}{(y_i - \mu)(y_i - \mu)^T}. \tag{9}\label{eq:sum-of-squares} \end{equation}\]

Before we construct the posterior distribution of the model, let’s define the prior distribution as follows:

\[\begin{equation} \Pr( \mu ) \propto \lvert \Lambda_0 \rvert^{-1/2} \exp \left(-\frac{1}{2} (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) \right) \tag{10}\label{eq:prior} \end{equation}\]

that is $\mu \sim \text{N}(\mu_0, \Lambda_0)$. By the way, $\Lambda_0$ is also a symmetric and positive definite matrix as well.

Now that we have both likelihood and prior distributions; let’s compute the posterior distribution of the model,

\[\begin{align} \Pr( \mu \mid y, \Sigma ) &\propto \Pr( y \mid \mu, \Sigma ) \Pr(\mu \mid \Sigma) && \text{by Bayes rule} \tag{11}\label{eq:posterior-def} \\ &= \lvert \Sigma \rvert^{-n/2} \exp \left( - \frac{1}{2} \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right) \times \lvert \Lambda_0 \rvert^{-1/2} \exp \left(-\frac{1}{2} (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) \right) \\ &\propto \exp \left( -\frac{1}{2} \underbrace{ \left( (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right)}_{\text{A}} \right) \tag{12}\label{eq:posterior-1} \end{align}\]

Part $\text{A}$ in Equation \eqref{eq:posterior-1} is actually a “completing the quadratic form” problem.

Let’s solve the problem as follows:

\[\begin{align} \text{A} &= (\mu^T - \mu_0^T) \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i^T - \mu^T)\Sigma^{-1}(y_i - \mu)} && \text{by transpose property} \tag{13}\label{eq:complete-squares-1} \\ &= \underbrace{(\mu^T \Lambda_0^{-1} - \mu_0^T \Lambda_0^{-1}) (\mu - \mu_0)}_{\text{B}} + \underbrace{\sum_{i=1}^{n}{(y_i^T \Sigma^{-1} - \mu^T \Sigma^{-1})(y_i - \mu)}}_{\text{C}} \tag{14}\label{eq:complete-squares-2} \end{align}\]

Let’s multiply out all terms in part $\text{B}$ in Equation \eqref{eq:complete-squares-2} as follows:

\[\begin{align} \text{B} &= \mu^T \Lambda_0^{-1} \mu - \underbrace{\mu^T \Lambda_0^{-1} \mu_0}_{\text{a scalar}} - \underbrace{\mu_0^T \Lambda_0^{-1} \mu}_{\text{a scalar}} + \mu_0^T \Lambda_0^{-1} \mu_0 \tag{15}\label{eq:b-1} \\ &= \mu^T \Lambda_0^{-1} \mu - \mu^T \Lambda_0^{-1} \mu_0 - (\mu^T \Lambda_0^{-1} \mu_0)^T + \mu_0^T \Lambda_0^{-1} \mu_0 && \text{by transpose property} \tag{16}\label{eq:b-2} \\ &= \mu^T \Lambda_0^{-1} \mu - \mu^T \Lambda_0^{-1} \mu_0 - \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0 && \text{as } \mu^T \Lambda_0^{-1} \mu_0 = (\mu^T \Lambda_0^{-1} \mu_0)^T \tag{17}\label{eq:b-3} \\ &= \mu^T \Lambda_0^{-1} \mu - 2 \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0. \tag{18}\label{eq:b-4} \\ \end{align}\]

Let’s also multiply out part $\text{C}$ in Equation \eqref{eq:complete-squares-2},

\[\begin{align} \text{C} &= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - \underbrace{y_i^T \Sigma^{-1} \mu}_{\text{scalar}} - \underbrace{\mu^T \Sigma^{-1} y_i}_{\text{scalar}} + \mu^T \Sigma^{-1} \mu)} \tag{19}\label{eq:c-1} \\ &= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - (\mu^T \Sigma^{-1} y_i)^T - \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} && \text{by transpose property} \tag{20}\label{eq:c-2} \\ &= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - \mu^T \Sigma^{-1} y_i - \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} && \text{as } (\mu^T \Sigma^{-1} y_i)^T = \mu^T \Sigma^{-1} y_i \tag{21}\label{eq:c-3} \\ &= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - 2 \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} \tag{22}\label{eq:c-4} \\ &= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - \sum_{i=1}^{n}{2 \mu^T \Sigma^{-1} y_i} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} && \text{by a linear operator of }\sum \tag{23}\label{eq:c-5} \\ &= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} \sum_{i=1}^{n}{y_i} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} \tag{24}\label{eq:c-6}\\ &= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} n \overline{y} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} && \text{as }\overline{y} = \frac{\sum_{i=1}^n y_i}{n} \tag{25}\label{eq:c-7} \\ &= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} n \overline{y} + n \mu^T \Sigma^{-1} \mu && \text{as }\sum_{i=1}^{n}{\text{constant}} = n \times \text{constant} \tag{26}\label{eq:c-8} \\ &= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T n \Sigma^{-1} \overline{y} + \mu^T n \Sigma^{-1} \mu \tag{27}\label{eq:c-9} \\ &= \mu^T n \Sigma^{-1} \mu - 2 \mu^T n \Sigma^{-1} \overline{y} + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} && \text{just rearrange terms} \tag{28}\label{eq:c-10} \end{align}\]

Now let’s combine both part $\text{B}$ (Equation \eqref{eq:b-4}) and part $\text{C}$ (Equation \eqref{eq:c-10}) into part $\text{A}$ in Equation \eqref{eq:complete-squares-2},

\[\begin{align} \text{A} =& \mu^T \Lambda_0^{-1} \mu - 2 \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0 + \\ & \mu^T n \Sigma^{-1} \mu - 2 \mu^T n \Sigma^{-1} \overline{y} + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} \tag{29}\label{eq:c-11} \\ =& \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - 2 \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) + \underbrace{\mu_0^T \Lambda_0^{-1} \mu_0 + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i}}_{\text{constant}_1} && \text{sum all terms accordingly} \tag{30}\label{eq:c-12} \\ =& \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - 2 \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) + \text{constant}_1 \tag{31}\label{eq:c-13} \\ =& \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - \underbrace{\mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )}_{\text{scalar}} - \underbrace{\mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )}_{\text{scalar}} + \text{constant}_1 && \text{separate the middle term} \tag{32}\label{eq:c-14} \\ =& \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) - ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )^T \mu + \text{constant}_1 && \text{by transpose property} \tag{33}\label{eq:c-15} \\ =& \left( \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T \right) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) \underbrace{- (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) + \text{constant}_1 }_{\text{constant}_2} && \text{by factoring & inverse matrix} \tag{34}\label{eq:c-16} \\ =& \left( \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T \right) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 \tag{35}\label{eq:c-17} \\ =& \left( \mu^T - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} \right) (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 && \text{get }(\Lambda_0^{-1} + n \Sigma^{-1} ) \text{ out} \tag{36}\label{eq:c-18} \\ =& \left( \mu^T - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1} )^{-T} \right) (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 && \text{symmetric property, }A^{-T} = A^{-1} \tag{37}\label{eq:c-19} \\ =& \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right)^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 && \text{by transpose property} \tag{38}\label{eq:c-20} \\ =& \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) + \text{constant}_2 \tag{39}\label{eq:c-21} \\ \end{align}\]

where

\[\begin{align} \mu_n &= (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \text{ and} \nonumber \\ \Lambda_n^{-1} &= \Lambda_0^{-1} + n \Sigma^{-1}. \tag{40}\label{eq:final-mean-variance} \end{align}\]

Next, let’s substitute the part $\text{A}$ (Equation \eqref{eq:c-21}) into the posterior distribution (Equation \eqref{eq:posterior-1}),

\[\begin{align} \Pr( \mu \mid y, \Sigma ) &\propto \exp \left( -\frac{1}{2} \underbrace{ \left( (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right)}_{\text{A}} \right) \tag{41}\label{eq:posterior-2} \\ &= \exp \left( -\frac{1}{2} \left( \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) + \text{constant}_2 \right) \right) \tag{42}\label{eq:posterior-3} \\ &= \exp \left( -\frac{1}{2} \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) \right) \times \exp \left( \text{constant}_2 \right) \tag{43}\label{eq:posterior-4} \\ &\propto \exp \left( -\frac{1}{2} \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) \right) \tag{44}\label{eq:posterior-5} \\ &= \text{N}(\mu \mid \mu_n, \Lambda_n) \tag{45}\label{eq:posterior-6} \end{align}\]

where

\[\begin{align} \mu_n &= (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \nonumber \\ \Lambda_n^{-1} &= \Lambda_0^{-1} + n \Sigma^{-1}. \nonumber \end{align}\]

By the way, the above derivation is also mentioned as Exercise 3.13 in the book.

Written on June 26, 2021