Jekyll2022-05-22T11:59:48+00:00http://hbunyamin.github.io/feed.xmlHendra BunyaminForgiven sinner and Lecturer at Maranatha Christian UniversityThe Posterior Mode of Beta Distribution2021-12-30T00:00:00+00:002021-12-30T00:00:00+00:00http://hbunyamin.github.io/data-science-2/The_Posterior_Mode_of_Beta<p>This article answers <strong>Exercise 4.14</strong> from the <em>highly recommended</em> <a href="https://www.bayesrulesbook.com/chapter-4.html#practice-balancing-the-data-prior"><strong>Bayes Rules!</strong></a> book.</p> <p><a href="/assets/images/bechdel-test.png"><img src="/assets/images/bechdel-test.png" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: The Bechdel Test. Image taken from <a href="https://commons.wikimedia.org/wiki/File:Bechdel_test.png">Wikipedia</a>.</center></em></p> <p>In <a href="https://www.bayesrulesbook.com/chapter-4.html#ch4-priors">Chapter 4 of the book</a>, recall that the <strong>Bechdel test</strong> is satisfied by a movie whose conditions are as follows:</p> <ul> <li>the movie has at least two women in it,</li> <li>these two women talk to each other, and</li> <li>the two women also talk about something other than a man.</li> </ul> <p>Furthermore, $\text{Figure 1}$ summarizes the three rules mentioned before.</p> <p>Suppose that we review a sample of $n$ recent movies and record $Y$, the number of movies that pass the Bechdel test. Considering $Y$ as the number of “successes” in a fixed number of independence trials, $Y$ can be specified as a Binomial model with $\pi$ as its parameter. Moreover, $\pi$ can also be described as Beta distribution with prior hyperparameters $\alpha$ and $\beta$:</p> \begin{align} Y \mid \pi &amp;\sim \text{Bin}(n,\pi) \\ \pi &amp;\sim \text{Beta}(\alpha, \beta). \end{align} <p>Thus, the posterior of Beta-Binomial model of $\pi$ is given by</p> $\begin{equation} \pi \mid (Y = y) \sim \text{Beta}(\alpha + y, \beta + n - y). \tag{1}\label{eq:the-posterior} \end{equation}$ <p><strong>The Question:</strong></p> <blockquote> <p>In the Beta-Binomial setting, show that we can write the posterior mode of $\pi$ as the weighted average of the prior mode and observed sample success rate: $$\begin{equation} \text{Mode}(\pi \mid Y = y) = \frac{\alpha + \beta - 2}{\alpha + \beta + n - 2} \cdot \text{Mode}(\pi) + \frac{n}{\alpha + \beta + n - 2} \cdot \frac{y}{n} \tag{2}\label{eq:the-problem} \end{equation}$$</p> </blockquote> <p><strong>Answer</strong>: <br /> Recall that mode of the prior is</p> $\begin{equation} \text{Mode}(\pi) = \frac{\alpha - 1}{\alpha + \beta - 2} \tag{3}\label{eq:mode-prior} \end{equation}$ <p>and mode of the posterior is</p> $\begin{equation} \text{Mode}(\pi \mid Y = y) = \frac{\alpha + y - 1}{\alpha + \beta + n -2}. \tag{4}\label{eq:mode-posterior} \end{equation}$ <p>Next, we show that Equation \eqref{eq:mode-posterior} can be written as Equation \eqref{eq:the-problem} as follows:</p> \begin{align} \text{Mode}(\pi \mid Y = y) &amp;= \frac{\alpha + y - 1}{\alpha + \beta + n -2} \\ &amp;= \frac{\alpha - 1}{\alpha + \beta + n - 2} + \frac{y}{\alpha + \beta + n - 2} \\ &amp;= \frac{\alpha - 1}{\alpha + \beta + n - 2} \cdot \frac{\alpha + \beta -2}{\alpha + \beta -2} + \frac{y}{\alpha + \beta + n - 2} \cdot \frac{n}{n} \\ &amp;= \frac{\alpha + \beta -2}{\alpha + \beta + n - 2} \cdot \frac{\alpha - 1}{\alpha + \beta -2} + \frac{n}{\alpha + \beta + n - 2} \cdot \frac{y}{n} &amp;&amp; \text{Rearrange the terms} \\ &amp;= \frac{\alpha + \beta -2}{\alpha + \beta + n - 2} \cdot \text{Mode}(\pi) + \frac{n}{\alpha + \beta + n - 2} \cdot \frac{y}{n}. &amp;&amp; \text{Utilize Equation }\eqref{eq:mode-prior} \end{align} <p>At last, we have shown that Equation \eqref{eq:the-problem} is indeed true.</p>This article answers Exercise 4.14 from the highly recommended Bayes Rules! book.One of Many Inverse Theorems2021-07-10T00:00:00+00:002021-07-10T00:00:00+00:00http://hbunyamin.github.io/linear-algebra/One_of_Inverse_Theorem<p>This article is inspired by <a href="http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html"><em>a website</em></a>, which is unfortunately has been down since around July 8, 2021. The website elaborately explained that both <strong>marginal distributions</strong> and <strong>conditional distributions</strong> of <em>subvector of multivariate normal random variables</em> given <em>the remaining elements</em> are indeed <strong>multivariate normal distributions</strong> as well. I feel obliged to write the content of the broken website in a blog which, hopefully, every learner can learn and benefit.</p> <p>Before we show the previous statement is indeed true, there is the inverse of a matrix theorem which needs explaining.</p> <p>Is that true that</p> $\begin{equation} (A + CBD)^{-1} = A^{-1} - A^{-1} C (B^{-1} + DA^{-1} C)^{-1} D A^{-1}? \tag{1}\label{eq:the-theorem} \end{equation}$ <h4 id="proof"><em>Proof:</em></h4> <p>We need to prove that</p> $\begin{equation} (A + CBD)(A + CBD)^{-1} = I \tag{2}\label{eq:first-part} \end{equation}$ <p>and</p> $\begin{equation} (A + CBD)^{-1}(A + CBD) = I \tag{3}\label{eq:second-part} \end{equation}$ <p>where $I$ is an identity matrix.</p> <blockquote> <p>We need to prove both Eq. \eqref{eq:first-part} and Eq. \eqref{eq:second-part} are true.</p> </blockquote> <p>Firstly, let’s prove Eq. \eqref{eq:first-part} by using Eq. \eqref{eq:the-theorem} as follows:</p> \begin{align} (A + CBD)(A + CBD)^{-1} &amp;= (A + CBD)(A^{-1} - A^{-1} C (B^{-1} + D A^{-1} C)^{-1} D A^{-1}) \\ &amp;= (A + CBD)A^{-1} - (A + CBD) A^{-1} C (B^{-1} + D A^{-1} C)^{-1} D A^{-1} \\ &amp;= I + CBDA^{-1} - (C + CBDA^{-1}C)(B^{-1} + DA^{-1}C)^{-1}DA^{-1} \\ &amp;= I + CBDA^{-1} - CB(B^{-1} + DA^{-1}C)(B^{-1} + DA^{-1}C)^{-1}DA^{-1} \\ &amp;= I + CBDA^{-1} - CBDA^{-1} \\ &amp;= I. \end{align} <p>Secondly, let’s prove Eq. \eqref{eq:second-part} by employing Eq. \eqref{eq:the-theorem},</p> \begin{align} (A + CBD)^{-1}(A + CBD) &amp;= (A^{-1} - A^{-1} C (B^{-1} + D A^{-1} C)^{-1} D A^{-1})(A+CBD) \\ &amp;= A^{-1}(A + CBD) - A^{-1} C(B^{-1} + DA^{-1} C)^{-1} DA^{-1} (A + CBD) \\ &amp;= I + A^{-1} CBD - A^{-1} C (B^{-1} + DA^{-1} C)^{-1} (D + DA^{-1} CBD) \\ &amp;= I + A^{-1}CBD - A^{-1}C \underbrace{(B^{-1} + DA^{-1}C)^{-1} (B^{-1} + DA^{-1}C)}_{I} BD \\ &amp;= I + A^{-1}CBD - A^{-1}CBD \\ &amp;= I. \end{align} <p>Therefore, we have shown that this inverse theorem, Eq. \eqref{eq:the-theorem} is true.</p>This article is inspired by a website, which is unfortunately has been down since around July 8, 2021. The website elaborately explained that both marginal distributions and conditional distributions of subvector of multivariate normal random variables given the remaining elements are indeed multivariate normal distributions as well. I feel obliged to write the content of the broken website in a blog which, hopefully, every learner can learn and benefit.Completing the Square for Multivariate Normal Model2021-06-26T00:00:00+00:002021-06-26T00:00:00+00:00http://hbunyamin.github.io/data-science-2/Completing_the_Square_for_Multivariate_Model<p>The subchapter 3.5 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>Bayesian Data Analysis Third Edition</strong></a> gives distributional results of Bayesian inference for the parameters of a multivariate normal distribution with a <strong>known</strong> variance. <em>Additionally, this article discusses the derivation of those results (Equation 3.13 of the book) in <strong>gory details</strong>.</em></p> <p><a href="/assets/images/bayes-theorem.jpg"><img src="/assets/images/bayes-theorem.jpg" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: A posterior distribution equals to a likelihood times a prior divided by a piece of evidence. Image taken from <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Wikipedia</a>, some rights reserved.</center></em></p> <p>Suppose we have a model for an observable vector $y$ of $d$ components, that is $y$ is a column vector of $d \times 1$, with the multivariate normal distribution,</p> $\begin{equation} y \mid \mu, \Sigma \sim \text{N}(\mu, \Sigma) \tag{1}\label{eq:mvn-one-sample} \end{equation}$ <p>where $\mu$ is a column vector of length $d$ and $\Sigma$ is a known $d \times d$ variance matrix, which is <a href="https://en.wikipedia.org/wiki/Symmetric_matrix"><em>symmetric</em></a> and <a href="https://en.wikipedia.org/wiki/Definite_matrix"><em>positive definite</em></a>. Therefore, the <em>likelihood function</em> for a single observation is</p> $\begin{equation} \Pr(y \mid \mu, \Sigma) \propto \lvert \Sigma \rvert^{-1/2} \exp \left( - \frac{1}{2} (y-\mu)^T \Sigma^{-1} (y - \mu) \right), \tag{2}\label{eq:likelihood-one-sample} \end{equation}$ <p>and for a sample of $n$ independent and identically distributed observations, $y_1, \ldots, y_n$, is</p> \begin{align} \Pr( y_1, \ldots, y_n \mid \mu, \Sigma ) &amp;\propto \prod_{i=1}^{n}{ \Pr( y_i \mid \mu, \Sigma ) } \tag{3}\label{eq:likelihood-samples-1} \\ &amp;= \prod_{i=1}^{n}{ \lvert \Sigma \rvert^{-1/2} \exp \left( - \frac{1}{2} (y_i-\mu)^T \Sigma^{-1} (y_i - \mu) \right) } \tag{4}\label{eq:likelihood-samples-2} &amp;&amp; \text{using Equation }\eqref{eq:likelihood-one-sample} \\ &amp;= \prod_{i=1}^{n}{ \lvert \Sigma \rvert^{-1/2} } \prod_{i=1}^{n}{\exp \left( - \frac{1}{2} (y_i-\mu)^T \Sigma^{-1} (y_i - \mu) \right) } \tag{5}\label{eq:likelihood-samples-3} \\ &amp;= \lvert \Sigma \rvert^{-n/2} \exp \left( - \frac{1}{2} \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right). \tag{6}\label{eq:likelihood-samples-4} \\ \end{align} <p>Actually, given the following nice <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)"><em>trace property</em></a>,</p> $\begin{equation} \sum_{i=1}^{n}{x_i^T A x_i} = \text{tr}\left( A \sum_{i=1}^{n}{x_i x_i^T} \right) \tag{7}\label{eq:trace-property} \end{equation}$ <p>with $x_i$ is a column vector whose dimension is $d \times 1$, $A$ is a symmetric matrix whose dimension is $d \times d$, and $\text{tr}$ is a <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)"><em>trace function</em></a>, we can rewrite Equation \eqref{eq:trace-property} as follows:</p> $\begin{equation} \Pr( y_1, \ldots, y_n \mid \mu, \Sigma ) \propto \lvert \Sigma \rvert^{-n/2} \exp \left( -\frac{1}{2} \text{tr}(\Sigma^{-1} S_0) \right) \tag{8}\label{eq:likelihood-final-version} \end{equation}$ <p>where $S_0$ is the “<em>sums squares</em>” matrix relative to $\mu$,</p> $\begin{equation} S_0 = \sum_{i=1}^{n}{(y_i - \mu)(y_i - \mu)^T}. \tag{9}\label{eq:sum-of-squares} \end{equation}$ <p>Before we construct the posterior distribution of the model, let’s define the <em>prior distribution</em> as follows:</p> $\begin{equation} \Pr( \mu ) \propto \lvert \Lambda_0 \rvert^{-1/2} \exp \left(-\frac{1}{2} (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) \right) \tag{10}\label{eq:prior} \end{equation}$ <p>that is $\mu \sim \text{N}(\mu_0, \Lambda_0)$. By the way, $\Lambda_0$ is also a symmetric and positive definite matrix as well.</p> <blockquote> <p>Now that we have both <em>likelihood</em> and <em>prior</em> distributions; let’s compute the posterior distribution of the model,</p> </blockquote> \begin{align} \Pr( \mu \mid y, \Sigma ) &amp;\propto \Pr( y \mid \mu, \Sigma ) \Pr(\mu \mid \Sigma) &amp;&amp; \text{by Bayes rule} \tag{11}\label{eq:posterior-def} \\ &amp;= \lvert \Sigma \rvert^{-n/2} \exp \left( - \frac{1}{2} \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right) \times \lvert \Lambda_0 \rvert^{-1/2} \exp \left(-\frac{1}{2} (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) \right) \\ &amp;\propto \exp \left( -\frac{1}{2} \underbrace{ \left( (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right)}_{\text{A}} \right) \tag{12}\label{eq:posterior-1} \end{align} <blockquote> <p>Part $\text{A}$ in Equation \eqref{eq:posterior-1} is actually a “<em>completing the quadratic form</em>” problem.</p> </blockquote> <p>Let’s solve the problem as follows:</p> \begin{align} \text{A} &amp;= (\mu^T - \mu_0^T) \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i^T - \mu^T)\Sigma^{-1}(y_i - \mu)} &amp;&amp; \text{by transpose property} \tag{13}\label{eq:complete-squares-1} \\ &amp;= \underbrace{(\mu^T \Lambda_0^{-1} - \mu_0^T \Lambda_0^{-1}) (\mu - \mu_0)}_{\text{B}} + \underbrace{\sum_{i=1}^{n}{(y_i^T \Sigma^{-1} - \mu^T \Sigma^{-1})(y_i - \mu)}}_{\text{C}} \tag{14}\label{eq:complete-squares-2} \end{align} <p>Let’s multiply out all terms in part $\text{B}$ in Equation \eqref{eq:complete-squares-2} as follows:</p> \begin{align} \text{B} &amp;= \mu^T \Lambda_0^{-1} \mu - \underbrace{\mu^T \Lambda_0^{-1} \mu_0}_{\text{a scalar}} - \underbrace{\mu_0^T \Lambda_0^{-1} \mu}_{\text{a scalar}} + \mu_0^T \Lambda_0^{-1} \mu_0 \tag{15}\label{eq:b-1} \\ &amp;= \mu^T \Lambda_0^{-1} \mu - \mu^T \Lambda_0^{-1} \mu_0 - (\mu^T \Lambda_0^{-1} \mu_0)^T + \mu_0^T \Lambda_0^{-1} \mu_0 &amp;&amp; \text{by transpose property} \tag{16}\label{eq:b-2} \\ &amp;= \mu^T \Lambda_0^{-1} \mu - \mu^T \Lambda_0^{-1} \mu_0 - \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0 &amp;&amp; \text{as } \mu^T \Lambda_0^{-1} \mu_0 = (\mu^T \Lambda_0^{-1} \mu_0)^T \tag{17}\label{eq:b-3} \\ &amp;= \mu^T \Lambda_0^{-1} \mu - 2 \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0. \tag{18}\label{eq:b-4} \\ \end{align} <p>Let’s also multiply out part $\text{C}$ in Equation \eqref{eq:complete-squares-2},</p> \begin{align} \text{C} &amp;= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - \underbrace{y_i^T \Sigma^{-1} \mu}_{\text{scalar}} - \underbrace{\mu^T \Sigma^{-1} y_i}_{\text{scalar}} + \mu^T \Sigma^{-1} \mu)} \tag{19}\label{eq:c-1} \\ &amp;= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - (\mu^T \Sigma^{-1} y_i)^T - \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} &amp;&amp; \text{by transpose property} \tag{20}\label{eq:c-2} \\ &amp;= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - \mu^T \Sigma^{-1} y_i - \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} &amp;&amp; \text{as } (\mu^T \Sigma^{-1} y_i)^T = \mu^T \Sigma^{-1} y_i \tag{21}\label{eq:c-3} \\ &amp;= \sum_{i=1}^{n}{(y_i^T \Sigma^{-1} y_i - 2 \mu^T \Sigma^{-1} y_i + \mu^T \Sigma^{-1} \mu)} \tag{22}\label{eq:c-4} \\ &amp;= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - \sum_{i=1}^{n}{2 \mu^T \Sigma^{-1} y_i} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} &amp;&amp; \text{by a linear operator of }\sum \tag{23}\label{eq:c-5} \\ &amp;= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} \sum_{i=1}^{n}{y_i} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} \tag{24}\label{eq:c-6}\\ &amp;= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} n \overline{y} + \sum_{i=1}^{n}{\mu^T \Sigma^{-1} \mu} &amp;&amp; \text{as }\overline{y} = \frac{\sum_{i=1}^n y_i}{n} \tag{25}\label{eq:c-7} \\ &amp;= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T \Sigma^{-1} n \overline{y} + n \mu^T \Sigma^{-1} \mu &amp;&amp; \text{as }\sum_{i=1}^{n}{\text{constant}} = n \times \text{constant} \tag{26}\label{eq:c-8} \\ &amp;= \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} - 2 \mu^T n \Sigma^{-1} \overline{y} + \mu^T n \Sigma^{-1} \mu \tag{27}\label{eq:c-9} \\ &amp;= \mu^T n \Sigma^{-1} \mu - 2 \mu^T n \Sigma^{-1} \overline{y} + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} &amp;&amp; \text{just rearrange terms} \tag{28}\label{eq:c-10} \end{align} <p>Now let’s combine both part $\text{B}$ (Equation \eqref{eq:b-4}) and part $\text{C}$ (Equation \eqref{eq:c-10}) into part $\text{A}$ in Equation \eqref{eq:complete-squares-2},</p> \begin{align} \text{A} =&amp; \mu^T \Lambda_0^{-1} \mu - 2 \mu^T \Lambda_0^{-1} \mu_0 + \mu_0^T \Lambda_0^{-1} \mu_0 + \\ &amp; \mu^T n \Sigma^{-1} \mu - 2 \mu^T n \Sigma^{-1} \overline{y} + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i} \tag{29}\label{eq:c-11} \\ =&amp; \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - 2 \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) + \underbrace{\mu_0^T \Lambda_0^{-1} \mu_0 + \sum_{i=1}^{n}{y_i^T \Sigma^{-1} y_i}}_{\text{constant}_1} &amp;&amp; \text{sum all terms accordingly} \tag{30}\label{eq:c-12} \\ =&amp; \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - 2 \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) + \text{constant}_1 \tag{31}\label{eq:c-13} \\ =&amp; \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - \underbrace{\mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )}_{\text{scalar}} - \underbrace{\mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )}_{\text{scalar}} + \text{constant}_1 &amp;&amp; \text{separate the middle term} \tag{32}\label{eq:c-14} \\ =&amp; \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \mu - \mu^T ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} ) - ( \Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y} )^T \mu + \text{constant}_1 &amp;&amp; \text{by transpose property} \tag{33}\label{eq:c-15} \\ =&amp; \left( \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T \right) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) \underbrace{- (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) + \text{constant}_1 }_{\text{constant}_2} &amp;&amp; \text{by factoring &amp; inverse matrix} \tag{34}\label{eq:c-16} \\ =&amp; \left( \mu^T (\Lambda_0^{-1} + n \Sigma^{-1} ) - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T \right) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 \tag{35}\label{eq:c-17} \\ =&amp; \left( \mu^T - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} \right) (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 &amp;&amp; \text{get }(\Lambda_0^{-1} + n \Sigma^{-1} ) \text{ out} \tag{36}\label{eq:c-18} \\ =&amp; \left( \mu^T - (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y})^T (\Lambda_0^{-1} + n \Sigma^{-1} )^{-T} \right) (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 &amp;&amp; \text{symmetric property, }A^{-T} = A^{-1} \tag{37}\label{eq:c-19} \\ =&amp; \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right)^T (\Lambda_0^{-1} + n \Sigma^{-1} ) \left( \mu - (\Lambda_0^{-1} + n \Sigma^{-1})^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \right) + \text{constant}_2 &amp;&amp; \text{by transpose property} \tag{38}\label{eq:c-20} \\ =&amp; \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) + \text{constant}_2 \tag{39}\label{eq:c-21} \\ \end{align} <p>where</p> \begin{align} \mu_n &amp;= (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \text{ and} \nonumber \\ \Lambda_n^{-1} &amp;= \Lambda_0^{-1} + n \Sigma^{-1}. \tag{40}\label{eq:final-mean-variance} \end{align} <p>Next, let’s substitute the part $\text{A}$ (Equation \eqref{eq:c-21}) into the posterior distribution (Equation \eqref{eq:posterior-1}),</p> \begin{align} \Pr( \mu \mid y, \Sigma ) &amp;\propto \exp \left( -\frac{1}{2} \underbrace{ \left( (\mu - \mu_0)^T \Lambda_0^{-1} (\mu - \mu_0) + \sum_{i=1}^{n}{(y_i-\mu)^T \Sigma^{-1} (y_i - \mu)} \right)}_{\text{A}} \right) \tag{41}\label{eq:posterior-2} \\ &amp;= \exp \left( -\frac{1}{2} \left( \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) + \text{constant}_2 \right) \right) \tag{42}\label{eq:posterior-3} \\ &amp;= \exp \left( -\frac{1}{2} \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) \right) \times \exp \left( \text{constant}_2 \right) \tag{43}\label{eq:posterior-4} \\ &amp;\propto \exp \left( -\frac{1}{2} \left( \mu - \mu_n \right)^T \Lambda_n^{-1} \left( \mu - \mu_n \right) \right) \tag{44}\label{eq:posterior-5} \\ &amp;= \text{N}(\mu \mid \mu_n, \Lambda_n) \tag{45}\label{eq:posterior-6} \end{align} <p>where</p> \begin{align} \mu_n &amp;= (\Lambda_0^{-1} + n \Sigma^{-1} )^{-1} (\Lambda_0^{-1} \mu_0 + n \Sigma^{-1} \overline{y}) \nonumber \\ \Lambda_n^{-1} &amp;= \Lambda_0^{-1} + n \Sigma^{-1}. \nonumber \end{align} <p>By the way, the above derivation is also mentioned as <em>Exercise 3.13</em> in <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><em>the book</em></a>.</p>The subchapter 3.5 of Bayesian Data Analysis Third Edition gives distributional results of Bayesian inference for the parameters of a multivariate normal distribution with a known variance. Additionally, this article discusses the derivation of those results (Equation 3.13 of the book) in gory details. $\pmb{\text{Figure 1}}$: A posterior distribution equals to a likelihood times a prior divided by a piece of evidence. Image taken from Wikipedia, some rights reserved.Completing the Square for Normal Model with Multiple Observations2021-04-12T00:00:00+00:002021-04-12T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Completing_the_Square_for_Normal_Model<p>The subchapter 2.5 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>Bayesian Data Analysis Third Edition</strong></a> explains how to estimate a normal mean with known variance; particularly, the subchapter extends the development of a normal model with a single observation into the more realistic situation where <em>a sample of independent and identically distributed observations</em> $y = (y_1, \ldots, y_n)$ are available.</p> <p><a href="/assets/images/normal-dist.jpg"><img src="/assets/images/normal-dist.jpg" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: Example of a normal distribution consisting a horde of rabbits. Image taken from <a href="https://vimeo.com/75089338">Casey Dunn</a>, some rights reserved.</center></em></p> <p>The <em>posterior</em> density of the normal model consists of a <em>likelihood</em> distribution, $\Pr(y \mid \theta)$, and a <em>prior</em> distribution, $\Pr(\theta)$. Specifically,</p> \begin{align} y_i \mid \theta &amp;\sim \text{N}(\theta, \sigma^2) &amp;&amp; \text{A normal distribution with mean = }\theta \text{ and variance = }\sigma^2\text{, for }i=1, \ldots, n \\ \theta &amp;\sim \text{N}(\mu_0, \tau_0^2) &amp;&amp; \text{A normal distribution with mean = }\mu_0 \text{ and variance = }\tau_0^2. \end{align} <p>Proceeding formally, the posterior density is</p> <p>\begin{align} \Pr(\theta \mid y) &amp;\propto \Pr(\theta) \Pr(y \mid \theta) &amp;&amp; \text{posterior definition} \tag{1}\label{eq:definition}\\ &amp;= \Pr(\theta) \prod_{i=1}^{n} \Pr(y_i \mid \theta) &amp;&amp; \text{i.i.d observations} \tag{2}\label{eq:iid} \\ &amp;\propto \exp \left( -\frac{1}{2 \tau_0^2} (\theta - \mu_0)^2 \right) \prod_{i=1}^n \exp \left( - \frac{1}{2 \sigma^2} (y_i - \theta)^2 \right) &amp;&amp; \text{normal distributions} \tag{3}\label{eq:exposition-normal} \\ &amp;= \exp \left( -\frac{1}{2} \left( \frac{1}{\tau_0^2} (\theta - \mu_0)^2 + \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - \theta)^2 \right) \right) &amp;&amp; \text{sum all terms} \tag{4}\label{eq:sum-all-terms} \\ &amp;= \exp \left( -\frac{1}{2} \left( \frac{1}{\tau_0^2} \theta^2 - \frac{2 \theta \mu_0}{\tau_0^2} + \frac{\mu_0^2}{\tau_0^2} + \frac{1}{\sigma^2} \sum_{i=1}^n (y_i^2 - 2 \theta y_i + \theta^2) \right) \right) &amp;&amp; \text{expand all squares} \tag{5}\label{eq:expand-all} \\ &amp;= \exp \left( -\frac{1}{2} \left( \frac{1}{\tau_0^2} \theta^2 - \frac{2 \theta \mu_0}{\tau_0^2} + \frac{\mu_0^2}{\tau_0^2} + \frac{\sum_{i=1}^n y_i^2}{\sigma^2} - \frac{2 \theta \sum_{i=1}^n y_i}{\sigma^2} + \frac{n \theta^2}{\sigma^2} \right) \right) &amp;&amp; \text{expand the last term} \tag{6}\label{eq:expand-again} \\ &amp;= \exp \left( -\frac{1}{2} \left( \frac{\theta^2}{\tau_0^2} + \frac{n \theta^2}{\sigma^2} - 2 \theta \left( \frac{\mu_0}{\tau_0^2} + \frac{\sum_{i=1}^n y_i}{\sigma^2} \right) + \frac{\mu_0^2}{\tau_0^2} + \frac{\sum_{i=1}^n y_i^2}{\sigma^2} \right) \right) &amp;&amp; \text{group all }\theta s \text{ &amp; } \theta^2 s \tag{7}\label{eq:collect-all} \\ &amp;= \exp \left( -\frac{1}{2} \left( \theta^2 \left( \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \right) - 2 \theta \left( \frac{\mu_0}{\tau_0^2} + \frac{\sum_{i=1}^n y_i}{\sigma^2} \right) + \frac{\mu_0^2}{\tau_0^2} + \frac{\sum_{i=1}^n y_i^2}{\sigma^2} \right) \times \frac{\frac{1}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}}{\frac{1}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}} \right) &amp;&amp; \text{use a trick} \tag{8}\label{eq:multiply-by} \\ &amp;= \exp \left( - \frac{1}{2} \frac{ \left( \theta^2 - 2 \theta \frac{ \frac{\mu_0}{\tau_0^2} + \frac{\sum y_i}{\sigma^2}}{ \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} + \frac{\frac{\mu_0^2}{\tau_0^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} + \frac{\frac{\sum y_i^2}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} \right) }{\frac{1}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}} \right) \tag{9}\label{eq:atas-bawah} \\ &amp;= \exp \left( - \frac{1}{2} \frac{\left( \theta - \frac{\frac{\mu_0}{\tau_0^2} + \frac{\sum y_i}{\sigma^2} }{ \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} } \right)^2 + C}{\frac{1}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}} \right) &amp;&amp; \text{with }C \text{ is a constant} \tag{10}\label{eq:a-constant} \\ &amp;\propto \exp \left( -\frac{1}{2} \frac{(\theta - \mu_n)^2}{\tau_n^2} \right) \tag{10}\label{eq:almost} \\ &amp;\propto \text{N}(\mu_n, \tau_n^2) &amp;&amp; \text{a normal distribution} \tag{11}\label{eq:finally} \end{align} with</p> <p>\begin{align} \mu_n &amp;= \frac{\frac{\mu_0}{\tau_0^2} + \frac{\sum_{i=1}^n y_i}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2} } \tag{12}\label{eq:mu-n} \\ &amp;= \frac{\frac{\mu_0}{\tau_0^2} + \frac{n \bar{y}}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2} } &amp;&amp; \text{because }\bar{y} = \frac{\sum_{i=1}^n y_i}{n} \tag{13}\label{eq:mu-n-2} \end{align} and</p> $\begin{equation} \frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}. \tag{14}\label{eq:sigma-n} \end{equation}$ <p>At last, we have shown that the <em>posterior</em> distribution of the normal model is also a normal distribution as it is explained by Equation (2.11) and (2.12) on page 42 of the <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>book</strong></a>.</p>The subchapter 2.5 of Bayesian Data Analysis Third Edition explains how to estimate a normal mean with known variance; particularly, the subchapter extends the development of a normal model with a single observation into the more realistic situation where a sample of independent and identically distributed observations $y = (y_1, \ldots, y_n)$ are available. $\pmb{\text{Figure 1}}$: Example of a normal distribution consisting a horde of rabbits. Image taken from Casey Dunn, some rights reserved. The posterior density of the normal model consists of a likelihood distribution, $\Pr(y \mid \theta)$, and a prior distribution, $\Pr(\theta)$. Specifically, \begin{align} y_i \mid \theta &amp;\sim \text{N}(\theta, \sigma^2) &amp;&amp; \text{A normal distribution with mean = }\theta \text{ and variance = }\sigma^2\text{, for }i=1, \ldots, n \\ \theta &amp;\sim \text{N}(\mu_0, \tau_0^2) &amp;&amp; \text{A normal distribution with mean = }\mu_0 \text{ and variance = }\tau_0^2. \end{align}A Quantum of the Fisher Information Derivation2021-02-23T00:00:00+00:002021-02-23T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Fisher_Information<p>This post elaborates a derivation of Equation (2.20) on page 53 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>Bayesian Data Analysis Third Edition</strong></a>.</p> <p><a href="/assets/images/Youngronaldfisher2.JPG"><img src="/assets/images/Youngronaldfisher2.JPG" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: Sir Ronald Aylmer Fisher (17 February 1890 - 29 July 1962). One of his many great contributions to Statistics is <a href="https://en.wikipedia.org/wiki/Fisher_information">Fisher Information</a>. Image taken from <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Wikipedia</a>, some rights reserved.</center></em></p> <p>Concretely, we want to show the derivation $J(\theta)$, the <em>Fisher Information</em>, from</p> $\begin{equation} J(\theta) = \text{E}\left( \left( \frac{d \log \Pr(y \mid \theta )}{d\theta} \right)^2 \, \middle| \, \theta \right) \tag{1}\label{eq:start-point} \end{equation}$ <p>to</p> $\begin{equation} J(\theta) = - \text{E}\left( \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \, \middle| \, \theta \right). \tag{2}\label{eq:end-point} \end{equation}$ <p>The idea of this derivation comes from a <a href="https://web.stanford.edu/class/stats311/Lectures/lec-09.pdf"><strong>lecture note by John Duchi from Stanford Statistics class</strong></a>. The difference between this post and the lecture note is that the lecture note deals with <em>multi-variables</em> which employs second derivatives for multi-values (<a href="https://en.wikipedia.org/wiki/Hessian_matrix"><em>Hessian matrix</em></a>); on the other hand, this post deals with a single variable and employs a second derivative for just one value.</p> <p>Let’s start with computing $$\begin{equation} \text{E} \left( \frac{d \log \Pr(y \mid \theta)}{d\theta} \, \middle| \, \theta \right) \end{equation}$$.</p> \require{cancel} \begin{align} \text{E}\left( \frac{d \log \Pr(y \mid \theta)}{d\theta} \, \middle| \, \theta \right) &amp;= \int \frac{d \log \Pr(y \mid \theta)}{d\theta} \Pr(y \mid \theta) d\theta &amp;&amp; \text{definition of expectation} \tag{3}\label{eq:dlog-1} \\ &amp;= \int \frac{d \Pr(y \mid \theta)}{d\theta} \frac{1}{\Pr(y \mid \theta)} \; \Pr(y \mid \theta) d\theta &amp;&amp; \text{derivation of }\frac{d \log \Pr(y \mid \theta)}{d\theta} \tag{4}\label{eq:dlog-2} \\ &amp;= \int \frac{d \Pr(y \mid \theta)}{d\theta} \frac{1}{\cancel{\Pr(y \mid \theta)}} \; \cancel{\Pr(y \mid \theta)} d\theta \tag{5}\label{eq:dlog-3} \\ &amp;= \int \frac{d \Pr(y \mid \theta)}{d\theta} d\theta \tag{6}\label{eq:dlog-4} \\ &amp;= \frac{d}{d\theta} \int \Pr(y \mid \theta) d\theta &amp;&amp; \text{exchange }\frac{d}{d\theta} \text{ and } \int \tag{7}\label{eq:dlog-5} \\ &amp;= \frac{d}{d\theta} \underbrace{\int \Pr(y \mid \theta) d\theta}_{1} &amp;&amp; \text{property of a pdf}\tag{8}\label{eq:dlog-6} \\ &amp;= \frac{d}{d\theta} (1) \tag{9}\label{eq:dlog-7} \\ &amp;= 0. \tag{10}\label{eq:dlog-8} \end{align} <p>Consider Equation \eqref{eq:dlog-5}, we shall utilize this exchangeability between <em>integral</em> and <em>differentiation</em> again later.</p> <p>Equation \eqref{eq:dlog-2} states that</p> $\begin{equation} \frac{d \log \Pr(y \mid \theta)}{d\theta} = \underbrace{\frac{1}{\Pr(y \mid \theta)}}_{u} \underbrace{\frac{d \Pr(y \mid \theta)}{d\theta}}_{v}. \tag{11}\label{eq:first-order-derivation} \end{equation}$ <p>Therefore,</p> \begin{align} \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} &amp;= \underbrace{- \frac{1}{\Pr( y \mid \theta )^2} \frac{d \Pr(y \mid \theta)}{d\theta}}_{u^{\prime}} \underbrace{\frac{d \Pr(y \mid \theta)}{d\theta}}_{v} + \underbrace{\frac{1}{\Pr(y \mid \theta)}}_{u} \underbrace{\frac{d^2 \Pr(y \mid \theta)}{d\theta^2}}_{v^{\prime}} &amp;&amp; \text{based on } u^{\prime} v + u v^{\prime} \tag{12}\label{eq:second-order-1} \\ &amp;= \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} - \left( \frac{d\Pr(y \mid \theta)}{d\theta} \frac{1}{\Pr(y \mid \theta)} \right) \left( \frac{d\Pr(y \mid \theta)}{d\theta} \frac{1}{\Pr(y \mid \theta)} \right) &amp;&amp; \text{just rearranging} \tag{13}\label{eq:second-order-2} \\ &amp;= \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} - \left( \frac{d \log \Pr( y \mid \theta)}{d\theta} \right) \left( \frac{d \log \Pr( y \mid \theta)}{d\theta} \right) &amp;&amp; \text{based on Equation }\eqref{eq:first-order-derivation} \tag{14}\label{eq:second-order-3} \\ &amp;= \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} - \left( \frac{d \log \Pr( y \mid \theta)}{d\theta} \right)^{2} \tag{15}\label{eq:second-order-4} \end{align} <p>From Equation \eqref{eq:second-order-4} we obtain</p> \begin{align} \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} = \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} - \left( \frac{d \log \Pr( y \mid \theta)}{d\theta} \right)^{2} &amp;\Longleftrightarrow \left( \frac{d \log \Pr( y \mid \theta)}{d\theta} \right)^{2} = - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} + \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} \tag{16}\label{eq:second-order-last} \end{align} <p>Now we are ready to calculate $$\begin{equation} \text{E}\left( \left( \frac{d \log \Pr(y \mid \theta )}{d\theta} \right)^2 \, \middle| \, \theta \right). \end{equation}$$</p> \require{cancel} \begin{align} \text{E}\left( \left( \frac{d \log \Pr(y \mid \theta )}{d\theta} \right)^2 \, \middle| \, \theta \right) &amp;= \int \left( \frac{d \log \Pr(y \mid \theta )}{d\theta} \right)^2 \Pr(y \mid \theta) d\theta &amp;&amp; \text{by definition} \tag{17}\label{eq:final-showdown-1}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} + \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} \right) \Pr(y \mid \theta) d\theta &amp;&amp; \text{by Equation }\eqref{eq:second-order-last} \tag{18}\label{eq:final-showdown-2}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta + \int \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\Pr(y \mid \theta)} \Pr(y \mid \theta) d\theta &amp;&amp; \text{by distributive} \tag{19}\label{eq:final-showdown-3}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta + \int \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} \frac{1}{\cancel{\Pr(y \mid \theta)}} \cancel{\Pr(y \mid \theta)} d\theta \tag{20}\label{eq:final-showdown-4}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta + \int \frac{d^2 \Pr(y \mid \theta)}{d\theta^2} d\theta \tag{21}\label{eq:final-showdown-5}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta + \frac{d^2}{d\theta^2} \left( \int \Pr(y \mid \theta) d\theta \right) &amp;&amp; \text{by exchangeability again} \tag{22}\label{eq:final-showdown-6}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta + \underbrace{\frac{d^2}{d\theta^2} \left( \int \Pr(y \mid \theta) d\theta \right)}_{0} &amp;&amp; \text{similar to Equation }\eqref{eq:dlog-8} \tag{23}\label{eq:final-showdown-7}\\ &amp;= \int \left( - \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta \tag{24}\label{eq:final-showdown-8}\\ &amp;= - \int \left( \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \right) \Pr(y \mid \theta) d\theta \tag{25}\label{eq:final-showdown-9}\\ &amp;= - \text{E}\left( \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \, \middle| \, \theta \right). &amp;&amp; \text{by definition} \tag{26}\label{eq:final-showdown-10} \end{align} <p>At last, we have finally shown that</p> $\begin{equation} J(\theta) = \text{E}\left( \left( \frac{d \log \Pr(y \mid \theta )}{d\theta} \right)^2 \, \middle| \, \theta \right) = - \text{E}\left( \frac{d^2 \log \Pr(y \mid \theta)}{d\theta^2} \, \middle| \, \theta \right) \end{equation}$ <p>as it is explained by Equation (2.20) on page 53 of the <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>book</strong></a>.</p>This post elaborates a derivation of Equation (2.20) on page 53 of Bayesian Data Analysis Third Edition.Mean and Variance of The Negative Binomial through Conditionals2020-12-30T00:00:00+00:002020-12-30T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Mean_and_Variance_of_Negative_Binomial<p>This post is the continuation of <a href="https://hbunyamin.github.io/data-science-1/Derivation_Marginal_Distribution/"><strong>the post which derives a predictive distribution from Poisson &amp; Gamma Conjugate Pair</strong></a>.</p> <p class="center-image"><a href="/assets/images/highest-cancer-death-rate.png"><img src="/assets/images/highest-cancer-death-rate.png" alt="img4" class="img-resize-2" /></a> <a href="/assets/images/lowest-cancer-death-rate.png"><img src="/assets/images/lowest-cancer-death-rate.png" alt="img5" class="img-resize-2" /></a><em><center>$\pmb{\text{Figure 1}}$: The counties of the United States with the highest ($\pmb{\text{left}}$) and the lowest ($\pmb{\text{right}}$) 10% age-standardized death rates for cancer of kidney/ureter for U.S. white males, 1980-1989. Image taken from <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf">BDA 3rd Edition</a>, some rights reserved.</center></em></p> <p>Previously, $\text{Figure 1}$ shows misleading patterns in the maps of cancer death rates which are modeled by a <em>posterior distribution</em>, in this case a <em>Gamma distribution</em>. The <em>likelihood</em> is defined as</p> $\begin{equation} y_j \mid \theta \sim \text{Poisson}(10 n_j \theta_j), \tag{1}\label{eq:likelihood} \end{equation}$ <p>the <em>prior distribution</em> is</p> $\begin{equation} \theta_j \sim \text{Gamma}(\alpha, \beta). \tag{2}\label{eq:prior} \end{equation}$ <p>Calculating the <em>posterior distribution</em> by multiplying Equation \eqref{eq:likelihood} and \eqref{eq:prior}, we arrive at</p> $\begin{equation} \theta_j \mid y_j \sim \text{Gamma}(\alpha + y_j, \beta + 10 \, n_j). \tag{3}\label{eq:posterior} \end{equation}$ <p><a href="https://hbunyamin.github.io/data-science-1/Derivation_Marginal_Distribution/"><strong>The previous post</strong></a> shows that</p> $\begin{equation} \Pr(y_j) = \int \Pr(y_j \mid \theta_j) \Pr(\theta_j) \, d\theta \tag{4}\label{eq:predictive-distribution} \end{equation}$ <p>is a <em>negative binomial distribution</em>, $\text{Neg-bin}( \alpha, \frac{\beta}{10 n_j} )$.</p> <blockquote> <p>This post attempts to show the <strong>mean</strong> ($\text{E}(y_j)$) and <strong>variance</strong> ($\text{var}(y_j)$) of a <em>negative binomial distribution</em>.</p> </blockquote> <p>Specifically, we utilize the following two equations,</p> $\begin{equation} \text{E}(u) = \text{E}(\text{E}( u \mid v )) \tag{5}\label{eq:conditional-mean} \end{equation}$ <p>and</p> $\begin{equation} \text{var}(u) = \text{E}(\text{var}(u \mid v)) + \text{var}(\text{E}(u \mid v)) \tag{6}\label{eq:conditional-variance} \end{equation}$ <p>in our attempt.</p> <p>Firstly, we employ Equation \eqref{eq:conditional-mean} to find $\text{E}(y_j)$ as follows:</p> \begin{align} \text{E}(y_j) &amp;= \iint y_j \Pr(y_j, \theta_j) \, dy_j \, d\theta_j &amp;&amp; \text{definition of expectation} \tag{7}\label{eq:definition-expectation} \\ &amp;= \iint y_j \Pr(y_j \mid \theta_j) \Pr(\theta_j) \, dy_j \, d\theta_j &amp;&amp; \text{definition of conditional probability} \tag{8}\label{eq:definition-conditional-prob} \\ &amp;= \iint y_j \Pr(y_j \mid \theta_j) \, dy_j \Pr(\theta_j) \, d\theta_j &amp;&amp; \text{just rearranging} \tag{9}\label{eq:rearranging} \\ &amp;= \int \underbrace{\int y_j \Pr(y_j \mid \theta_j) \, dy_j}_{\text{An expectation}} \Pr(\theta_j) \, d\theta_j &amp;&amp; \tag{10}\label{eq:an-expectation} \\ &amp;= \int \text{E}(y_j \mid \theta_j) \Pr(\theta_j) \, d\theta_j. \tag{11}\label{eq:gathering-expectation} \end{align} <p>Recall that $y_j \mid \theta_j$ has $\text{Poisson}(10 n_j \theta_j)$ based on Equation \eqref{eq:likelihood}; therefore, we can proceed from Equation \eqref{eq:gathering-expectation} as follows:</p> \begin{align} \text{E}(y_j) &amp;= \int 10 n_j \theta_j \Pr(\theta_j) \, d\theta_j &amp;&amp; \text{because }\text{E}(y_j \mid \theta_j) = 10 n_j \theta_j \tag{12}\label{eq:inserting-expectation} \\ &amp;= \int 10 n_j \theta_j \frac{\beta^\alpha}{\Gamma (\alpha)} \theta_j^{\alpha-1} e^{-\beta \theta_j} \, d\theta_j &amp;&amp; \text{based on Equation }\eqref{eq:prior}, \text{a Gamma} \tag{13}\label{eq:inserting-gamma} \\ &amp;= \int 10 n_j \frac{\beta^\alpha}{\Gamma (\alpha)} \theta_j^{\alpha} e^{-\beta \theta_j} \, d\theta_j &amp;&amp; \text{adding }\theta_j \text{ into }\theta_j^{\alpha-1} \tag{14}\label{eq:mean-1} \\ &amp;= 10 n_j \frac{\beta^\alpha}{\Gamma(\alpha)} \int \theta_j^{(\alpha+1)-1} e^{-\beta \theta_j} \, d\theta_j &amp;&amp; \text{getting out }10 n_j \frac{\beta^\alpha}{\Gamma(\alpha)} \tag{15}\label{eq:mean-2} \end{align} <p>Remember that if we have</p> $\begin{equation} \theta_j \sim \text{Gamma}(\alpha, \beta) \tag{16}\label{eq:gamma-dist} \end{equation}$ <p>then, the integral of <em>probability density function</em> of $\theta_j$ over $[0, \infty]$ is $1$,</p> \begin{align} \int_{0}^{\infty} \frac{\beta^{\alpha+1}}{\Gamma(\alpha+1)} \theta_j^{(\alpha+1)-1} e^{-\beta \theta_j} \, d\theta_j = 1 &amp;\Longleftrightarrow \int_{0}^{\infty} \theta_j^{(\alpha+1)-1} e^{-\beta \theta_j} \, d\theta_j = \frac{\Gamma(\alpha+1)}{\beta^{\alpha+1}}. \tag{17}\label{eq:gamma-dist-1} \end{align} <p>Substituting Equation \eqref{eq:gamma-dist-1} into Equation \eqref{eq:mean-2}, we have the <strong>mean of negative binomial distribution</strong>:</p> \require{cancel} \begin{align} \text{E}(y_j) &amp;= 10 n_j \frac{\beta^\alpha}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{\beta^{\alpha+1}} \\ &amp;= 10 n_j \frac{\cancel{\beta^\alpha}}{\Gamma(\alpha)} \frac{\Gamma(\alpha+1)}{\cancel{\beta^\alpha}\beta} \\ &amp;= 10 n_j \frac{\alpha !}{(\alpha - 1)!} \frac{1}{\beta} &amp;&amp; \text{because }\Gamma(\alpha) = (\alpha-1)! \\ &amp;= 10 n_j \frac{\alpha \cdot (\alpha-1)!}{(\alpha - 1)!} \frac{1}{\beta} \\ &amp;= 10 n_j \frac{\alpha \cdot \cancel{(\alpha-1)!}}{\cancel{(\alpha - 1)!}} \frac{1}{\beta} \\ &amp;= 10 n_j \frac{\alpha}{\beta}. \tag{18}\label{eq:mean-neg-bin} \end{align} <p>Next, we shall compute $\text{var}(y_j)$ by utilizing Equation \eqref{eq:conditional-variance},</p> $\begin{equation} \text{var}(y_j) = \text{E}(\text{var}(y_j \mid \theta_j)) + \text{var}(\text{E}(y_j \mid \theta_j)). \tag{19}\label{eq:variance-1} \end{equation}$ <p>Recall that</p> $\begin{equation} y_j \mid \theta_j \sim \text{Poisson}(10 n_j \theta_j); \end{equation}$ <p>therefore, we have</p> $\begin{equation} \text{E}(y_j \mid \theta_j) = \text{var}(y_j \mid \theta_j) = 10 n_j \theta_j. \tag{20}\label{eq:mean-variance-poisson} \end{equation}$ <p>By substituting Equation \eqref{eq:mean-variance-poisson} on Equation \eqref{eq:variance-1}, we get the <strong>variance of negative binomial distribution</strong></p> \begin{align} \text{var}(y_j) &amp;= \text{E}(10 n_j \theta_j) + \text{var}(10 n_j \theta_j) \\ &amp;= 10 n_j \text{E}(\theta_j) + (10 n_j)^2 \, \text{var}(\theta_j) &amp;&amp; \text{note: }\theta_j \sim \text{Gamma}(\alpha, \beta) \text{ and var}(k \theta_j) = k^2 \text{var}(\theta_j) \\ &amp;= 10 n_j \frac{\alpha}{\beta} + (10 n_j)^2 \frac{\alpha}{\beta^2}. &amp;&amp; \text{E}(\theta_j) = \frac{\alpha}{\beta} \text{ and } \text{var}(\theta_j) = \frac{\alpha}{\beta^2} \tag{21}\label{eq:variance-negative-binomial} \end{align} <p>At last, we have shown the <strong>mean</strong> and <strong>variance of negative binomial distribution</strong> in Equation \eqref{eq:mean-neg-bin} and \eqref{eq:variance-negative-binomial} respectively.</p> <p>This post is also a <strong>solution of exercise number 6</strong> from <em>Chapter 2</em> of the <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>book</strong></a>.</p>This post is the continuation of the post which derives a predictive distribution from Poisson &amp; Gamma Conjugate Pair.Deriving Marginal Distribution from Poisson &amp; Gamma Conjugate Pair2020-12-23T00:00:00+00:002020-12-23T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Derivation_Marginal_Distribution<p>This post shows the derivation of <em>marginal distribution</em> from a <strong>Poisson</strong> model with <strong>Gamma</strong> prior distribution. Specifically, the idea comes from Chapter 2 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><em><strong>Bayesian Data Analysis (BDA) 3rd Edition</strong></em></a> on page 49.</p> <p><a href="/assets/images/highest-cancer-death-rate.png"><img src="/assets/images/highest-cancer-death-rate.png" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: The counties of the United States with the highest 10% age-standardized death rates for cancer of kidney/ureter for U.S. white males, 1980-1989. Image taken from <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf">BDA 3rd Edition</a>, some rights reserved.</center></em></p> <p>$\text{Figure 1}$ shows that most of the shaded counties are located in the middle of the country (<a href="https://en.wikipedia.org/wiki/Great_Plains"><strong>Great Plains</strong></a>).</p> <p><a href="/assets/images/lowest-cancer-death-rate.png"><img src="/assets/images/lowest-cancer-death-rate.png" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 2}}$: The counties of the United States with the lowest 10% age-standardized death rates for cancer of kidney/ureter for U.S. white males, 1980-1989. Interestingly, the pattern is somewhat similar to the map of the highest rates in $\text{Figure 1}$. Image taken from <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf">BDA 3rd Edition</a>, some rights reserved.</center></em></p> <p>Both $\text{Figure 1}$ and $\text{Figure 2}$ show that the <em>Great Plains</em> has both the highest and lowest rates. Recall that the reason of this issue is <em>sample size</em>. <em>Great Plains</em> has many low-population counties; therefore rare cancer death rates, such as kidney cancer, are represented in both maps. There is no evidence from both maps that cancer rates are high (please read page 47 of the <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf">excellent book</a> for more details).</p> <p>This misleading patterns in the maps of raw death rates suggest that a Poisson model-based approach to estimating the true underlying rates might be helpful. Let’s construct a <em>likelihood</em> from a Poisson distribution.</p> $\begin{equation} y_j \mid \theta_j \sim \text{Poisson}(10 \, n_j \, \theta_j) \tag{1}\label{eq:likelihood} \end{equation}$ <p>with $y_j$ denotes the number of kidney cancer deaths in county $j$ from 1980-1989, $n_j$ is the population of the county, and $\theta_j$ is the underlying rate in units of deaths per person per year.</p> <p>The conjugate prior for <em>Poisson</em> model is <em>Gamma</em> distribution with parameters $\alpha$ and $\beta$:</p> $\begin{equation} \theta_j \sim \text{Gamma}(\alpha, \beta). \tag{2}\label{eq:prior} \end{equation}$ <p>By multiplying Equation \eqref{eq:likelihood} and \eqref{eq:prior}, we obtain the posterior</p> $\begin{equation} \theta_j \mid y_j \sim \text{Gamma}(\alpha + y_j, \beta + 10 \, n_j). \tag{3}\label{eq:posterior} \end{equation}$ <p>Recall that the Bayes Rule states that</p> $\begin{equation} \Pr( \theta_j \mid y_j ) = \frac{\Pr( y_j \mid \theta_j ) \Pr(\theta_j)}{\Pr(y_j)} \Longleftrightarrow \Pr(y_j) = \frac{\Pr( y_j \mid \theta_j ) \Pr(\theta_j)}{\Pr( \theta_j \mid y_j )}. \tag{4}\label{eq:bayes-rule} \end{equation}$ <p>Specifically, we will <strong>derive the predictive distribution</strong>, the marginal distribution of $y_j$, averaging over the prior distribution of $\theta_j$ or, in short, $\pmb{\Pr(y_j)}$. <em><strong>The objective of this post is showing this derivation</strong></em>.</p> <blockquote> <p><em>How do we derive $\Pr(y_j)$?</em></p> </blockquote> <p>Firstly, we have the likelihood, a Poisson distribution, as shown in Equation \eqref{eq:likelihood}</p> $\begin{equation} \Pr(y_j \mid \theta_j) = \frac{1}{y_j!} (10 n_j \theta_j)^{y_j} \, e^{-10 n_j \theta_j}. \tag{5}\label{eq:likelihood-poisson} \end{equation}$ <p>Secondly, we also have the prior, a Gamma distribution, as described in Equation \eqref{eq:prior}</p> $\begin{equation} \Pr(\theta_j) = \frac{\beta^\alpha}{\Gamma (\alpha) } \theta_j^{\alpha - 1} e^{-\beta \theta_j}. \tag{6}\label{eq:prior-gamma} \end{equation}$ <p>Last but not least, we have our posterior distribution, a Gamma distribution, as shown in Equation \eqref{eq:posterior}</p> $\begin{equation} \Pr( \theta_j \mid y_j ) = \frac{(\beta + 10 n_j )^{\alpha + y_j}}{\Gamma (\alpha + y_j) } \theta_j^{\alpha + y_j -1} e^{-(\beta + 10 n_j) \theta_j}. \tag{7}\label{eq:posterior-gamma} \end{equation}$ <p>Let’s substitute Equation \eqref{eq:likelihood-poisson}, \eqref{eq:prior-gamma}, and \eqref{eq:posterior-gamma} into Equation \eqref{eq:bayes-rule} as follows:</p> \require{cancel} \begin{align} \Pr(y_j) &amp;= \frac{\frac{1}{y_j!} (10 n_j \theta_j)^{y_j} \, e^{-10 n_j \theta_j} \times \frac{\beta^\alpha}{\Gamma (\alpha) } \theta_j^{\alpha - 1} e^{-\beta \theta_j} }{\frac{(\beta + 10 n_j )^{\alpha + y_j}}{\Gamma (\alpha + y_j) } \theta_j^{\alpha + y_j -1} e^{-(\beta + 10 n_j) \theta_j}} \tag{8}\label{eq:derivation-1}\\ &amp;= \frac{1}{y_j !} \frac{(10 n_j)^{y_j} \theta_j^{y_j} e^{-10 n_j \theta_j} \frac{\beta^\alpha}{\Gamma (\alpha)} \theta_j^{\alpha-1} e^{-\beta \theta_j} \Gamma (\alpha + y_j)}{(\beta + 10 n_j)^{\alpha+y_j} \theta_j^{y_j} \theta_j^{\alpha-1} e^{-\beta \theta_j} e^{-10 n_j \theta_j}} \tag{9}\label{eq:derivation-2} \\ &amp;= \frac{1}{y_j !} \frac{(10 n_j)^{y_j} \cancel{\theta_j^{y_j}} \cancel{e^{-10 n_j \theta_j}} \frac{\beta^\alpha}{\Gamma (\alpha)} \cancel{\theta_j^{\alpha-1}} \cancel{e^{-\beta \theta_j}} \Gamma (\alpha + y_j)}{(\beta + 10 n_j)^{\alpha+y_j} \cancel{\theta_j^{y_j}} \cancel{\theta_j^{\alpha-1}} \cancel{e^{-\beta \theta_j}} \cancel{e^{-10 n_j \theta_j}}} \tag{10}\label{eq:derivation-3} \\ &amp;= \frac{1}{y_j !} \frac{(10 n_j)^{y_j} \beta^{\alpha} \Gamma(\alpha + y_j)}{\Gamma(\alpha) (\beta + 10 n_j)^{\alpha + y_j}} \tag{11}\label{eq:derivation-4} \\ &amp;= \frac{1}{y_j !} \frac{\Gamma(\alpha + y_j)}{\Gamma(\alpha)} \frac{(10 n_j)^{y_j}}{(\beta + 10 n_j )^{\alpha + y_j} } \beta^{\alpha} \tag{12}\label{eq:derivation-5} \\ &amp;= \frac{1}{y_j !} \frac{\Gamma(\alpha + y_j)}{\Gamma(\alpha)} \frac{(10 n_j)^{y_j}}{(\beta + 10 n_j )^{y_j} } \frac{\beta^{\alpha}}{(\beta+10 n_j)^{\alpha}} \tag{13}\label{eq:derivation-6} \\ &amp;= \frac{1}{y_j !} \frac{(\alpha + y_j - 1)!}{(\alpha - 1)!} \frac{(10 n_j)^{y_j}}{(\beta + 10 n_j )^{y_j} } \frac{\beta^{\alpha}}{(\beta+10 n_j)^{\alpha}} &amp;&amp; \text{because }\Gamma(n) = (n-1)! \tag{14}\label{eq:derivation-7} \\ &amp;= \binom{\alpha + y_j - 1}{\alpha - 1} \left( \frac{\beta}{\beta + 10 n_j} \right)^\alpha \left( \frac{10 n_j}{\beta + 10 n_j} \right)^{y_j} &amp;&amp; \text{because }\binom{n}{r} = \frac{n!}{r! \, (n-r)!} \tag{15}\label{eq:derivation-8} \\ &amp;= \binom{y_j + \alpha - 1}{\alpha - 1} \left( \frac{\frac{\beta}{10 n_j}}{ \frac{\beta}{10 n_j} + 1} \right)^\alpha \left( \frac{1}{\frac{\beta}{10 n_j } + 1 } \right)^{y_j} \tag{16}\label{eq:derivation-9} \end{align} <p>As we know that a <strong><em>negative binomial</em></strong> distribution, $\text{Neg-bin}(\alpha, \beta)$, is</p> $\begin{equation} \theta \sim \binom{\theta + \alpha - 1}{\alpha - 1} \left( \frac{\beta}{\beta+1} \right)^\alpha \left( \frac{1}{\beta + 1} \right)^\theta, \qquad \theta = 0, 1, 2, \ldots \tag{17}\label{eq:derivation-10} \end{equation}$ <p>Therefore, we conclude that $\Pr(y_j)$ in Equation \eqref{eq:derivation-9} is indeed a <strong><em>negative binomial distribution</em></strong>,</p> $\begin{equation} y_j \sim \text{Neg-bin}\left( \alpha, \frac{\beta}{10 n_j} \right) \tag{18}\label{eq:final-derivation} \end{equation}$ <p>as explained on page 49 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><strong>the book</strong></a>.</p>This post shows the derivation of marginal distribution from a Poisson model with Gamma prior distribution. Specifically, the idea comes from Chapter 2 of Bayesian Data Analysis (BDA) 3rd Edition on page 49.Showing Binomial is an Exponential Family with a Natural Parameter2020-12-08T00:00:00+00:002020-12-08T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Natural_Parameter<p>This post shows that the binomial is indeed an <strong>exponential family</strong> with <strong>natural parameter</strong> $\text{logit}(\theta)$. Specifically, this exercise comes from Chapter 2 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><em><strong>Bayesian Data Analysis (BDA) 3rd Edition</strong></em></a> on page 37.</p> <p>Recall that a binomial distribution whose <strong>likelihood</strong> $\Pr(y \mid \theta, n ) = \text{Bin}(y \mid n, \theta)$ with $n$ known, the <strong>conjugate prior distribution</strong> on $\theta$ is a <strong>beta distribution</strong>. Particularly, the <strong>likelihood</strong> (a <em>binomial distribution</em>) is</p> <p>$$\begin{equation} \Pr( y \mid \theta ) \propto \theta^y (1 - \theta)^{n-y} \tag{1}\label{eq:likelihood} \end{equation}$$ <br /> with $\theta$ denotes a probability of a head occurrence, $n$ is a number of trials, and $y$ expresses a number of head occurences. Additionally, the prior (a <em>beta distribution</em>) is</p> <p>$$\begin{equation} \Pr( \theta ) \propto \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \tag{2}\label{eq:prior} \end{equation}$$ <br /> with $\alpha$ and $\beta$ denote a number of head and tail occurrences respectively.</p> <p>We will show that</p> $\begin{equation} \Pr(\theta \mid y ) \propto g(\theta)^{\eta + n} \exp{\left( \phi(\theta)^T (\nu + t(y)) \right)} \tag{3}\label{eq:posterior-density} \end{equation}$ <p>Actually, Equation \eqref{eq:posterior-density} is a general form which holds for vector $\phi(\theta)$ and both $\eta$ and $\nu$ are constants. Let’s start computing the posterior density as follows:</p> \begin{align} \Pr(\theta \mid y ) &amp;\propto \Pr(y \mid \theta) \Pr( \theta ) \tag{4}\label{eq:posterior-start} &amp;&amp; \text{by Bayes Rule} \\ &amp;\propto \theta^y (1- \theta)^{n-y} \, \theta^{\alpha - 1} (1 - \theta)^{\beta - 1} \\ &amp;= \theta^{y+\alpha-1} (1 - \theta)^{n - y + \beta - 1} \\ &amp;= \theta^{y+\alpha-1} \, \frac{1}{(1 - \theta)^{-n+y-\beta +1}} \\ &amp;= \frac{\theta^{\alpha-1}}{(1 - \theta)^{-n-\beta+1}} \, \frac{\theta^y}{(1-\theta)^y} &amp;&amp; \text{by rearranging terms} \\ &amp;= \frac{\theta^{\alpha-1}}{(1 - \theta)^{-n-\beta+1}} \, \exp{\left( \log{ \left( \frac{\theta}{1-\theta} \right)^y } \right)} \\ &amp;= \theta^{\alpha-1} (1 - \theta)^{\beta-1} (1 - \theta)^n \, \exp{( y \; \text{logit}{ (\theta) } )} \\ &amp;= \left( \theta^{\frac{\alpha - 1}{n}} (1 - \theta)^{\frac{\beta-1}{n}} (1-\theta) \right)^n \, \exp{( \text{logit}{ (\theta) } \; y )} \\ &amp;= g(\theta)^n \exp{( \phi(\theta) \; t(y) )} &amp;&amp; \text{by referring to Equation }\eqref{eq:posterior-density} \end{align} <p>with $g(\theta) = \left( \theta^{\frac{\alpha - 1}{n}} (1 - \theta)^{\frac{\beta-1}{n}+1} \right)$, $t(y) = y$, and $\phi(\theta) = \text{logit}(\theta)$. <br /> Finally, we have shown that the binomial is indeed an <strong>exponential family</strong> with <strong>natural parameter</strong> $\text{logit}(\theta)$.</p>This post shows that the binomial is indeed an exponential family with natural parameter $\text{logit}(\theta)$. Specifically, this exercise comes from Chapter 2 of Bayesian Data Analysis (BDA) 3rd Edition on page 37. Recall that a binomial distribution whose likelihood $\Pr(y \mid \theta, n ) = \text{Bin}(y \mid n, \theta)$ with $n$ known, the conjugate prior distribution on $\theta$ is a beta distribution. Particularly, the likelihood (a binomial distribution) isPredictive Distributions (BDA 3rd Edition, Chapter 2)2020-11-25T00:00:00+00:002020-11-25T00:00:00+00:00http://hbunyamin.github.io/data-science-1/Predictive_Distributions<p>This post provides an answer for Exercise 2 from Chapter 2 of <a href="http://www.stat.columbia.edu/~gelman/book/BDA3.pdf"><em><strong>Bayesian Data Analysis (BDA) 3rd Edition</strong></em></a>. Let’s state the problem from the <em>beloved</em> book.</p> <p>Consider two coins, $C_1$ and $C_2$, with the following characteristics:</p> \begin{align} \Pr(\text{head} \mid C_1) &amp;= 0.6, \\ \Pr(\text{head} \mid C_2) &amp;= 0.4. \end{align} <p>Choose one of the coins at random and imagine spinning it repeatedly. <br /> Here is the question:</p> <blockquote> <p>Given that the <strong>first two spins</strong> from the chosen coin are <strong>tails</strong>, what is the <strong>expectation of the number of additional spins until a head shows up</strong>?</p> </blockquote> <p>To simplify our writing, we denote $\text{head}$ and $\text{tail}$ as $H$ and $T$ respectively. Therefore, we have</p> \begin{align*} C_1 &amp;\rightarrow \Pr(H \mid C_1) = 0.6 = \frac{3}{5}, \tag{1}\label{eq:c1}\\ C_2 &amp;\rightarrow \Pr(H \mid C_2) = 0.4 = \frac{2}{5}. \tag{2}\label{eq:c2} \end{align*} <p><a href="/assets/images/the-experiment.png"><img src="/assets/images/the-experiment.png" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: The problem poses an experiment consisting of two steps. Specifically, $N \sim$ geometric distribution.</center></em></p> <p>If we read the problem carefully, we may find that the problem consists of two steps as depicted in $\pmb{\text{Figure 1}}$. Particularly, random variable $N$ is a geometric distribution which has a <em>probability mass function</em> such that</p> <p>$$\begin{equation} \Pr(n) = (1-C_i)^{n-1} C_i \tag{3}\label{eq:pmf-geometri} \end{equation}$$ where $C_i$ depends on either $C_1$ or $C_2$.</p> <p>Now, let us compute $$\begin{equation} E(N \mid TT) = ? \tag{4}\label{eq:problem} \end{equation}$$ as shown</p> \begin{align} \text{E}(N \mid TT) &amp;= \int N \, \Pr(N \mid TT) \, dN &amp;&amp; \text{by definition} \tag{5}\label{eq:compute-1} \\ &amp;= \int \int N \, \Pr(N, C \mid TT) \, dC \, dN &amp;&amp; \text{by Bayes rule} \tag{6}\label{eq:compute-2} \\ &amp;= \int \int N \, \Pr(N \mid TT,C) \Pr(C \mid TT) \, dC \, dN &amp;&amp; \text{by conditional probability} \tag{7}\label{eq:compute-3} \\ &amp;= \int \underbrace{\int N \, \Pr(N \mid TT,C) \, dN}_{\text{E}(N \mid TT, C)} \, \Pr(C \mid TT) \, dC &amp;&amp; \text{just rearranging} \tag{8}\label{eq:compute-4} \\ &amp;= \int \text{E}(N \mid TT, C) \, \Pr(C \mid TT) \, dC &amp;&amp; \text{by expectation definition} \tag{9}\label{eq:compute-5} \\ &amp;= \sum_{i=1}^2 \text{E}(N \mid TT, C_i) \, \Pr(C_i \mid TT) &amp;&amp; \text{since }C \text{ is discrete .} \tag{10}\label{eq:compute-6} \end{align} <p>Recall that $N \sim$ geometric distribution; accordingly,</p> \begin{align} \text{E}(N \mid TT, C_i) &amp;= \text{E}(N \mid C_i) &amp;&amp; \text{whether we have }TT\text{ as conditional or not} \tag{11}\label{eq:expectation-1} \\ &amp;= \frac{1}{C_i} \tag{12}\label{eq:expectation-2}. \end{align} <p>We proceed from Equation \eqref{eq:compute-6} as shown</p> \begin{align} \text{E}(N \mid TT) &amp;= \text{E}(N \mid TT, C_1) \, \Pr(C_1 \mid TT) + \text{E}(N \mid TT, C_2) \, \Pr(C_2 \mid TT) \tag{13}\label{eq:final-1} \\ &amp;= \frac{1}{C_1} \Pr(C_1 \mid TT) + \frac{1}{C_2} \Pr(C_2 \mid TT) \tag{14}\label{eq:final-2} \\ &amp;= \frac{1}{C_1} \underbrace{\frac{\Pr(TT \mid C_1) \, \Pr(C_1)}{\Pr(TT)}}_{\text{Part 1}} + \frac{1}{C_2} \underbrace{\frac{\Pr(TT \mid C_2) \, \Pr(C_2)}{\Pr(TT)}}_{\text{Part 2}} \tag{15}\label{eq:final-3} \\ \end{align} <p>Next, let’s compute $\text{Part 1}$ which looks like</p> \begin{align} \frac{\Pr(TT \mid C_1) \, \Pr(C_1)}{\Pr(TT)} &amp;= \frac{Pr(TT \mid C_1) \, \Pr(C_1)}{\Pr(TT \mid C_1) \, \Pr(C_1) + \Pr(TT \mid C_2) \, \Pr(C_2)} &amp;&amp; \text{expanding }\Pr(TT) \tag{16}\label{eq:part-1-1} \\ &amp;= \frac{\left( \frac{2}{5} \right) \left( \frac{2}{5} \right) \left(\frac{1}{2} \right)}{\left( \frac{2}{5} \right) \left( \frac{2}{5} \right) \left( \frac{1}{2} \right) + \left( \frac{3}{5} \right) \left( \frac{3}{5} \right) \left( \frac{1}{2} \right)}. \tag{17}\label{eq:part-1-2} \\ \end{align} <p>Similarly, we also calculate $\text{Part 2}$ in the following:</p> \begin{align} \frac{\Pr(TT \mid C_2) \, \Pr(C_2)}{\Pr(TT)} &amp;= \frac{\Pr(TT \mid C_2) \, \Pr(C_2)}{\Pr(TT \mid C_1) \, \Pr(C_1) + \Pr(TT \mid C_2) \, \Pr(C_2)} &amp;&amp; \text{expanding }\Pr(TT) \tag{18}\label{eq:part-2-1} \\ &amp;= \frac{\left( \frac{3}{5} \right) \left( \frac{3}{5} \right) \left( \frac{1}{2} \right) }{ \left( \frac{2}{5} \right) \left( \frac{2}{5} \right) \left( \frac{1}{2} \right) + \left( \frac{3}{5} \right) \left( \frac{3}{5} \right) \left( \frac{1}{2} \right) }. \tag{19}\label{eq:part-2-2} \\ \end{align} <p>Finally, we are able to compute Equation \eqref{eq:problem} as</p> \begin{align} \text{E}(N \mid TT) &amp;= \frac{1}{C_1} \, \text{Part 1} + \frac{1}{C_2} \, \text{Part 2} \tag{20}\label{eq:final-answer}\\ &amp;= \frac{1}{3/5} \, \text{Part 1} + \frac{1}{2/5} \, \text{Part 2} \\ &amp;= 2.2436 &amp;&amp; \text{utilizing Eq. }\eqref{eq:part-1-2}\text{ and Eq. }\eqref{eq:part-2-2} \\ &amp;\approx 3 &amp;&amp; \text{rounding the number.} \end{align} <p>This means that in order to <em>find a head</em> after we have two <strong>tails</strong> regardless the coin we choose, we need $\pmb{3}$ more throws <strong>on average</strong>.</p>This post provides an answer for Exercise 2 from Chapter 2 of Bayesian Data Analysis (BDA) 3rd Edition. Let’s state the problem from the beloved book.Variance and Covariance of Categorical Distribution2020-09-18T00:00:00+00:002020-09-18T00:00:00+00:00http://hbunyamin.github.io/ml-2/Expectation_Variance_and_Covariance_of_Categorical_Distribution<p>This post is inspired by <a href="http://www.cs.columbia.edu/~blei/fogm/2020F/index.html">the lecture given by David Blei</a> on Thursday, 17 September 2020. One of the topics he explained was related to a <strong><em>categorical variable</em></strong> and a <strong><em>categorical distribution</em></strong>. This post will elaborate those two concepts. Let’s get started.</p> <p><a href="/assets/images/one-hot-encoding.png"><img src="/assets/images/one-hot-encoding.png" alt="img1" class="img-responsive" /></a><em><center>$\pmb{\text{Figure 1}}$: A categorical variable ( $\text{Color}$) and its values ( $\text{Red}$, $\text{Yellow}$, and $\text{Green}$ ). Image taken from <a href="https://www.kaggle.com/alexisbcook/categorical-variables">Kaggle</a>, some rights reserved.</center></em></p> <p>$\text{Figure 1}$ shows an example of categorical values stored in a categorical variable, $\text{Color}$. Basically, a categorical variable takes one of $K$ values and each categorical value is represented by a $K$-vector with a single $1$ and otherwise $0$s.</p> <blockquote> <p>Let’s denote a categorical variable as $x^{(k)}$ which means that the $k$th component of it has a a single $1$ and otherwise $0$s.</p> </blockquote> <p>For example, the categorical variable in $\text{Figure 1}$ has $3$ values ( $\text{Red}$, $\text{Yellow}$, and $\text{Green}$ ) and each value is represented by $3$-vector with a single $1$ and otherwise $0$s as follows:</p> \begin{align} \text{Red} &amp;= x^{(1)} = (1, 0, 0 ) \\ \text{Yellow} &amp;= x^{(2)} = (0, 1, 0 ) \tag{1}\label{eq:yellow} \\ \text{Green} &amp;= x^{(3)} = (0, 0, 1 ). \end{align} <p>The $K$-vector with single $1$ and otherwise $0$s is commonly named <a href="https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/"><strong>one-hot vector</strong></a>.</p> <p>A <strong><em>categorical distribution</em></strong> is parameterized by $\theta$. Moreover, $\pmb{\theta}$ <strong>specifies the probability of each categorical value</strong>. Suppose we have $K$ categorical values; therefore,</p> $\begin{equation} \theta = (\theta_1, \theta_2, \ldots, \theta_K) \tag{2}\label{eq:theta} \end{equation}$ <p>with</p> $\begin{equation} \sum_{k=1}^{K}{\theta_k} = 1 \text{ and } 0 \leq \theta_k \leq 1 \text{ for }k=1, \ldots, K. \tag{3}\label{eq:theta-constraints} \end{equation}$ <p>Consider that $X^{(k)}$ is a random <em>categorical</em> variable which takes one of $K$ values. Moreover, since $X^{(k)}$ is random variable, it has <strong><em>categorical distribution</em></strong> that is described by a discrete probability distribution,</p> <p>$$\begin{equation} \text{p}(x^{(k)}) = \prod_{l=1}^{K}{\theta_{l}^{x^{(l)}}} \tag{4}\label{eq:pdf-categorical} \end{equation}$$<br /> with $x^{(l)}$ is the $l$th component of $x^{(k)}$. Additionally, we elaborate Equation \eqref{eq:pdf-categorical} into</p> \begin{align} \text{p}(x^{(k)}) &amp;= \prod_{l=1}^{K}{\theta_{l}^{x^{(l)}}} \\ &amp;= \theta_{1}^{x^{(1)}} \times \theta_{2}^{x^{(2)}} \times \cdots \times \theta_{k}^{x^{(k)}} \times \cdots \times \theta_{K}^{x^{(K)}} \\ &amp;= \theta_{1}^{0} \times \theta_{2}^{0} \times \cdots \times \theta_{k}^{1} \times \cdots \times \theta_{K}^{0} \\ &amp;= \theta_{k}. \tag{5}\label{eq:pdf-categorical-simplified} \end{align} <p>Let’s put Equation \eqref{eq:pdf-categorical-simplified} into practice and demonstrate it in one example. Suppose we want to compute $\text{p}(\text{Yellow})$ in Equation \eqref{eq:yellow},</p> \begin{align} \text{p}(\text{Yellow}) &amp;= \text{p}(x^{(2)}) \\ &amp;= \text{p}((0,1,0)) &amp; \Rightarrow \text{1st} = 0, \text{2nd} = 1, \text{3rd} = 0 \\ &amp;= \prod_{l=1}^{3}{\theta_{l}^{x^{(l)}}} \\ &amp;= \theta_{1}^{x^{(1)}} \times \theta_{2}^{x^{(2)}} \times \theta_{3}^{x^{(3)}} \\ &amp;= \theta_{1}^{0} \times \theta_{2}^{1} \times \theta_{3}^{0} \\ &amp;= \theta_{2}. \end{align} <p>With Equation \eqref{eq:pdf-categorical-simplified} in hand, we are now ready to compute the <em>expectation</em> of $X^{(k)}$ as</p> \begin{align} \text{E}(X^{(k)}) &amp;= \sum_{l=1}^{K}{x^{(l)} \text{p}(x^{(l)})} \\ &amp;= \underbrace{0 \times \text{p}(x^{(1)})}_{1\text{st}} + \underbrace{0 \times \text{p}(x^{(2)})}_{2\text{nd}} + \cdots + \underbrace{1 \times \text{p}(x^{(k)})}_{k\text{th}} + \cdots + \underbrace{0 \times \text{p}(x^{(K)}}_{K\text{th}}) \\ &amp;= \text{p}(x^{(k)}) \\ &amp;= \theta_k. \tag{6}\label{eq:expectation} \end{align} <p>Next, we compute the Variance, $\text{Var}$, as follows:</p> \begin{align} \text{Var}(X^{(k)}) &amp;= \underbrace{\text{E}((X^{(k)})^2)}_{\text{Part I}} - \underbrace{(\text{E}(X^{(k)}))^2}_{\text{Part II}}. &amp; \text{the definition of variance} \tag{7}\label{eq:variance-definition} \\ \end{align} <p>Next, we compute $\text{Part I}$, $\text{E}((X^{(k)})^2)$, as follows:</p> \begin{align} \text{E}((X^{(k)})^2) &amp;= \sum_{l=1}^{K}{(x^{(l))^2} \text{p}(x^{(l)})} \\ &amp;= \underbrace{0^2 \times \text{p}(x^{(1)})}_{1\text{st}} + \underbrace{0^2 \times \text{p}(x^{(2)})}_{2\text{nd}} + \cdots + \underbrace{1^2 \times \text{p}(x^{(k)})}_{k\text{th}} + \cdots + \underbrace{0^2 \times \text{p}(x^{(K)}}_{K\text{th}}) \\ &amp;= \text{p}(x^{(k)}) \\ &amp;= \theta_k. \tag{8}\label{eq:expectation-x-square} \end{align} <p>Now, we can finalize computing the Variance in Equation \eqref{eq:variance-definition},</p> \begin{align} \text{Var}(X^{(k)}) &amp;= \text{E}((X^{(k)})^2) - (\text{E}(X^{(k)}))^2 &amp;&amp; \text{by definition of variance} \\ &amp;= \theta_k - (\theta_k)^2 &amp;&amp; \text{using Equation }\eqref{eq:expectation} \text{ and }\eqref{eq:expectation-x-square} \\ &amp;= \theta_k (1 - \theta_k). &amp;&amp; \text{using distributive property} \tag{9}\label{eq:variance} \end{align} <p>Last but not least, we shall compute the Covariance, $\text{Cov}(X^{(j)}, X^{(k)})$. We start by the definition of Covariance,</p> \begin{align} \text{Cov}(X^{(j)}, X^{(k)}) &amp;= \underbrace{\text{E}(X^{(j)} X^{(k)})}_{\text{Part I}} - \underbrace{(\text{E}(X^{(j)}) E(X^{(k)}))}_{\text{Part II}}. &amp;&amp; \text{by definition} \tag{10}\label{eq:covariance} \end{align} <p>Let’s compute the $\text{Part I}$ as follows:</p> \begin{align} \text{E}(X^{(j)} X^{(k)}) &amp;= (0)(0) \theta_1 + \cdots + \underbrace{(1)(0) \theta_j}_{j\text{th}} + \cdots + \underbrace{(0)(1) \theta_k}_{k\text{th}} + \cdots + (0)(0) \theta_K \\ &amp;= 0. \tag{11}\label{eq:covariance-zero} \end{align} <p>Eventually, we can finalize Equation \eqref{eq:covariance} as</p> \begin{align} \text{Cov}(X^{(j)}, X^{(k)}) &amp;= \text{E}(X^{(j)} X^{(k)}) - (\text{E}(X^{(j)}) E(X^{(k)})) \\ &amp;= 0 - \theta_j \theta_k &amp;&amp; \text{using Equation } \eqref{eq:expectation} \text{ and }\eqref{eq:covariance-zero} \\ &amp;= - \theta_j \theta_k. \end{align} <p>To conclude, we have shown how to derive the <strong><em>expectation</em></strong>, <strong><em>variance</em></strong>, and <strong><em>covariance</em></strong> of a <em>categorical distribution</em>. We hope this post helps anyone who wants to understand a <em>categorical distribution</em>.</p>This post is inspired by the lecture given by David Blei on Thursday, 17 September 2020. One of the topics he explained was related to a categorical variable and a categorical distribution. This post will elaborate those two concepts. Let’s get started.