Loading [MathJax]/jax/output/CommonHTML/jax.js

A Foundation of Faster Optimization Algorithms part 1

This article summarizes the topic “Exponentially Weighted Moving Averages” (Week 2 of Improving Deep Neural Networks: Hyperparameters tuning, Regularization and Optimization) from deeplearning.ai.

There are a few optimization algorithms which are faster than gradient descent. In order to understand these optimization algorithms, we need to understand the concept of exponentially weighted moving averages.

img1

Figure 1Figure 1: A plot of temperatures for each day in a year. Image taken from Deeplearning.ai, some rights reserved.

Suppose that the temperature of day i is denoted by θi from i=1,,365 (how many days in one year?). Visually, the temperatures are shown in Figure 1. Furthermore, we show several examples of the temperatures as follows:

θ1=40Fθ2=49Fθ3=40Fθ180=60Fθ181=56F

As we can see in the image above, the values of temperatures during a year are noisy which means that there is considerable variation in the values. The variation is caused by noise and we need to remove the variation if we want to expose the underlying values of temperatures (Brownlee, 2019).

How do we remove the noise which resides in the values of time series?

One of the techniques to remove the noise is called smoothing. Particularly, the common technique used commonly in time series forecasting is exponentially weighted averages. Computing exponentially weighted averages involves constructing a new series whose values are calculated by the average of raw observations in the original time series. Let’s denote the new series as vt for t=1,2,,365 as follows:

vt=βvt1+(1β)θt.

with vt is the average at time t, θt is the temperature at time t, and β is the parameter determining average number of days’ temperatures (0<β<1). Specifically,

vtaverage average over 11β days' temperature.

For example, let’s define β=0.9 which means that we compute vt as approximately an average over

11β=110.9=10.1=10 days.

Similarly, β=0.98 gives approximately an average over 50 days. Figure 2 shows the plot of these weighted averages.

img1

Figure 2Figure 2: A plot of vt at β=0.9 (redred) and β=0.98 (greengreen). Image taken from Deeplearning.ai, some rights reserved.

As an extreme example, β=0.5 computes approximately an average over 2 days as depicted in Figure 3. These weighted averages are noisy because the computation of these averages takes only 2 days before.

img1

Figure 3Figure 3: A plot of vt at β=0.5 (yellowyellow). Image taken from Deeplearning.ai, some rights reserved.

From those three different values of β, we can see that as β is getting larger and larger, the plot is going smoother and smoother. On the other hand, if β is getting smaller and smaller, the plot will become noisier and noisier. As a conclusion,

β smoother smoother because we are averaging more daysβ noisier noisier because we are averaging fewer days.

Bias Correction

Previously, we have defined Equation (1) which is

vt=βvt1+(1β)θt.

Let’s substitute v0=0 and β=0.98 into Equation (1).
At t=1,

v1=βv0+(1β)θ1v1=(0.98)(0)+(10.98)θ1v1=0.02θ1.

At t=2,

v2=βv1+(1β)θ2v2=(0.98)(0.02θ1)+(10.98)θ2v2=0.0196θ1+0.02θ2.

With the assumption that θ1, θ2>0, we arrive at

v2θ1 and v2θ2

which means that v2 is not a very good estimate of the first two days’ temperature of the year.

How can we improve the estimate of the first two days’ temperature?

Introducing bias correction will improve the estimate. With β=0.98,

vt average over 110.98 average over 50 days' temperature;

moreover, at t=2, the bias correction factor will be

1βt=1(0.98)2=0.0396 Therefore, the bias correction for v2 will be

v21β2=v20.0396 from Equation (5).

Since v2 is divided by 0.0396 which acts as a bias correction, the bias from v2 will be removed and the estimate of v2 is improving. Generally, the form of weighted averages with bias correction is defined as

vt1βt=βvt1+(1β)θt. with t=1,2,,365.

As a final note, the bias correction helps us find better estimates of a series during the initial phase of learning as we will see in the Part 2 of this blog. As t is getting larger and larger, βt0. Therefore, if t is large, the bias correction will have no effect on the series.

References

Brownlee, J. (2019). Introduction to Time Series Forecasting in Python. https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/. Accessed: 2020-09-5


Written on August 22, 2020