This article summarizes the topic “Exponentially Weighted Moving Averages” (Week 2 of Improving Deep Neural Networks: Hyperparameters tuning, Regularization and Optimization) from deeplearning.ai.
There are a few optimization algorithms which are faster than gradient descent. In order to understand these optimization algorithms, we need to understand the concept of exponentially weighted moving averages.
Suppose that the temperature of day i is denoted by θi from i=1,…,365 (how many days in one year?). Visually, the temperatures are shown in Figure 1. Furthermore, we show several examples of the temperatures as follows:
θ1=40∘Fθ2=49∘Fθ3=40∘F⋮θ180=60∘Fθ181=56∘F⋮As we can see in the image above, the values of temperatures during a year are noisy which means that there is considerable variation in the values. The variation is caused by noise and we need to remove the variation if we want to expose the underlying values of temperatures (Brownlee, 2019).
How do we remove the noise which resides in the values of time series?
One of the techniques to remove the noise is called smoothing. Particularly, the common technique used commonly in time series forecasting is exponentially weighted averages. Computing exponentially weighted averages involves constructing a new series whose values are calculated by the average of raw observations in the original time series. Let’s denote the new series as vt for t=1,2,…,365 as follows:
vt=βvt−1+(1−β)θt.with vt is the average at time t, θt is the temperature at time t, and β is the parameter determining average number of days’ temperatures (0<β<1). Specifically,
vt≈average average over 11−β days' temperature.For example, let’s define β=0.9 which means that we compute vt as approximately an average over
11−β=11−0.9=10.1=10 days.Similarly, β=0.98 gives approximately an average over 50 days. Figure 2 shows the plot of these weighted averages.
As an extreme example, β=0.5 computes approximately an average over 2 days as depicted in Figure 3. These weighted averages are noisy because the computation of these averages takes only 2 days before.
From those three different values of β, we can see that as β is getting larger and larger, the plot is going smoother and smoother. On the other hand, if β is getting smaller and smaller, the plot will become noisier and noisier. As a conclusion,
β⋙⟹ smoother smoother because we are averaging more daysβ⋘⟹ noisier noisier because we are averaging fewer days.
Bias Correction
Previously, we have defined Equation (1) which is
vt=βvt−1+(1−β)θt.Let’s substitute v0=0 and β=0.98 into Equation (1).
At t=1,
At t=2,
v2=βv1+(1−β)θ2⇔v2=(0.98)(0.02θ1)+(1−0.98)θ2⇔v2=0.0196θ1+0.02θ2.With the assumption that θ1, θ2>0, we arrive at
v2≪θ1 and v2≪θ2which means that v2 is not a very good estimate of the first two days’ temperature of the year.
How can we improve the estimate of the first two days’ temperature?
Introducing bias correction will improve the estimate. With β=0.98,
vt≈ average over 11−0.98≈ average over 50 days' temperature;moreover, at t=2, the bias correction factor will be
1−βt=1−(0.98)2=0.0396 Therefore, the bias correction for v2 will be
v21−β2=v20.0396 from Equation (5).Since v2 is divided by 0.0396 which acts as a bias correction, the bias from v2 will be removed and the estimate of v2 is improving. Generally, the form of weighted averages with bias correction is defined as
vt1−βt=βvt−1+(1−β)θt. with t=1,2,…,365.
As a final note, the bias correction helps us find better estimates of a series during the initial phase of learning as we will see in the Part 2 of this blog. As t is getting larger and larger, βt≈0. Therefore, if t is large, the bias correction will have no effect on the series.
References
Brownlee, J. (2019). Introduction to Time Series Forecasting in Python. https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/
. Accessed: 2020-09-5