Computational Trick for Harmonic Mean

A Helpful Step in Computing the Bayes Factor

Author

Robert Granger

In this article, I derive a computational trick for calculating the harmonic mean when the values of \(x_1, x_2, ... x_n\) are too small such that we end up with an underflow error. The method is a variation of the ubiquitous computational technique for log sum. I then demonstrate the usefulness of this technique by applying it to calculating the Bayes Factor. Throughout this article, code snippets will be provided using R.

The Harmonic Mean

In the early years of schooling and certainly any introductory statistics course, students are taught a handful of measures for central tendency, i.e, computing an average. The most common measures being the mean, median, and mode. Occasionally, other measures are taught such as the mid-range or the interquartile range, but the aforementioned are by far the most popular. Unknown to many, the mean (specifically the arithmetic mean) that is taught is just one of the three Pythagorean means studied by the early Greek mathematicians. The three Pythagorean means are the arithmetic mean, geometric mean, and the harmonic mean.

Arithmetic Mean:

\[ AM(x_1, x_2, ..., x_n) = \frac{x_1 + x_2 + ... + x_n}{n} \]

Geometric Mean:

\[ GM(x_1, x_2, ..., x_n) = \sqrt[N]{x_1 \times x_2 \times ... \times x_n} \]

Harmonic Mean:

\[ HM(x_1, x_2, ..., x_n) = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + ... + \frac{1}{x_n}} \]

This article focuses on the calculation of the harmonic mean. Calculating the harmonic mean is easy! Suppose we have a rather small dataset, \(\boldsymbol{X}\), with 10 values:

4 3 1 6 4
2 5 9 3 1
X = c(4, 3, 1, 6, 4,
      2, 5, 9, 3, 1)
      
HM <- function(data){
  length(data)/sum(1/data)
}

print(paste("Harmonic Mean is equal to", HM(X)))
[1] "Harmonic Mean is equal to 2.41286863270777"

As we can see, the computation is fairly straightforward, so why are we in need of a trick?

The Computational Trick

Suppose instead we had a different dataset, \(\boldsymbol{Y}\), such that the values are so small, they can only be tracked through their natural log. This may seem like an odd situation but is fairly common in the realm of statistics as the data may be a listing of joint probabilities. Let the \(\log(\boldsymbol{Y})\) be equal to:

-1000 -1001 -999 -1001 -1008
-1006 -1000 -1000 -998 -1003
logY = c(-1000, -1001, -999, -1001, -1008,
         -1006, -1000, -1000, -998, -1003)

Y = exp(logY)

print(Y)
 [1] 0 0 0 0 0 0 0 0 0 0

Notice that the computer returns “0” for all value of Y in our dataset. This is wrong though as we know the natural log must be strictly positive. If I attempt to get \(\boldsymbol{Y}\) back by taking the log,

print(log(Y))
 [1] -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf -Inf

it now returns all negative infinities instead of our original dataset. The original information of the \(\log(\boldsymbol{Y})\) is lost as we have an underflow error. In other words, the value of \(\boldsymbol{Y}\) is so close to 0, the computer cannot differentiate it from 0.

Now it should be noted that the harmonic mean of \(\boldsymbol{Y}\) is so small it is basically 0 as well. However, just like how we tracked our original data on the natural log scale, it would be advantageous for us to be able to track the harmonic mean on the natural log scale as well. This is of course impossible with our current function for the harmonic mean,

print(paste("Harmonic Mean is equal to", HM(exp(logY))))
[1] "Harmonic Mean is equal to 0"

as we must first take the exponential of the natural log values before adding them together. This in turn would return -infinity for our value of the natural log of the harmonic mean. So… how do we get around this? We begin by writing down the equation for the harmonic mean in terms of the \(\log(y_i)\):

\[\begin{align} HM(\boldsymbol{Y}) & = \frac{n}{\frac{1}{\exp(\log(y_1))} + \frac{1}{\exp(\log(y_2))} + ... + \frac{1}{\exp(\log(y_n))}} \\ & = \frac{n}{\exp(-\log(y_1)) + \exp(-\log(y_2)) + ... + \exp(-\log(y_n))} \\ & = \left(\frac{n}{\exp(c-\log(y_1)) + \exp(c-\log(y_2)) + ... + \exp(c-\log(y_n))}\right)\exp(c) \\ \end{align}\]

Now, let’s take the natural log of both sides:

\[\begin{align} \log\Big(HM(\boldsymbol{Y})\Big) & = \log(n) + \log(c) - \log\Big(\exp(c-\log(y_1)) + \exp(c-\log(y_2)) + ... + \exp(c-\log(y_n))\Big) \\ \end{align}\]

The equation above no longer requires taking the exponential of \(y_i\) but instead the exponential of \(c-y_i\) where \(c\) is any arbitrary constant. We want to select a \(c\) such that we minimize the potential for underflow errors. In this case, the obvious choice is to select \(c=\max(\log(\boldsymbol{Y}))\). I code up the function below and compare it with our \(\boldsymbol{X}\) data:

logHM <- function(logdata){
  maxlogdata = max(logdata)
  N = length(logdata)
  log(N) + maxlogdata - log(sum(exp(maxlogdata-logdata)))
}

### Original Method without computational trick
print(paste("Log Harmonic Mean is equal to", log(HM(X))))
[1] "Log Harmonic Mean is equal to 0.880816343680495"
### New Method using computation trick
print(paste("Log Harmonic Mean is equal to", logHM(log(X))))
[1] "Log Harmonic Mean is equal to 0.880816343680495"

As expected, the two methods return the exact same value. Now, let’s run it on the \(\boldsymbol{Y}\) dataset:

### Original Method without computational trick
print(paste("Log Harmonic Mean is equal to", log(HM(Y))))
[1] "Log Harmonic Mean is equal to -Inf"
### New Method using computation trick
print(paste("Log Harmonic Mean is equal to", logHM(logY)))
[1] "Log Harmonic Mean is equal to -1005.83288259162"

While the simple, direct computation failed with a return of “-Inf”, the new method using the computational trick was able to successfully compute the log harmonic mean.

Extending the Trick to compute the Bayes Factor

The goal of almost every Bayesian statistics problem is to determine the posterior distribution of some model parameters, \(\theta\),

\[ p(\theta|data) = \frac{p(data|\theta)p(\theta)}{p(data)}. \]

For example, if you have some data that you believe comes from a normal distribution, you may want to understand it’s mean and variance, so \(\theta = [\mu,\sigma^2]\). Essentially, we start with some prior information on the parameters, \(p(\theta)\), and update that information with our data likelihood, \(p(data|\theta)\), and obtain our posterior estimate, \(p(\theta|data)\). The denominator of our equation, \(p(data)\), is called the marginal data likelihood and is often avoided in calculations by taking advantage of conjugacy or normalization if discrete.

Unfortuantely, the posterior distribution is subject to constraints imposed by different model choices. A common approach to model selection using Bayesian statistics is through the calculation of the Bayes Factor. The Bayes Factor is simply the odds ratio of obtaining the observed data given two models, \(M1\), and \(M2\):

\[ BF(M1,M2) = \frac{p(data|M1)}{p(data|M2)} \]

Notice, the Bayes Factor is the ratio of the marginal data likelihoods of the two models. The reason this term is often avoided is do to the challenge in it’s calculation as it is the probability of obtaining the data integrated across all parameters.

\[ p(data|M_j) = \int p(data|\theta,M_j)p(\theta)d\theta \]

Nevertheless, (Newton and Raftery 1994) show that we can approximate the marginal data likelihood using samples from the posterior distribution \(p(data|\theta,Mi)\) through the harmonic mean.

\[ \hat{p}(data|M_j) = \frac{n}{\sum_{i=1}^n \frac{1}{p(data|\theta_i,Mj)}+\frac{1}{p(data|\theta_i,Mj)}+...\frac{1}{p(data|\theta_i,Mj)}} \]

We can therefore use our computational trick for the harmonic mean to obtain the log of the marginal data likelihood of both models, \(M_1\) and \(M_2\). We can then compute the log of the Bayes Factor,

\[ \log\Big(BF(M_1,M_2)\Big) = \log\Big(p(data|M_1)\Big) - \log\Big(p(data|M_2)\Big), \]

and then just take the exponential when we’re all done to get the Bayes Factor.

BF12 <- function(logdata1,logdata2){
  maxlogdata1 = max(logdata1); maxlogdata2 = max(logdata2)
  N1 = length(logdata1);N2 = length(logdata2)
  logpdata1 = log(N1) + maxlogdata1 - log(sum(exp(maxlogdata1-logdata1)))
  logpdata2 = log(N2) + maxlogdata2 - log(sum(exp(maxlogdata2-logdata2)))
  exp(logpdata1 - logpdata2)
}

And let’s just generate some data to test this function.

X1 = c(4, 3, 1, 6, 4,
       2, 5, 9, 3, 1)
X2 = c(3, 3, 2, 2, 2,
       1, 5, 1, 7, 7)
### No Computational Trick
HM(X1)/HM(X2)
[1] 1.122558
### Computational Trick
BF12(log(X1),log(X2))
[1] 1.122558

And now let’s generate some very small data that needs to be represented by the natural log:

logY1 = c(-1000, -1000, -1001, -1003, -999,
          -1010, -1002, -1004, -1003, -998)
logY2 = c(-1001, -1009, -1007, -1003, -997,
          -1010, -1002, -1002, -1002, -999)
### No Computational Trick
HM(exp(logY1))/HM(exp(logY2))
[1] NaN
### Computational Trick
BF12(logY1,logY2)
[1] 1.41284

Once again, we see the importance of this computational trick as the naive, direct computation fails.

Example: Computing the BF with a normal distribution

In this section, let’s look at an actual example where we generate samples from our posterior using a Monte Carlo simulation. Suppose you generate 500 observations from a normal distribution with mean, \(\mu=10\), and variance, \(\sigma^2 = 3\).

m = 500
MU = 10
SIGMA2 = 3
X = rnorm(m, MU, sqrt(SIGMA2))

Let’s take a look at this data:

hist(X)

You show the following histogram to two of your friends, Daniel and Brandon. We are interested in developing a model where we can predict the mean, \(\mu\), but want a fixed variance, \(\sigma^2\). Daniel (\(M_1\)) looks at the histogram and proposes that the variance should be fixed at \(\sigma^2=3.0\). Brandon (\(M_2\)), on the other hand, looks at the histogram and thinks the variance is \(\sigma^2=2.5\). Which of your friends has proposed the better model?

Since the variance, \(\sigma^2\) is fixed, we can simply write our posterior as:

\[ p(\mu|X,M_j) = \frac{p(X|\mu,M_j)p(\mu,M_j)}{p(X|M_j)} \] To ensure an easy to compute posterior, we use a normal prior, \(p(\mu)=Normal(0,1)\). This results in a conjugate posterior normal distribution,

\[ p(\mu|X,M_j) = Normal\left(\frac{m\bar{X}}{\sigma^2+m},\frac{\sigma^2}{\sigma^2+m}\right) \] where \(m\) is the number of observations and \(\bar{X}\) is the mean of these observations. (See pg 41 of Gelman et. al. 2014 for this solution)

Now that we know the posterior distribution, we can draw samples from it by sampling from this distribution.

samples = 10000

### Model 1 (Daniel)
SIGMA2M1 = 3.0
posteriorMUM1 = rnorm(samples, m*mean(X)/(SIGMA2M1+m),sqrt(SIGMA2M1/(SIGMA2M1+m)))


### Model 2 (Brandon)
SIGMA2M2 = 2.8
posteriorMUM2 = rnorm(samples, m*mean(X)/(SIGMA2M2+m),sqrt(SIGMA2M2/(SIGMA2M2+m)))

After obtaining 10,000 samples from the posterior, the data likelihood at each sample can be computed via \(p(data|\theta) = p(X|\mu) = \prod_{i=1}^m Normal(x_i|\mu)\).

mlikeM1 = rep(NA,samples)
for(samp in 1:samples){
  MUsamp = posteriorMUM1[samp]
  mlikeM1[samp] = prod(dnorm(X,MUsamp,sqrt(SIGMA2M1)))
}

mlikeM2 = rep(NA,samples)
mloglikeM2 = rep(NA,samples)
for(samp in 1:samples){
  MUsamp = posteriorMUM1[samp]
  mlikeM2[samp] = prod(dnorm(X,MUsamp,sqrt(SIGMA2M2)))
}

Then compute the Bayes Factor by taking the ratio of harmonic means of the data likelihoods:

HM(mlikeM1)/HM(mlikeM2)
[1] NaN

Oh no! It returned NaN! Something has gone wrong. Let’s look at the the first 5 likelihood values from each of our samples:

head(mlikeM1); head(mlikeM2)
[1] 0 0 0 0 0 0
[1] 0 0 0 0 0 0

All of the values are being reported as 0, but this is not possible as we’re multiply a bunch of terms that are all strictly greater than 0 (albeit very small). It looks like we’re experiencing an underflow issue. Let’s instead compute the log likelihoods and use our computational trick.

mloglikeM1 = rep(NA,samples)
for(samp in 1:samples){
  MUsamp = posteriorMUM1[samp]
  mloglikeM1[samp] = sum(dnorm(X,MUsamp,sqrt(SIGMA2M1),log=TRUE))
}

mloglikeM2 = rep(NA,samples)
for(samp in 1:samples){
  MUsamp = posteriorMUM1[samp]
  mloglikeM2[samp] = sum(dnorm(X,MUsamp,sqrt(SIGMA2M2),log=TRUE))
}
BF12(mloglikeM1,mloglikeM2)
[1] 4.323865

We have successfully avoided the underflow issue and computed the Bayes Factor. Since we created the data, we know the better model was Daniel’s as he proposed the true variance, \(\sigma^2=3\). If we didn’t create the data though, we could have made this conclusion though since the Bayes Factor is greater than 1, indicating \(M_1\) is the better fit. Without this computational trick, we would not have been able to solve for the Bayes Factor and reach this conclusion.

I make one final note to users of this method. While we demonstrated the usefulness of this computational trick, we did not show the usefulness of estimating the marginal likelihood via the harmonic mean. It is a very simple technique; however, it tends to be highly unstable and requires a very large sample size. Nevertheless, I have found that for answering the binary question of which model is better, it does quite well; however, the true Bayes factor is typically considerably off from the value obtained via this procedure.

References

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014). Bayesian Data Analysis. Chapman & Hall/CRC, third edition.

Newton, M. A. Raftery, A.E. (1994). Approximate Bayesian Inference with the Weighted Likelihood Bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 56(1):3-26.