# Bayesian Inference Cheatsheet

Note: This cheatsheet is in "beta". Suggestions are welcome

When performing Bayesian Inference, there are numerous ways to solve, or approximate, a posterior distribution. Usually an author of a book or tutorial will choose one, or they will present both but many chapters apart. This notebook solves the same problem each way all in Python. References are provided for each method for further exploration as well.

## Another great reference¶

Chapter 8 of Osvaldo Martin's book contains a more in depth overview of the methods below. I highly recommend reading it as well, particularly if you would like a deeper understanding of each method.

In [1]:
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc3 as pm
import pystan
import arviz as az

WARNING (theano.configdefaults): install mkl with conda install mkl-service: No module named 'mkl'
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.


## The problem¶

What is the proportion of water on the globe, given measurements of random tosses? This question is taken from Richard McElreath's books and his lecture. Professor McElreath explains the problem in his lecture or alternatively I have a short write in my Bayesian Glossary

### Problem definition in Python¶

Usually this is done with numpy arrays, where water and land is encoded with 0 and 1 for success. I'm doing it here in two steps to highlight that the ones and zeros just stand for success and failure

In [2]:
observations = pd.Series(["w", "l", "w", "w", "w","l","w", "l", "w"])

# Convert to binomial representation
observations_binom = (observations == "w").astype(int).values
observations_binom

Out[2]:
array([1, 0, 1, 1, 1, 0, 1, 0, 1])
In [3]:
water_observations = sum(observations_binom)
total_observations = len(observations_binom)

water_observations, total_observations

Out[3]:
(6, 9)

# Inference Methods¶

There are multiple ways of doing inference, even more than the ones here. For each method there is an example, reference links, and short list of pros and cons.

## Conjugate Prior¶

Conjugate priors are the "original" way of solving Bayes rule. Conjugate priors are analytical solutions, meaning that a good (probably great) mathematician was able to create a closed form solution by hand. For our Beta Binominal water tossing problem we can use a formula (taken from Wikipedia) and to get the exact solution.

#### References¶

https://en.wikipedia.org/wiki/Conjugate_prior

In [4]:
# Analytical Formula magic
_alpha = 1 + water_observations
_beta = 1 + total_observations - water_observations

In [5]:
x = np.linspace(0,1,100)
posterior_conjugate = stats.beta.pdf(x, a=_alpha, b=_beta)

fig, ax = plt.subplots()
plt.plot(x,posterior_conjugate)
ax.set_yticks([])
fig.suptitle("Exact Posterior through Conjugate Prior")

Out[5]:
Text(0.5, 0.98, 'Exact Posterior through Conjugate Prior')
In [6]:
posterior_conjugate[40] / posterior_conjugate[70]

Out[6]:
0.2931796007669374

### Conjugate Prior Pros¶

• Posterior distribution is exact. It is not a numerical approximation
• The fastest way to perform Bayesian Inference. At least 4 orders of magnitude faster than other methods
• Speed allows for novel applications, such as online updates for posteriors for real time applications

### Conjugate Prior Cons¶

#### Steps¶

1. Generate a grid of points
2. Calculate likelihood for each point
3. For each beta take another grid of points to get probability
4. Calculate likelihood and posterior

#### References¶

Statistical Rethinking Chapter 3 contains a great overview of Grid Search
https://xcelab.net/rm/statistical-rethinking/

In [7]:
possible_probabilities = np.linspace(0,1,100)
prior = np.repeat(5,100)
likelihood = stats.binom.pmf(water_observations, total_observations, possible_probabilities)

posterior_unstandardized = likelihood * prior
posterior_grid_search = posterior_unstandardized / sum(posterior_unstandardized)

In [8]:
fig, ax = plt.subplots()
plt.plot(possible_probabilities, posterior_grid_search)
ax.set_yticks([])
fig.suptitle("Posterior through Grid Approximation")

Out[8]:
Text(0.5, 0.98, 'Posterior through Grid Approximation')
In [9]:
posterior_grid_search[40] / posterior_grid_search[70]

Out[9]:
0.29317960076693733

### Grid Search Pros¶

• Conceptually simple
• Can theoretically be used to solve any problems

### Grid Search Cons¶

• Due to combinatoric explosion ("the curse of dimensionality"), this solution scales very poorly and can take a very long time to solve for non trivial models. The sun will explode before inference is completed on most models

#### References¶

Bayesian Analysis in Python Chapter 8 https://www.packtpub.com/big-data-and-business-intelligence/bayesian-analysis-python-second-edition

In [10]:
with pm.Model() as normal_aproximation:
p = pm.Uniform('p', 0, 1)
w = pm.Binomial('w', n=total_observations, p=p, observed=water_observations)
mean_q = pm.find_MAP()
std_q = ((1/pm.find_hessian(mean_q, vars=[p]))**0.5)[0]
mean_q['p'], std_q

/home/BayesUser/miniconda3/envs/arviz/lib/python3.6/site-packages/pymc3/tuning/starting.py:61: UserWarning: find_MAP should not be used to initialize the NUTS sampler, simply call pymc3.sample() and it will automatically initialize NUTS in a better way.
warnings.warn('find_MAP should not be used to initialize the NUTS sampler, simply call pymc3.sample() and it will automatically initialize NUTS in a better way.')
logp = -1.8075, ||grad|| = 1.5: 100%|██████████| 7/7 [00:00<00:00, 151.39it/s]

Out[10]:
(array(0.66666667), array([0.15713484]))
In [11]:
possible_probabilities = np.linspace(0,1,100)
posterior_laplace = stats.norm.pdf(possible_probabilities, mean_q['p'], std_q)
fig, ax = plt.subplots()
ax.set_yticks([])
plt.plot(possible_probabilities, posterior_laplace)

Out[11]:
Text(0.5, 0.98, 'Posterior Approximation through Quadratic Approximation')
In [12]:
posterior_laplace[40] / posterior_laplace[70]

Out[12]:
0.255729157826732

### Laplace Approximation Pros¶

• Relatively fast compared to methods such as MCMC or ADVI

### Laplace Approximation Cons¶

• Normal distributions can't approximate everything, for example bimodal or discrete distributions

## Markov Chain Monte Carlo¶

Markov Chain Monte Carlo (MCMC) is a way to numerically approximate a posterior distribution by iteratively sampling from it. I found the visualizations in the link below make it easier to see what this means. The advances in MCMC are coming from advances in samplers. The No U Turn Sampler (NUTS) for example is one of the most widely popular at the moment, since it can "find" good samples more often than older samples such as Metropolis Hastings.

There are many Bayesian libraries in addition to the two shown here.

### "Hand built" Metropolis Hastings¶

This example is adapated from Thomas Wiecki's blog post

In [13]:
samples = 5000
proposal_width = .1
p_accept = .5

# Assume Normal Prior for this example
p_water_current = .5
p_water_prior_normal_mu = .5
p_water_prior_normal_sd = 1

mh_posterior = []
for i in range(samples):
# Proposal Normal distribution
p_water_proposal = stats.norm(p_water_current, proposal_width).rvs()

# Calculate likelihood of each
likelihood_current = stats.binom.pmf(water_observations, total_observations, p_water_current)
likelihood_proposal = stats.binom.pmf(water_observations, total_observations, p_water_proposal)

# Convert likelihoods that evaluate to nans to zero
likelihood_current = np.nan_to_num(likelihood_current)
likelihood_proposal = np.nan_to_num(likelihood_proposal)

# Calculate prior value given proposal and current
prior_current = stats.norm(p_water_prior_normal_mu, p_water_prior_normal_sd).pdf(p_water_current)
prior_proposal = stats.norm(p_water_prior_normal_mu, p_water_prior_normal_sd).pdf(p_water_proposal)

# Calculate posterior probability of current and proposals
p_current = likelihood_current * prior_current
p_proposal = likelihood_proposal * prior_proposal

p_accept = p_proposal / p_current

if np.random.rand() < p_accept:
p_water_current = p_water_proposal

assert p_water_current > 0

mh_posterior.append(p_water_current)

In [14]:
mh_posterior= np.asarray(mh_posterior)
az.plot_posterior(mh_posterior)

Out[14]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fac4c4cf5f8>],
dtype=object)
In [15]:
np.asarray(mh_posterior)

Out[15]:
array([0.6391594 , 0.53115708, 0.55828625, ..., 0.54408089, 0.57741335,
0.5757483 ])

### Estimate local probability¶

Because MCMC provides samples and not an exact distribution, we have to take a ratio of counts, not a ratio of areas as before

In [16]:
def estimate_prob_ratio_from_samples(mcmc_posterior, prob_1=possible_probabilities[40],
prob_2=possible_probabilities[70], width=.01):

# We need to compare the same interval in the distribution as we did above
# In the above case x[20] is .01 wide centered at
num_samples_1 = np.sum(np.isclose(prob_1, mcmc_posterior, rtol=0, atol=width))
num_samples_2 = np.sum(np.isclose(prob_2, mcmc_posterior, rtol=0, atol=width))
return num_samples_1 / num_samples_2

estimate_prob_ratio_from_samples(mh_posterior)

Out[16]:
0.28214285714285714

## PyMC3 MCMC Hamiltonian Monte Carlo (Specifically No U Turn Sampler)¶

Probabilistic Programming languages simplify Bayesian modeling tremendously but letting letting the statistician focus on the model, and less so on the sampler or other implementation details. From a s

In [17]:
with pm.Model() as mcmc_nuts:
p_water = pm.Uniform("p", 0 ,1)
w = pm.Binomial("w", p=p_water, n=total_observations, observed=water_observations)
trace = pm.sample(50000, chains=2)

az.plot_posterior(trace)

Auto-assigning NUTS sampler...
INFO:pymc3:Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 4 jobs)
INFO:pymc3:Multiprocess sampling (2 chains in 4 jobs)
NUTS: [p]
INFO:pymc3:NUTS: [p]
Sampling 2 chains: 100%|██████████| 101000/101000 [00:16<00:00, 6234.09draws/s]
/home/BayesUser/repos/arviz/arviz/data/io_pymc3.py:56: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
chain_likelihoods.append(np.stack(log_like))

Out[17]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fac48a8b6d8>],
dtype=object)
In [18]:
pymc3_samples = trace.get_values("p")
estimate_prob_ratio_from_samples(pymc3_samples)

Out[18]:
0.3025787965616046

## Stan MCMC Hamiltonian Monte Carlo (Specifically No U Turn Sampler)¶

We can fit the same model in pystan as well, in this case pystan, which is a python interface to Stan.

In [19]:
water_code = """
data {
int<lower=0> water_observations; // number of water observations
int<lower=0> number_of_tosses; // number of globe tosses
}

parameters {
real p_water; // Proportion of water on globe
}

model {
p_water~uniform(0.0,1.0);
water_observations ~ binomial(number_of_tosses, p_water);
}
"""
water_dat = {'water_observations': 6,
'number_of_tosses': 9}

sm = pystan.StanModel(model_code=water_code)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_dedfcbf67db9d2d5064f58308236c51b NOW.

In [20]:
fit = sm.sampling(data=water_dat, iter=50000, chains=4)

WARNING:pystan:2776 of 100000 iterations ended with a divergence (2.78 %).
WARNING:pystan:Try running with adapt_delta larger than 0.8 to remove the divergences.

In [21]:
stan_samples = fit["p_water"]
estimate_prob_ratio_from_samples(stan_samples)

Out[21]:
0.29606625258799174

### Markov Monte Carlo Chain Pros¶

• MCMC seems to be most popular method at time of writing
• Popular libraries such as Stan, PyMC3, emcee, Pyro, use MCMC as main inference engine

## PyMC3 Variational Inference (Specifically Automatic Differentiation Variational Inference)¶

In short Variational Inference iteratively transforms a model into an unconstrained space, then tries to optimize the Kullback-Leibler divergence.

#### References¶

A great explanation of the exact steps can be found at the bottom of page 3 in the paper below.
http://www.jmlr.org/papers/volume18/16-107/16-107.pdf

In [22]:
with pm.Model() as advi:
p_water = pm.Uniform("p", 0 ,1)
w = pm.Binomial("w", p=p_water, n=total_observations, observed=water_observations)


Average Loss = 2.3092: 100%|██████████| 50000/50000 [00:10<00:00, 4553.81it/s]
Finished [100%]: Average Loss = 2.3102
INFO:pymc3.variational.inference:Finished [100%]: Average Loss = 2.3102

Out[22]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fac3dfb98d0>],
dtype=object)
In [23]:
estimate_prob_ratio_from_samples(advi_samples)

Out[23]:
0.358148893360161

### Variational Inference Pros¶

• No sampling required (unlike MCMC)
• Able to handle complex models
• Scales to large amounts of data better than MCMC
• Faster than MCMC

### Variational Inference Cons¶

• Not guaranteed to reach "optimal" KL Divergence
• The true posterior may not be well approximated by any distribution in the chosen class

# Other Inference Methods¶

There are other Bayesian methods that I frankly don't know enough about to write a full treatment here. I will list the ones I do know about however.

## Approximate Bayesian Computing¶

Agustina Arroyuelo has a great write up for Google Summer of Code 2018.
https://agustinaarroyuelo.github.io/jekyll/update/2018/05/29/PyMC3's-Journal-Club.html