Four Tips You Should Know When Writing A Stan Programme
July 3, 2023
Estimating parameters of a model is always a thrilling moment, especially when that is your own model! One popular procedure for realizing this goal is the Markov chain Monte Carlo (MCMC) method. It has a long tradition of success in constructing desired posterior distributions and facilitating solutions. For MCMCers, Stan offers a really fast sampler and an easy-to-learn modeling language. With its spectacular features, it has amassed a considerable user base. Although most of our interactions with Stan are joyful, there are still potential holes that might trouble us for several weeks or even months. In this article, I will share four tips to help you avoid pitfalls and save your valuable time!
Generally, the essential purpose of the this article is to provide you with the optimal workflow I found for writing Stan after lessons paid for with nearly despair and countless hours. I will introduce you two must be done things *before* officially starting to write your .stan file. Next, I will highlight a point that commonly being ignored by many people *during* the estimation process and equip you with knowledge to diagnose possible problems. Following, I will give you a strategy to speed up your codes.
With the 4S above, you would be further efficient for obtaining the target distribution. Admittedly, it is impossible to include all information in a short article, so I enclosed a list of awesome Resources at the end. Now let's start!
During the process of programming, the primary concern should be whether the results generated by the program are the desired parameters. A powerful method to verify whether a program can obtain correct and valid parameters is through a simulation study. Therefore, it is a great start to generate some data using your model.
With simulation, you can (1) have in-depth communication with the model you wish to estimate. When generating simulated data, you can clearly understand how each parameter works in your model; (2) ensure that the foundation of the Stan program is solid. By analyzing whether the simulated variables have the features that you assume, you can verify whether you have correctly set the prior distribution and likelihood function; (3) get well-prepared for the validation procedure. After writing an MCMC program, you can easily utilize true values set in the simulation to validate your algorithm.
Here is an example. Let's say I want to write an MCMC program for the deterministic inputs, noisy “and” gate (DINA) model. (This is merely an illustrative example and unfamiliarity with this model will not impede your understanding.) The formula of DINA is as followed, with $0 < g_{j} < 1-s_{j} < 1$ :
$$Pr(Y_{ij}=1)=(1-s_{j})^{\eta_{ij}}g_{j}^{1 - \eta_{ij}}$$
where $i$ represents the index of examinees with $i = (1, ..., N)$ and $j$ represents the index of items with $j = (1, ..., J)$. $Y_{ij}$ stands for the response of examinee $i$ on item $j$ with 1 = correct and 0 = incorrect. $ \boldsymbol{s} = (s_1, s_2, ..., s_j, ..., s_J)$, $ \boldsymbol{g} = (g_1, g_2, ..., g_j, ..., g_J)$ and $ \boldsymbol{\eta} = (0, 1)$. It would be excellent to generate some $Y_{ij}$ before writing a Stan programme.
# set variables ---------------------------
# (Please note that for the purpose of clarity, I have simplified the actual data generation process.)
J <- 10
N <- 1000
g <- rep(x = 0.2, times = J)
s <- rep(x = 0.2, times = J)
eta_tem <- mvtnorm::rmvnorm(n = N, mean = rep(0, J), sigma = diag(J))
eta <- 1 * (eta_tem > 0)
# define function -------------------------
response <- function(eta, slip, guess){
# To generate examinees' response for one item
nper <- length(eta)
## generate response for one item
presp <- sapply(1:nper, function(i) {
(1 - slip)^eta[i] * guess^(1 - eta[i])
})
re <- ifelse(test = presp >= runif(n = nper, min = 0, max = 1),
yes = 1, no = 0)
return(re)
}
# data generation -------------------------
respon <- matrix(data = NA, nrow = N, ncol = J)
for (j in 1:J){
respon[, j] <- response(eta = eta[, j], slip = s[j], guess = g[j])
}
head(respon)
Ideally, $Y_{ij}$ should be a random variable taking on values of 0 or 1. If the generated data conforms to this feature, congratulations!! Additionally, please save this simulation program as it not only generates data but also provides true values for each parameter. These will be used to validate your Stan program.
Another way to guarantee we can achieve targeted estimates is to carefully examine every parameter you set or every output after your calculation. Confirming that those variables are indeed what you intend them to be. Since in a Stan file we cannot directly view the variables, the print statement is extremely helpful here.
To demonstrate its usage, I presented here a nothing_but_debug file containing a Stan file that solely includes variables of interest, as well as an R script used for running Stan.
In the Stan file, you can just include very simple and few elements in the script as our main purpose is to check the values of some specific variables. In our example, we focus on the $s$ and $g$ parameters.
data{
int<lower = 1> J;
}
parameters{
real<lower = 0, upper = 1> slip[J];
real<lower = 0, upper = 1> guess[J];
}
model{
slip ~ beta(5, 10); # prior of your parameters
guess ~ beta(5, 10);
print("slip = ", slip); # cannot be used in the data and parameter blocks
print("guess = ", guess);
}
Next, we can execute the .stan file in R. One chain with a very short length is favored, which is fast and can print the results directly to the console.
J <- 10
data_input <- list(J = J)
# run stan -----------------------------
library(rstan)
rstan_options(auto_write = TRUE)
estMCMC <- stan_model(file= 'model.stan')
set.seed(13137)
stan_inits <- list(list(guess = rep(0.1, times = J),
slip = rep(0.1, times = J)))
# MCMC
fitMCMC <- sampling(estMCMC,
data = data_input,
init = stan_inits,
iter = 10,
chains = 1)
As mentioned eariler, our parameters should satisfy: $0 < g_{j} < 1-s_{j} < 1$. Clearly, the parameters do not meet the requirements as necessary constraints have not been imposed upon them.
After highlighting the role of simulated data and the print statement, I would like to direct your attention to model convergence. Many individuals are eager to retrieve the posterior mean and perform further analysis once finally get to the end of the sampling process. However, if the model has not converged, the posterior mean is highly unreliable, and results based on it are not warranted. Checking the convergence carefully before moving on can prevent you from wasting several months on untrustworthy results.
Here are two points you don't want to miss:
- Has the posterior parameter space been sufficiently explored? This question can be answered with
the trace plot and the Gelman–Rubin $\hat{R}$. For the trace plot, sometimes we are encouraged to use one chain to diagnose
problematic parameters, but it is also necessary for us to run more than three chains to detect potential nonconvergence.
There is a chance that two or even three chains just happened to mix well with each other, but the fourth chain ended up
in a different region which indicates a nonconvergence. For $\hat{R}$, we aim for all parameters with values less than 1.1.
This rule should be satisfied not only for the important variables that we are interested in, but also for less significant parameters,
as the exploration of one parameter space can make a great difference to the other.
You can draw a trace plot by using
mcmc_trace()
in the bayesplot package, and applyingsummary()
to your stan object to extract the Gelman–Rubin $\hat{R}$. - Have we collected enough samples from the targeted distribution? The effective sample size of estimates
is a powerful indicator. The estimates would be ready for advanced analysis if the effective sample size for each of them could
account for more than 50% total sample size in each chain. This requirement might be a little bit strong, but you would never expect
the effective sample size to be less than 100, especially when the chain length is more than 5,000.
You can view the effective sample size with the
summary()
function.
There are various sources that may lead to convergence problems. Two common causes are identifiability and prior distribution. For identifiability, you are encouraged to carefully check your model. If it turns out that part of your parameter displayed an extremely low effective sample size, it might be a signal for nonidentifiability. In terms of the prior setting, always using independent and simple prior is a good standard. If a restricted and dependent distribution is required, as in our example which indicated $0 < g_{j} < 1-s_{j} < 1$, we will need to transform the "complex" prior to a "simple" one for convergence. Here are stan codes for the illustration of the transformation process.
data{
int<lower = 1> J; // number of items
}
parameters {
// create a new vector (actually matrix) to store ordered parameters
row_vector[2] new_pars[J];
}
transformed parameters {
real<lower = 0, upper = 1> slip[J];
real<lower = 0, upper = 1> guess[J];
for (j in 1:J) {
// arrange the new parameter in ascending order
// and assign them to the desired position
guess[j] = sort_asc(new_pars[j])[1];
slip[j] = 1 - sort_asc(new_pars[j])[2];
}
}
model {
for (j in 1:J){
// sample new parameters from a simple prior
new_pars[j] ~ beta(1, 1);
}
...
}
Some final words for this section: setting different initial values for different chains can be helpful to avoid being stuck in the local optimal solution :)
Memory issues commonly arise when running a model with multiple parameters or generating posterior predictive samples. Exceeding the memory limit can result in the forced termination of a program. Below are two tricks that can help us out:
- Exclude non-interested parameters. We can eliminate specific parameters with the "pars" and "include" arguments.
As in the previous case, we are not interested in the transformed parameter, so we can
specify
sampling(object, data, pars = "new_pars", include = FALSE)
. In this way, samples of "new_pars" will not be stored. - Define a local block. In some cases, we need to generate posterior predictive samples for the posterior predictive checking.
The generated posterior samples may occupy considerable storage space, but in reality, we only need to compute some statistical
metrics based on these samples. In such cases, we can store only the statistical metrics and treat the posterior samples as local
variables enclosed within
{}
. This restricts their scope to the local block only, significantly reducing memory consumption.
- Welcome to the big family of Stan: https://discourse.mc-stan.org/
- Here are everything about Stan: https://mc-stan.org/users/documentation/
- Excellent Bayesian courses offered by Dr. Ben Lambert: https://youtube.com/playlist?list=PLwJRxp3blEvZ8AKMXOy0fc0cqT61GsKCG
- Perfect guide on errors and warnings: https://mc-stan.org/misc/warnings.html