Hello,
I have a dataset, in a form of pandas DataFrame c, with columns cat1, cat2, value. As names suggest, cat1 and cat2 are categorical variables with 3 and 24 distinct values, respectively, and value is continuous. I am trying to model value with Beta distribution whose alpha and beta parameters depend on a combination of cat1 & cat2, i.e. for each pair of (cat1, cat2) = (c1, c2), I assume there are parameters alpha_c1 and beta_c1 so that c.loc[(c[“cat1”]=c1) & (c[“cat2”]=c2), “value”] is Beta(alpha_c1, beta_c1)-distributed.
I have created & run my model as follows:
with pm.Model(
coords={
“obs_idx”: c.index, # I don't know if that is necessary; also tried without and avoiding dims setting in the variables of the model
“cat1”: c[“cat1”].unique(), # 3 distinct values
“cat2”: c[“cat2”].unique(), # 24 distinct values
}
) as m:
# idx = pm.Data(“idx”, c.index, dims=“obs_idx”) # I don’t know whether that is important
cat1 = pm.Data(“cat1”, c[“cat1”], dims=“obs_idx”)
cat2 = pm.Data(“cat2”, c[“cat2”], dims=“obs_idx”)
output_data = pm.Data(
“value”,
c[“value”],
dims=“obs_idx”
)
# Alpha & beta priors
alpha_prior = pm.HalfNormal(f"alpha_prior", sigma=4, dims=[“cat1”, “cat2”])
beta_prior = pm.HalfNormal(f"beta_prior", sigma=4, dims=[“cat1”, “cat2”])
# Every (cat1, cat2) pair has its own (alpha, beta) parameters
alpha = pm.Deterministic(“alpha”, alpha_prior[cat1, cat2], dims=“obs_idx”)
beta = pm.Deterministic(“beta”, beta_prior[cat1, cat2], dims=“obs_idx”)
val = pm.Beta(
"value",
alpha=alpha,
beta=beta,
observed=output_data,
shape=cat1.shape[0], # or idx.shape[0], if used,
dims=“obs_idx”
)
pdata = pm.sample(
draws=1600,
chains=6,
cores=6,
)
The model runs as expected; if I check the posterior distributions of alpha_prior and beta_prior variables, they are distinct, depending on cat1 and cat2 indices; see PDFs of beta distributions attached (stupid names, because priors become posteriors
):
However, if I try to generate posterior predictive samples, they seem to be independent of cat1 and cat2 data setup:
with m:
m.set_dim("obs_idx", new_length=168, coord_values=list(range(100000, 100168)))
# if I avoid `obs_idx` coordinate in model definition, I may skip the line above
pm.set_data(
{
"cat1": 24*[2] + 24*[0] + 5*24*[2],
"cat2": 7*list(range(24))
}
)
post_data = pm.sample_posterior_predictive(
pdata, var_names=["value"], random_seed=256, predictions=True
)
I would expect that the first 24 and last 120 values generated in post_data would follow the distributions as in the third subplot above (cat1 = 2) and values from 24 to 48 would follow the distributions as in the first subplot (cat1 = 0), but if I plot
plt.plot(post_data.predictions.value.mean(dim=["draw", "chain"]).data)
I get the following:
If I manually implement sampling from inference data, named pdata in my case, I got reasonable output:
cat1_seq = 24*[2] + 24*[0] + 5*24*[2]
cat2_seq = 7*list(range(24))
values = []
np.random.seed(314)
for c1id, c2id in zip(cat1_seq, cat2_seq):
_chain_id = np.random.randint(6)
_draw_id = np.random.randint(1600)
values.append(
scipy.stats.beta(
a=pdata.posterior.alpha_prior[_chain_id, _draw_id, c1id, c2id],
b=pdata.posterior.beta_prior[_chain_id, _draw_id, c1id, c2id]
).rvs()
)
plt.plot(range(168), values)
So, it seems from the first graph with three subplots that the model learns the differences between (cat1, cat2) combinations. But I can’t enforce the posterior predictive sampling to use (cat1, cat2) combinations sequence by my choice.
How can I achieve that the model use the provided input data for cat1 and cat2 while posterior predictive sampling?
Thanks,
Gasper


