Using input data as index for other model parameters

gasper256 · November 24, 2025, 11:47am

Hello,

I have a dataset, in a form of pandas DataFrame c, with columns cat1, cat2, value. As names suggest, cat1 and cat2 are categorical variables with 3 and 24 distinct values, respectively, and value is continuous. I am trying to model value with Beta distribution whose alpha and beta parameters depend on a combination of cat1 & cat2, i.e. for each pair of (cat1, cat2) = (c1, c2), I assume there are parameters alpha_c1 and beta_c1 so that c.loc[(c[“cat1”]=c1) & (c[“cat2”]=c2), “value”] is Beta(alpha_c1, beta_c1)-distributed.

I have created & run my model as follows:
with pm.Model(
coords={
“obs_idx”: c.index, # I don't know if that is necessary; also tried without and avoiding dims setting in the variables of the model
“cat1”: c[“cat1”].unique(), # 3 distinct values
“cat2”: c[“cat2”].unique(), # 24 distinct values
}
) as m:
# idx = pm.Data(“idx”, c.index, dims=“obs_idx”) # I don’t know whether that is important
cat1 = pm.Data(“cat1”, c[“cat1”], dims=“obs_idx”)
cat2 = pm.Data(“cat2”, c[“cat2”], dims=“obs_idx”)
output_data = pm.Data(
“value”,
c[“value”],
dims=“obs_idx”
)
# Alpha & beta priors
alpha_prior = pm.HalfNormal(f"alpha_prior", sigma=4, dims=[“cat1”, “cat2”])
beta_prior = pm.HalfNormal(f"beta_prior", sigma=4, dims=[“cat1”, “cat2”])
# Every (cat1, cat2) pair has its own (alpha, beta) parameters
alpha = pm.Deterministic(“alpha”, alpha_prior[cat1, cat2], dims=“obs_idx”)
beta = pm.Deterministic(“beta”, beta_prior[cat1, cat2], dims=“obs_idx”)
val = pm.Beta(
"value",
alpha=alpha,
beta=beta,
observed=output_data,
shape=cat1.shape[0], # or idx.shape[0], if used,
dims=“obs_idx”
)
pdata = pm.sample(
draws=1600,
chains=6,
cores=6,
)

The model runs as expected; if I check the posterior distributions of alpha_prior and beta_prior variables, they are distinct, depending on cat1 and cat2 indices; see PDFs of beta distributions attached (stupid names, because priors become posteriors ):

However, if I try to generate posterior predictive samples, they seem to be independent of cat1 and cat2 data setup:
with m:
m.set_dim("obs_idx", new_length=168, coord_values=list(range(100000, 100168)))
# if I avoid `obs_idx` coordinate in model definition, I may skip the line above
pm.set_data(
{
"cat1": 24*[2] + 24*[0] + 5*24*[2],
"cat2": 7*list(range(24))
}
)
post_data = pm.sample_posterior_predictive(
pdata, var_names=["value"], random_seed=256, predictions=True
)
I would expect that the first 24 and last 120 values generated in post_data would follow the distributions as in the third subplot above (cat1 = 2) and values from 24 to 48 would follow the distributions as in the first subplot (cat1 = 0), but if I plot
plt.plot(post_data.predictions.value.mean(dim=["draw", "chain"]).data)

I get the following:

If I manually implement sampling from inference data, named pdata in my case, I got reasonable output:
cat1_seq = 24*[2] + 24*[0] + 5*24*[2]
cat2_seq = 7*list(range(24))
values = []
np.random.seed(314)
for c1id, c2id in zip(cat1_seq, cat2_seq):
_chain_id = np.random.randint(6)
_draw_id = np.random.randint(1600)
values.append(
scipy.stats.beta(
a=pdata.posterior.alpha_prior[_chain_id, _draw_id, c1id, c2id],
b=pdata.posterior.beta_prior[_chain_id, _draw_id, c1id, c2id]
).rvs()
)
plt.plot(range(168), values)

So, it seems from the first graph with three subplots that the model learns the differences between (cat1, cat2) combinations. But I can’t enforce the posterior predictive sampling to use (cat1, cat2) combinations sequence by my choice.

How can I achieve that the model use the provided input data for cat1 and cat2 while posterior predictive sampling?

Thanks,
Gasper

ricardoV94 · November 24, 2025, 12:02pm

Tip: You can use triple backtick ``` to create a multiline code snippet if you do ```python you even get syntax highlight.

It seems like your setup should work. When you call pm.sample_posterior_predictive, PyMC has to do some logic to know if variables from the trace are still valid for posterior predictive. It shows this as Sampling: [list_of_vars] log. What shows up in your case?

Finall, can you provide a fully reproducible snippet with data? It will lower the barrier for someone here to jump and try to find the issue. Fake data with the right data type and shapes is sufficient (if it reproduces the issue)

gasper256 · November 24, 2025, 1:19pm

Hello,

thanks for reply & for the tip.

When I execute sample_posterior_predictive, the output is
2025-11-24 12:29:12,196 - pymc.sampling.forward - INFO - Sampling: [alpha_prior, beta_prior, value]

Please find demo data attached.

And the whole code (```trick seems not to work for me :() is in attachment.

demo.py (2.1 KB)

demo.csv (233.2 KB)

ricardoV94 · November 24, 2025, 2:06pm

The problem is PyMC is sampling alpha_prior and beta_prior, which you don’t want to. I think the problem arises because the dims of those variables have the same name as the pm.Data variables. Try to change that?

ricardoV94 · November 24, 2025, 2:10pm

Issue BUG: Missing check for collision between coord name and data variable name · Issue #7788 · pymc-devs/pymc · GitHub

Topic		Replies	Views
Generating Posterior Predictive Samples Over Forecast Period Questions	13	1273	January 2, 2020
PyMC3 posterior prediction example; how to incorporate a matrix of betas? Questions	24	2956	April 23, 2021
Predicting from a model with multiple observed variables? Questions	4	4069	October 9, 2018
Categorical model with continuous dependent variable Questions	10	2985	February 5, 2018
Problem in post predictive sampling in context of updating priors for sequential data Questions	3	510	December 21, 2018

Using input data as index for other model parameters

Related topics