Hi, I’m new to bayesian statistics and pymc.
Say I have N variable correlated according to a MVN distribution. I have train dataset that I can use to find the posterior distribution of the MVN. Then I have new data which has missing observation on some subset of the N variables. I want to predict the values for those missing data.
What’s the way to do this in MCMC way? If we have a fixed single-point parameters, there’s a closed form solution for this:
For prototyping, I generated a mock train dataset and test dataset, and then mask some values in the test dataset. My purpose is then to recover the masked values.
n_var = 10
_mu = stats.norm.rvs(loc=0, scale=0.8, size=n_var)
_sigmas = np.sqrt(stats.expon.rvs(0,0.5,size=n_var))
random_eigs = np.random.rand(n_var)+0.1
random_eigs = random_eigs / np.sum(random_eigs) * n_var
_corr = stats.random_correlation.rvs(eigs=random_eigs)
_cov = np.diag(_sigmas)@_corr@np.diag(_sigmas)
data = stats.multivariate_normal.rvs(mean=_mu, cov=_cov, size=2000)
bern = stats.bernoulli.rvs(0.5,size=2000)
train_data = data[bern==1]
test_data = data[bern==0]
test_data_mask = stats.bernoulli.rvs(p=0.3,size=test_data.shape)
with pm.Model() as m:
mu = pm.Normal("mu", mu=0,sigma=1.6, shape=n_var)
chol, corr, vars = pm.LKJCholeskyCov(
"chol_cov",
n=n_var,
eta=2,
sd_dist=pm.Exponential.dist(1.0),
compute_corr=True,
)
mvn = pm.MvNormal("mvn", mu=mu, chol=chol, observed=train_data)
trace = pm.sample(2000)
Is this possible to do with pm.sample_posterior_predictive? How do I inject the test_data and the mask in it?