How to find conditional distribution of missing values?

Hi, I’m new to bayesian statistics and pymc.

Say I have N variable correlated according to a MVN distribution. I have train dataset that I can use to find the posterior distribution of the MVN. Then I have new data which has missing observation on some subset of the N variables. I want to predict the values for those missing data.

What’s the way to do this in MCMC way? If we have a fixed single-point parameters, there’s a closed form solution for this:

X_U|X_O \sim \mathcal{N}(\mu_{X_U} + \Sigma_{X_UX_O} \Sigma_{X_OX_O}^{-1}(X_O - \mu_{X_O}), \Sigma_{X_UX_U} - \Sigma_{X_UX_O} \Sigma_{X_OX_O}^{-1} \Sigma_{X_OX_U}) \\ \Sigma = \begin{pmatrix} \Sigma_{X_UX_U} & \Sigma_{X_UX_O} \\ \Sigma_{X_OX_U} & \Sigma_{X_OX_O} \end{pmatrix} = \text{covariance of the N variables}\\ X_O = \text{observed variables}\\ X_U = \text{unobserved variables}\\

For prototyping, I generated a mock train dataset and test dataset, and then mask some values in the test dataset. My purpose is then to recover the masked values.

n_var = 10
_mu = stats.norm.rvs(loc=0, scale=0.8, size=n_var)
_sigmas = np.sqrt(stats.expon.rvs(0,0.5,size=n_var))
random_eigs = np.random.rand(n_var)+0.1
random_eigs = random_eigs / np.sum(random_eigs) * n_var
_corr = stats.random_correlation.rvs(eigs=random_eigs)
_cov = np.diag(_sigmas)@_corr@np.diag(_sigmas)

data = stats.multivariate_normal.rvs(mean=_mu, cov=_cov, size=2000)
bern = stats.bernoulli.rvs(0.5,size=2000)
train_data = data[bern==1]

test_data = data[bern==0]
test_data_mask =  stats.bernoulli.rvs(p=0.3,size=test_data.shape)

with pm.Model() as m:
    mu = pm.Normal("mu", mu=0,sigma=1.6, shape=n_var)

    chol, corr, vars = pm.LKJCholeskyCov(
        "chol_cov",
        n=n_var,
        eta=2,
        sd_dist=pm.Exponential.dist(1.0),
        compute_corr=True,
    )

    mvn = pm.MvNormal("mvn", mu=mu, chol=chol, observed=train_data)

    trace = pm.sample(2000)

Is this possible to do with pm.sample_posterior_predictive? How do I inject the test_data and the mask in it?

If you set the masked values to nan and pass that as observed, pymc will automatically infer those during the course of sample. You can find them in trace.posterior.

For test_data you will need to write the conditional model yourself. This blogpost may give some ideas, although it doesn’t have your specific example: Out of model predictions with PyMC

We want to add some tooling to create the right conditional MvNormal for prediction but we haven’t got around to it yet

1 Like

There a discussion about how to do this here. The whole thread is maybe interesting, but I linked the post with the best code.

1 Like

Thanks. That solved my problem.

The way you do it is by manually defining the transformation. It’s great since MVN has a nice closed form transformation for it. Now I’m curious, what if I my incomplete observation are from another distribution parameterized by the MVN? something like this:

\text{logit}(\vec{p}) \sim \mathcal{N}(\mu,\Sigma)\\ \text{observation}_i \sim \text{Binomial}(p_i,n_i)\\ \vec{p} = [p_0,p_1, \dots p_N]

Can we still find the posterior predictive for some missing \text{observation}_i?

I don’t really have a problem I need to solve with that kind of model right now, just curious if we can have pymc handle it.

You would have to rewrite the graph downstream of the conditional mvnormal but it should work. Simply gonna have more sources of randomness between the latent mvnormal and the final observations.

1 Like

I am going to assume that you are assuming that the “training data” and “new data” are both drawn from the same (parametric conditional) distribution. In that case, you can do slightly better than the recommendations here.

In Bayesian statistics, if you model the data and parameters, you can always treat missing data in the same way as parameters—just another latent or unknown value that you want to do posterior inference for conditioned on some observed data. Viewed this way, there’s nothing special about missing data inference—you just do joint inference for all your unknowns as usual. The marginal for the unknowns are your predictions based on the observed data.

This is different than fitting the model with the training data, then using posterior predictive inference on the test data. You can’t easily do that in a joint Bayesian model without something called the “cut” operator that was introduced in BUGS (the cut in question prevents information from the “new data” filtering back to inform the parameter estimates as it would in the general Bayesian joint model). And you can’t easily do it in models where you can’t generate the conditional prediction in a forward way as you can with multivariate normal distributions or the Bernoulli example (i.e., it requires MCMC sampling).

1 Like

That’s what setting the data as nan and calling sample would achieve.

But for pipelining predictions of new data, using forward sampling will be much faster, even if not a general solution.

Similarly, conjugacy if you want efficient new predictions and incremental learning.

I didn’t know about the cut operator. How is that implemented? One strategy that people now and then reach for is to try to reuse draws from a previous inference run, for parameters that are not meant to be updated.

This would work much like posterior predictive sampling except you still get mcmc inference for the new parameters. You probably run into the same random walk issues that arbitrarily partitioned step samplers (as opposed to carefully designed conditional gibbs samplers, or joint samplers ofc) tend to have.

I guess you can also infer two distinct subsets of parameters at once by not considering the logp contribution of the variables in the other set. Is that how the cut operator works?

I don’t know how it’s implemented, but the effect is exactly what you describe.

This comes up all the time in pharmacometric/pharmacodynamic (PK/PD) models. There you have a very well-informed PK model that given a time series of concentration of drugs in the body, but a less good PD model that given a time series of concentration in the body gives the conditional probability of a disease measurement such as blood pressure or number of correct guesses on an eye chart. The skeleton of the model is this

p(disease_meas, drug_conc, pk_params, pd_params)
= p(disease_meas | drug_meas, pd_params)
  * p(drug_meas | pk_params)
  * p(pk_params) * p(pd_params)
  • PD model: p(disease | drug_meas, pd_params)
  • PK model: p(drug_meas | pk_params)
  • priors: p(rate_params) and p(disease_params)

The Bayesian way to do inference would be to sample from the full posterior

p(pk_params, pd_params | disease_meas, drug_conc)

This can be problematic when the PK model is a lot better than the PD model because inference flows from the data jointly to the PK and PD parameters and the bad PD model can pull the PK model off course to the point where pharmacologists just know it’s wrong.

The “cut” way to do inference solves this “problem” by sampling N draws form the PK posterior,

pk_params(1), ..., pk_params(N)  ~ p(pk_params | drug_conc)

then plugs them in one at a time to draw the PD parameters,

pd_params(1) ~ p(pd_params | pk_params(1), disease_meas)
pd_params(2) ~ p(pd_params | pk_params(2), disease_meas)
...
pd_params(N) ~ p(pd_params | pk_params(N), disease_meas)

So you take a draw from the PD posterior given the drawn PK parameters and disease measurement. Now there’s nothing in the PD model that can infect the PK parameter estimate. The only problem is that the pair pd-param(n), pk_params(n) is not a draw from the Bayesian posterior. It has been “cut”.

Multiple imputation for missing data often works the same way when it’s done through preprocessing as it so often is.

See: Cuts in Bayesian graphical models. Martyn Plummer: https://statmath.wu.ac.at/research/talks/resources/Plummer_WU_2015.pdf

1 Like