1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
|
# Model class selection
When averaging over partitions, we may be interested in evaluating which
**model class** provides a better fit of the data, considering all
possible parameter choices. This is done by evaluating the model
evidence summed over all possible partitions {cite}`inf-peixoto_nonparametric_2017`:
$$
P(\boldsymbol A) = \sum_{\boldsymbol\theta,\boldsymbol b}P(\boldsymbol A,\boldsymbol\theta, \boldsymbol b) = \sum_{\boldsymbol b}P(\boldsymbol A,\boldsymbol b).
$$
This quantity is analogous to a [partition function](<https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)>)
in statistical physics, which we can write more conveniently as a
negative [free energy](https://en.wikipedia.org/wiki/Thermodynamic_free_energy) by taking
its logarithm
$$
\ln P(\boldsymbol A) = \underbrace{\sum_{\boldsymbol b}q(\boldsymbol b)\ln P(\boldsymbol A,\boldsymbol b)}_{-\left<\Sigma\right>}\;
\underbrace{- \sum_{\boldsymbol b}q(\boldsymbol b)\ln q(\boldsymbol b)}_{\mathcal{S}}
$$ (free-energy)
where
$$
q(\boldsymbol b) = \frac{P(\boldsymbol A,\boldsymbol b)}{\sum_{\boldsymbol b'}P(\boldsymbol A,\boldsymbol b')}
$$
is the posterior probability of partition $\boldsymbol b$. The
first term of Eq. {eq}`free-energy` (the "negative energy") is minus the
average of description length $\left<\Sigma\right>$, weighted
according to the posterior distribution. The second term
$\mathcal{S}$ is the [entropy](<https://en.wikipedia.org/wiki/Entropy_(information_theory)>) of the
posterior distribution, and measures, in a sense, the "quality of fit"
of the model: If the posterior is very "peaked", i.e. dominated by a
single partition with a very large probability, the entropy will tend to
zero. However, if there are many partitions with similar probabilities
--- meaning that there is no single partition that describes the network
uniquely well --- it will take a large value instead.
Since the MCMC algorithm samples partitions from the distribution
$q(\boldsymbol b)$, it can be used to compute
$\left<\Sigma\right>$ easily, simply by averaging the description
length values encountered by sampling from the posterior distribution
many times.
The computation of the posterior entropy $\mathcal{S}$, however,
is significantly more difficult, since it involves measuring the precise
value of $q(\boldsymbol b)$. A direct "brute force" computation of
$\mathcal{S}$ is implemented via
{meth}`~graph_tool.inference.BlockState.collect_partition_histogram`
and {func}`~graph_tool.inference.microstate_entropy`, however
this is only feasible for very small networks. For larger networks, we
are forced to perform approximations. One possibility is to employ the
method described in {cite}`inf-peixoto_revealing_2021`, based on fitting a
mixture "random label" model to the posterior distribution, which allows
us to compute its entropy. In graph-tool this is done by using
{class}`~graph_tool.inference.ModeClusterState`, as we
show in the example below.
```{testsetup} model-evidence
np.random.seed(43)
gt.seed_rng(43)
```
```{testcode} model-evidence
from scipy.special import gammaln
g = gt.collection.data["lesmis"]
for deg_corr in [True, False]:
state = gt.minimize_blockmodel_dl(g, state_args=dict(deg_corr=deg_corr)) # Initialize the Markov
# chain from the "ground
# state"
dls = [] # description length history
bs = [] # partitions
def collect_partitions(s):
global bs, dls
bs.append(s.get_state().a.copy())
dls.append(s.entropy())
# Now we collect 2000 partitions; but the larger this is, the
# more accurate will be the calculation
gt.mcmc_equilibrate(state, force_niter=2000, mcmc_args=dict(niter=10),
callback=collect_partitions)
# Infer partition modes
pmode = gt.ModeClusterState(bs)
# Minimize the mode state itself
gt.mcmc_equilibrate(pmode, wait=1, mcmc_args=dict(niter=1, beta=np.inf))
# Posterior entropy
H = pmode.posterior_entropy()
# log(B!) term
logB = mean(gammaln(np.array([len(np.unique(b)) for b in bs]) + 1))
# Evidence
L = -mean(dls) + logB + H
print(f"Model log-evidence for deg_corr = {deg_corr}: {L}")
```
```{testoutput} model-evidence
Model log-evidence for deg_corr = True: -680.709748...
Model log-evidence for deg_corr = False: -670.79211...
```
The outcome shows a slight preference for the non-degree-corrected model.
When using the nested model, the approach is entirely analogous. We show below the
approach for the same network, using the nested model.
```{testsetup} nested-model-evidence
np.random.seed(43)
gt.seed_rng(42)
```
```{testcode} nested-model-evidence
from scipy.special import gammaln
g = gt.collection.data["lesmis"]
for deg_corr in [True, False]:
state = gt.NestedBlockState(g, state_args=dict(deg_corr=deg_corr))
# Equilibrate
gt.mcmc_equilibrate(state, force_niter=1000, mcmc_args=dict(niter=10))
dls = [] # description length history
bs = [] # partitions
def collect_partitions(s):
global bs, dls
bs.append(s.get_state())
dls.append(s.entropy())
# Now we collect 5000 partitions; but the larger this is, the
# more accurate will be the calculation
gt.mcmc_equilibrate(state, force_niter=5000, mcmc_args=dict(niter=10),
callback=collect_partitions)
# Infer partition modes
pmode = gt.ModeClusterState(bs, nested=True)
# Minimize the mode state itself
gt.mcmc_equilibrate(pmode, wait=1, mcmc_args=dict(niter=1, beta=np.inf))
# Posterior entropy
H = pmode.posterior_entropy()
# log(B!) term
logB = mean([sum(gammaln(len(np.unique(bl))+1) for bl in b) for b in bs])
# Evidence
L = -mean(dls) + logB + H
print(f"Model log-evidence for deg_corr = {deg_corr}: {L}")
```
```{testoutput} nested-model-evidence
Model log-evidence for deg_corr = True: -670.572583...
Model log-evidence for deg_corr = False: -662.13989...
```
The situation is the same: The non-degree-corrected model possesses the largest
evidence. Note also that we observe a better evidence for the nested models
themselves, when comparing to the evidences for the non-nested model --- which
is not quite surprising, since the non-nested model is a special case of the
nested one.
|