example-models/jupyter/sum-to-zero/sum_to_zero_evaluation.qmd at d26a769ca31dd5f6b5ece2b273cb242cb55c0d9a · stan-dev/example-models · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "The Sum-to-Zero Constraint in Stan"
author: "Mitzi Morris"
format:
  html:
    theme:
      light: [cosmo, quarto-config/theme.scss]
      dark: [cosmo, quarto-config/theme-dark.scss]
    syntax-definitions:
      - quarto-config/stan.xml
    highlight-style:
      light: quarto-config/tango.theme
      dark: quarto-config/nord.theme
    code-copy: true
    code-overflow: wrap
    code-fold: true
    code-summary: "Show Code"
    toc: true
    css: quarto-config/quarto_styles.css
    page-layout: full
    toc-location: right
    embed-resources: true
execute:
  engine: jupyter
include-in-header:
  - text: |
      <style>
        .table {
          width: auto;
        }
      </style>
---

*Load libraries and set up notebook.*
```{python}
# libraries used in this notebook
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import plotnine as p9
import libpysal
from splot.libpysal import plot_spatial_weights
from random import randint

from cmdstanpy import CmdStanModel, cmdstan_path, set_cmdstan_path
import logging
cmdstanpy_logger = logging.getLogger("cmdstanpy")
cmdstanpy_logger.setLevel(logging.FATAL)
set_cmdstan_path(os.path.join('/Users', 'mitzi', 'github', 'stan-dev', 'cmdstan'))

import warnings
warnings.filterwarnings('ignore')

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

from utils_html import *

# notebook display options
plt.rcParams['figure.figsize'] = (7, 7)
plt.rcParams['figure.dpi'] = 100

np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)
pd.set_option('display.precision', 2)
pd.options.display.float_format = '{:.2f}'.format
```

## Introducing the `sum_to_zero_vector` Constrained Parameter Type


The [`sum_to_zero_vector`](https://mc-stan.org/docs/reference-manual/transforms.html#zero-sum-vector)
constrained parameter type was introduced in the [Stan 2.36 release](https://github.com/stan-dev/cmdstan/releases/tag/v2.36.0).

The parameter declaration:

```stan
  sum_to_zero_vector[K] beta;
```
produces a vector of size `K` such that `sum(beta) = 0`.
The unconstrained representation requires only `K - 1` values because the
last is determined by the first `K - 1`.

If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance.
To get the same variance as the intended normal prior do

```stan
parameters {
  sum_to_zero_vector[N] z;
}
nnmodel {
  z ~ normal(0, sqrt((N / (N - 1.0)) * sigma)
}
```
where sigma is the intended standard deviation.
(Note: it’s slightly more efficient to pre-calculate the `sqrt(N / (N - 1.0))` in transformed_data.)

Prior to Stan 2.36, a sum-to-zero constraint could be implemented either

- as a "hard" sum to zero constraint, where the parameter is declared to be an $N-1$ length vector with a corresponding $N$-length transformed parameter
whose first $N-1$ elements are the same as the corresponding parameter vector, and the $N^{th}$ element is the negative sum of the $N-1$ elements.

- as a "soft" sum to zero constraint, where the parameter is an $N$-length vector with a prior on the sum of the elements that keeps it within $\epsilon$ of $0$.

The choice between the two was largely an empirical one.
As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient;
for larger vectors, the soft sum-to-zero constraint is faster,
but much depends on the specifics of the model and the data.

The hard sum-to-zero is often problematic.
For small $N$ and models with sensible priors, the hard sum-to-zero is usually satisfactory.
But as the size of the vector grows, it distorts the marginal variance of the $N^{th}$.
Given a parameter vector:
$$
x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2)
$$
by the properties of independent normal variables, each of the free elements $x_1, \ldots, x_{N-1}$ has variance $\sigma^2$.
However, the $N^{th}$ element is defined deterministically as:
$$
x_N = -\sum_{i=1}^{N-1} x_i
$$
and its variance is inflated by a factor of $N-1$.
$$
\operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr)
= \sum_{i=1}^{N-1} \operatorname{Var}(x_i)
= (N-1)\sigma^2.
$$
For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint,
as every change to any of the $N-1$ elements also requires a corresponding change to
the $N^{th}$ element; balancing these changes introduces potential non-identifiabilities.

The soft sum-to-zero constraint is problematic for the following reasons.

* The tolerance $\epsilon$ (the scale of the penalty) must be chosen by the analyst.  Too large,
and the result is too far from zero to be effective, too small and the sampler cannot satisfy the
constraint.
* The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters.
This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
* The marginal variances may not reflect the intended prior.

The `sum_to_zero_vector` transform ensures that each element of the resulting constrained vector has the same variance.
This improves the sampler performance, providing fast computation and good effective sample size.
This becomes increasingly noticeable as models increase in size and complexity.
To demonstrate this, in this notebook we consider two different classes of models:

- Multi-level regressions for binomial data with group-level categorical predictors.
- Spatial models for areal data.

For these models, we provide three implementations which differ only in the
implementation of the sum-to-zero constraint:  the built-in `sum_to_zero_vector`,
and the hard and soft sum-to-zero implementations.
We fit each model to the same dataset, using the same random seed, and then
compare the summary statistics for the constrained parameter values.
Since the models are nearly equivalent except in the implied prior distributions
on the vector, we expect that all three implementations
should produce nearly equivalent estimates; what differs is the speed of computation,
as measured by effective samples per second.

::: {.callout-note}
The spatial models are taken from the a set of notebooks available from GitHub repo [https://github.com/mitzimorris/geomed_2024](https://github.com/mitzimorris/geomed_2024).
:::

## Multi-level Models with Group-level Categorical Predictors

In this section we consider a model which estimates per-demographic disease prevalence rates for a population.
The model is taken from the Gelman and Carpenter, 2020
[Bayesian Analysis of Tests with Unknown Specificity and Sensitivity](https://doi.org/10.1111/rssc.12435).
It combines a model for multilevel regression and post-stratification with a model that
accounts for test sensitivity and specificity.

The data consists of:

* A set of per-demographic aggregated outcomes of a diagnostic test procedure
with unequal number of tests per demographic.

* A corresponding set of demographic descriptors encoded as a vector of categorical values.
In this example these are named `sex`, `age`, `eth`, and `edu`, but there can be any number
of demographic predictors with any number of categories.

* The specified test sensitivity and specificity

Putting a sum-to-zero constraint on the categorical allows us to model all categories as offset from the mean.
We prefer this to the alternative of dropping a category for reference or other contrast coding strategy.

### The Stan model


The full model is in file [binomial_4_preds_ozs.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4_preds_ozs.stan).
It provides an estimate of the true prevalence based on binary tests with
a given (or unknown) test sensitivity and specificity, and uses a `sum_to_zero_vector`
to constrain the group-level parameters `age`, `eth`, and `edu`.

In order to put a standard normal prior on `beta_age`, `beta_eth`, and `beta_edu`,
we need to scale the variance, as suggested above.
The scaling factors are pre-computed in the `transformed data` block,
and applied as part of the prior.

*relevant excepts from* `binomial_4_preds_ozs.stan`
```stan
transformed data {
  // scaling factors for marginal variances of sum_to_zero_vectors
  real s_age = sqrt(N_age / (N_age - 1.0));
  real s_eth = sqrt(N_eth / (N_eth - 1.0));
  real s_edu = sqrt(N_edu / (N_edu - 1.0));
}
parameters {
  real beta_0;
  real beta_sex;
  real<lower=0> sigma_age, sigma_eth, sigma_edu;
  sum_to_zero_vector[N_age] beta_age;
  sum_to_zero_vector[N_eth] beta_eth;
  sum_to_zero_vector[N_edu] beta_edu;
}
transformed parameters {
  // true prevalence
  vector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age]
			  + beta_eth[eth] + beta_edu[edu]);
  // incorporate test sensitivity and specificity.
  vector[N] p_sample = p * sens + (1 - p) * (1 - spec);
}
model {
  pos_tests ~ binomial(tests, p_sample);  // data model
  // scale normal priors on sum_to_zero_vectors
  beta_age ~ normal(0, s_age * sigma_age);
  beta_eth ~ normal(0, s_eth * sigma_eth);
  beta_edu ~ normal(0, s_edu * sigma_edu);
  ...
}
```

#### Simulating Data

To investigate the predictive behavior of this model at different timepoints in a pandemic,
we have written a data-generating program to create datasets given the
baseline disease prevalence, test specificity and sensitivity,
and the desired number of diagnostic tests.
In the `generated quantities` block we use Stan's PRNG functions to populate
 vthe true weights for the categorical coefficient vectors, and the relative percentages
of per-category observations.
Then we use a set of nested loops to generate the data for each demographic,
using the PRNG equivalent of the data model.

The full data-generating program is in file [gen_binomial_4_preds.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4_preds_ozs.stan).
The helper function `simulate_data` in file `utils.py` sets up the data-generating program
according to the specified number of demographic categories, approximate average number of
observations per category, and baseline disease prevalence, test specificity and sensitivity.
This allows us to create datasets for large and small populations
and for finer or more coarse-grained sets of categories.
The larger the number of strata overall, the more observations are needed to get good coverage.
Because the modeled data `pos_tests` is generated according to the Stan model,
the model is a priori well-specified with respect to the data.


::: {.callout-note}
Because the data-generating parameters and percentage of observations per category are generated at random,
some datasets may have very low overall disease rates and/or many unobserved strata.
While this is informative for understanding what is consistent with the generating distributions,
it results in estimates of varying effect parameters with near-zero values,
which are on the boundary of parameter space, and hence very challenging to fit.
Because we wish to compare baseline performance,
we have cherry-picked a seed which generates datasets that avoid this issue.
:::


*Generate several datasets of different sizes.*
```{python}
#| output: false

# specify data shapes and test characteristics
N_eth = 3
N_edu = 5
N_age = 9
baseline = -3.5
sens = 0.75
spec = 0.9995


# generate datasets
from utils import simulate_data
data_tiny = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 4, seed=56789)
data_small = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 17, seed=data_tiny['seed'])
data_large = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 200, seed=data_tiny['seed'])
```
The data generating program creates unequal numbers of tests per demographic.
The following tables summarize the resulting distribution.

:::: columns
::: {.column width="32%"}

```{python}
print("Small Dataset")
print("tests total " + str(np.sum(data_small['tests'])))
small_df = pd.DataFrame({'tests per demographic': data_small['tests']}).describe()
small_df.drop('count', inplace=True)
small_df
```
:::
::: {.column width="32%"}
```{python}
print("Large Dataset")
print("tests total " + str(np.sum(data_large['tests'])))
large_df = pd.DataFrame({'tests per demographic': data_large['tests']}).describe()
large_df.drop('count', inplace=True)
large_df
```
:::
::: {.column width="32%"}
```{python}
print("Tiny Dataset")
print("tests total " + str(np.sum(data_tiny['tests'])))
tiny_df = pd.DataFrame({'tests per demographic': data_tiny['tests']}).describe()
tiny_df.drop('count', inplace=True)
tiny_df
```
:::
::::


### Model Fitting

We fit each model to the the large and small datasets.
We record the seed used for the first run and use it for all subsequent fits; this insures that the models have the same set of initial parameter values
for all parameters except the constrained params `beta_eth`, `beta_edu`, and `beta_age`.


**Model 1: `sum_to_zero_vector`**

This model is in file [binomial_4_preds_ozs.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4_preds_ozs.stan).

*Compile and fit the model to data.*
```{python}
#| output: false
binomial_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs.stan'))
print("Model 1:  sum_to_zero_vector, small dataset")
binomial_ozs_fit = binomial_ozs_mod.sample(data=data_small, parallel_chains=4)

a_seed = binomial_ozs_fit.metadata.cmdstan_config['seed']
print("Model 1:  sum_to_zero_vector, large dataset")
binomial_ozs_fit_lg = binomial_ozs_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
```
**Model 2:  Hard sum-to-zero constraint**

This model is in file [binomial_4_preds_hard.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4_preds_hard.stan).

*Compile and fit the model to data.*
```{python}
#| output: false
binomial_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard.stan'))
print("\nModel 2:  hard sum-to-zero constraint, small dataset")
binomial_hard_fit = binomial_hard_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
print("Model 2:  hard sum-to-zero constraint, large dataset")
binomial_hard_fit_lg = binomial_hard_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
```

**Model 3:  soft sum-to-zero constraint**

This model is in file [binomial_4_preds_soft.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4_preds_soft.stan).

*Compile and fit the model to data.*
```{python}
#| output: false
binomial_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_soft.stan'))
print("\nModel 3:  soft sum-to-zero constraint, small dataset")
binomial_soft_fit = binomial_soft_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
print("Model 3:  soft sum-to-zero constraint, large dataset")
binomial_soft_fit_lg = binomial_soft_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)
```


### Model Checking and Comparison

#### Check implementations

If all three implementations are correct, we expect that they will produce the same
estimates for all parameters. As always, we check the R-hat values and effective sample sizes.
To do this, we tabulate the summary statistics from each of the three models, fit to the
large and small datasets.


*Tabulate summary statistics from all fits.*
```{python}
# small dataset
ozs_fit_summary = binomial_ozs_fit.summary(sig_figs=2)
ozs_fit_summary.index =  ozs_fit_summary.index.astype(str) + "  - ozs"


hard_fit_summary = binomial_hard_fit.summary(sig_figs=2)
hard_fit_summary.index = hard_fit_summary.index.astype(str) + " - hard"

soft_fit_summary = binomial_soft_fit.summary(sig_figs=2)
soft_fit_summary.index = soft_fit_summary.index.astype(str) + " - soft"

small_data_fits_summary = pd.concat([ozs_fit_summary, hard_fit_summary, soft_fit_summary])

# large dataset
ozs_fit_lg_summary = binomial_ozs_fit_lg.summary(sig_figs=2)
ozs_fit_lg_summary.index =  ozs_fit_lg_summary.index.astype(str) + "  - ozs"

hard_fit_lg_summary = binomial_hard_fit_lg.summary(sig_figs=2)
hard_fit_lg_summary.index = hard_fit_lg_summary.index.astype(str) + " - hard"

soft_fit_lg_summary = binomial_soft_fit_lg.summary(sig_figs=2)
soft_fit_lg_summary.index = soft_fit_lg_summary.index.astype(str) + " - soft"

large_data_fits_summary = pd.concat([ozs_fit_lg_summary, hard_fit_lg_summary, soft_fit_lg_summary])
```

**Eth**

*Fitted estimates*
```{python}
beta_eth_summary = summarize_predictor(small_data_fits_summary, 'beta_eth\[')
beta_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_eth\[')
labels = expand_true(data_small['beta_eth'], "Datagen<br>values")
small_html = style_dataframe(beta_eth_summary, 3).to_html()
large_html = style_dataframe(beta_eth_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_eth_summary = summarize_predictor(small_data_fits_summary, 'sigma_eth')
sigma_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_eth')
small_html = style_dataframe(sigma_eth_summary, 4).to_html()
large_html = style_dataframe(sigma_eth_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
```


**Edu**

*Fitted estimates*
```{python}
beta_edu_summary = summarize_predictor(small_data_fits_summary, 'beta_edu\[')
beta_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_edu\[')
labels = expand_true(data_small['beta_edu'], "Datagen<br>values")
small_html = style_dataframe(beta_edu_summary, 3).to_html()
large_html = style_dataframe(beta_edu_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_edu_summary = summarize_predictor(small_data_fits_summary, 'sigma_edu')
sigma_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_edu')
small_html = style_dataframe(sigma_edu_summary, 4).to_html()
large_html = style_dataframe(sigma_edu_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
```

**Age**

*Fitted estimates*
```{python}
beta_age_summary = summarize_predictor(small_data_fits_summary, 'beta_age\[')
beta_age_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_age\[')
labels = expand_true(data_small['beta_age'], "Datagen<br>values")
small_html = style_dataframe(beta_age_summary, 3).to_html()
large_html = style_dataframe(beta_age_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_age_summary = summarize_predictor(small_data_fits_summary, 'sigma_age')
sigma_age_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_age')
small_html = style_dataframe(sigma_age_summary, 4).to_html()
large_html = style_dataframe(sigma_age_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
```

These comparisons demonstrate that

* All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.

* The models recover the sign of the parameter, but not the exact value.
In the case of lots of observations and only a few categories they do a better job of recovering the true parameters.

* In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures.
In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low.

* The `sum_to_zero_vector` implementation has the highest effective sample size per second,
excepting a few individual parameters for which the hard sum-to-zero performs equally well.


#### Sampling efficiency

Sampling efficiency is measured by iterations per second, however, as the draws from the MCMC sampler
may be correlated, we need to compute the effective sample size across all chains
divided by the total sampling time - this is *ESS_bulk/s*, the effective sample size per second.
The following table shows the average runtime for 100 runs
of each of the three models on large and small datasets.
This data was generated by script `eval_efficiencies.py`.


*Tabulate summary statistics from all fits.*
```{python}
with open(os.path.join('binomial_runtimes_small.json'), 'r') as file:
     data = json.load(file)
df_small = pd.DataFrame(data)

with open(os.path.join('binomial_runtimes_large.json'), 'r') as file:
     data = json.load(file)
df_large = pd.DataFrame(data)

small_html = style_dataframe(df_small, 3).to_html()
large_html = style_dataframe(df_large, 3).to_html()
display_side_by_side(small_html, large_html)
```

The `sum_to_zero_vector` is far more efficient than either the hard or soft sum-to-zero constraint.

#### Prior predictive checks

[Prior predictive checks](https://mc-stan.org/docs/stan-users-guide/posterior-predictive-checks.html#prior-predictive-checks)
simulate data directly from the prior, in the absense of any observed data.
Here we use the simulated dataset to examine the prior marginal variances
of the elements of the sum-to-zero vector under the hard-sum-to-zero constraint
and the built-in `sum_to_zero` transform.
In particular, we are interested in the marginal variances of the elements of the sum-to-zero constrained parameters.

We have written two models:  [binomial_4preds_ozs_ppc.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4preds_ozs_ppc.stan)
and [binomial_4preds_hard_ppc.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4preds_hard_ppc.stan).
We generate a sample from each, and compare the marginal variances of the elements of sum-to-zero constrained parameter `beta_age`.

**Marginal variances of the hard sum-to-zero constraint, sample from the prior**

*Generate sample from the prior of the hard sum-to-zero model.*
```{python}
#| output: false
binomial_hard_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard_ppc.stan'))
binomial_hard_ppc_fit = binomial_hard_ppc_mod.sample(data=data_small, parallel_chains=4)
```
*Get marginal variances of elements of parameter `beta_age` from the prior.*
```{python}
np.var(binomial_hard_ppc_fit.beta_age, axis=0)
```

The above summaries show how the hard sum-to-zero constraint distorts the variance of the $N^{th}$ element.
This is only a problem for very sparse datasets, where the prior swamps the data.
With data, this problem goes away; to see this we fit the hard sum-to-zero constraint model
on the tiny and small datasets.

**Marginal variances of the hard sum-to-zero constraint, tiny dataset**

*Fit hard sum-to-zero model to the tiny dataset.*
```{python}
#| output: false
binomial_hard_fit_tiny = binomial_hard_mod.sample(data=data_tiny, parallel_chains=4)
```
*Get marginal variances of elements of parameter `beta_age` from the posterior.*
```{python}
np.var(binomial_hard_fit_tiny.beta_age, axis=0)
```

**Marginal variances of the hard sum-to-zero constraint, small dataset**

*Fit hard sum-to-zero model to the small dataset.*
```{python}
#| output: false
binomial_hard_fit_small = binomial_hard_mod.sample(data=data_small, parallel_chains=4)
```
*Get marginal variances of elements of parameter `beta_age` from the posterior.*
```{python}
np.var(binomial_hard_fit_small.beta_age, axis=0)
```

The marginal variances for the `sum_to_zero_vector` elements are roughly equal in both the prior
and the posterior.

**Marginal variances of the built-in `sum_to_zero_vector`, sample from the prior**

*Generate sample from the prior of the `sum_to_zero_vector` model.*
```{python}
#| output: false
binomial_ozs_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs_ppc.stan'))
binomial_ozs_ppc_fit = binomial_ozs_ppc_mod.sample(data=data_small, parallel_chains=4)
```
*Get marginal variances of elements of parameter `beta_age` from the prior.*
```{python}
np.var(binomial_ozs_ppc_fit.beta_age, axis=0)
```

**Marginal variances of the built-in `sum_to_zero_vector`, tiny dataset**

*Fit the `sum_to_zero_vector` model to the tiny dataset.*
```{python}
#| output: false
binomial_hard_fit_tiny = binomial_hard_mod.sample(data=data_tiny, parallel_chains=4)
```

*Get marginal variances of elements of parameter `beta_age` from the posterior.*
```{python}
np.var(binomial_hard_fit_tiny.beta_age, axis=0)
```


### Discussion

Comparison of these three models with different data regimes demonstrates the following.

* For a multi-level model with group-level categorical predictors the `sum_to_zero_vector` provides fast results and good effective sample sizes for both datasets.

* Model [binomial_4preds_ozs.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/binomial_4preds_ozs.stan).
shows how to properly scale the variance of a `sum_to_zero_vector` constrained parameter in order to put a standard normal prior on it.

* Prior predictive checks demonstrate the difference between the marginal variances of
the `sum_to_zero_vector` and hard sum-to-zero implementations.


## Spatial Models with an ICAR component

Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics.
Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models,
first introduced by Besag, 1974, account for this by pooling information from neighboring regions.
The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model
plus ICAR component for spatial auto-correlation by adding an ordinary
random-effects component for non-spatial heterogeneity.
The BYM2 model builds on this model and subsequent refinements.

The ICAR, BYM2, and BYM2_multicomp models are more fully explained in a series of notebooks
available from GitHub repo:  [https://github.com/mitzimorris/geomed_2024](https://github.com/mitzimorris/geomed_2024), see notebooks:

* [The ICAR model in Stan](https://github.com/mitzimorris/geomed_2024/blob/main/python-notebooks/h4_icar.ipynb)
* [The BYM2 model in Stan](https://github.com/mitzimorris/geomed_2024/blob/main/python-notebooks/h5_bym2.ipynb)
* [The BYM2_multicomp model in Stan](https://github.com/mitzimorris/geomed_2024/blob/main/python-notebooks/h6_bym2_multicomp.ipynb)

### ICAR Models

The key element of the BYM2 model is the ICAR component.
Its conditional specification is a
multivariate normal random vector $\mathbf{\phi}$
where each ${\phi}_i$ is conditional on the values of its neighbors.

The joint specification rewrites to a _Pairwise Difference_,

$$ p(\phi) \propto \exp \left\{ {- \frac{1}{2}} \sum_{i \sim j}{({\phi}_i - {\phi}_j)}^2 \right\} $$

Each ${({\phi}_i - {\phi}_j)}^2$ contributes a
penalty term based on the distance between the values of neighboring regions.
However, $\phi$ is non-identifiable, constant added to $\phi$ washes out of ${\phi}_i - {\phi}_j$.
Therefore, **a sum-to-zero constraint is needed to both identify and center $\phi$.**

The Stan implementation of the ICAR component computes the sum of the pairwise distances
by representing the spatial adjacency matrix as a array of pairs of neighbor indices.

```stan
data {
  ...
  // spatial structure
  int<lower = 0> N_edges;  // number of neighbor pairs
  array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent
```

The ICAR prior comes into the model as parameter `phi`.
```stan
model {
  ...
  target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR prior
```

The ICAR model requires that the neighbor graph is fully connected for two reasons:

* The joint distribution is computed from the pairwise differences between a node and its neighbors;
singleton nodes have no neighbors and are therefore undefined.

* Even if the graph doesn't have any singleton nodes, when the graph has multiple connected components
a sum-to-zero constraint on the entire vector fails to properly identify the model.

Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph.
In order to apply the BYM2 model to the full NYC dataset, it is necessary to
extend the BYM2 model to account for disconnected components and singleton nodes.

First we use the BYM2 model for fully connected graphs
to compare implementations of the sum-to-zero vector on `phi`,
and then we extend the BYM2 model following the INLA implementation
developed by Freni-Sterrantino et al. in 2018 and presented in
[A note on intrinsic Conditional Autoregressive models for disconnected graphs](https://arxiv.org/abs/1705.04854).

The dataset we're using is that used in the analysis published in 2019
[Bayesian Hierarchical Spatial Models: Implementing the Besag York Mollié Model in Stan](https://www.sciencedirect.com/science/article/pii/S1877584518301175).
It data consists of motor vehicle collisions in New York City,
as recorded by the NYC Department of Transportation, between the years 2005-2014,
restricted to collisions involving school age children 5-18 years of age as pedestrians.
Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census,
using the [2010 Census block map for New York City](https://data.cityofnewyork.us/City-Government/2010-Census-Blocks/v2h8-6mxf).
New York City is comprised of several large and small islands, a chunk of the mainland (the Bronx), and the southern tip of Long Island Sound (Brooklyn and Queens).
To analyze the full New York City dataset requires the extended BYM2 model, which we do in the second part of this section.

### Model 1: The BYM2 model,  Riebler et al. 2016

We compare three implementations of the BYM2 model which differ only in their implementation of `phi`,
which is constrained to sum to zero.

**BYM2 v1: `sum_to_zero_vector`**

Parameter `phi` is declared as a `sum_to_zero_vector`.
See file [bym2_ozs.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/bym2_ozs.stan).

**BYM2 v2:  Hard sum-to-zero constraint**

Parameter `phi_raw` is an vector of size N-1 and the N-length vector `phi` is computed in the `transformed parameters` block.
See file [bym2_hard.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/bym2_hard.stan).

**BYM2 v3:  soft sum-to-zero constraint**

Parameter `phi` is a vector of size N and its sum is soft-centered on $0$.
See file [bym2_soft.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/bym2_soft.stan).


#### Data assembly

The inputs to the BYM2 model are

* The Poisson regression data

   + `int<lower=0> N` - number of regions
   + `array[N] int<lower=0> y` - per-region count outcome
   + `vector<lower=0>[N] E` - the population of each region (a.k.a. "exposure"),
   + `int<lower=1> K` - the number of predictors
   + `matrix[N, K] xs` - the design matrix

* The spatial structure

  + `int<lower = 0> N_edges` - the number of neighbor pairs
  + `array[2, N_edges] int<lower = 1, upper = N> neighbors` - the graph structure
  + `real tau` - the scaling factor, introduced in the BYM2

::: {.callout-note}
The scaling factor `tau` was introduced by Riebler et al so that the
variance of the spatial and ordinary random effects are both approximately equal to 1,
thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance.
We have written a helper function called `get_scaling_factor`, in file
[utils_bym2.py](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/utils_bym2.py)
which takes as its argument the neighbor graph and computes the geometric mean of the
corresponding adjacency matrix.
:::

To assemble the spatial data from the US Census data maps we use a set of utilities
described more fully in notebooks 2 and 5 of the [geomed_2024 Spatial Data Processing in Stan workshop](https://github.com/mitzimorris/geomed_2024).
The neighbors graph of New York City consists of several large connected components,
notably Brooklyn and Queens, the Bronx, Manhattan, and Staten Island, and a few other
smaller components and singletons.
The helper function `nyc_soft_by_comp_size` adds component info to the geodataframe
and sorts the data by component size.

*Get NYC dataset, pre-process as needed for BYM2 model, display neighbors graph.*
```{python}
from utils_nyc_map import nyc_sort_by_comp_size
nyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))
(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)

from splot.libpysal import plot_spatial_weights
print("New York City neighbors graph")
plot_spatial_weights(nyc_nbs, nyc_gdf)
```

*The BYM2 model requires that the neighbors graph be fully connected -
use the largest subcomponent Brooklyn and Queens.*
```{python}
from libpysal.weights import Queen
brklyn_qns_gdf = nyc_gdf[nyc_gdf['comp_id']==0].reset_index(drop=True)
brklyn_qns_nbs = Queen.from_dataframe(brklyn_qns_gdf , geom_col='geometry')
from utils_bym2 import get_scaling_factor, nbs_to_adjlist
brklyn_qns_nbs_adj = nbs_to_adjlist(brklyn_qns_nbs)
tau = get_scaling_factor(brklyn_qns_nbs)
```

*Put all NYC predictors on the same scale - log transform columns as needed.*
```{python}
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = brklyn_qns_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])
```

*Assemble all inputs into dictionary `brklyn_qns_data`.*
```{python}
brklyn_qns_data = {"N":brklyn_qns_gdf.shape[0],
            "y":brklyn_qns_gdf['count'].astype('int'),
            "E":brklyn_qns_gdf['kid_pop'].astype('int'),
            "K":4,
            "xs":design_mat,
            "N_edges": brklyn_qns_nbs_adj.shape[1],
            "neighbors": brklyn_qns_nbs_adj,
            "tau": tau
}
```

#### Model fitting

The BYM2 model requires many warmup iterations in order to MCMC to converge for all parameters,
including hyperparameters `rho` and `sigma`.
We run all three models using the same seed, in order to make the initial parameters as similar
as possible.


*Compile and fit all three models to the Brooklyn-Queens data.*
```{python}
#| output: false
bym2_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_ozs.stan'))
brklyn_qns_ozs_fit = bym2_ozs_mod.sample(data=brklyn_qns_data, iter_warmup=7000)
a_seed = brklyn_qns_ozs_fit.metadata.cmdstan_config['seed']

bym2_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_hard.stan'))
brklyn_qns_hard_fit = bym2_hard_mod.sample(data=brklyn_qns_data, iter_warmup=7000, seed=a_seed)

bym2_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_soft.stan'))
brklyn_qns_soft_fit = bym2_soft_mod.sample(data=brklyn_qns_data, iter_warmup=7000, seed=a_seed)
```

#### Check implementations

If all three implementations are correct, we expect that they will produce the same
estimates for all parameters. As always, we check the R-hat values and effective sample sizes.
To do this, we tabulate the summary statistics from each of the three models
for the hierarchical parameters `beta`, `sigma`, and `rho`.

*Tabulate summary statistics from all fits.*
```{python}
brklyn_qns_ozs_summary = brklyn_qns_ozs_fit.summary()
brklyn_qns_ozs_summary.index = brklyn_qns_ozs_summary.index.astype(str) + "  - ozs"

brklyn_qns_hard_summary = brklyn_qns_hard_fit.summary()
brklyn_qns_hard_summary.index = brklyn_qns_hard_summary.index.astype(str) + " - hard"

brklyn_qns_soft_summary = brklyn_qns_soft_fit.summary()
brklyn_qns_soft_summary.index = brklyn_qns_soft_summary.index.astype(str) + " - soft"

brklyn_qns_fits_summary = pd.concat([brklyn_qns_ozs_summary, brklyn_qns_hard_summary, brklyn_qns_soft_summary])
intercept_summary = summarize_predictor(brklyn_qns_fits_summary, 'beta_intercept')
betas_summary = summarize_predictor(brklyn_qns_fits_summary, 'betas')
sigma_summary = summarize_predictor(brklyn_qns_fits_summary, 'sigma')
rho_summary = summarize_predictor(brklyn_qns_fits_summary, 'rho')

brklyn_qns_summary = pd.concat([intercept_summary, betas_summary, sigma_summary, rho_summary])
display(HTML(style_dataframe(brklyn_qns_summary, 3).to_html()))
```

The `sum_to_zero` vector is the most efficient.
In contrast to the outcomes for the binomial model above,
the soft sum-to-zero constraint is more efficient than the hard sum-to-zero constraint.

### Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018

In order to apply the BYM2 model to the full NYC dataset, it is necessary to
extend the BYM2 model to account for disconnected components and singleton nodes.
(The alternative is to add edges to the neighbor graphs so that the neighbors graph
is a single connected component.  This is problematic, see Notebook 6
[The BYM2_multicomp model in Stan](https://github.com/mitzimorris/geomed_2024/blob/main/python-notebooks/h6_bym2_multicomp.ipynb) for details.)

In [A note on intrinsic Conditional Autoregressive models for disconnected graphs](https://arxiv.org/abs/1705.04854) Freni-Sterrantino et al. in 2018, the authors
show how to extend the BYM2 model to handle disconnected components and singleton nodes.
They provide the following recommendations:

* Non-singleton nodes are given the BYM2 prior
* Singleton nodes (islands) are given a standard Normal prior
* Compute per-connected component scaling factor
* **Impose a sum-to-zero constraint on each connected component**

We have followed these recommendations and implemented this model in Stan.
The full model is in file
[bym2_multicomp.stan](https://github.com/stan-dev/example-models/tree/master/jupyter/sum-to-zero/stan/bym2_multicomp.stan)

For this case study, we compare 2 implementations of the BYM2_multicomp model:
one which uses the `sum_to_zero_vector` and one which implements the soft sum-to-zero constraint.

It is necessary to constrain the  the elements of the spatial effects vector `phi` on a component-by-component basis.
Stan's [slicing with range indexes](https://mc-stan.org/docs/stan-users-guide/multi-indexing.html#slicing-with-range-indexes),
provides a way to efficiently access each component.
The helper function `nyc_sort_by_comp_size` both sorts the study data by component and adds the component index to the geodataframe.

In the BYM2 model for a fully connected graph the sum-to-zero constraint on `phi`
is implemented directly by declaring `phi` to be a `sum_to_zero_vector`, which is a
[constrained parameter type](https://mc-stan.org/docs/reference-manual/transforms.html#variable-transforms.chapter).
The declaration:
```stan
  sum_to_zero_vector[N] phi;  // spatial effects
```
creates a *constrained* variable of length $N$, with a corresponding unconstrained variable of length $N-1$.

In order to constrain slices of the parameter vector `phi`, we do the following:

* In the `parameters` block, we declare the *unconstrained* parameter `phi_raw` as an regular vector `vector` (instead of a `sum_to_zero_vector`).
    + For a fully connected graph of size $N$, the size of the unconstrained sum-to-zero vector is $N-1$.
For a disconnected graph with $M$ non-singleton nodes, the size of `phi_raw` is $M$ minus the
number of connected components.

```stan
  vector[N_connected - N_components] phi_raw;  // spatial effects
```

* In the `functions` block, we implement the unconstraining transform.


* In the `transformed parameters` block, we apply the constraining transform.

```stan
  vector[N_connected] phi = zero_sum_components(phi_raw, component_idxs, component_sizes);
```

The constraining transform is broken into two functions:

* function `zero_sum_constrain`, the actual constraining transform, which corresponds directly
to the built-in `zero_sum_vector` transform.

* function `zero_sum_constrain_components`, which handles the slicing, and calls `zero_sum_constrain` on each component.

```stan
  /**
   * Constrain sum-to-zero vector
   *
   * @param y unconstrained zero-sum parameters
   * @return vector z, the vector whose slices sum to zero
   */
  vector zero_sum_constrain(vector y) {
    int N = num_elements(y);
    vector[N + 1] z = zeros_vector(N + 1);
    real sum_w = 0;
    for (ii in 1:N) {
      int i = N - ii + 1;
      real n = i;
      real w = y[i] * inv_sqrt(n * (n + 1));
      sum_w += w;
      z[i] += sum_w;
      z[i + 1] -= w * n;
    }
    return z;
  }
```

* `zero_sum_components`: slices vector `phi` by component, applies constraining transform to each.

```stan
  /**
   * Component-wise constrain sum-to-zero vectors
   *
   * @param phi unconstrained vector of zero-sum slices
   * @param idxs component start and end indices
   * @param sizes component sizes
   * @return vector phi_ozs, the vector whose slices sum to zero
   */
  vector zero_sum_components(vector phi,
                                array[ , ] int idxs,
                                array[] int sizes) {
    vector[sum(sizes)] phi_ozs;
    int idx_phi = 1;
    int idx_ozs = 1;
    for (i in 1:size(sizes)) {
      phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] =
        zero_sum_constrain(segment(phi, idx_phi, sizes[i] - 1));
      idx_phi += sizes[i] - 1;
      idx_ozs += sizes[i];
    }
    return phi_ozs;
  }
```

::: {.callout-note}
The constraining transform is a linear operation, leading to a constant Jacobian determinant
which is therefore not included.
As of Stan 2.36, transforms which include a Jacobian adjustment can do so with the
`jacobian +=` statement and must have names ending in `_jacobian`.
See section [functions implementing change-of-variable adjustments](https://mc-stan.org/docs/stan-users-guide/user-functions.html#functions-implementing-change-of-variable-adjustments) in the Stan User's Guide
chapter on user-defined functions for details.
:::


#### Data Assembly


*Get NYC dataset, pre-process as needed for BYM2_multicomp model*
```{python}
from utils_nyc_map import nyc_sort_by_comp_size
from utils_bym2 import nbs_to_adjlist, get_scaling_factors
nyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))
(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)
nyc_nbs_adj = nbs_to_adjlist(nyc_nbs)
component_sizes = [x for x in nyc_comp_sizes if x > 1]
scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)
```

*Put all NYC predictors on the same scale - log transform columns as needed.*
```{python}
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = nyc_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])
```

*Assemble all inputs into dictionary `bym2_multicomp_data`.*
```{python}
bym2_multicomp_data = {
    "N":nyc_gdf.shape[0],
    "y":nyc_gdf['count'].astype('int'),
    "E":nyc_gdf['kid_pop'].astype('int'),
    "K":4,
    "xs":design_mat,
    "N_edges": nyc_nbs_adj.shape[1],
    "neighbors": nyc_nbs_adj,
    "N_components": len(component_sizes),
    "component_sizes": component_sizes,
    "scaling_factors": scaling_factors
}
```