|
39 | 39 | { |
40 | 40 | "cell_type": "code", |
41 | 41 | "execution_count": null, |
42 | | - "id": "d03ed3eb", |
| 42 | + "id": "23be27b5", |
43 | 43 | "metadata": {}, |
44 | 44 | "outputs": [], |
45 | 45 | "source": [ |
|
67 | 67 | }, |
68 | 68 | { |
69 | 69 | "cell_type": "markdown", |
70 | | - "id": "74391bea", |
| 70 | + "id": "c21c80d9", |
71 | 71 | "metadata": {}, |
72 | 72 | "source": [ |
73 | 73 | "The radon data may sometimes trigger spurious warnings from plotnine and CmdStanPy, which we will ignore. During model development we recommend leaving CmdStanPy logging level at `logging.WARNING`." |
|
299 | 299 | { |
300 | 300 | "cell_type": "code", |
301 | 301 | "execution_count": null, |
302 | | - "id": "c08867d5", |
| 302 | + "id": "fccb5170", |
303 | 303 | "metadata": {}, |
304 | 304 | "outputs": [], |
305 | 305 | "source": [ |
|
314 | 314 | { |
315 | 315 | "cell_type": "code", |
316 | 316 | "execution_count": null, |
317 | | - "id": "66961dbc", |
| 317 | + "id": "546c552c", |
318 | 318 | "metadata": {}, |
319 | 319 | "outputs": [], |
320 | 320 | "source": [ |
|
420 | 420 | { |
421 | 421 | "cell_type": "code", |
422 | 422 | "execution_count": null, |
423 | | - "id": "67bc90a9", |
| 423 | + "id": "afb04f4d", |
424 | 424 | "metadata": {}, |
425 | 425 | "outputs": [], |
426 | 426 | "source": [ |
|
510 | 510 | "source": [ |
511 | 511 | "## Best Practice #2: start with a simple model\n", |
512 | 512 | "\n", |
513 | | - "\n", |
514 | | - "We can use a simple linear regression model the relationship between\n", |
| 513 | + "Starting from a simple model ensures that there is a good baseline\n", |
| 514 | + "against which to measure performance.\n", |
| 515 | + "Gelman and Hill start from a simple linear regression which models the relationship between\n", |
515 | 516 | "the log radon measurement and the floor on which this measurement was taken.\n", |
516 | | - "This is the **complete pooling** model because it pools all measurements from all counties.\n", |
517 | 517 | "\n", |
| 518 | + "The **complete pooling** model pools all measurements from all counties.\n", |
518 | 519 | "While it's possible to fit this model in Stan, we can also do this in plotnine by adding\n", |
519 | 520 | "[plotnine.geom.geom_smooth(method=\"lm\")](https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_smooth.html#plotnine.geoms.geom_smooth)\n", |
520 | 521 | "to the earlier scatter plot of radon measurements by floor.\n", |
|
535 | 536 | }, |
536 | 537 | { |
537 | 538 | "cell_type": "markdown", |
538 | | - "id": "2b5e6077", |
| 539 | + "id": "58c552b0", |
539 | 540 | "metadata": {}, |
540 | 541 | "source": [ |
541 | 542 | "If we add faceting to this plot, we are fitting 85 individual regressions. Given the sparsity of the data, this is a non-starter. There are 25 counties where all measurements were taken on the basement level; for these counties, the trend line is absent. For the remaining counties, most of which have very few measurments, in several instances, e.g., counties \"Lyon\" and \"Redwood\", the regression line between floor 0 and 1 has a positvie slope; in contrast to the complete pooling model. This goes against what we know about how radon enters the home.\n", |
|
545 | 546 | { |
546 | 547 | "cell_type": "code", |
547 | 548 | "execution_count": null, |
548 | | - "id": "d18ffec9", |
| 549 | + "id": "6f3c5760", |
549 | 550 | "metadata": {}, |
550 | 551 | "outputs": [], |
551 | 552 | "source": [ |
|
622 | 623 | }, |
623 | 624 | { |
624 | 625 | "cell_type": "markdown", |
625 | | - "id": "b1977e2e", |
| 626 | + "id": "a754da06", |
626 | 627 | "metadata": {}, |
627 | 628 | "source": [ |
628 | 629 | "The CmdStanMCMC object provides accessor methods which return the set of draws for each model variable.\n", |
|
640 | 641 | "[stan_variables](https://mc-stan.org/cmdstanpy/api.html#cmdstanpy.CmdStanMCMC.stan_variables)\n", |
641 | 642 | "returns a Python dict from variable names to numpy.ndarrays.\n", |
642 | 643 | "\n", |
643 | | - "Because we are going to be plotting the results, we use the `draws_pd` method." |
| 644 | + "Because we are going to be plotting the results with plotnine, we use the `draws_pd` method to extract the model parameters `alpha`, `beta`, and `sigma`." |
644 | 645 | ] |
645 | 646 | }, |
646 | 647 | { |
647 | 648 | "cell_type": "code", |
648 | 649 | "execution_count": null, |
649 | | - "id": "9e4c872c", |
| 650 | + "id": "3781d33a", |
650 | 651 | "metadata": {}, |
651 | 652 | "outputs": [], |
652 | 653 | "source": [ |
|
703 | 704 | { |
704 | 705 | "cell_type": "code", |
705 | 706 | "execution_count": null, |
706 | | - "id": "7d0c83a6", |
| 707 | + "id": "1e0037a4", |
707 | 708 | "metadata": {}, |
708 | 709 | "outputs": [], |
709 | 710 | "source": [ |
|
715 | 716 | { |
716 | 717 | "cell_type": "code", |
717 | 718 | "execution_count": null, |
718 | | - "id": "8350618b", |
| 719 | + "id": "22aafd3e", |
719 | 720 | "metadata": {}, |
720 | 721 | "outputs": [], |
721 | 722 | "source": [ |
|
728 | 729 | { |
729 | 730 | "cell_type": "code", |
730 | 731 | "execution_count": null, |
731 | | - "id": "74999c7a", |
| 732 | + "id": "5ffc0a68", |
732 | 733 | "metadata": {}, |
733 | 734 | "outputs": [], |
734 | 735 | "source": [ |
|
740 | 741 | { |
741 | 742 | "cell_type": "code", |
742 | 743 | "execution_count": null, |
743 | | - "id": "e99d5185", |
| 744 | + "id": "4e5314d4", |
744 | 745 | "metadata": {}, |
745 | 746 | "outputs": [], |
746 | 747 | "source": [ |
|
758 | 759 | { |
759 | 760 | "cell_type": "code", |
760 | 761 | "execution_count": null, |
761 | | - "id": "fb9ac61e", |
| 762 | + "id": "114086ca", |
762 | 763 | "metadata": {}, |
763 | 764 | "outputs": [], |
764 | 765 | "source": [ |
|
776 | 777 | " alpha=0.5,\n", |
777 | 778 | " )\n", |
778 | 779 | " + geom_point(no_pool_alpha_pd, aes(x='county', y='mean'))\n", |
779 | | - " + scale_x_discrete(limits=pop_desc['county']) \n", |
| 780 | + " + geom_hline(yintercept=pool_stats.alpha[1], color='darkorange', size=1.5)\n", |
| 781 | + " + scale_x_discrete(limits=pop_desc['county'])\n", |
| 782 | + " + ggtitle(\"No pooling model estimates for alpha (basement log_radon level)\")\n", |
| 783 | + " + ylab(\"central 67% interval\") + xlab(\"ordered by observations per county, desc\")\n", |
780 | 784 | " + flip_xlabels\n", |
781 | 785 | " + theme(figure_size=(12,4)) \n", |
782 | 786 | ")" |
783 | 787 | ] |
784 | 788 | }, |
| 789 | + { |
| 790 | + "cell_type": "markdown", |
| 791 | + "id": "b6808b62", |
| 792 | + "metadata": {}, |
| 793 | + "source": [ |
| 794 | + "The complete pooling model ignores the county level variance. The no-pooling model overstates the variance between counties." |
| 795 | + ] |
| 796 | + }, |
785 | 797 | { |
786 | 798 | "cell_type": "markdown", |
787 | 799 | "id": "d15f8442", |
|
1046 | 1058 | }, |
1047 | 1059 | { |
1048 | 1060 | "cell_type": "markdown", |
1049 | | - "id": "dca8dfc3", |
| 1061 | + "id": "2b3d68e6", |
1050 | 1062 | "metadata": {}, |
1051 | 1063 | "source": [ |
1052 | 1064 | "## Appendix C: Stan tips" |
1053 | 1065 | ] |
1054 | 1066 | }, |
1055 | 1067 | { |
1056 | 1068 | "cell_type": "markdown", |
1057 | | - "id": "a60524e1", |
| 1069 | + "id": "06387fc5", |
1058 | 1070 | "metadata": {}, |
1059 | 1071 | "source": [ |
1060 | 1072 | "* it uses the more efficient [`normal_id_glm` distribution](https://mc-stan.org/docs/functions-reference/normal-id-glm.html) instead of the `normal` distribution. The `normal_id_glm` distribution is optimized for simple linear regresion models and takes as arguments:\n", |
|
1069 | 1081 | { |
1070 | 1082 | "cell_type": "code", |
1071 | 1083 | "execution_count": null, |
1072 | | - "id": "add9533a", |
| 1084 | + "id": "a324f22c", |
1073 | 1085 | "metadata": {}, |
1074 | 1086 | "outputs": [], |
1075 | 1087 | "source": [] |
|
0 commit comments