Use touchstone by VisruthSK · Pull Request #352 · stan-dev/loo

VisruthSK · 2026-04-08T23:22:10Z

Use touchstone to evaluate PRs' impact on performance. Starting with a monolithic approach to testing, calling just loo().

Closes #348.

codecov-commenter · 2026-04-08T23:25:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.80%. Comparing base (60f012d) to head (c73be4d).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #352   +/-   ##
=======================================
  Coverage   92.80%   92.80%           
=======================================
  Files          31       31           
  Lines        3004     3004           
=======================================
  Hits         2788     2788           
  Misses        216      216

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

VisruthSK · 2026-04-09T06:05:09Z

Should lower reps to maybe 5, perhapsalso use smaller data? Current ones are fully LLM generated so maybe more careful creation could cut down on execution time while still being informative tests.

@jgabry any thoughts? The main file is touchstone/script.R The code rn is ugly, I'll fix it tomorrow probably, but any thoughts on the actual tests? The run is taking 1hr, is that way too slow?

jgabry · 2026-04-09T19:36:54Z

If we switch to the example that @avehtari suggested, is that a bigger log lik matrix/array than what you have here or smaller?

VisruthSK · 2026-04-10T02:36:53Z

Wine one is a decent bit larger. How long do you figure is too long?

jgabry · 2026-04-10T19:14:37Z

Is there an easy way to tell it to skip the touchdown step? For important PRs it’s fine for it to be slow, even very slow, but for simple PRs that we know won’t affect runtime at all (or trivially) it would be good to be able to skip it. I guess we could always just merge the PR without waiting for it to finish, but sometimes we’re not checking the CI results immediately and then we’d end up just wasting a bunch of computation which I think might prevent other things in the Stan org from running.

avehtari · 2026-04-10T20:01:23Z

Storing that log_lik_matrix and using the stored matrix for loo takes a couple seconds. I don't think that's too slow.

jgabry · 2026-04-10T20:14:17Z

I agree, that’s not too slow. Maybe I’m just confused. I thought that example was bigger than the one Visruth is currently using, and the current run takes 1hr. It runs things many times to compute the benchmarks, but running something that takes a few seconds many times should still take way less than an hour. So then why does the current example take 1hr? Sorry if I’m missing something simple here.

VisruthSK · 2026-04-10T20:17:31Z

I think the LLM code is doing the wrong thing--I'll run the wine thing and store the log lik (probably in the touchstone dir?) and just read from it and run loo. I think the LLMd code is just abysmal and so is too slow. Will try to get this in by Monday.

jgabry · 2026-04-10T21:22:24Z

I think the LLM code is doing the wrong thing

It might be, I haven't had time to dig into it. Either way, thanks for setting this up. I took a quick look, and I do like that it's also testing the loo.function method in addition to the matrix and array methods. But yeah, let's use @avehtari's example.

avehtari · 2026-04-11T08:26:01Z

I tested the current benchmark code by running it manually on my laptop, and it takes less than 10s

Barebones touchstone setup

ce9689c

VisruthSK mentioned this pull request Apr 8, 2026

use posterior::gpdfit and posterior::qgeneralized_pareto() #305

Open

jgabry mentioned this pull request Apr 9, 2026

Rely on posterior for pareto smooth tails #290

Open

VisruthSK added 2 commits April 8, 2026 18:14

Generated tests

faaca77

Fixed script calling

c73be4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use touchstone#352

Use touchstone#352
VisruthSK wants to merge 3 commits intomasterfrom
use-touchstone

VisruthSK commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 8, 2026 •

edited

Loading

Uh oh!

VisruthSK commented Apr 9, 2026 •

edited

Loading

Uh oh!

jgabry commented Apr 9, 2026

Uh oh!

VisruthSK commented Apr 10, 2026 •

edited

Loading

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

VisruthSK commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

VisruthSK commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

VisruthSK commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Apr 9, 2026

Uh oh!

VisruthSK commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

VisruthSK commented Apr 10, 2026

Uh oh!

jgabry commented Apr 10, 2026

Uh oh!

avehtari commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VisruthSK commented Apr 8, 2026 •

edited

Loading

codecov-commenter commented Apr 8, 2026 •

edited

Loading

VisruthSK commented Apr 9, 2026 •

edited

Loading

VisruthSK commented Apr 10, 2026 •

edited

Loading