Skip to content
Open
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
36a875a
open router
bradleyshep Mar 23, 2026
3b88747
no guidelines variant, new workflows, results save updates
bradleyshep Mar 23, 2026
76016e7
new evals batch one
bradleyshep Mar 23, 2026
3bdecca
query evals
bradleyshep Mar 23, 2026
b3ce8f7
more evals + categories
bradleyshep Mar 23, 2026
52e28b9
fixes
bradleyshep Mar 23, 2026
617e052
fixes
bradleyshep Mar 23, 2026
6eb1168
fmt
bradleyshep Mar 23, 2026
b9a545f
llm benchmark site
bradleyshep Mar 23, 2026
afee2e0
Create ModelDetail.tsx
bradleyshep Mar 23, 2026
61d815e
site + details
bradleyshep Mar 23, 2026
e132ed8
benchmark site + run
bradleyshep Mar 24, 2026
56e693f
more evals + fixes
bradleyshep Mar 24, 2026
1216af6
fixes
bradleyshep Mar 25, 2026
ec966f9
refinements
bradleyshep Mar 26, 2026
00d6598
updates
bradleyshep Mar 26, 2026
850254e
updates; guidelines mode
bradleyshep Mar 27, 2026
4abe096
Create README.md
bradleyshep Mar 27, 2026
b432278
fixes
bradleyshep Mar 27, 2026
bed39d0
updates
bradleyshep Mar 27, 2026
9ffba0b
remove tools/site
bradleyshep Mar 27, 2026
bb26681
normalize model names
bradleyshep Mar 28, 2026
139408e
scoring fixes
bradleyshep Mar 30, 2026
6f740ed
fixes
bradleyshep Mar 30, 2026
dd35c66
results
bradleyshep Mar 30, 2026
db0e185
rust concurrency and details updates
bradleyshep Mar 31, 2026
25a246e
Update spacetimedb-typescript.mdc
bradleyshep Mar 31, 2026
741fcf4
update actions
bradleyshep Mar 31, 2026
603b5ee
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Mar 31, 2026
5fd1a0e
Update llm-benchmark-periodic.yml
bradleyshep Mar 31, 2026
b6677f9
updates
bradleyshep Mar 31, 2026
b9e43b8
Update spacetimedb-typescript.mdc
bradleyshep Mar 31, 2026
68ae3ef
refinements
bradleyshep Apr 1, 2026
e8b039a
updates
bradleyshep Apr 1, 2026
28662c6
fixes/cleanup
bradleyshep Apr 1, 2026
8f070f4
cleanup
bradleyshep Apr 1, 2026
a3e4421
cleanup
bradleyshep Apr 1, 2026
6feb97d
Update global.json
bradleyshep Apr 1, 2026
920217c
Delete llm-comparison-details.lock
bradleyshep Apr 1, 2026
e5a5546
fmt
bradleyshep Apr 1, 2026
aa3caf1
fmt
bradleyshep Apr 1, 2026
c9770da
lints
bradleyshep Apr 1, 2026
fc1d685
clippy
bradleyshep Apr 1, 2026
5cf84bb
Single source ai docs
bradleyshep Apr 8, 2026
ef78cac
separate csharp client / unity; cpp; rust refinements
bradleyshep Apr 9, 2026
d5afe95
Update init.rs
bradleyshep Apr 9, 2026
0f356c2
Update llms.md
bradleyshep Apr 9, 2026
b91f238
docusaurus md generation
bradleyshep Apr 9, 2026
be7ca88
Merge branch 'master' into bradley/llm-single-source-of-truth
bradleyshep Apr 9, 2026
e663777
Remove unused
bradleyshep Apr 9, 2026
8b96bca
Merge remote-tracking branch 'origin/bradley/llm-benchmarks-improveme…
bradleyshep Apr 9, 2026
9987996
fixes
bradleyshep Apr 9, 2026
9acb33c
skill updates
bradleyshep Apr 9, 2026
c9c9732
unreal skill
bradleyshep Apr 10, 2026
dc7a0f1
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Apr 10, 2026
acbb618
Update SKILL.md
bradleyshep Apr 10, 2026
0974acf
multi index preference
bradleyshep Apr 13, 2026
5ab0938
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Apr 13, 2026
dee42a6
Remove summary, file io, ci quickfix/check; add analysis; remove jsons
bradleyshep Apr 14, 2026
2223a75
Update client.rs
bradleyshep Apr 14, 2026
dfb6779
analysis command + permissions
bradleyshep Apr 14, 2026
514068b
updates
bradleyshep Apr 14, 2026
61192f5
Merge remote-tracking branch 'origin/bradley/llm-single-source-of-tru…
bradleyshep Apr 14, 2026
4eac30e
remove cursor rules mode + dead code cleanup
bradleyshep Apr 14, 2026
91f5861
add goldens to evals, save runs and anlysis locally if dry run
bradleyshep Apr 14, 2026
3689d9c
Update client.rs
bradleyshep Apr 14, 2026
a77321a
BOM
bradleyshep Apr 15, 2026
57b3a34
lints
bradleyshep Apr 15, 2026
989f11b
some cleanup
bradleyshep Apr 15, 2026
7f954b1
remove guidelines -> use skills
bradleyshep Apr 15, 2026
6fa6418
cleanup
bradleyshep Apr 15, 2026
b0436d0
prompt normalization
bradleyshep Apr 15, 2026
7a77779
Merge branch 'master' into bradley/llm-benchmarks-improvements
bradleyshep Apr 15, 2026
61a8d86
dry run + local analysis
bradleyshep Apr 15, 2026
4434307
Update runner.rs
bradleyshep Apr 15, 2026
40ef9e8
results
bradleyshep Apr 15, 2026
8a2b485
Merge branch 'bradley/llm-benchmarks-improvements' of https://github.…
bradleyshep Apr 15, 2026
b9e12f3
Revert "results"
bradleyshep Apr 15, 2026
f794ffe
delete
bradleyshep Apr 15, 2026
58317f5
Merge branch 'bradley/llm-benchmarks-improvements' of https://github.…
bradleyshep Apr 15, 2026
595aa7a
;omts
bradleyshep Apr 15, 2026
2a156d3
Analyze run of given date support
bradleyshep Apr 16, 2026
474c448
SKILLS: randomness + some cleanup
bradleyshep Apr 20, 2026
d74b2a9
Update SKILL.md
bradleyshep Apr 20, 2026
e1b2224
Merge branch 'master' into bradley/llm-benchmarks-improvements
cloutiertyler Apr 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
33 changes: 0 additions & 33 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -590,39 +590,6 @@ jobs:
run: |
cargo ci cli-docs

llm_ci_check:
name: Verify LLM benchmark is up to date
permissions:
contents: read
runs-on: ubuntu-latest
# Disable the tests because they are causing us headaches with merge conflicts and re-runs etc.
if: false
steps:
# Build the tool from master to ensure consistent hash computation
# with the llm-benchmark-update workflow (which also uses master's tool).
- name: Checkout master (build tool from trusted code)
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 1

- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2

- name: Install llm-benchmark tool from master
run: |
cargo install --path tools/xtask-llm-benchmark --locked
command -v llm_benchmark

# Now checkout the PR branch to verify its benchmark files
- name: Checkout PR branch
uses: actions/checkout@v4
with:
clean: false

- name: Run hash check (both langs)
run: llm_benchmark ci-check

unity-testsuite:
needs: [lints]
# Skip if this is an external contribution.
Expand Down
115 changes: 115 additions & 0 deletions .github/workflows/llm-benchmark-periodic.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
name: Periodic LLM benchmarks

on:
schedule:
# Daily at midnight UTC. Change to '0 */6 * * *' for every 6h,
# or '0 */4 * * *' for every 4h.
- cron: '0 0 * * *'
workflow_dispatch:
inputs:
models:
description: 'Models to run (provider:model format, comma-separated, or "all")'
required: false
default: 'all'
languages:
description: 'Languages to benchmark (comma-separated: rust,csharp,typescript)'
required: false
default: 'rust,csharp,typescript'
modes:
description: 'Modes to run (comma-separated: guidelines,no_context,docs,...)'
required: false
default: 'guidelines,no_context'

concurrency:
group: llm-benchmark-periodic
cancel-in-progress: true

jobs:
run-benchmarks:
runs-on: spacetimedb-new-runner
container:
image: localhost:5000/spacetimedb-ci:latest
options: >-
--privileged
timeout-minutes: 180

steps:
- name: Install spacetime CLI
run: |
curl -sSf https://install.spacetimedb.com | sh -s -- -y
echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Checkout master
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 1

- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2

- name: Setup .NET SDK
uses: actions/setup-dotnet@v4
with:
dotnet-version: "8.0.x"

- name: Install WASI workload
env:
DOTNET_MULTILEVEL_LOOKUP: "0"
DOTNET_CLI_HOME: ${{ runner.temp }}/dotnet-home
DOTNET_SKIP_FIRST_TIME_EXPERIENCE: "1"
run: |
dotnet workload install wasi-experimental --skip-manifest-update --disable-parallel

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: 22

- name: Install pnpm
uses: pnpm/action-setup@v4

- name: Build llm-benchmark tool
run: cargo install --path tools/xtask-llm-benchmark --locked

- name: Run benchmarks
env:
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
LLM_BENCHMARK_API_KEY: ${{ secrets.LLM_BENCHMARK_API_KEY }}
LLM_BENCHMARK_UPLOAD_URL: ${{ secrets.LLM_BENCHMARK_UPLOAD_URL }}
MSBUILDDISABLENODEREUSE: "1"
DOTNET_CLI_USE_MSBUILD_SERVER: "0"
INPUT_LANGUAGES: ${{ inputs.languages || 'rust,csharp,typescript' }}
INPUT_MODELS: ${{ inputs.models || 'all' }}
INPUT_MODES: ${{ inputs.modes || 'guidelines,no_context' }}
run: |
LANGS="$INPUT_LANGUAGES"
MODELS="$INPUT_MODELS"
MODES="$INPUT_MODES"

SUCCEEDED=0
FAILED=0
for LANG in $(echo "$LANGS" | tr ',' ' '); do
if [ "$MODELS" = "all" ]; then
if llm_benchmark run --lang "$LANG" --modes "$MODES"; then
SUCCEEDED=$((SUCCEEDED + 1))
else
echo "::warning::Benchmark run failed for lang=$LANG"
FAILED=$((FAILED + 1))
fi
else
if llm_benchmark run --lang "$LANG" --modes "$MODES" --models "$MODELS"; then
SUCCEEDED=$((SUCCEEDED + 1))
else
echo "::warning::Benchmark run failed for lang=$LANG models=$MODELS"
FAILED=$((FAILED + 1))
fi
fi
done
echo "Benchmark runs: $SUCCEEDED succeeded, $FAILED failed"
if [ "$SUCCEEDED" -eq 0 ] && [ "$FAILED" -gt 0 ]; then
echo "::error::All benchmark runs failed"
exit 1
fi

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
Loading
Loading