Skip to content

Latest commit

 

History

History
3900 lines (3810 loc) · 214 KB

File metadata and controls

3900 lines (3810 loc) · 214 KB

Purpose

Fit-for-purpose goal: integrate plugin-ipc into ~/src/netdata/netdata/ so Netdata can immediately replace the current Linux cgroups.plugin -> ebpf.plugin custom metadata transport with typed IPC that is reliable, maintainable, testable, and ready for guarded production rollout.

User Decisions

  • Decision recorded on 2026-04-18:
    • PR scope for Thiago's formatter-only netdata-otel change:
      • keep the netdata-otel formatting/cosmetic change in the Netdata integration PR
      • rationale:
        • it is formatter-driven cosmetic churn
        • it is harmless
        • there is no requirement to split it out at this stage
  • Decision recorded on 2026-04-18:
    • Upstream sync policy for vendored library changes:
      • every change that touches vendored library code in the Netdata integration PR must be copied back to the upstream plugin-ipc repository
      • rationale:
        • the Netdata vendoring script will overwrite vendored library trees on the next sync
        • leaving library-only fixes in Netdata would make them disappear on the next revendor
      • concrete scope identified from 33ecdf4de..d97e8fa1c:
        • src/crates/netipc/src/protocol/cgroups.rs
        • src/crates/netipc/src/transport/shm.rs
        • src/crates/netipc/src/transport/shm_tests.rs
      • implication:
        • Netdata-only integration files stay in the Netdata PR
        • vendored netipc Rust fixes must be ported upstream before the next vendoring run
  • Decision recorded on 2026-04-14:
    • Limit negotiation contract:
      • the general contract is:
        • client proposes
        • server decides
        • the final negotiated values are returned by the server in the handshake response
      • each negotiated field must be specified independently; there is no single universal formula such as always min() or always server value
      • examples explicitly clarified by the user:
        • request size limit:
          • client proposes
          • server decides whether to echo it unchanged or alter it
          • this must be defined explicitly in the spec, field by field
        • response size limit:
          • client may propose
          • server returns its own value because only the server knows what it may need to send
        • packet chunking / packet size:
          • server decides using min(client, server) so the session can actually communicate
      • the specification must be updated to state this explicitly and unambiguously for every negotiated field
      • all implementations must be reviewed and aligned to this rule field by field
  • Decision recorded on 2026-04-14:
    • Transport-profile lock after handshake:
      • the selected transport/profile is negotiated during handshake and locked for the lifetime of that session
      • no fallback is allowed after transport negotiation has completed
      • if SHM is negotiated, SHM must be usable for that session
      • any post-handshake SHM fallback to baseline transport is considered a contract violation and must not be adopted as the upstream fix
  • Decision recorded on 2026-04-14:
    • Request-direction negotiation policy:
      • max_request_payload_bytes
        • client proposes the whole-request payload ceiling
        • server echoes it back unchanged when it is acceptable
        • hard-cap the field at 1 MiB
        • if the client proposes more than 1 MiB, reject the handshake
        • do not silently clamp down
      • max_request_batch_items
        • client proposes the intended batch size
        • if there is no concrete protocol-level constraint, server echoes it back unchanged
        • do not invent hypothetical lowering logic without evidence
  • Decision recorded on 2026-04-14:
    • SHM readiness and profile lock:
      • the negotiated profile is locked for the lifetime of the session
      • if SHM is selected, SHM must already be guaranteed usable for that session when the handshake succeeds
      • no post-handshake fallback is allowed
      • this requires moving SHM readiness earlier than the current implementation does today
  • Decision recorded on 2026-04-14:
    • Typed L2 request sizing policy:
      • the client should proactively propose max_request_payload_bytes
      • it should not rely primarily on overflow/reconnect learning
      • for typed L2 methods, the library should calculate the initial request payload proposal from:
        • the method schema
        • the configured/requested batch size
        • explicit sizing assumptions for dynamic fields
      • current approved sizing assumption for strings:
        • assume strings up to 1024 bytes when deriving request payload ceilings unless a method-specific rule says otherwise
      • objective:
        • initial negotiation should already be close to the real need
        • reconnects due to request overflow should be rare safety-net events, not the normal sizing mechanism
      • implication:
        • public typed L2 methods need method-specific sizing rules
        • opaque raw/internal L2 paths may still need reactive overflow recovery as a fallback because they cannot infer request schema automatically
  • Decision recorded on 2026-04-14:
    • max_request_payload_bytes policy:
      • hard-cap the negotiated request payload ceiling at 1 MiB
      • if the client proposes anything above 1 MiB, reject the handshake
      • do not silently clamp down and continue
      • below 1 MiB, preferred behavior is to echo the client proposal back unchanged
  • Decision recorded on 2026-04-14:
    • max_request_batch_items policy:
      • if there is no concrete protocol-level constraint, echo the client proposal back unchanged
      • do not invent hypothetical lowering logic without evidence
  • Decision recorded on 2026-04-14:
    • max_response_batch_items protocol field:
      • keep it in the protocol handshake payloads
      • define it as symmetric with request batch items
      • the server must return the same effective batch-item ceiling for requests and responses
      • rationale:
        • current protocol/method behavior is symmetric by position for batch responses
        • current implementations mirror request item_count into batch response item_count
      • implication:
        • this is now a semantic contract clarification, not a handshake wire-layout removal
        • the specs and all implementations must still be aligned so this field is never independently negotiated
  • Decision recorded on 2026-04-14:
    • Handshake specification deliverable requirements:
      • before implementation, the docs/specs must contain the full handshake description as an overall process/strategy
      • the handshake docs/specs must include per-field analysis:
        • what the client does
        • what the client sends
        • what the server does
        • what the server sends back
      • the docs/specs must include Mermaid sequence diagrams for the handshake process
  • Decision recorded on 2026-04-14:
    • Handshake correctness and guarantee requirements:
      • the negotiated profile must be guaranteed to work after handshake
      • this guarantee must be explicit in the docs/specs and enforced by implementation/tests
  • Decision recorded on 2026-04-14:
    • Handshake test requirements:
      • the handshake process must be fully tested field by field
      • tests must ensure all implementations comply with the documented handshake semantics 100%
      • all auth failures must be tested individually
      • reconnection due to payload overflow must be fully tested
  • Decision recorded on 2026-04-14:
    • L2 public API requirement:
      • L2 users must not provide max_request_payload_bytes
      • request payload sizing is internal library logic, not a user-facing L2 knob
  • Decision recorded on 2026-04-14:
    • Handshake wire evolution for max_response_batch_items:
      • keep the field on the wire
      • do not introduce a new handshake layout version for this point alone
      • document and enforce that it is symmetric with max_request_batch_items
  • Decision recorded on 2026-04-15:
    • Cross-machine workflow and completion bar:
      • the only valid workflow is:
        • commit and push in /home/costa/src/plugin-ipc.git
        • pull on win11:~/src/plugin-ipc.git
        • if fixes are needed after Windows validation:
          • fix locally in /home/costa/src/plugin-ipc.git
          • commit and push locally
          • pull again on win11:~/src/plugin-ipc.git
      • do not leave uncommitted divergence as the way to sync Linux and Windows
      • the task is not complete until:
        • the entire relevant Linux test suite is green
        • the entire relevant native Windows (win11) test suite is green
        • the repository state is judged correct enough to proceed to the Netdata integration PR follow-up
  • Decision recorded on 2026-04-15:
    • Windows sync cleanup before validation:
      • win11:~/src/plugin-ipc.git/benchmarks-windows.csv may be discarded if locally dirty
      • rationale:
        • it is a generated artifact
        • win11 must remain a clean validation checkout
        • the authoritative workflow is commit/push here, pull there
  • Decision recorded on 2026-04-15:
    • Baseline validation pass:
      • run all practical test suites and benchmark suites on both Linux and native Windows in parallel
      • objective:
        • confirm that the handshake rewrite and related fixes did not regress correctness
        • confirm that the benchmark baselines still hold after the handshake and SHM-readiness changes
      • Linux validation matrix:
        • cmake --build build -j4
        • ctest --test-dir build --output-on-failure -j4
        • cd src/crates/netipc && cargo test
        • cd src/go && go test ./...
        • bash tests/run-go-race.sh
        • bash tests/run-extended-fuzz.sh
        • bash tests/run-posix-bench.sh
      • Windows validation matrix on win11:~/src/plugin-ipc.git:
        • cmake --build build -j4
        • ctest --test-dir build --output-on-failure -j4
        • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • cd src/go && go test ./...
        • bash tests/run-windows-msys-validation.sh
        • bash tests/run-windows-bench.sh
      • precondition verified before launch:
        • local /home/costa/src/plugin-ipc.git and win11:~/src/plugin-ipc.git are both on commit 50c4a2d21d3009c53520d1b7fc4fac78ce77e876
        • 50c4a2d is a TODO-only validation-matrix commit on top of code commit 313f7ed
        • no tracked local modifications are present on either host
  • Decision recorded on 2026-04-15:
    • Expanded validation scope for the baseline pass:
      • "all possible tests" must include standalone validation entrypoints that are not covered by the basic ctest / cargo test / go test lanes
      • additional Linux validation entrypoints to run:
        • bash tests/run-coverage-c.sh
        • bash tests/run-coverage-go.sh
        • bash tests/run-coverage-rust.sh
        • bash tests/run-sanitizer-asan.sh
        • bash tests/run-sanitizer-tsan.sh
        • bash tests/run-valgrind.sh
        • bash tests/interop_codec.sh
        • bash tests/test_uds_interop.sh
        • bash tests/test_shm_interop.sh
        • bash tests/test_service_interop.sh
        • bash tests/test_service_shm_interop.sh
        • bash tests/test_cache_interop.sh
        • bash tests/test_cache_shm_interop.sh
      • additional native Windows validation entrypoints on win11:~/src/plugin-ipc.git to run after the strict native benchmark finishes:
        • bash tests/run-verifier-windows.sh
        • bash tests/run-coverage-c-windows.sh
        • bash tests/run-coverage-go-windows.sh
        • bash tests/run-coverage-rust-windows.sh
        • bash tests/test_named_pipe_interop.sh
        • bash tests/test_win_shm_interop.sh
        • bash tests/test_service_win_interop.sh
        • bash tests/test_service_win_shm_interop.sh
        • bash tests/test_cache_win_interop.sh
        • bash tests/test_cache_win_shm_interop.sh
      • benchmark scope already covered by:
        • bash tests/run-posix-bench.sh
        • bash tests/run-windows-msys-validation.sh
        • bash tests/run-windows-bench.sh
  • Finding recorded on 2026-04-15 during full Windows validation:
    • Windows C coverage still meets the configured threshold, but gcov showed the new successful-dispatch oversized-response branch in src/libnetdata/netipc/src/service/netipc_service_win.c was uncovered:
      • branch location: server_handle_session() case NIPC_OK
      • lines observed uncovered in the run:
        • response_len > session->max_response_payload_bytes
        • server_note_response_capacity(...)
        • resp_hdr.transport_status = NIPC_STATUS_LIMIT_EXCEEDED
        • response_len = 0
    • This is not acceptable for the spec-compliance bar because Linux has a targeted payload-limit test for this path and Windows must prove the same server behavior.
    • Implementation plan:
      • add a small dedicated Windows coverage-only C target for service payload limits
      • keep it separate from the already oversized test_win_service_guards.c
      • wire it into tests/run-coverage-c-windows.sh
      • rerun Windows C coverage and then rerun the required full validation loop after the fix is committed and pulled on win11:~/src/plugin-ipc.git
  • Finding recorded on 2026-04-15 during native Windows/MSYS validation:
    • The full Windows run reached tests/run-windows-msys-validation.sh and failed at:
      • test_service_win_shm_interop
      • pair: C server, C client
      • observed client output: client: not ready
    • Immediate targeted reproduction on the same win11:~/src/plugin-ipc.git checkout passed test_service_win_shm_interop 10/10 times, so the failure is intermittent.
    • Concrete harness issue found:
      • tests/test_service_win_interop.sh gives the server up to TIMEOUT=10 seconds to print READY
      • the C/Rust/Go Windows service clients each only wait 200 x 10ms = 2 seconds for client readiness
      • under heavy validation load, one early client attempt can return client: not ready even though the same pair passes immediately on repeat
    • Implementation plan:
      • harden the Windows service and cache interop shell harnesses so client: not ready is retried until the existing TIMEOUT budget expires
      • keep persistent failures visible after the timeout
      • do not hide real call/decoding failures; only retry the pre-call readiness race
  • Finding recorded on 2026-04-15 during full Windows validation after commit cf5cf8dfdaf223460763bf8287ce55394ab912f0:
    • the full run reached the MSYS bounded benchmark comparison and exposed a harness-policy contradiction in the native reference row:
      • failing row: snapshot-baseline c->c @ max
      • samples path reported by the runner: /tmp/netipc-bench-143223/samples-snapshot-baseline-c-c-0.csv
      • observed raw spread:
        • raw_min=5730.000
        • raw_max=22687.000
        • raw_ratio=3.959337
        • configured max ratio: 2.00
      • observed trimmed stable core:
        • stable_min=21620.000
        • stable_max=22637.000
        • stable_ratio=1.047040
      • the same stale full run later exposed another compare-lane noisy row:
        • failing row: snapshot-shm c->c @ max
        • samples path reported by the runner: /tmp/netipc-bench-143696/samples-snapshot-shm-c-c-0.csv
        • observed stable spread:
          • stable_min=175040.000
          • stable_max=462863.000
          • stable_ratio=2.644327
          • configured max ratio: 2.00
      • tests/compare-windows-bench-toolchains.sh intentionally runs a bounded comparison with regression floors, but tests/run-windows-bench.sh still rejects the whole row when the raw sample set contains an outlier, even if the trimmed stable core is valid.
    • implementation plan:
      • keep the normal Windows benchmark publication path fail-closed on raw instability
      • add an explicit opt-in runner mode for comparison lanes that allows a row with raw outliers only when the trimmed stable core already passed the existing stable-sample and stable-ratio checks
      • add bounded per-row retry to the targeted runner so compare lanes do not accept rows with unstable trimmed cores; instead, noisy rows must rerun and produce a stable sample set before they are used in the policy CSV
      • enable that opt-in only from tests/compare-windows-bench-toolchains.sh
      • add a shell policy test for this pure stability decision so the compare lane cannot silently regress back to raw-outlier flapping
  • Finding recorded on 2026-04-15 during full Windows validation after commit d4ff75787743d28ee7f6eedd73e274d0cb608506:
    • the full Windows run failed inside tests/run-windows-msys-validation.sh before the strict full native Windows benchmark could run
    • exact failed artifact:
      • /tmp/plugin-ipc-full-windows-20260415-153255/msys-validation/bench-compare/policy.csv
    • concrete policy failures:
      • np-100k: MSYS 49.5% of mingw64, required 70.0%
      • shm-max: MSYS 54.0% of mingw64, required 85.0%
      • shm-100k: MSYS 32.7% of mingw64, required 95.0%
      • snapshot-np: MSYS 28.3% of mingw64, required 80.0%
    • concrete harness issue:
      • tests/compare-windows-bench-toolchains.sh measures all mingw64 rows first and all msys rows second
      • this makes the policy ratio vulnerable to cross-phase host-load drift during the requested parallel Linux + Windows validation
      • the comparison policy is meant to compare paired measurements, not two long phases that may run under different external load
    • implementation plan:
      • keep the existing policy floors unchanged
      • change the compare harness to run each row as an adjacent pair:
        • mingw64 measurement for the row
        • msys measurement for the same row
      • keep the existing per-row retry and raw-outlier opt-in policy
      • rerun Windows validation after committing and pulling on win11:~/src/plugin-ipc.git
  • Finding recorded on 2026-04-15 during the full Linux/POSIX benchmark run after commit 1e3da7da60c1923dbd8f436ae6bef35b29066b5c:
    • the POSIX benchmark does not fail the performance floors at this point, but the C server rows repeatedly emit:
      • Server c (...) did not exit cleanly within 5s; forcing kill
    • concrete cause:
      • tests/run-posix-bench.sh gives every server extra lifetime: server_duration=$((duration + 5))
      • bench/drivers/c/bench_posix.c starts a timer that sleeps duration_sec + 3
      • bench/drivers/c/bench_posix.c then calls pthread_join(timer_tid, NULL) after nipc_server_run() returns
      • when the harness sends SIGTERM after the client run completes, the C signal handler stops nipc_server_run(), but process exit is still delayed while joining the timer thread
      • for the normal 5-second benchmark rows this can leave roughly 7-8 seconds of timer sleep, while the harness only waits 5 seconds before force-killing
    • cross-platform audit:
      • bench/drivers/c/bench_windows.c has the same timer-wait pattern with WaitForSingleObject(timer, INFINITE) after the server thread exits
      • the current Windows harness usually avoids the warning because it passes server duration without the POSIX +5, but the C driver still has the same shutdown-latency bug
    • implementation plan:
      • make the POSIX C benchmark timer cancelable when the server exits before the timer fires
      • make the Windows C benchmark timer wait on a cancellation event instead of an unconditional sleep
      • keep timer-driven self-stop behavior unchanged for standalone benchmark server runs
      • keep harness thresholds and benchmark floors unchanged
      • rerun affected benchmark validation after committing and pulling on win11:~/src/plugin-ipc.git
    • fix applied locally:
      • bench/drivers/c/bench_posix.c
        • cancel and join the timer thread when nipc_server_run() exits before the timer fires
        • report timer-thread creation failure instead of silently continuing
      • bench/drivers/c/bench_windows.c
        • replace unconditional timer sleep with a cancel event
        • signal the cancel event and join the timer thread when the server thread exits
        • report timer-thread creation failure instead of silently continuing
    • targeted local verification:
      • cmake --build build-bench-posix --target bench_posix_c -j24 passed
      • manual bench_posix_c C server/client row:
        • server command: uds-ping-pong-server ... 10
        • client command: uds-ping-pong-client ... 1 1000
        • after SIGTERM the server exited within the 2-second proof window
        • server output contained READY and SERVER_CPU_SEC=...
        • no forced-kill warning was needed
      • git diff --check passed
      • bash tests/test_windows_bench_stability_policy.sh passed

Implementation Status (2026-04-14)

  • Specs/docs updated before implementation:
    • docs/level1-wire-envelope.md
    • docs/level1-transport.md
    • docs/level1-posix-uds.md
    • docs/level1-windows-np.md
    • docs/level2-typed-api.md
    • docs/getting-started.md
  • Implemented in C / Go / Rust:
    • handshake negotiation aligned to the documented field-by-field contract
    • max_request_payload_bytes hard-capped at 1 MiB
    • proposals above 1 MiB rejected with LIMIT_EXCEEDED
    • max_request_payload_bytes echoed unchanged below cap
    • max_request_batch_items echoed unchanged
    • max_response_payload_bytes server-owned
    • max_response_batch_items kept symmetric with request batch items
    • packet_size negotiated as min(client, server) and rejected if not usable
    • typed L2 public configs no longer expose max_request_payload_bytes
    • SHM readiness moved before handshake completion in managed/raw service paths so negotiated SHM is guaranteed for that session
  • Verified test results after implementation:
    • Rust:
      • cd src/crates/netipc && cargo test
      • result: 305 passed; 0 failed
    • Go POSIX/raw/cgroups/protocol:
      • cd src/go && go test ./pkg/netipc/protocol ./pkg/netipc/service/cgroups ./pkg/netipc/service/raw ./pkg/netipc/transport/posix
      • result: all passed
    • Go Windows compile checks:
      • GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
      • GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/raw
      • result: both compile successfully
    • C targeted transport/service tests:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^test_uds$'
      • result: passed
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_service|test_hardening|test_service_extra|test_ping_pong)$'
      • result:
        • test_uds passed after fixing C client mapping for HELLO_ACK transport_status = LIMIT_EXCEEDED
        • test_service passed
        • test_service_extra passed
        • test_hardening passed
        • test_ping_pong timed out
  • Open obstacle discovered during verification:
    • tests/fixtures/c/test_ping_pong.c hangs in the empty-snapshot case
    • concrete trace:
      • running timeout 20 stdbuf -o0 ./build/bin/test_ping_pong stalls after:
        • Test: empty snapshot is valid for the service kind
        • PASS: server started
        • PASS: client ready
      • strace -ff -o /tmp/test_ping_pong.trace timeout 10 stdbuf -o0 ./build/bin/test_ping_pong shows:
        • the server sends a STATUS_OK response with payload_len = 0 for the empty snapshot request
        • the client then disconnects and reconnects
        • the server accept loop subsequently polls invalid fds (0 / 32765) instead of the listening socket
    • refined local findings from the reproduction on 2026-04-15:
      • the earlier "empty snapshot sends zero payload" theory was wrong
      • debugger evidence showed the empty-snapshot typed dispatch path computes a non-zero typed payload as expected
      • the real root cause was the test fixture lifecycle:
        • tests/fixtures/c/test_ping_pong.c detached the server accept thread and never joined it
        • after test teardown, detached accept loops continued polling invalid or reused fds, contaminating later cases
        • that produced the observed fd=0 / fd=32765 evidence and the spurious header-only UNSUPPORTED response seen by the third test
      • evidence:
        • tests/fixtures/c/test_ping_pong.c
        • src/libnetdata/netipc/src/service/netipc_service.c
        • gdb trace on server_handle_session showed normal non-zero response sizes for the first two tests and no empty-snapshot dispatch failure
        • strace artifacts:
          • /tmp/test_ping_pong.recheck.431492
          • /tmp/test_ping_pong.recheck.431494
          • /tmp/test_ping_pong.recheck.431496
    • fix applied on 2026-04-15:
      • tests/fixtures/c/test_ping_pong.c
        • store the accept thread
        • stop detaching it
        • join it during teardown after nipc_server_drain()
    • verification after the fix:
      • timeout 20 stdbuf -o0 ./build/bin/test_ping_pong
      • result: 20 passed, 0 failed
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_service|test_hardening|test_service_extra|test_ping_pong)$'
      • result: 100% tests passed, 0 tests failed out of 5
    • next full-suite obstacle discovered on 2026-04-15 after commit 2674826:
      • the full Linux/native-Windows validation loop is still blocked by stale Go typed-L2 fixture code in:
        • tests/fixtures/go/cmd/interop_service/main.go
        • tests/fixtures/go/cmd/interop_service_win/main.go
        • tests/fixtures/go/cmd/interop_cache/main.go
        • tests/fixtures/go/cmd/interop_cache_win/main.go
        • bench/drivers/go/main.go
        • bench/drivers/go/main_windows.go
      • exact build failures:
        • unknown field MaxRequestPayloadBytes in struct literal of type cgroups.ServerConfig
        • unknown field MaxResponseBatchItems in struct literal of type cgroups.ServerConfig
        • unknown field MaxRequestPayloadBytes in struct literal of type cgroups.ClientConfig
        • unknown field MaxResponseBatchItems in struct literal of type cgroups.ClientConfig
      • meaning:
        • the public typed L2 API cleanup is correct
        • several typed Go service/benchmark helpers still reference removed fields and must be aligned before the full Linux and native Windows suites can pass
    • next full-suite obstacle discovered on 2026-04-15 after aligning the Go typed helpers:
      • the full Linux/native-Windows validation loop is also blocked by stale C typed-L2 service tests and interop helpers
      • concrete failing files already observed during the Linux build:
        • tests/fixtures/c/test_multi_server.c
        • tests/fixtures/c/interop_service.c
      • broader audit shows the same stale public-field pattern in multiple typed C files:
        • tests/fixtures/c/interop_cache.c
        • tests/fixtures/c/interop_cache_win.c
        • tests/fixtures/c/interop_service.c
        • tests/fixtures/c/interop_service_win.c
        • tests/fixtures/c/test_cache.c
        • tests/fixtures/c/test_chaos.c
        • tests/fixtures/c/test_hardening.c
        • tests/fixtures/c/test_multi_server.c
        • tests/fixtures/c/test_service.c
        • tests/fixtures/c/test_stress.c
        • tests/fixtures/c/test_win_service.c
        • tests/fixtures/c/test_win_service_extra.c
        • tests/fixtures/c/test_win_service_guards.c
        • tests/fixtures/c/test_win_service_guards_extra.c
        • tests/fixtures/c/test_win_stress.c
      • important nuance:
        • raw transport configs in the same files still legitimately carry max_request_payload_bytes and max_response_batch_items
        • only public typed nipc_client_config_t / nipc_server_config_t uses are stale
      • implication:
        • the cleanup must be type-aware
        • Windows-only white-box overflow tests that used the removed public request-payload field need manual rewriting so overflow-reconnect remains covered without reintroducing the public knob

Baseline Validation Status (2026-04-15)

  • Final result of the full baseline pass:
    • code checkout under test:
      • Linux: 50c4a2d21d3009c53520d1b7fc4fac78ce77e876
      • Windows: win11:~/src/plugin-ipc.git at 50c4a2d21d3009c53520d1b7fc4fac78ce77e876
      • note: 50c4a2d is the TODO-only validation-matrix commit on top of code commit 313f7ed
    • Linux is not fully green:
      • first full ctest failed once in go_FuzzDecodeHello with context deadline exceeded
      • isolated fuzz rerun passed
      • second full ctest passed
      • C coverage failed the per-file gate because netipc_service.c is 87.2% against a 90% threshold
    • native Windows is not fully green:
      • go test ./... failed in TestWinServerDispatchSingleSnapshotZeroCapacity
      • run-windows-msys-validation.sh failed 3 targeted comparison rows
      • run-verifier-windows.sh failed because gflags.exe /p /enable test_named_pipe.exe /full returned exit code 1
      • run-coverage-c-windows.sh failed in test_win_service_guards.exe with 8 failed guard assertions
    • benchmark floors:
      • Linux POSIX benchmark generated 202 CSV lines and passed all performance floors
      • native Windows strict benchmark generated 202 CSV lines and passed all Windows performance floors
      • MSYS comparison benchmark did not pass because the MSYS validation script failed targeted rows
  • Validation artifacts:
    • Linux:
      • /tmp/plugin-ipc-validate-linux-20260415-045044
      • /tmp/plugin-ipc-validate-linux-extra-20260415-0615
      • /tmp/plugin-ipc-validate-linux-coverage-20260415-0616
      • /tmp/plugin-ipc-validate-linux-coverage-split-20260415-0616
      • /tmp/plugin-ipc-validate-linux-ctest-rerun-20260415-0615
    • Windows:
      • /tmp/plugin-ipc-validate-windows-20260415-045103
      • /tmp/plugin-ipc-validate-windows-extra-20260415-0622
  • Linux results:
    • cmake --build build -j4: passed
    • first ctest --test-dir build --output-on-failure -j4:
      • failed only in go_FuzzDecodeHello
      • exact log:
        • FuzzDecodeHello (30.06s)
        • context deadline exceeded
      • evidence:
        • /tmp/plugin-ipc-validate-linux-20260415-045044/linux-ctest.log
    • isolated rerun:
      • cd src/go/pkg/netipc/protocol && go test -run=^$ -fuzz=^FuzzDecodeHello$ -fuzztime=30s
      • passed
      • evidence:
        • /tmp/plugin-ipc-validate-linux-20260415-045044/linux-fuzzdecodehello-isolated.log
    • second full ctest --test-dir build --output-on-failure -j4:
      • passed
      • evidence:
        • /tmp/plugin-ipc-validate-linux-ctest-rerun-20260415-0615/linux-ctest-rerun.log
      • meaning:
        • the first Linux ctest failure is currently classified as a flake / scheduling-sensitive failure, not a deterministic regression
    • cargo test: passed
    • go test ./...: passed
    • bash tests/run-go-race.sh: passed
    • bash tests/run-extended-fuzz.sh: passed
    • bash tests/run-posix-bench.sh: passed
    • bash tests/generate-benchmarks-posix.sh: passed
      • generator confirmed:
        • All performance floors met
      • CSV rows:
        • 202
    • bash tests/run-sanitizer-asan.sh: passed
    • bash tests/run-sanitizer-tsan.sh: passed
    • bash tests/run-valgrind.sh: passed
    • Linux interop shell tests:
      • tests/interop_codec.sh: passed
      • tests/test_uds_interop.sh: passed
      • tests/test_shm_interop.sh: passed
      • tests/test_service_interop.sh: passed
      • tests/test_service_shm_interop.sh: passed
      • tests/test_cache_interop.sh: passed
      • tests/test_cache_shm_interop.sh: passed
    • coverage:
      • bash tests/run-coverage-go.sh: passed
        • total coverage 94.3%
      • bash tests/run-coverage-rust.sh: passed
        • total coverage 95.17%
      • bash tests/run-coverage-c.sh: failed
        • direct functional rerun of ./build-coverage/bin/test_service passed with 209 passed, 0 failed
        • the script still fails because per-file coverage gate is missed:
          • netipc_service.c = 87.2%
          • threshold = 90%
          • overall total still 90.7%
        • this is below the repository's documented Linux/POSIX C coverage baseline:
          • COVERAGE-EXCLUSIONS.md records netipc_service.c = 92.1%
        • evidence:
          • /tmp/plugin-ipc-validate-linux-coverage-20260415-0616/linux-coverage-c.log
          • /tmp/plugin-ipc-validate-linux-coverage-20260415-0616/test_service_direct.log
  • Windows results:
    • cmake --build build -j4: passed
    • ctest --test-dir build --output-on-failure -j4: passed
    • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1: passed
    • cd src/go && go test ./...: failed
      • exact failing test:
        • TestWinServerDispatchSingleSnapshotZeroCapacity
      • exact panic:
        • CgroupsBuilder buffer too small: need at least 48 bytes, got 0
      • evidence:
        • /tmp/plugin-ipc-validate-windows-20260415-045103/win-go.log
    • bash tests/run-windows-msys-validation.sh: failed
      • exact evidence:
        • /tmp/plugin-ipc-validate-windows-20260415-045103/win-msys-validation.log
      • concrete failing targeted rows already captured there:
        • snapshot-baseline c->c @ 0
        • snapshot-shm c->c @ 0
        • shm-ping-pong c->rust @ 0
      • final script summary:
        • 3 targeted row(s) failed
      • refined failing-row evidence:
        • snapshot-baseline c->c @ 0: stable_ratio=2.332047, max allowed 2.00
        • snapshot-shm c->c @ 0: raw_ratio=2.411989, max allowed 2.00
        • shm-ping-pong c->rust @ 0: raw_ratio=7.526888, max allowed 2.00
    • bash tests/run-windows-bench.sh: passed
      • evidence:
        • /tmp/plugin-ipc-validate-windows-20260415-045103/win-bench-native.log
        • /tmp/plugin-ipc-validate-windows-20260415-045103/benchmarks-windows-full.csv
        • /tmp/plugin-ipc-validate-windows-20260415-045103/win-bench-gen-native.log
        • /tmp/plugin-ipc-validate-windows-20260415-045103/benchmarks-windows-full.md
      • CSV rows:
        • 202
      • generator confirmed:
        • All performance floors met
    • bash tests/run-verifier-windows.sh: failed
      • evidence:
        • /tmp/plugin-ipc-validate-windows-extra-20260415-0622/win-verifier.log
      • exact failure:
        • gflags.exe /p /enable test_named_pipe.exe /full
        • exit code 1
    • bash tests/run-coverage-c-windows.sh: failed
      • evidence:
        • /tmp/plugin-ipc-validate-windows-extra-20260415-0622/win-coverage-c.log
      • exact summary:
        • 190 passed, 8 failed
      • failing assertions:
        • increment batch transparently resizes and succeeds
        • increment batch negotiated request size grows
        • string reverse transparently resizes and succeeds
        • string reverse negotiated request size grows
        • hybrid SHM request overflow transparently recovers
        • hybrid send-capacity resize keeps client READY
        • hybrid batch request overflow transparently recovers
        • hybrid batch resize keeps client READY
    • bash tests/run-coverage-go-windows.sh: passed
    • bash tests/run-coverage-rust-windows.sh: passed
    • Windows interop shell tests:
      • tests/test_named_pipe_interop.sh: passed
      • tests/test_win_shm_interop.sh: passed
      • tests/test_service_win_interop.sh: passed
      • tests/test_service_win_shm_interop.sh: passed
      • tests/test_cache_win_interop.sh: passed
      • tests/test_cache_win_shm_interop.sh: passed

Active Fix Pass (2026-04-15)

  • Fixes started after the full baseline pass exposed red lanes:
    • Go cgroups snapshot dispatch:
      • evidence:
        • Windows go test ./... panicked in TestWinServerDispatchSingleSnapshotZeroCapacity
        • panic came from protocol.NewCgroupsBuilder() with a zero response buffer and explicit maxItems = 3
      • fix:
        • expose protocol.CgroupsBuilderMinBytes(maxItems)
        • make SnapshotDispatch() return ErrOverflow before constructing a builder when the response buffer cannot reserve the requested directory slots
        • also reuse the helper in DispatchCgroupsSnapshot()
      • local verification:
        • cd src/go && go test ./pkg/netipc/protocol ./pkg/netipc/service/raw
        • result: passed
    • CTest Go fuzz timeout margin:
      • evidence:
        • first full Linux ctest failed only in go_FuzzDecodeHello
        • failure was context deadline exceeded at approximately the requested 30s fuzz duration
        • isolated rerun passed, showing the target is not deterministically crashing
      • fix:
        • keep these as short CTest smoke fuzzers, but run -fuzztime=20s
        • longer fuzz coverage remains owned by tests/run-extended-fuzz.sh
      • local verification:
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^go_FuzzDecodeHello$'
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^go_FuzzDecodeHelloAck$'
        • result: both passed
    • Windows verifier gflags.exe invocation:
      • evidence:
        • verifier log showed GFLAGS: Unexpected argument - 'P:/'
        • direct test confirmed MSYS argument conversion changed /p into P:/
      • fix:
        • call gflags.exe through env MSYS2_ARG_CONV_EXCL='*'
      • Windows verification:
        • bash tests/run-verifier-windows.sh test_named_pipe.exe
        • result: passed
    • C transport outbound limit enforcement:
      • evidence:
        • Windows C guard tests still failed the request-overflow reconnect cases after the first fix
        • debugger evidence on test_win_service_guards.exe showed:
          • first batch request payload was 32 bytes
          • session request limit was 8
          • final error remained NIPC_ERR_OVERFLOW
          • client request capacity stayed 8
        • source evidence:
          • C nipc_np_send() and nipc_uds_send() did not validate outgoing payload size against the negotiated directional payload limit before writing
          • the receiver returned LIMIT_EXCEEDED, so the client learned response capacity instead of request capacity
      • fix:
        • C POSIX UDS send now rejects over-limit outbound request/response payloads before writing
        • C Windows named-pipe send now rejects over-limit outbound request/response payloads before writing
        • the client can now learn request overflow locally and reconnect with a larger request proposal
    • Managed SHM pre-handshake request capacity:
      • evidence:
        • Windows C hybrid guard tests failed the request-overflow reconnect cases
        • SHM regions are created before the server reads HELLO
        • with the approved handshake contract, the server may echo any client request proposal up to 1 MiB
        • therefore a SHM request segment sized from the server's current learned request value can be smaller than the request size the handshake just accepted
      • fix:
        • POSIX and Windows managed servers now pre-create SHM request segments at NIPC_MAX_PAYLOAD_CAP + NIPC_HEADER_LEN
        • response segments remain server-sized because response capacity is server-owned
      • local verification:
        • cmake --build build -j4
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_shm|test_service|test_service_extra|test_cache|test_ping_pong)$'
        • result: passed
      • Windows verification after commit b5f9cd6:
        • bash tests/run-coverage-c-windows.sh
        • previous test_win_service_guards.exe failures are fixed: 198 passed, 0 failed
        • test_win_service_guards_extra.exe: 93 passed, 0 failed
        • test_win_service_extra.exe: 167 passed, 0 failed
        • remaining failure: coverage threshold, netipc_service_win.c is 88.3% against required 90%
        • next fix: add focused Windows service tests for real uncovered branches; do not lower the threshold
    • Windows verification after commit aed8e57:
      • bash tests/run-coverage-c-windows.sh
      • test_win_service_guards.exe: 226 passed, 0 failed
      • coverage results:
        • netipc_service_win.c: 90.5%
        • netipc_named_pipe.c: 92.6%
        • netipc_win_shm.c: 94.2%
      • result: all Windows C files meet the 90% coverage threshold
    • Linux C coverage after commit 1ce2446:
      • command: bash tests/run-coverage-c.sh 90
      • all C coverage test binaries passed
      • remaining failure: netipc_service.c is 87.2% against required 90%
      • uncovered branches include the POSIX typed-client request-overflow recovery paths and no-growth overflow guard
      • next fix: add the POSIX equivalent of the Windows production request-overflow guard tests; do not lower the threshold
    • Linux C coverage after adding POSIX request-overflow tests:
      • test_service_extra: 90 passed, 0 failed
      • bash tests/run-coverage-c.sh 90: all C coverage test binaries passed
      • remaining failure: netipc_service.c improved to 88.0%, still below required 90%
      • next fix: add focused POSIX typed response-overflow recovery tests for baseline and hybrid profiles; do not lower the threshold
    • POSIX response-overflow evidence after adding response-overflow tests:
      • test_service_extra: 109 passed, 0 failed
      • bash tests/run-coverage-c.sh 90: all C coverage test binaries passed, but netipc_service.c stayed at 88.0%
      • important finding:
        • the tests recovered through broken-session retry after the server learned response capacity
        • the explicit NIPC_STATUS_LIMIT_EXCEEDED response path was still uncovered
        • this means a successful dispatch whose encoded response exceeds the negotiated response cap was reaching transport send as an oversized response instead of being converted into a zero-payload LIMIT_EXCEEDED response
      • fix:
        • POSIX and Windows service dispatch now convert successful-but-oversized responses into NIPC_STATUS_LIMIT_EXCEEDED before transport send
        • this aligns response-overflow recovery with the negotiated handshake contract instead of relying on transport breakage
    • Linux C coverage after the response-overflow service fix:
      • bash tests/run-coverage-c.sh 90: all C coverage test binaries passed
      • netipc_service.c improved to 89.0%, still below required 90%
      • next fix: add a focused POSIX managed-server unsupported-method response test to cover the explicit NIPC_STATUS_UNSUPPORTED path
    • Linux C coverage after adding focused POSIX unsupported-method and dispatch-overflow tests:
      • command: bash tests/run-coverage-c.sh 90
      • all C coverage test binaries passed
      • coverage results:
        • netipc_protocol.c: 96.3%
        • netipc_uds.c: 91.7%
        • netipc_shm.c: 92.6%
        • netipc_service.c: 90.7%
        • total: 92.3%
      • result: all POSIX C files now meet the 90% coverage threshold
      • added proof covers:
        • SHM unsupported-method response path after profile negotiation
        • typed dispatch overflow returning explicit LIMIT_EXCEEDED, learning response capacity, reconnecting, and succeeding
      • test organization:
        • new payload-limit coverage lives in tests/fixtures/c/test_service_payload_limits.c
        • new method-status coverage lives in tests/fixtures/c/test_service_method_limits.c
        • shared fixture setup lives in tests/fixtures/c/test_service_limit_helpers.h
    • Windows/MSYS validation on win11:~/src/plugin-ipc.git before the final POSIX service commit:
      • command:
        • bash tests/run-windows-msys-validation.sh /tmp/netipc-msys-validation-20260415-141321 3
      • result: passed
      • evidence:
        • summary: /tmp/netipc-msys-validation-20260415-141321/summary.txt
        • policy: /tmp/netipc-msys-validation-20260415-141321/bench-compare/policy.csv
        • joined comparison: /tmp/netipc-msys-validation-20260415-141321/bench-compare/joined.csv
      • caveat:
        • this was run before the current local service/test changes are committed and pulled to Windows, so affected Windows checks still need to be rerun after sync
    • next full-suite obstacle discovered on 2026-04-15 during the first native Windows rebuild from commit b4a44fa:
      • Windows-only Rust typed-L2 helpers still reference removed public cgroups config fields
      • concrete failing files:
        • tests/fixtures/rust/src/bin/interop_service_win.rs
        • tests/fixtures/rust/src/bin/interop_cache_win.rs
        • bench/drivers/rust/src/bench_windows.rs
      • important nuance:
        • the stale fields are only on typed netipc::service::cgroups::{ClientConfig, ServerConfig}
        • raw transport configs in interop_named_pipe.rs, interop_uds.rs, and the Windows transport client helpers remain valid and must not be changed
    • next native Windows runtime obstacles discovered on 2026-04-15 during full ctest on commit 6651ba6:
      • test_named_pipe_go
        • failure:
          • TestSessionSendRejectsTooSmallPacketSize
          • Connect failed: protocol or layout version mismatch
        • source:
          • src/go/pkg/netipc/transport/windows/pipe_edge_test.go
        • verified root cause:
          • the test still expects connect success followed by Send() rejection for too-small negotiated packet size
          • current Windows named-pipe handshake rejects unusable packet sizes during HELLO_ACK negotiation with STATUS_INCOMPATIBLE
          • evidence:
            • src/go/pkg/netipc/transport/windows/pipe_edge_test.go
            • src/go/pkg/netipc/transport/windows/pipe.go
      • test_win_service_extra
        • failure:
          • server SHM create fault disconnects hybrid client
        • source:
          • tests/fixtures/c/test_win_service_extra.c
        • verified root cause:
          • the test still expects disconnect when server-side SHM creation fails after the new handshake guarantee work
          • current Windows managed server now pre-creates SHM before handshake and strips failing SHM profiles from the accept config, so the correct behavior is baseline-ready without SHM, not disconnect
          • evidence:
            • tests/fixtures/c/test_win_service_extra.c
            • src/libnetdata/netipc/src/service/netipc_service_win.c
      • implication:
        • the tree is not yet green on native Windows
        • these are behavioral/runtime regressions, not more stale API field references
      • local follow-up on 2026-04-15:
        • both remaining failures were patched as stale test expectations, not runtime/library logic changes:
        • src/go/pkg/netipc/transport/windows/pipe_edge_test.go
          • TestSessionSendRejectsTooSmallPacketSize now expects handshake rejection with ErrIncompatible
        • tests/fixtures/c/test_win_service_extra.c
          • first patch attempt was only partially correct:
            • the reconnect expectation was right
            • but the fault was armed too late to guarantee baseline fallback on the first session
          • verified root cause:
            • start_server_named() returns after setting the ready event, and the server thread can immediately enter nipc_server_run()
            • nipc_server_run() pre-creates SHM in server_prepare_accept_config() before any client connect
            • so arming NIPC_WIN_SHM_TEST_FAULT_CREATE_MAPPING after start_server_named() races with the already-prepared first session
          • implication:
            • the test must arm the create-mapping fault before starting the server thread if it wants deterministic baseline fallback on the first handshake
      • local Linux verification after these patches:
        • cmake --build build -j4
        • /usr/bin/ctest --test-dir build --output-on-failure -j4
        • result:
          • 100% tests passed, 0 tests failed out of 39
      • local Windows Go compile verification after these patches:
        • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
      • external reviewer round after both Linux and native Windows went green:
        • useful true findings:
          • stale Go raw-client SHM attach comment still says the server creates SHM after handshake
          • stale Windows C test function name still says disconnects after the test now verifies baseline fallback
        • verified false positives:
          • "Go/Rust lack explicit request-payload over-cap rejection tests"
          • evidence already exists in:
            • src/go/pkg/netipc/transport/posix/uds_test.go
            • src/go/pkg/netipc/transport/windows/pipe_integration_test.go
            • src/crates/netipc/src/transport/posix_tests.rs

Current Handshake Audit (2026-04-14)

  • Wire-level negotiated fields are explicit in the protocol payloads:
    • supported_profiles
    • preferred_profiles
    • max_request_payload_bytes
    • max_request_batch_items
    • max_response_payload_bytes
    • max_response_batch_items
    • packet_size
    • evidence:
      • src/libnetdata/netipc/include/netipc/netipc_protocol.h
      • tests/test_protocol.c
  • The docs are currently too coarse:
    • they describe request limits generically as sender-driven and response limits generically as server-driven
    • they do not define a per-field negotiation matrix
    • evidence:
      • docs/level1-wire-envelope.md
      • docs/level1-posix-uds.md
      • docs/level1-windows-np.md
      • docs/level1-transport.md
  • The current transport implementations are aligned with one hard-coded generic policy:
    • request payload = max(client, server) capped at MAX_PAYLOAD_CAP
    • request batch items = max(client, server)
    • response payload = server value
    • response batch items = server value
    • packet size = min(client, server)
    • profile = highest bit from preferred intersection, else highest bit from full intersection
    • evidence:
      • C POSIX:
        • src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • C Windows:
        • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
      • Go POSIX:
        • src/go/pkg/netipc/transport/posix/uds.go
      • Go Windows:
        • src/go/pkg/netipc/transport/windows/pipe.go
      • Rust POSIX:
        • src/crates/netipc/src/transport/posix.rs
      • Rust Windows:
        • src/crates/netipc/src/transport/windows.rs
  • The tests also encode that same generic request-side max() policy today:
    • C:
      • tests/fixtures/c/test_uds.c
    • Go POSIX:
      • src/go/pkg/netipc/transport/posix/uds_test.go
    • Go Windows:
      • src/go/pkg/netipc/transport/windows/pipe_integration_test.go
    • Rust:
      • src/crates/netipc/src/transport/posix_tests.rs
  • The old request-side max() policy is not arbitrary:
    • L2/L3 clients learn larger request capacities after overflow/reconnect
    • managed servers remember learned request/response capacities and advertise them to later sessions
    • this is why the existing transport handshake upgrades later clients to the larger server-advertised request envelope
    • evidence:
      • C:
        • src/libnetdata/netipc/src/service/netipc_service.c
        • src/libnetdata/netipc/src/service/netipc_service_win.c
      • Go:
        • src/go/pkg/netipc/service/raw/client.go
        • src/go/pkg/netipc/service/raw/client_windows.go
      • Rust:
        • src/crates/netipc/src/service/raw.rs
  • Under the user-approved contract, that is still insufficient:
    • the protocol contract must be:
      • client proposes
      • server decides
      • server returns final negotiated values in HELLO_ACK
    • but each field must define its own decision rule explicitly
    • therefore the current generic request-side max() rule cannot remain as an undocumented blanket policy

Current Obstacles For The Requested Handshake Rewrite (2026-04-14)

  • Obstacle 1: the current SHM guarantee is false
    • current docs still say the data plane switches to SHM after the handshake completes
    • current code still provisions SHM after the handshake-selected profile is already returned
    • implication:
      • the requested rule "negotiated profile is guaranteed to work after handshake" requires architectural change, not only docs/tests
  • Obstacle 2: public typed APIs and current docs still treat max_response_batch_items as independently tunable
    • current HELLO and HELLO_ACK payload layouts still carry it as if it were independently negotiated
    • current docs and configs still expose it as a separate knob
    • implication:
      • the requested contract requires semantic cleanup across docs, APIs, codecs, and tests so it stays symmetric with request batch items
  • Obstacle 3: public typed L2 APIs currently expose internal handshake knobs the user wants removed
    • public service client/server configs still expose max_request_payload_bytes
    • public configs also still expose max_response_batch_items
    • implication:
      • the requested contract requires public API cleanup in C / Rust and likely Go typed surfaces, not only handshake docs
  • SHM currently violates the user's transport-lock expectation at the architectural level:
    • Level 1 handshake already returns selected_profile = SHM
    • but SHM create/attach still happens later in L2 service code
    • this is why Thiago's PR added post-handshake fallback in vendored POSIX C service code
    • under the user-approved contract, that fallback must not be adopted as the upstream fix
    • evidence:
      • handshake/profile selection:
        • docs/level1-transport.md
        • docs/level1-posix-uds.md
        • docs/level1-windows-np.md
      • late SHM setup:
        • src/libnetdata/netipc/src/service/netipc_service.c
        • src/libnetdata/netipc/src/service/netipc_service_win.c
        • src/go/pkg/netipc/service/raw/client.go
        • src/go/pkg/netipc/service/raw/client_windows.go
        • src/crates/netipc/src/service/raw.rs

Negotiated Field Policy Draft (2026-04-14)

  • This section is the corrected handshake matrix draft derived from the user's decisions so far.

  • Global rule:

    • the client sends HELLO
    • the server decides the final session values
    • the server returns those values in HELLO_ACK
    • every field has its own decision rule
    • on handshake failure, the server sends HELLO_ACK with non-OK transport_status and then closes
  • Important distinction:

    • some HELLO fields are proposal inputs only
    • the operational values for the session are the HELLO_ACK fields
  • auth_token -> transport_status

    • client sends:
      • auth_token
    • server does:
      • validates exact match
    • server returns:
      • no negotiated auth value
      • only transport_status
    • operational meaning:
      • OK means authorized
      • AUTH_FAILED means handshake rejected before session establishment
  • supported_profiles + preferred_profiles -> server_supported_profiles + intersection_profiles + selected_profile

    • client sends:
      • supported_profiles
      • preferred_profiles
    • server does:
      • computes intersection = client_supported & server_supported
      • if intersection == 0, returns transport_status = UNSUPPORTED
      • otherwise selects the final profile
    • current source-of-truth selection algorithm:
      • highest bit of (intersection & client_preferred & server_preferred)
      • else highest bit of intersection
    • server returns:
      • server_supported_profiles
      • intersection_profiles
      • selected_profile
    • operational meaning:
      • client does not continue using its own supported_profiles
      • both sides use selected_profile for the session
    • user-approved invariant:
      • once returned by handshake, the profile is locked for the session
      • if SHM is selected, SHM must already be usable for that session
  • max_request_payload_bytes -> agreed_max_request_payload_bytes

    • client sends:
      • proposed request payload ceiling
    • user direction so far:
      • typed L2 should proactively compute and propose this from method schema, desired batch size, and dynamic-field assumptions
      • the server should not increase it
      • preferred behavior is to echo it back unchanged
    • concrete current protocol constraint:
      • source-of-truth currently enforces a hard payload cap of NIPC_MAX_PAYLOAD_CAP = 256 MB
    • evidence:
      • src/libnetdata/netipc/include/netipc/netipc_protocol.h
      • src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
    • server returns:
      • agreed_max_request_payload_bytes
    • operational meaning:
      • both sides use agreed_max_request_payload_bytes for the session
    • user decision now recorded:
      • replace the current 256 MB hard cap with 1 MiB
      • if the client proposes a larger value, reject the handshake
      • do not silently cap-down
  • max_request_batch_items -> agreed_max_request_batch_items

    • client sends:
      • proposed request batch-item ceiling
    • user direction so far:
      • client proposes intended batch size
      • preferred behavior is to echo it back unchanged
      • if there is no concrete server-side constraint, echo it back unchanged
    • concrete current evidence:
      • none found yet for a protocol-level hard maximum analogous to NIPC_MAX_PAYLOAD_CAP
      • current source-of-truth raises this value with max(client, server), but that is existing behavior, not evidence of necessity
    • server returns:
      • agreed_max_request_batch_items
    • operational meaning:
      • both sides use agreed_max_request_batch_items for the session
    • user decision now recorded:
      • if there is no concrete protocol-level constraint, echo the client proposal back unchanged
  • max_response_payload_bytes -> agreed_max_response_payload_bytes

    • client sends:
      • optional hint/current expectation
    • user-approved direction:
      • server ignores the client for the final value
      • server returns the value it will actually use
    • server returns:
      • agreed_max_response_payload_bytes
    • operational meaning:
      • both sides use agreed_max_response_payload_bytes for the session
  • max_response_batch_items -> agreed_max_response_batch_items

    • concrete current evidence:
      • current typed batch semantics are symmetric by position
      • docs:
        • docs/level1-transport.md says batch request/response items are correlated by array position
        • docs/level2-typed-api.md says the managed server assembles one batch response preserving request order
      • implementations:
        • C sets resp_hdr.item_count = hdr.item_count for batch responses
        • Go sets resp_hdr.ItemCount = hdr.ItemCount for batch responses
        • Rust sets resp_hdr.item_count = hdr.item_count for batch responses
    • user decision now recorded:
      • keep the field on the wire
      • require strict symmetry with request batch items
      • server must return the same effective batch-item ceiling for requests and responses for the session
    • impact:
      • no handshake layout removal for this field
      • docs, APIs, codecs, and tests still need coordinated semantic cleanup so the field is never treated as independently negotiated
  • packet_size -> agreed_packet_size

    • client sends:
      • proposed transport packet size
    • user-approved direction:
      • server decides with min(client, server)
    • server returns:
      • agreed_packet_size
    • operational meaning:
      • both sides use agreed_packet_size for chunking in the session
  • session_id

    • client sends:
      • nothing
    • server does:
      • allocates a per-session identifier
    • server returns:
      • session_id
    • operational meaning:
      • identifies this session
      • used in per-session SHM naming/derivation

TL;DR

  • Analyze how plugin-ipc should be integrated into the Netdata repo and build.
  • Before any Netdata integration, implement transparent SHM resizing in plugin-ipc itself.
  • Validate that feature thoroughly first, including full C/Rust/Go interop matrices on Unix and Windows.
  • Use it first to replace the current cgroups.plugin -> ebpf.plugin metadata channel on Linux.
  • Make the library available to C, Rust, and Go code inside Netdata.
  • Record integration design decisions before implementation.
  • User-approved local workspace cleanup in this slice:
    • remove the generated Go test / helper binaries after the push
    • affected files:
      • src/go/cgroups.test.exe
      • src/go/main
      • src/go/raw.test.exe
      • src/go/windows.test.exe
  • User-directed benchmark follow-up now in scope:
    • treat the Linux shm-batch-ping-pong C/Rust spread as two independent problems:
      • Rust server penalty versus C server with the same C client
      • Rust client penalty versus C client with the same C server
    • worst-case rust -> rust is the compounded result of both penalties
    • objective:
      • identify the exact Rust-side hot paths responsible for the server-side and client-side losses
      • fix Rust until the Linux C/Rust SHM batch path is materially closer to the C baseline
    • scope expansion approved by the user:
      • do the same benchmark-delta investigation across all material language/client/server combinations
      • identify every real implementation issue behind the benchmark gaps
      • fix the implementation issues, not just explain them
      • keep benchmark artifacts and benchmark-derived docs in sync after each validated fix
    • first verified benchmark-delta findings:
      • POSIX shm-batch-ping-pong with client ∈ {c,rust} and server ∈ {c,rust} still has a real Rust penalty on both sides:
        • c -> c = 64,148,960
        • c -> rust = 58,334,803
        • rust -> c = 52,277,542
        • rust -> rust = 48,220,338
        • implication:
          • Rust server penalty is real
          • Rust client penalty is larger
          • rust -> rust is the compounded case
      • benchmark-driver distortion is also real and must be fixed before deeper transport conclusions:
        • Go lookup benchmark does a synthetic linear scan instead of using the actual O(1) cache structure:
          • bench/drivers/go/main.go
        • Rust lookup benchmark also does a synthetic linear scan:
          • bench/drivers/rust/src/main.rs
        • Rust actual cache lookup currently allocates name.to_string() on every lookup:
          • src/crates/netipc/src/service/raw.rs
        • Go and Rust batch / pipeline clients still do avoidable hot-loop allocations that C avoids or minimizes:
          • Go:
            • bench/drivers/go/main.go
          • Rust:
            • bench/drivers/rust/src/main.rs
  • Current execution scope:
    • remove the multi-method service drift from docs, code, tests, and public APIs
    • align the implementation to one-service-kind-per-endpoint
    • implement the accepted SHM resize / renegotiation behavior
    • eliminate contradictory wording and examples across the repository
    • refresh the Linux and Windows benchmark matrices on the current tree
    • update benchmark artifacts and all benchmark-derived docs so everything is in sync
    • investigate the remaining benchmark spreads and identify whether they reflect real transport/runtime inefficiency, measurement distortion, or pair-specific implementation overhead
    • correct the benchmark build path so C benchmark results are generated from optimized C libraries, not from a local Debug CMake tree
    • Current implementation status:
    • docs/specs/TODOs now explicitly state service-oriented discovery and one request kind per endpoint
    • Go public cgroups APIs and Go raw service/tests were rewritten to the single-kind model
    • cd src/go && go test -count=1 ./pkg/netipc/service/raw now passes after aligning the raw client/server with learned SHM req/resp capacities and transparent overflow-driven reconnect/retry
    • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups now passes
    • Rust public cgroups facade now uses the single-kind raw server constructor instead of the old multi-handler bundle
    • targeted Rust verification now passes:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::cgroups:: -- --test-threads=1
    • Rust raw Unix tests no longer use the old mixed pingpong_handlers() helper
    • the Rust raw service subset now passes after binding increment-only and string-reverse-only endpoints explicitly and teaching the raw client/server the learned SHM req/resp resize path:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
    • Go raw L2 now tracks learned request/response capacities, treats STATUS_LIMIT_EXCEEDED as an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls
    • Rust raw L2 now tracks learned request/response capacities, treats STATUS_LIMIT_EXCEEDED as an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls
    • Go and Rust transport listeners now expose payload-limit setters so the server can advertise learned capacities to later clients before accept():
      • Go POSIX: src/go/pkg/netipc/transport/posix/uds.go
      • Go Windows: src/go/pkg/netipc/transport/windows/pipe.go
      • Rust POSIX: src/crates/netipc/src/transport/posix.rs
      • Rust Windows: src/crates/netipc/src/transport/windows.rs
    • src/crates/netipc/src/service/raw.rs no longer exposes the generic Handlers bundle or the transitional new_single_kind / with_workers_single_kind constructors
    • src/crates/netipc/src/service/raw.rs now models managed servers as single-kind endpoints directly:
      • ManagedServer::new(..., expected_method_code, handler)
      • ManagedServer::with_workers(..., expected_method_code, handler, worker_count)
    • Rust POSIX and Windows benchmark drivers now use the single-kind raw service surface instead of the deleted multi-handler Handlers bundle:
      • bench/drivers/rust/src/main.rs
      • bench/drivers/rust/src/bench_windows.rs
    • src/crates/netipc/src/service/raw_unix_tests.rs and src/crates/netipc/src/service/raw_windows_tests.rs now use that single-kind raw service surface directly instead of feeding a generic handler bundle into the raw server
    • verified source-level residue scan for src/crates/netipc/src/service/raw_windows_tests.rs is now clean:
      • no remaining Handlers
      • no remaining test_cgroups_handlers()
      • no remaining increment_handlers()
    • verified source-level residue scan for src/crates/netipc/src/service/raw.rs and src/crates/netipc/src/service/raw_unix_tests.rs is now clean:
      • no remaining Handlers
      • no remaining new_single_kind
      • no remaining with_workers_single_kind
    • C public naming drift was reduced from plural handler bundles to singular service-handler naming
    • tests/fixtures/c/test_win_service.c is now snapshot-only; it no longer starts a typed snapshot service and then exercises increment / string-reverse / batch calls against it
    • source-level cleanup of the remaining Windows C fixtures is only partial so far:
      • the obvious typed snapshot .on_increment / .on_string_reverse bundle drift was removed from:
        • tests/fixtures/c/test_win_service_extra.c
        • tests/fixtures/c/test_win_stress.c
        • tests/fixtures/c/test_win_service_guards.c
        • tests/fixtures/c/test_win_service_guards_extra.c
      • but real win11 compilation later proved these files still contain stale calls to removed C APIs and stale raw-server assumptions
    • verified source-level residue scan across the touched Windows C fixtures is therefore not enough on its own:
      • it proves only that the obvious typed-handler bundle names were removed
      • it does not prove runtime or even compile-time correctness on Windows
    • verified source-level residue scan for the touched Windows Go raw helpers/tests is now clean:
      • no remaining Handlers{...} bundle initializers
      • no remaining winTestHandlers() / winFailingHandlers() helpers
      • no remaining server.handlers references in the Windows raw tests
    • Windows Go package cross-compile proof now passes from this Linux host:
      • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/raw
      • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/cgroups
    • the Unix interop/service/cache matrix now passes end-to-end after the resize rewrite:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop)$'
    • the broader Unix shm/service/cache slice across C, Rust, and Go now also passes:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go)$'
    • the previously exposed POSIX UDS mismatch is now resolved:
      • Rust cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1 now passes 299/299
      • the stale transport tests were rewritten to match the accepted directional negotiation semantics:
        • requests are sender-driven
        • responses are server-driven
      • C test_uds now proves directional negotiation explicitly and keeps direct receive-limit coverage through a raw malformed-response path
      • the broader non-fuzz Unix CTest sweep now passes end-to-end:
        • /usr/bin/ctest --test-dir build --output-on-failure -E '^(fuzz_protocol_30s|go_FuzzDecodeHeader|go_FuzzDecodeChunkHeader|go_FuzzDecodeHello|go_FuzzDecodeHelloAck|go_FuzzDecodeCgroupsRequest|go_FuzzDecodeCgroupsResponse|go_FuzzBatchDirDecode|go_FuzzBatchItemGet)$'
        • result: 28/28 passed
    • the public docs now match the accepted directional handshake semantics:
      • docs/level1-wire-envelope.md explicitly says request limits are sender-driven and response limits are server-driven
      • docs/getting-started.md no longer documents the deleted Rust CgroupsHandlers / CgroupsServer surface
    • Windows transport test sources were aligned to the same directional contract:
      • Go src/go/pkg/netipc/transport/windows/pipe_integration_test.go no longer expects the old min-style negotiation
      • Rust src/crates/netipc/src/transport/windows.rs now contains a matching directional negotiation test
      • Go Windows transport tests still have cross-compile proof from this Linux host:
        • cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
    • local source checks are clean for the touched Windows C files:
      • git diff --check -- tests/fixtures/c/test_win_stress.c tests/fixtures/c/test_win_service_guards.c tests/fixtures/c/test_win_service_guards_extra.c TODO-netdata-plugin-ipc-integration.md
    • local source checks are also clean for the touched Go/Rust raw files:
      • git diff --check -- src/crates/netipc/src/service/raw.rs src/crates/netipc/src/service/raw_unix_tests.rs src/go/pkg/netipc/service/raw/client.go src/go/pkg/netipc/service/raw/client_windows.go src/go/pkg/netipc/service/raw/shm_unix_test.go src/go/pkg/netipc/service/raw/helpers_windows_test.go src/go/pkg/netipc/service/raw/more_windows_test.go src/go/pkg/netipc/service/raw/shm_windows_test.go TODO-netdata-plugin-ipc-integration.md
    • limitation:
      • this Linux host does not have x86_64-w64-mingw32-gcc
      • so local source cleanup alone is not enough for the edited Windows C fixtures
      • the same host limitation means the raw_windows_tests.rs source cleanup is not backed by a real Windows Rust compile/run proof from this environment either
      • the touched Windows Go packages now have cross-compile proof, but still do not have a real Windows runtime proof from this environment
    • current verified Windows runtime status from the real win11 workflow:
      • the documented ssh win11 + MSYSTEM=MINGW64 toolchain path works and has been used for real validation
      • after syncing the local tree, cmake --build build -j4 on win11 exposed real stale C fixture/API mismatches that were not visible from Linux source scans alone
      • the first verified win11 failure classes were:
        • stale removed client helpers:
          • nipc_client_call_increment
          • nipc_client_call_increment_batch
          • nipc_client_call_string_reverse
        • stale internal error enum usage:
          • NIPC_ERR_INTERNAL_ERROR
        • stale raw-server handler signature assumptions:
          • old bool raw handlers instead of nipc_error_t (*)(..., const nipc_header_t *, ...)
        • stale nipc_server_init(...) argument ordering under the internal test macro path
        • stale client struct field assumptions such as client.request_buf_size
      • those compile-time failures have now been corrected locally and revalidated on win11:
        • test_win_service_extra.exe now builds and passes on win11
      • the remaining active Windows C problem is now narrower and runtime-only:
      • after correcting the stale Windows C fixture/API mismatches and the baseline request-overflow signaling gap, test_win_service_guards.exe now passes on win11:
        • === Results: 141 passed, 0 failed ===
        • the previous apparent timeout was not a persistent runtime hang:
          • later reruns completed normally once the stale one-item batch test drift was removed
        • the last real guard-binary contradiction was:
          • a one-item increment "batch" test still expecting reconnect/growth
        • that expectation was wrong under the accepted semantics:
          • one-item increment batches are normalized to the plain increment path
          • the guard was rewritten to use a real 2-item batch for baseline request-resize coverage
      • the rest of the edited Windows C runtime slice has now been validated on win11 too:
        • test_win_service.exe:
          • === Results: 80 passed, 0 failed ===
        • test_win_service_extra.exe:
          • === Results: 82 passed, 0 failed ===
        • test_win_service_guards_extra.exe:
          • === Results: 93 passed, 0 failed ===
        • test_win_stress.exe:
          • === Results: 1 passed, 0 failed ===
        • a combined rerun of all edited Windows C binaries also passed cleanly on win11
      • the earlier test_win_service.exe timeout is not currently reproducible as a deterministic bug:
        • it timed out once in a combined slice and once in an early soak run
        • after the stale guard/test contradictions were removed, a focused rerun passed
        • a subsequent combined rerun passed
        • a targeted 3-run win11 soak of test_win_service.exe also passed 3/3
        • working theory:
          • that earlier timeout was a transient host/process stall, not a currently reproducible library correctness bug
      • a real L2 behavior gap was exposed and fixed during this win11 investigation:
        • on baseline request overflow, the server session loop now emits a zero-payload LIMIT_EXCEEDED response before disconnecting, instead of silently breaking the session
        • this fix was needed for transparent request-side resize/reconnect to work on Windows baseline transport at all
      • current remaining Windows Rust runtime blocker:
        • focused win11 run:
          • timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
        • current observed behavior:
          • build completes
          • test process prints:
            • running 1 test
            • test service::cgroups::windows_tests::test_cache_round_trip_windows ...
          • then stalls without completing
        • strongest current evidence:
          • Rust raw Windows tests already implement reliable Windows shutdown by:
            • storing the service name + wake client config
            • setting running_flag = false
            • issuing a dummy NpSession::connect(...) to wake the blocking ConnectNamedPipe()
          • cgroups Windows tests and Rust Windows interop binaries still use the weaker pattern:
            • only running_flag = false
            • no wake connection
          • the Windows accept loop in src/crates/netipc/src/service/raw.rs blocks in listener.accept(), which ultimately blocks in ConnectNamedPipe(), so running_flag = false alone is not sufficient to stop the server reliably on Windows
        • working theory:
          • the cache test body may already be completing
          • the stall is very likely in Windows server shutdown/join, not in snapshot/cache decoding itself
      • that Rust Windows blocker is now verified fixed on win11:
        • fix:
          • cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
            • set running_flag = false
            • then issue a wake connection so the blocking ConnectNamedPipe() returns and the accept loop can observe shutdown
        • focused proof:
          • timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
          • result:
            • test service::cgroups::windows_tests::test_cache_round_trip_windows ... ok
        • full Rust Windows lib proof:
          • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
          • result:
            • 176 passed
            • 0 failed
            • 1 ignored
        • factual conclusion:
          • the live bug was stale Windows shutdown/test-fixture behavior, not a current Rust cache decode/refresh correctness issue
      • broader real Windows interop/service/cache proof is now also green on win11:
        • command:
          • timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
        • result:
          • test_named_pipe_interop: passed
          • test_win_shm_interop: passed
          • test_service_win_interop: passed
          • test_service_win_shm_interop: passed
          • test_cache_win_interop: passed
          • test_cache_win_shm_interop: passed
          • summary:
            • 100% tests passed, 0 tests failed out of 6
    • targeted C rebuild and runtime verification now passes:
      • cmake --build build --target test_service test_hardening test_ping_pong
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_hardening|test_ping_pong)$'
    • the latest naming / contract cleanup slice is now backed by both local Linux and real win11 proof:
      • local Linux rerun:
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_hardening|test_ping_pong)$'
        • result:
          • 100% tests passed, 0 failed
      • after syncing this slice's edited files to win11, targeted rebuild passed:
        • cmake --build build -j4 --target test_win_service test_win_service_extra test_win_service_guards test_win_service_guards_extra
      • direct win11 runtime proof for the edited guard binaries passed:
        • ./test_win_service_guards.exe
          • result:
            • === Results: 141 passed, 0 failed ===
        • ./test_win_service_guards_extra.exe
          • result:
            • === Results: 93 passed, 0 failed ===
      • direct win11 runtime proof for the edited service binaries also passed via CTest:
        • ctest --test-dir build --output-on-failure -R "^(test_win_service|test_win_service_extra)$"
        • result:
          • test_win_service: passed
          • test_win_service_extra: passed
    • benchmark refresh on the current tree is now complete and synced:
      • factual root cause of the benchmark blocker:
        • the C and Rust batch benchmark clients still generated random batch sizes in the range 1..1000
        • the actual batch protocol normalizes item_count == 1 to the non-batch path
        • Go was already correct and generated 2..1000, which is why the same C batch server still interoperated with the Go client
      • fixed in:
        • bench/drivers/c/bench_posix.c
        • bench/drivers/c/bench_windows.c
        • bench/drivers/rust/src/main.rs
        • bench/drivers/rust/src/bench_windows.rs
        • bench/drivers/go/main.go
        • tests/run-posix-bench.sh
        • tests/run-windows-bench.sh
      • specific fixes:
        • batch benchmark generators now use 2..1000 items for real batch scenarios
        • Windows benchmark failure reporting now defines server_out before calling dump_server_output
      • targeted proof after the fix:
        • the previously failing pairs now succeed locally and on win11:
          • uds-batch-ping-pong c->c
          • uds-batch-ping-pong rust->c
          • shm-batch-ping-pong c->c
          • shm-batch-ping-pong rust->c
          • np-batch-ping-pong c->c
          • np-batch-ping-pong rust->c
      • clean official reruns:
        • Linux:
          • bash tests/run-posix-bench.sh benchmarks-posix.csv 5
          • result:
            • Total measurements: 201
        • Windows:
          • ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/run-windows-bench.sh benchmarks-windows.csv 5'
          • result:
            • Total measurements: 201
      • clean generated artifacts:
        • bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
          • result:
            • All performance floors met
        • ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md'
          • result:
            • All performance floors met
    • the follow-up benchmark spread investigation has now established a real benchmark-build bug on POSIX:
      • the local benchmark runner used:
        • C from build/bin/bench_posix_c
        • Rust from src/crates/netipc/target/release/bench_posix
        • Go from build/bin/bench_posix_go
      • the local CMake tree used for the C benchmark was configured as:
        • build/CMakeCache.txt:
          • CMAKE_BUILD_TYPE:STRING=Debug
      • the benchmark target itself added -O2, but the C libraries it linked against were still unoptimized:
        • build/CMakeFiles/bench_posix_c.dir/flags.make:
          • C_FLAGS = -g -std=gnu11 -O2
        • build/CMakeFiles/netipc_protocol.dir/flags.make:
          • C_FLAGS = -g -std=gnu11
        • build/CMakeFiles/netipc_service.dir/flags.make:
          • C_FLAGS = -g -std=gnu11
      • a dedicated optimized benchmark tree proved this materially changes the published POSIX rows:
        • release build setup:
          • cmake -S . -B build-release -DCMAKE_BUILD_TYPE=Release
          • cmake --build build-release --target bench_posix_c bench_posix_go -j8
        • direct targeted reruns:
          • published shm-batch-ping-pong c->c:
            • 25,947,290
          • optimized C libs shm-batch-ping-pong c(rel)->c(rel):
            • 63,699,472
          • published uds-pipeline-batch-d16 c->c:
            • 49,512,090
          • optimized C libs uds-pipeline-batch-d16 c(rel)->c(rel):
            • 103,212,623
        • mixed-language targeted reruns also moved sharply upward when the C side used optimized libraries:
          • intended shm-batch-ping-pong c(rel)->rust:
            • 57,122,454
          • intended shm-batch-ping-pong rust->c(rel):
            • 52,041,263
          • intended uds-pipeline-batch-d16 c(rel)->rust:
            • 91,093,895
          • intended uds-pipeline-batch-d16 rust->c(rel):
            • 101,978,294
      • implemented fix:
        • tests/run-posix-bench.sh now configures and uses a dedicated optimized benchmark tree:
          • default: build-bench-posix
          • build type: Release
        • tests/run-windows-bench.sh now configures and uses a dedicated optimized benchmark tree:
          • default: build-bench-windows
          • build type: Release
          • explicit MinGW toolchain export on win11
      • factual conclusion:
        • the old checked-in POSIX benchmark report was distorted by linking the C benchmark binary against Debug-built C libraries
        • the current checked-in POSIX and Windows benchmark artifacts now come from the corrected dedicated benchmark build paths
    • the Windows benchmark tree is not affected by the same local Debug-build distortion:
      • ssh win11 '... grep CMAKE_BUILD_TYPE build/CMakeCache.txt'
        • CMAKE_BUILD_TYPE:STRING=RelWithDebInfo
      • the previously suspicious Windows SHM batch outlier did not survive the corrected rerun:
        • old checked-in row:
          • shm-batch-ping-pong c->rust = 9,282,667
        • corrected clean rerun row:
          • shm-batch-ping-pong c->rust = 55,868,058
      • final artifact sanity checks:
        • benchmarks-posix.csv
          • rows: 201
          • duplicate keys: 0
          • zero-throughput rows: 0
        • benchmarks-windows.csv
          • rows: 201
          • duplicate keys: 0
          • zero-throughput rows: 0
      • checked-in benchmark docs are now synced to the refreshed artifacts:
        • benchmarks-posix.csv
        • benchmarks-posix.md
        • benchmarks-windows.csv
        • benchmarks-windows.md
        • README.md
      • corrected max-throughput ranges from the current checked-in artifacts:
        • POSIX:
          • uds-ping-pong: 182,963 to 231,160
          • shm-ping-pong: 2,460,317 to 3,450,961
          • uds-batch-ping-pong: 27,182,404 to 40,240,940
          • shm-batch-ping-pong: 31,250,784 to 64,148,960
          • uds-pipeline-d16: 568,373 to 735,829
          • uds-pipeline-batch-d16: 51,960,946 to 102,954,841
          • snapshot-baseline: 158,948 to 205,624
          • snapshot-shm: 1,006,053 to 1,738,616
          • lookup: 114,556,227 to 203,279,430
        • Windows:
          • np-ping-pong: 18,241 to 21,039
          • shm-ping-pong: 2,099,392 to 2,715,487
          • np-batch-ping-pong: 7,013,700 to 8,550,220
          • shm-batch-ping-pong: 36,494,096 to 58,768,397
          • np-pipeline-d16: 245,420 to 270,488
          • np-pipeline-batch-d16: 28,977,365 to 41,270,903
          • snapshot-baseline: 16,090 to 20,967
          • snapshot-shm: 857,823 to 1,262,493
          • lookup: 107,472,315 to 164,305,717
    • current remaining raw Rust drift is now narrower and well-scoped:
      • the raw managed server already enforces one expected_method_code
      • the raw client surface still exposes a generic constructor and mixed call surface under the stale internal name CgroupsClient
      • the next cleanup slice is to bind the raw Rust client constructors to one service kind and migrate the raw Rust tests to those constructors, matching the already-correct Go raw design
    • raw Rust client drift is now removed from the active service surface:
      • src/crates/netipc/src/service/raw.rs now exposes RawClient instead of the stale internal multi-kind name CgroupsClient
      • the raw client is now created only through service-kind-specific constructors:
        • RawClient::new_snapshot(...)
        • RawClient::new_increment(...)
        • RawClient::new_string_reverse(...)
      • request kind remains only as envelope validation on the raw client
      • the raw Rust Unix/Windows tests now create snapshot, increment, and string-reverse clients explicitly instead of reusing one generic constructor across service kinds
    • local Linux Rust proof for that slice is now green:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
        • result:
          • 75 passed
          • 0 failed
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 299 passed
          • 0 failed
    • real win11 Rust proof for that slice is now green too:
      • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 176 passed
          • 0 failed
          • 1 ignored
    • the broader win11 interop/service/cache matrix initially exposed two more stale constructor residues outside the Rust raw tests:
      • Rust benchmark drivers still imported the deleted raw CgroupsClient instead of using the public snapshot facade
        • fixed in:
          • bench/drivers/rust/src/main.rs
          • bench/drivers/rust/src/bench_windows.rs
      • Go public cgroups wrappers still called the deleted generic raw constructor:
        • raw.NewClient(...)
        • fixed in:
          • src/go/pkg/netipc/service/cgroups/client.go
          • src/go/pkg/netipc/service/cgroups/client_windows.go
      • Go benchmark drivers still hand-rolled the stale raw dispatch signature instead of using the single-kind increment adapter
        • fixed in:
          • bench/drivers/go/main.go
    • the next verified contradiction slice was documentation-heavy and is now resolved:
      • low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
        • request limits are sender-driven
        • response limits are server-driven
        • SHM capacities are fixed per session
        • larger learned capacities require a reconnect and a new session, not in-place SHM resize
      • docs/level1-wire-envelope.md no longer says handshake rule 6 takes the minimum of client and server values
      • docs/level1-windows-np.md now documents per-session Windows SHM object names with session_id, aligned with both code and docs/level1-windows-shm.md
      • public L2 comments/docs no longer claim a blanket "retry ONCE":
        • ordinary failures still retry once
        • overflow-driven resize recovery may reconnect more than once while capacities grow
      • Unix test/script cleanup helpers no longer remove the stale pre-session path {service}.ipcshm; they now use per-session cleanup that matches {service}-{session_id}.ipcshm
      • validation for this slice is green:
        • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
          • result:
            • 75 passed
            • 0 failed
        • cd src/go && go test -count=1 ./pkg/netipc/service/raw
          • result:
            • ok
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service_interop|test_cache_interop|test_shm_interop)$'
          • result:
            • 100% tests passed
            • 0 failed
    • the next verified residue slice is narrower and fixture-focused:
      • several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path {service}.ipcshm instead of using per-session cleanup
      • current proven hits:
        • tests/fixtures/c/test_service.c
        • tests/fixtures/c/test_cache.c
        • tests/fixtures/c/test_hardening.c
        • tests/fixtures/c/test_chaos.c
        • tests/fixtures/c/test_multi_server.c
        • tests/fixtures/c/test_stress.c
        • src/go/pkg/netipc/service/cgroups/cgroups_unix_test.go
    • that Unix fixture-cleanup residue slice is now resolved:
      • the touched Unix C fixtures now use nipc_shm_cleanup_stale(TEST_RUN_DIR, service) instead of unlinking the dead {service}.ipcshm path
      • the touched Go public cgroups Unix tests now use posix.ShmCleanupStale(testRunDirUnix, service) instead of removing the dead {service}.ipcshm path
      • validation for this slice is green:
        • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups
          • result:
            • ok
        • cmake --build build --target test_service test_cache test_hardening test_multi_server test_chaos test_stress
          • result:
            • rebuild passed
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_cache|test_hardening|test_multi_server|test_chaos|test_stress)$'
          • result:
            • 100% tests passed
            • 0 failed
    • one more live Unix fixture contradiction remains after that cleanup pass:
      • tests/fixtures/c/test_chaos.c:test_shm_chaos() still opens the dead pre-session SHM path {run_dir}/{service}.ipcshm
      • this is not just stale cleanup text; it likely means the SHM-chaos path is not actually targeting the live per-session SHM file today
    • that live SHM-chaos contradiction is now resolved:
      • tests/fixtures/c/test_chaos.c:test_shm_chaos() now captures the live session_id from the ready client session and opens {run_dir}/{service}-{session_id}.ipcshm
      • the test no longer treats "SHM file not found" as an acceptable skip on this path
      • validation:
        • cmake --build build --target test_chaos
          • result:
            • rebuild passed
        • /usr/bin/ctest --test-dir build --output-on-failure -R '^test_chaos$'
          • result:
            • 100% tests passed
            • 0 failed
    • current residue scan excluding this TODO file is now clean for the main drift markers:
      • no remaining old {service}.ipcshm path literals
      • no remaining deleted CgroupsHandlers / CgroupsServer API references
      • no remaining deleted raw.NewClient(...) / service::raw::CgroupsClient references
      • no remaining deleted new_single_kind / with_workers_single_kind references
    • broader Unix validation after these cleanup passes is also green:
      • /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop|test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go|test_hardening|test_ping_pong|test_multi_server|test_chaos|test_stress)$'
        • result:
          • 100% tests passed
          • 0 failed
          • 19/19 passed
          • bench/drivers/go/main_windows.go
    • local Go proof for the wrapper/benchmark cleanup is now green:
      • cd src/go && go test -count=1 ./pkg/netipc/service/cgroups
        • result:
          • ok
      • cd bench/drivers/go && go test -run '^$' ./...
        • result:
          • compile-only pass
    • real win11 build + matrix proof after those residue fixes is now green:
      • cmake --build build -j4
        • result:
          • build succeeds again after the Rust/Go constructor cleanup
      • timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
        • result:
          • test_named_pipe_interop: passed
          • test_win_shm_interop: passed
          • test_service_win_interop: passed
          • test_service_win_shm_interop: passed
          • test_cache_win_interop: passed
          • test_cache_win_shm_interop: passed
          • summary:
            • 100% tests passed, 0 tests failed out of 6
    • verified residue scan for the stale constructor names used in this slice is now clean:
      • no remaining raw.NewClient
      • no remaining service::raw::CgroupsClient
      • no remaining RawClient::new(
    • a smaller cross-platform residue cleanup is now also complete:
      • the test-only Rust helper dispatch_single() in src/crates/netipc/src/service/raw.rs is now explicitly marked as dead-code-tolerant under test builds, so Windows lib-test builds no longer emit the stale unused-function warning
      • the remaining public docs/spec wording in this slice was normalized away from the older "method-specific" phrasing where it described the public L2 service surface or service contracts:
        • docs/level1-transport.md
        • docs/codec.md
        • docs/level2-typed-api.md
        • docs/code-organization.md
        • docs/codec-cgroups-snapshot.md
    • local Linux validation after that wording/test-helper cleanup is still green:
      • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 299 passed
          • 0 failed
    • real win11 validation after that cleanup is also still green:
      • timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
        • result:
          • 176 passed
          • 0 failed
          • 1 ignored
        • factual note:
          • the previous Windows-only dispatch_single unused-function warning is no longer present in this run
        • the Windows guard output still shows the accepted request-resize behavior:
          • transparent recovery
          • exactly one reconnect
          • negotiated request-size growth
    • new verified internal raw-client alignment:
      • fact:
        • the raw managed servers in Go and Rust were already bound to one expected_method_code
        • the remaining client-side drift was that one long-lived raw client context still exposed multiple service-kind calls
      • implementation slice now completed in Go:
        • raw Go clients are now created per service kind:
          • NewSnapshotClient(...)
          • NewIncrementClient(...)
          • NewStringReverseClient(...)
        • each client now stores one expected request code and rejects wrong-kind calls as validation failures instead of pretending one client can legitimately serve multiple service kinds
        • the cache helpers now bind explicitly to cgroups-snapshot
      • exact local Unix proof:
        • cd src/go && go test -count=1 ./pkg/netipc/service/raw
        • result:
          • ok
      • exact real Windows proof on win11:
        • cd ~/src/plugin-ipc.git/src/go && go test -count=1 ./pkg/netipc/service/raw
        • first rerun exposed one Windows-only missed constructor site:
          • pkg/netipc/service/raw/shm_windows_test.go:334
          • stale NewClient(...)
        • after correcting that last Windows-only leftover and resyncing:
          • result:
            • ok
      • factual conclusion:
        • the Go raw helper layer is now materially aligned with the accepted single-service-kind design on both Unix and Windows
        • remaining work is to carry the same invariant through the remaining Rust raw helper surface
    • a full Rust cargo test --lib run is still blocked by one unrelated transport failure outside this rewrite slice:
      • transport::posix::tests::test_receive_batch_count_exceeds_limit
    • remaining heavy work is now concentrated in:
      • proving the accepted resize behavior with the full interop/service/cache matrices on Unix and Windows, not just the targeted raw suites
      • getting real Windows compile/run proof for the edited Rust/Go/C Windows test surfaces
      • reconciling the current C path with the final single-kind + learned-size design language everywhere, then validating all 3 languages together

Analysis

Verified facts about Netdata today

  • cgroups.plugin is not an external executable. It runs inside the Netdata daemon:
    • cgroups_main() is started from src/daemon/static_threads_linux.c.
  • ebpf.plugin is a separate external executable:
    • built by add_executable(ebpf.plugin ...) in CMakeLists.txt.
  • Current cgroups.plugin -> ebpf.plugin integration is a custom SHM + semaphore contract:
    • producer: src/collectors/cgroups.plugin/cgroup-discovery.c
    • shared structs: src/collectors/cgroups.plugin/sys_fs_cgroup.h
    • consumer: src/collectors/ebpf.plugin/ebpf_cgroup.c
  • The shared payload currently transports cgroup metadata, not PID membership:
    • fields: name, hash, options, enabled, path
    • ebpf.plugin still reads each cgroup.procs file itself.
  • Netdata already has a stable per-run invocation identifier:
    • src/libnetdata/log/nd_log-init.c
    • Netdata reads NETDATA_INVOCATION_ID, else INVOCATION_ID, else generates a UUID and exports NETDATA_INVOCATION_ID.
  • External plugins are documented to receive NETDATA_INVOCATION_ID:
    • src/plugins.d/README.md
  • Netdata already exposes plugin environment variables centrally:
    • src/daemon/environment.c
  • Netdata already has the right build roots for all 3 languages:
    • C via top-level CMakeLists.txt
    • Rust workspace in src/crates/Cargo.toml
    • Go module in src/go/go.mod

Verified facts about plugin-ipc today

  • plugin-ipc already has the exact L3 cgroups snapshot API for this use case:
    • docs/level3-snapshot-api.md
  • The typed snapshot schema closely matches Netdata’s current SHM payload:
    • src/libnetdata/netipc/include/netipc/netipc_protocol.h
  • The C API already supports:
    • managed server lifecycle
    • typed cgroups client/cache
    • POSIX transport with negotiated SHM fast path
  • Authentication in plugin-ipc is a uint64_t auth_token:
    • src/libnetdata/netipc/include/netipc/netipc_service.h
    • src/libnetdata/netipc/include/netipc/netipc_uds.h
    • Rust/Go implementations use the same concept.

Important integration implications

  • Phase 1 can replace the metadata transport only.
  • Phase 1 will not remove ebpf.plugin reads of cgroup.procs.
  • The default plugin-ipc response size is too small for real Netdata snapshots on large hosts, so Linux integration must use an explicit large response limit.
  • The best build/distribution model is in-tree vendoring inside Netdata, not an external system dependency.
  • Current Netdata payload sizing evidence already proves this:
    • cgroup_root_max default is 1000 in src/collectors/cgroups.plugin/sys_fs_cgroup.c
    • current per-item SHM body carries name[256] and path[FILENAME_MAX + 1] in src/collectors/cgroups.plugin/sys_fs_cgroup.h
    • FILENAME_MAX on this Linux build environment is 4096
    • this means the current per-item shape is already about 4.3 KiB before protocol framing/alignment

Verified design-drift findings

  • The original written phase plan did not describe a multi-method server.
    • Evidence:
      • TODO-plugin-ipc.history.md
      • historical phase plan still says:
        • Define and freeze a minimal v1 typed schema for one RPC method ('increment')
  • The first generated L2 spec also did not need a multi-method server model.
    • Evidence:
      • initial docs/level2-typed-api.md from commit 1722f95
      • handler contract was framed as one typed request view + one response builder per handler callback
      • no raw transport-level switch over multiple method codes in that initial text
  • The history TODO already contained the correct service-oriented discovery model.
    • Evidence:
      • TODO-plugin-ipc.history.md
      • explicit historical decisions already said:
        • discovery is service-oriented, not plugin-oriented
        • service names are the stable public contract
        • one endpoint per service
        • one persistent client context per service
        • startup order can remain random
        • caller owns reconnect cadence via refresh(ctx)
    • Implication:
      • the later multi-method server model was not a missing discussion
      • it was drift away from an already-decided service model
  • The first explicit spec drift appears in commit 53b5e5a on 2026-03-16.
    • Evidence:
      • docs/level2-typed-api.md in commit 53b5e5a
      • handler contract changed to:
        • raw-byte transport handler
        • switch(method_code)
        • INCREMENT
        • STRING_REVERSE
        • CGROUPS
      • this is the first clear documentation model where one server endpoint dispatches multiple request kinds
  • The first strong implementation-level generalization appears the same day in commit 69bb794.
    • Evidence:
      • commit message explicitly says:
        • Add dispatch_increment(), dispatch_string_reverse(), dispatch_cgroups_snapshot()
      • docs/getting-started.md in that commit adds typed helper examples for more than one method family
      • this widened the implementation and examples toward a generic multi-method dispatch surface
  • The drift was then reinforced in public examples in commit 6014b0e on 2026-03-17.
    • Evidence:
      • docs/getting-started.md
      • C example registers:
        • .on_increment
        • .on_cgroups
      • Rust example registers:
        • on_increment
        • on_cgroups
      • Go example registers:
        • OnIncrement
        • OnSnapshot
      • text says:
        • You register typed callbacks for the supported methods
  • The drift became operationally entrenched in interop in commit 099945b on 2026-03-16.
    • Evidence:
      • commit message explicitly says:
        • Cross-language interop now tests all method types
      • interop fixtures for C, Rust, and Go on POSIX and Windows all dispatch:
        • INCREMENT
        • CGROUPS_SNAPSHOT
        • STRING_REVERSE
  • The drift later propagated into current coverage/TODO planning and the repository README.
    • Evidence:
      • TODO-pending-from-rewrite.md planned:
        • snapshot / increment / string-reverse / batch over SHM
      • README.md now says:
        • servers register typed handlers

Current factual conclusion from the drift investigation

  • There is currently no evidence in the TODO history that the original direction from the user was:
    • one server should serve multiple request kinds
  • The strongest historical evidence points the other way:
    • the original phase plan explicitly named one RPC method only
  • Working theory:
    • the drift started when the typed API was generalized from:
      • one typed request kind per server
      • to
      • one generic server dispatching multiple method codes
    • then examples, interop fixtures, tests, coverage plans, and README text copied that model until it felt normal

Decisions

Made

  1. Windows runtime validation host

    • User decision: use win11 over SSH for real Windows proof instead of stopping at source cleanup or cross-compilation from Linux.
    • Constraint:
      • prefer the already-documented win11 workflow from this repository's TODOs/docs
      • do not guess the Windows execution flow when the repo already documents it
    • Implication:
      • touched Windows Rust/Go/C transport/service/interop/cache surfaces should now be proven on a real Windows runtime, not just by static review or Linux-hosted cross-compilation
      • the next implementation slice should follow the existing win11 operational guidance already captured in the repo
  2. Authentication source

    • User decision: use NETDATA_INVOCATION_ID for authentication.
    • Meaning:
      • the auth value changes on every Netdata run
      • only plugins launched under the same Netdata instance can authenticate
    • Evidence:
      • src/libnetdata/log/nd_log-init.c creates/exports NETDATA_INVOCATION_ID
      • src/plugins.d/README.md documents it for external plugins
    • Implication:
      • this is stronger than a machine-stable token for local plugin-to-plugin IPC
      • restarts invalidate old clients automatically
  3. Source layout in Netdata

    • User decision: native Netdata layout.
    • Layout:
      • C in src/libnetdata/netipc/
      • Rust in src/crates/netipc/
      • Go in src/go/pkg/netipc/
    • Implication:
      • the library becomes a first-class internal Netdata component in all 3 languages
      • future sync from plugin-ipc upstream will be manual/curated, not subtree-based
  4. Invocation ID to auth-token mapping

    • User decision: derive the plugin-ipc uint64_t auth_token from NETDATA_INVOCATION_ID using a deterministic hash.
    • Constraint:
      • the mapping must be identical in C, Rust, and Go
    • Implication:
      • only processes launched under the same Netdata run can authenticate
      • Netdata restart rotates auth automatically
  5. Rollout mode

    • User decision: big-bang switch.
    • Implication:
      • there will be no legacy custom-SHM fallback path for this metadata channel
    • Risk:
      • any bug in the new path blocks ebpf.plugin cgroup metadata integration immediately
  6. Linux response size policy

    • User concern/decision direction:
      • do not accept a large fixed memory cost such as 16 MiB just for this IPC path
      • prefer dynamic behavior that adapts to actual payload size
      • allocation should happen only when needed
    • Implication:
      • the current plugin-ipc response budgeting model needs review before integration
      • response sizing / negotiation may need design changes, not just configuration
  7. Snapshot overflow handling direction

    • User decision direction:
      • reconnect is acceptable for snapshot overflow handling
      • growth policy should be power-of-two
      • SHM L2 should transparently handle overflow-driven resizing, hidden from both L2 clients and L2 servers
    • User design intent:
      • the server should not need to know the final safe snapshot size before the first request
      • the first real overflow during response preparation should trigger the resize path
      • once the server has learned a larger size from a real snapshot, later clients should negotiate into that larger size automatically
    • Implication:
      • current fixed per-session SHM sizing and current HELLO/HELLO_ACK limit semantics are not sufficient as-is for this Netdata use case
      • the growth mechanism likely needs new L2 protocol behavior, not only implementation tweaks
  8. Pre-integration gating

    • User decision:
      • implement this transparent SHM resize behavior in plugin-ipc first
      • do not start Netdata integration before it is done
      • require thorough validation first, including full interop matrices across C/Rust/Go on Unix and Windows
    • Verified evidence that the repo already has the right validation scaffolding:
      • POSIX interop tests in CMakeLists.txt:
        • test_uds_interop
        • test_shm_interop
        • test_service_interop
        • test_service_shm_interop
        • test_cache_interop
        • test_cache_shm_interop
      • Windows interop tests in CMakeLists.txt:
        • test_named_pipe_interop
        • test_win_shm_interop
        • test_service_win_interop
        • test_cache_win_interop
      • Existing transport-specific integration tests already exist:
        • POSIX SHM: tests/fixtures/c/test_shm.c, Rust src/crates/netipc/src/transport/shm_tests.rs
        • Windows SHM: tests/fixtures/c/test_win_shm.c, Rust src/crates/netipc/src/transport/win_shm.rs, Go src/go/pkg/netipc/transport/windows/shm_test.go
    • Implication:
      • the resize feature must be proven at:
        • L1 transport level
        • L2 service/client level
        • cross-language interop level
        • both POSIX and Windows implementations
  9. Design priorities for the resize rewrite

    • User decision:
      • optimize for long-term correctness, reliability, robustness, and performance
      • backward compatibility is not required
      • do not optimize for minimizing work now
      • prefer the right design even if that means a substantial rewrite
    • Implication:
      • decisions should favor clean semantics and maintainability over preserving current handshake/transport structure
      • a third rewrite is acceptable if it produces a better architecture
  10. User design constraints from follow-up discussion

    • IPC servers should service a single request kind.
    • Sessions should be assumed long-lived:
      • connect once
      • serve many requests
      • disconnect on shutdown or exceptional recovery
  11. Benchmark refresh slice disposition

  • User decision:
    • commit and push the refreshed benchmark slice now
    • then investigate the remaining benchmark spreads separately
  • Implication:
    • commit only the benchmark-fix, benchmark-artifact, and benchmark-doc sync files from this slice
    • do not mix this commit with unrelated cleanup or integration work
  1. Current commit scope
  • User decision:
    • commit and push the full remaining work from this task now
  • Implication:
    • stage the remaining drift-removal, SHM-resize, service-kind alignment, test, and doc changes that belong to this task
    • avoid unrelated local or user-owned changes outside this task
  • Steady-state fast path matters far more than the rare resize path.
  • Learned transport sizes are important:
    • adapt automatically
    • stabilize quickly
    • then remain fixed for the lifetime of the process
    • reset on restart
  • Separate request and response sizing should exist.
  • Variable sizing pressure is expected mainly on responses, not requests.
  • Artificial hard caps are not acceptable as a design crutch.
  • Disconnect-based recovery is acceptable if it is reliable and the system stabilizes.
  1. Accepted architecture decisions for the SHM resize rewrite
  • User accepted:
    • L2 service model: single-method-per-server
    • Resize signaling path: explicit LIMIT_EXCEEDED signal, then disconnect/reconnect
    • Auto-resize scope: separate learned request and response sizing, both supported
    • Initial size policy: per-server-kind compile-time defaults
    • Learned-size lifetime: in-memory only for the current process lifetime, reset on restart
  • Implication:
    • the current generic multi-method service abstraction is now known design drift
    • the rewrite should simplify transport/service code around one request kind per server
  1. Service discovery and availability model
  • User clarified the intended service model explicitly:
    • clients connect to a service kind, not to a specific plugin implementation
    • each service endpoint serves one request kind only
    • example service kinds include:
      • cgroups-snapshot
      • ip-to-asn
      • pid-traffic
    • the serving plugin is intentionally abstracted away from clients
  • User clarified the intended runtime model explicitly:
    • plugins are asynchronous peers
    • startup order is not guaranteed
    • enrichments from other plugins/services are optional
    • a client plugin may start before the service it needs exists
    • a service may disappear and reappear during runtime
    • clients must reconnect periodically and tolerate service absence
  • Implication:
    • repository docs/specs/TODOs must describe:
      • service-name-based discovery
      • service-type ownership independent from plugin identity
      • optional dependency semantics
      • reconnect / retry behavior for not-yet-available services
  1. Execution mandate for this phase
  • User decision:
    • proceed autonomously to remove the drift from implementation and docs
    • align code, tests, and examples to the single-service-kind model
    • implement the accepted SHM size renegotiation / resize behavior
    • remove contradictory wording and stale examples that preserve the wrong model
  • Implication:
    • this is now a repository-wide consistency and implementation task
    • active docs, public APIs, interop fixtures, and validation must converge on the same model before Netdata integration
  1. Request-kind field semantics
  • User clarification:
    • request type / method code may remain in wire structures and headers
    • its role is validation, not public multi-method dispatch
    • a service endpoint expects exactly one request kind
    • any other request kind must be rejected
  • Implication:
    • we can keep method codes in the protocol
    • service implementations must bind one endpoint to one expected request kind
    • public APIs/tests/docs must not imply that one service endpoint accepts multiple unrelated request kinds
  1. Payload-vs-service boundary
  • User clarification:
    • if a service needs arrays of things, batching belongs to that service payload/codec
    • batching is not a reason for one L2 endpoint to expose multiple public request kinds
  • Implication:
    • the public L2 service layer should not keep generic multi-method or generic batch dispatch as part of its contract
    • INCREMENT, STRING_REVERSE, and batch ping-pong traffic can remain at protocol / transport / benchmark level
    • the public cgroups snapshot service should be snapshot-only

Pending

  1. Service naming and endpoint placement

    • Context:
      • POSIX transport needs a service name and run-dir placement.
      • Netdata already has os_run_dir(true).
    • Open question:
      • exact service name/versioning strategy for the cgroups snapshot endpoint
  2. Exact Linux response-size budget

    • Context:
      • user rejected a large fixed per-connection budget as bad for footprint
      • dynamic/adaptive options must be evaluated against the current plugin-ipc design
    • Current hard payload evidence:
      • 1000 cgroups at roughly 4.3 KiB each already implies multi-megabyte worst-case snapshots
    • Open question:
      • what protocol / implementation change best preserves low idle footprint while still supporting large snapshots
  3. Dynamic response sizing model

    • Context:
      • current plugin-ipc session handshake negotiates agreed_max_response_payload_bytes once
      • current implementations then size buffers against that session-wide maximum
    • Verified evidence:
      • handshake uses min(client, server) in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • C client allocates request/response/send buffers eagerly in src/libnetdata/netipc/src/service/netipc_service.c
      • C server allocates per-session response buffer sized to the full negotiated maximum in src/libnetdata/netipc/src/service/netipc_service.c
      • Linux SHM region size is fixed from negotiated request/response capacities in src/libnetdata/netipc/src/transport/posix/netipc_shm.c
      • UDS chunked receive is already dynamically grown with realloc in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • Rust and Go clients are already more dynamic and grow buffers lazily in:
        • src/crates/netipc/src/service/cgroups.rs
        • src/go/pkg/netipc/service/cgroups/client.go
      • Netdata ebpf.plugin refreshes cgroup metadata every 30 seconds:
        • src/collectors/ebpf.plugin/ebpf_process.h
        • src/collectors/ebpf.plugin/ebpf_cgroup.c
    • Decision needed:
      • choose whether to keep the current protocol and improve allocation policy only, or evolve the protocol to support truly dynamic large snapshots
    • Options:
      • A. Keep protocol, make implementation adaptive, and use baseline-only transport for the cgroups snapshot service in phase 1
      • B. Add paginated snapshot requests/responses
      • C. Add out-of-band exact-sized bulk snapshot transfer for large responses
      • D. Keep the current fixed session-wide max model and just configure a large cap
      • E. Keep SHM for data, but negotiate/create SHM capacity per request instead of per session
      • F. Split transport into a tiny control channel plus ephemeral payload channel/object
      • G. Add a small size-probe step before fetching the full snapshot
      • H. Add true server-streamed snapshot responses (multi-message response sequence)
      • I. Allow snapshot responses to return "resize to X bytes and retry", so the client grows once on demand and reuses that larger buffer from then on
      • J. Make SHM L2 transparently reconnect and double capacities on overflow, so resizing is hidden from both clients and servers and the server retains the learned larger size for future sessions
    • Current preferred direction under discussion:
      • J, but it still needs stress-testing against the current HELLO/HELLO_ACK semantics, SHM lifecycle, and L2 retry behavior
  4. Transparent SHM resize semantics

    • Context:
      • user direction is to make SHM L2 resizing automatic and transparent to both clients and servers
      • reconnect is acceptable and growth should be power-of-two on overflow
    • Verified evidence:
      • current server sends NIPC_STATUS_INTERNAL_ERROR on handler/batch failure in src/libnetdata/netipc/src/service/netipc_service.c
      • current C/Go/Rust clients treat any non-OK response transport status as bad layout / failure:
        • src/libnetdata/netipc/src/service/netipc_service.c
        • src/go/pkg/netipc/service/cgroups/client.go
        • src/crates/netipc/src/service/cgroups.rs
      • NIPC_STATUS_LIMIT_EXCEEDED already exists in src/libnetdata/netipc/include/netipc/netipc_protocol.h
    • Corrected layering rule from user discussion:
      • transport/L2 may handle overflow signaling, reconnect, and shared-memory remap mechanics
      • replay detection for mutating RPCs belongs to the request payload and the server business logic, not to transport-level semantic dedupe
    • Clarified implication:
      • transport should not try to "understand" whether a mutation was already applied
      • if a mutating method cares about replay safety, it must carry a request identity / idempotency token in its own payload and the server method must enforce it
    • For the Netdata cgroups snapshot use case:
      • this is not a blocker, because snapshot is read-only
    • Open question:
      • whether transparent reconnect-and-retry should be generic transport behavior for all methods, or exposed as a capability that higher layers opt into when their payload semantics make replay safe
  5. Negotiation semantics for learned SHM size

    • Context:
      • user correctly rejected the current min(client, server) rule for learned snapshot sizing
      • current handshake stores only one scalar per direction, so it cannot distinguish:
        • client hard cap
        • client initial size
        • server learned target size
    • Verified evidence:
      • current HELLO/HELLO_ACK uses fixed agreed_max_* fields in:
        • src/libnetdata/netipc/src/transport/posix/netipc_uds.c
        • src/crates/netipc/src/transport/posix.rs
        • src/crates/netipc/src/transport/windows.rs
    • Open question:
      • should the protocol split "current operational size" from "hard ceiling", so the server can advertise a learned larger target without losing the client’s ability to refuse absurd allocations
  6. Request-side vs response-side SHM growth asymmetry

    • Verified evidence:
      • POSIX SHM send rejects oversize messages locally before the peer can react:
        • src/libnetdata/netipc/src/transport/posix/netipc_shm.c
      • existing tests already cover this class of failure:
        • tests/fixtures/c/test_shm.c
        • tests/fixtures/c/test_service.c (test_shm_batch_send_overflow_on_negotiated_limit)
        • tests/fixtures/c/test_win_shm.c
        • tests/fixtures/c/test_win_service_guards.c
    • Implication:
      • response-capacity growth can be learned by the server while building a response
      • request-capacity growth cannot be learned the same way, because an oversize request fails client-side before the server sees it
    • Open question:
      • should the first implementation cover:
        • response-side transparent resize only
        • or symmetric request+response resize with separate client-learned request sizing semantics
  7. Netdata lifecycle ownership details

    • Context:
      • cgroups.plugin runs in-daemon
      • ebpf.plugin is external
    • Open question:
      • exact daemon init/shutdown points for starting/stopping the plugin-ipc cgroups server and for initializing the ebpf.plugin client cache

Plan

2026-04-14 Handshake implementation phase

  1. Update the wire-level negotiation implementation in C / Go / Rust transports.

    • Enforce max_request_payload_bytes <= 1 MiB during handshake.
    • Reject oversized request proposals with handshake LIMIT_EXCEEDED.
    • Stop using request-side max(client, server).
    • Make agreed_max_response_batch_items strictly equal to the effective request batch-item limit.
    • Evidence:
      • src/libnetdata/netipc/src/transport/posix/netipc_uds.c
      • src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
      • src/go/pkg/netipc/transport/posix/uds.go
      • src/go/pkg/netipc/transport/windows/pipe.go
      • src/crates/netipc/src/transport/posix.rs
      • src/crates/netipc/src/transport/windows.rs
  2. Move SHM readiness earlier so successful handshake guarantees the selected profile is already usable.

    • The server must not complete successful handshake with selected_profile = SHM unless SHM for that session is already ready.
    • This requires changing the current “accept first, create SHM later” flow.
    • Evidence:
      • src/libnetdata/netipc/src/service/netipc_service.c
      • src/libnetdata/netipc/src/service/netipc_service_win.c
      • src/go/pkg/netipc/service/raw/client.go
      • src/go/pkg/netipc/service/raw/client_windows.go
      • src/crates/netipc/src/service/raw.rs
  3. Remove max_request_payload_bytes from the public typed L2 API surfaces.

    • Keep internal learned sizing / overflow recovery machinery.
    • Make typed L2 derive initial request sizing internally instead of exposing it publicly.
    • Evidence:
      • src/libnetdata/netipc/include/netipc/netipc_service.h
      • src/go/pkg/netipc/service/cgroups/types.go
      • src/crates/netipc/src/service/cgroups.rs
  4. Keep overflow-driven reconnect as a tested internal fallback.

    • Preserve the existing recovery model, but align it to the new 1 MiB ceiling and the new public API contract.
    • Evidence:
      • src/libnetdata/netipc/src/service/netipc_service.c
      • src/libnetdata/netipc/src/service/netipc_service_win.c
      • src/go/pkg/netipc/service/raw/client.go
      • src/go/pkg/netipc/service/raw/client_windows.go
      • src/crates/netipc/src/service/raw.rs
  5. Rewrite and extend handshake tests so each negotiated field is validated individually across all implementations.

    • Include all auth failures individually.
    • Include explicit request-payload-cap rejection.
    • Include request/response batch-item symmetry checks.
    • Include overflow reconnect tests under the new contract.
    • Evidence:
      • tests/fixtures/c/test_uds.c
      • tests/fixtures/c/test_named_pipe.c
      • tests/fixtures/c/test_service.c
      • tests/fixtures/c/test_win_service.c
      • src/go/pkg/netipc/transport/posix/uds_test.go
      • src/go/pkg/netipc/transport/windows/pipe_integration_test.go
      • src/go/pkg/netipc/service/raw/more_unix_test.go
      • src/go/pkg/netipc/service/raw/more_windows_test.go
      • src/crates/netipc/src/transport/posix_tests.rs
      • src/crates/netipc/src/transport/windows.rs
      • src/crates/netipc/src/service/raw_unix_tests.rs
      • src/crates/netipc/src/service/raw_windows_tests.rs
  6. Audit the current implementation surfaces that still encode multi-method service behavior.

  7. Define the replacement public model in code terms:

    • one service module per service kind
    • one endpoint per request kind
    • service-specific typed clients/servers/cache helpers
  8. Redesign SHM resize semantics in implementation terms:

    • explicit LIMIT_EXCEEDED
    • disconnect/reconnect recovery
    • separate learned request/response sizes
    • process-lifetime learned sizing
  9. Rewrite the C, Rust, and Go Level 2 service layers to match the corrected model.

  10. Rewrite interop/service fixtures and validation scripts to test one service kind per server.

  11. Rewrite public docs/examples/specs to remove contradictory multi-method wording.

  12. Run targeted tests first, then the full relevant Unix/Windows matrices required to trust the rewrite.

  13. Summarize any residual risk or remaining ambiguity before starting Netdata integration work.

  14. Rerun the current Linux and Windows benchmark matrices on the aligned tree.

  15. Regenerate benchmark artifacts and update all benchmark-derived docs/README summaries.

Implied decisions

  • Preserve Level 1 transport interoperability work where still valid.
  • Preserve codec/message-family work where it remains useful under a service-oriented split.
  • Prefer removal/rename of drifted APIs over keeping compatibility shims, because backward compatibility is not required.
  • Keep request-kind and outer-envelope metadata available to single-kind handlers only for:
    • validating that the endpoint received the expected request kind
    • reading transport batch metadata when a single service kind supports batched payloads
  • Do not use that metadata to reintroduce generic multi-method dispatch at the public Level 2 surface.
  • If a generic Level 2 helper remains for tests/benchmarks, keep it internal and single-kind:
    • one expected request kind per endpoint
    • no public multi-method callback surface
    • no docs/examples presenting it as a production service model

Testing requirements

  • C, Rust, and Go unit tests for the rewritten service APIs
  • POSIX interop matrix for corrected service identities and SHM resize behavior
  • Windows interop matrix for corrected service identities and SHM resize behavior
  • Explicit tests for:
    • late provider startup
    • reconnect after provider restart
    • service absence as a tolerated state
    • SHM resize on response overflow
    • learned-size reuse after reconnect
    • request-side and response-side learned sizing behavior

Documentation updates required

  • Keep README, docs specs, and active TODOs aligned with:
    • service-oriented discovery
    • one request kind per endpoint
    • optional asynchronous enrichments
    • reconnect-driven recovery
    • SHM resize / renegotiation behavior
  1. Finalize remaining design details above.
  2. Vendor plugin-ipc into Netdata in the chosen native layout.
  3. Add a Linux cgroups typed server inside Netdata daemon lifecycle.
  4. Replace ebpf.plugin shared-memory metadata reader with plugin-ipc cgroups cache client.
  5. Keep existing PID membership logic in ebpf.plugin unchanged in phase 1.
  6. Remove the old custom SHM metadata path as part of the big-bang switch.
  7. Add tests for:
    • normal metadata refresh
    • stale/restarted Netdata invalidating old clients
    • large snapshots
    • ebpf.plugin recovery on server restart

Implied decisions

  • Phase 1 is Linux-only.
  • Phase 1 targets cgroups.plugin -> ebpf.plugin metadata only.
  • Current collectors-ipc/ebpf-ipc.* apps/pid SHM remains untouched.
  • NETDATA_INVOCATION_ID must be available to the ebpf.plugin launcher path and any future external clients.
  • A deterministic invocation-id hashing helper will be needed in C, Rust, and Go.

Testing requirements

  • Unit tests for invocation-id to auth-token derivation in C, Rust, and Go.
  • Integration test proving only same-run plugins can connect.
  • Integration test proving restart rotates auth and old clients fail cleanly.
  • Snapshot scale test with high cgroup counts and long names/paths.
  • ebpf.plugin regression test for existing cgroup discovery semantics.

Documentation updates required

  • Netdata integration design note for the new cgroups metadata transport.
  • Developer docs for the new in-tree netipc layout and per-language use.
  • ebpf.plugin and cgroups.plugin internal docs describing the new IPC path.
  • Rollout/kill-switch documentation if dual-path rollout is selected.

Benchmark remediation progress

  • Verified benchmark-distortion findings before changing code:
    • POSIX shm-batch-ping-pong for c/rust exceeds the 1.2x threshold:
      • c->c = 64,148,960
      • c->rust = 58,334,803
      • rust->c = 52,277,542
      • rust->rust = 48,220,338
    • The full corrected Linux and Windows matrices also showed broader benchmark-driver artifacts:
      • Go lookup benchmark used a synthetic linear scan instead of the actual cache-style hash lookup.
      • Rust lookup benchmark used a synthetic linear scan too.
      • Rust cache lookup allocated name.to_string() on every lookup.
      • Go and Rust benchmark clients still had hot-loop buffer allocations in batch, pipeline, and ping-pong paths.
  • Implemented first remediation pass:
    • src/crates/netipc/src/service/raw.rs
      • replaced the flat (hash, String) lookup key with nested per-hash maps so Rust cache lookups stop allocating per call
    • bench/drivers/rust/src/main.rs
      • removed hot-loop allocations from SHM batch client
      • removed hot-loop allocations from ping-pong client
      • moved pipeline-batch receive buffer allocation out of the outer loop
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/rust/src/bench_windows.rs
      • removed the same hot-loop allocations on Windows
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/go/main.go
      • removed hot-loop allocations from batch, pipeline, pipeline-batch, and ping-pong clients
      • replaced lookup linear scan with hash-map lookup
    • bench/drivers/go/main_windows.go
      • removed the same hot-loop allocations on Windows
      • replaced lookup linear scan with hash-map lookup
  • Validation after the first remediation pass:
    • cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1
      • 299 passed, 0 failed
    • cd bench/drivers/go && go test -run '^$' ./...
      • compile-only pass
    • cd src/go && go test -count=1 ./pkg/netipc/service/raw ./pkg/netipc/service/cgroups
      • both packages passed
  • Targeted Linux rerun after the first remediation pass:
    • lookup
      • c = 173,132,146
      • rust = 45,886,102
      • go = 47,703,281
      • fact: the fake benchmark scans are gone; the remaining gap is now in the actual lookup data structures
    • shm-batch-ping-pong, target 0
      • c->c = 62,314,895
      • c->rust = 57,112,806
      • rust->c = 51,620,887
      • rust->rust = 47,356,599
      • fact: the Rust client and Rust server penalties are both still real
    • uds-pipeline-d16, target 0
      • c->c = 721,232
      • c->rust = 717,024
      • c->go = 572,552
      • rust->c = 719,458
      • rust->rust = 727,197
      • rust->go = 576,525
      • fact: the remaining delta is mostly a Go server issue, not a client issue
    • uds-pipeline-batch-d16, target 0
      • c->c = 103,250,763
      • c->rust = 91,495,522
      • c->go = 51,623,524
      • rust->c = 102,367,177
      • rust->rust = 89,465,821
      • rust->go = 52,915,850
      • fact: the earlier client-side benchmark distortion is gone; the remaining large delta is mainly the Go server path
  • Next concrete fixes identified from code + rerun evidence:
    • Go and Rust cache lookup should mirror the C open-addressing hash table:
      • evidence:
        • C uses hash ^ djb2(name) with open addressing in src/libnetdata/netipc/src/service/netipc_service.c
        • Go still uses a composite map[{hash,name}] in src/go/pkg/netipc/service/raw/cache.go
        • Rust still uses nested HashMap<u32, HashMap<String, usize>> in src/crates/netipc/src/service/raw.rs
      • implication:
        • Go and Rust still pay full runtime string hashing on every lookup while C does not
    • Go POSIX UDS transport should mirror the C/Rust vectored send path:
      • evidence:
        • C uses sendmsg + two iovecs in src/libnetdata/netipc/src/transport/posix/netipc_uds.c
        • Rust uses raw_send_iov() in src/crates/netipc/src/transport/posix.rs
        • Go still copies header + payload into a merged scratch buffer in src/go/pkg/netipc/transport/posix/uds.go
      • implication:
        • Go server responses on UDS still pay an extra memcpy per message on the hot path
  • Next measurement step:
    • apply the lookup-index and Go UDS send fixes
    • rerun only the affected slices first:
      • Linux: lookup, shm-batch-ping-pong, uds-pipeline-d16, uds-pipeline-batch-d16
      • Windows: lookup, shm-batch-ping-pong, np-pipeline-d16, np-pipeline-batch-d16
    • only after the slice reruns are understood should the full matrices and docs be refreshed again.
  • Second targeted Linux rerun after rebuilding the Rust release benchmark:
    • lookup
      • c = 170,976,986
      • rust = 150,660,413
      • go = 121,278,244
      • fact:
        • Rust lookup is now near C after mirroring the C open-addressing structure
        • Go lookup improved materially too, but it is still above the 1.2x threshold versus C
    • shm-batch-ping-pong, target 0
      • c->c = 60,929,552
      • c->rust = 55,151,867
      • rust->c = 49,426,036
      • rust->rust = 45,104,001
      • fact:
        • Rust still has a real server-side penalty on this path
        • Rust still has a larger real client-side penalty on this path
    • uds-pipeline-d16, target 0
      • c->c = 713,563
      • c->rust = 720,602
      • rust->c = 722,202
      • rust->rust = 712,371
      • c->go = 548,145
      • rust->go = 563,484
      • fact:
        • Rust is now aligned with C on the non-batch UDS pipeline path
        • the remaining delta is almost entirely the Go server path
    • uds-pipeline-batch-d16, target 0
      • c->c = 101,588,680
      • c->rust = 83,396,588
      • rust->c = 99,570,528
      • rust->rust = 86,762,291
      • c->go = 52,899,078
      • rust->go = 51,902,022
      • fact:
        • Rust client-side is now close to C on this path
        • Rust server-side still shows a real batch-path penalty
        • Go server-side is still the dominant outlier
  • Structural batch-path asymmetry verified from code:
    • C managed server exposes a whole-request callback:
      • src/libnetdata/netipc/include/netipc/netipc_service.h:187-192
      • callback receives request_hdr, full request_payload, and whole response_buf
    • C benchmark server uses that whole-request callback to batch-specialize increment in one loop:
      • bench/drivers/c/bench_posix.c:164-216
      • the callback sees NIPC_FLAG_BATCH, loops all items itself, and emits the whole batch response directly
    • Rust managed server exposes only per-item raw dispatch:
      • src/crates/netipc/src/service/raw.rs:1285-1297
      • batch handling is then forced through the managed-server loop:
        • src/crates/netipc/src/service/raw.rs:2002-2047
        • per item: batch_item_get() -> dispatch_single_internal() -> bb.add()
    • Go managed server exposes the same per-item dispatch shape:
      • src/go/pkg/netipc/service/raw/types.go:57-59
      • batch handling is forced through:
        • src/go/pkg/netipc/service/raw/client.go:903-946
        • per item: BatchItemGet() -> dispatchSingle() -> bb.Add()
    • fact:
      • the remaining Rust and Go batch server gaps are not just transport issues
      • C can specialize whole-batch increment handling at the callback boundary; Rust and Go cannot
  • Working theory for the remaining Linux gaps:
    • shm-batch-ping-pong
      • Rust still has both client-side and server-side cost versus C
      • the server-side part aligns with the batch callback asymmetry above
    • uds-pipeline-batch-d16
      • Rust client-side is now nearly aligned with C
      • the remaining Rust delta is mainly server-side batch handling overhead
      • the much larger Go delta is likely server-side too, with the same structural asymmetry plus extra Go dispatch/runtime overhead
  • Decision required before the next implementation step:
    • Background:
      • The remaining batch-path gap is now tied to the managed-server design.
      • Any serious fix must choose whether to optimize only the benchmarks or to change the service/server implementation model.
      1. Batch server optimization strategy
      • Evidence:
        • C whole-request callback:
          • src/libnetdata/netipc/include/netipc/netipc_service.h:187-192
          • bench/drivers/c/bench_posix.c:164-216
        • Rust per-item batch loop:
          • src/crates/netipc/src/service/raw.rs:1285-1297
          • src/crates/netipc/src/service/raw.rs:2002-2047
        • Go per-item batch loop:
          • src/go/pkg/netipc/service/raw/types.go:57-59
          • src/go/pkg/netipc/service/raw/client.go:903-946
      • A. Benchmark-only fast path
        • Implement dedicated Rust/Go benchmark servers that bypass the managed server for increment batch.
        • Pros:
          • fastest way to measure the upper bound
          • smallest code change
        • Implications:
          • benchmark numbers improve, but the library/server path stays asymmetric
        • Risks:
          • hides a real product/library performance issue
          • docs and benchmarks stop representing real library behavior
      • B. Internal managed-server specialization
        • Keep the external single-kind API shape, but add internal fast paths for known service kinds such as increment batch.
        • Pros:
          • fixes real library behavior
          • avoids large public API churn
          • aligned with one-service-kind servers
        • Implications:
          • managed-server internals become aware of service-kind-specific fast paths
        • Risks:
          • hidden complexity if done ad hoc
          • may still leave the public abstraction less explicit than the implementation
      • C. Explicit service-kind-specific server APIs
        • Redesign Rust/Go managed servers so each service kind gets its own whole-request server callback surface, matching the accepted single-kind architecture.
        • Pros:
          • cleanest long-term design
          • makes the fast path explicit instead of hidden
          • best fit for maintainability and performance
        • Implications:
          • broader API/implementation/test/doc rewrite in Rust and Go
        • Risks:
          • largest scope before the next measurement
      • Recommendation:
        • 1. C
        • Reason:
          • the evidence shows a real API/implementation asymmetry, not just a hot-loop bug
          • your accepted single-kind-service design already points in this direction
  • Priority check raised by Costa:
    • Background:
      • Current benchmark results are already very high in absolute terms.
      • The remaining gaps are real, but fixing them now would require a broader Rust/Go managed-server redesign for batch-heavy paths.
    • Facts:
      • Clean Linux rerun:
        • lookup
          • c = 170,976,986
          • rust = 150,660,413
          • go = 121,278,244
        • shm-batch-ping-pong
          • c->c = 60,929,552
          • rust->rust = 45,104,001
        • uds-pipeline-batch-d16
          • c->c = 101,588,680
          • rust->rust = 86,762,291
          • go->go = 51,355,370
      • Fact:
        • these are already very high throughputs in absolute terms
        • the remaining work is now mainly about closing relative efficiency gaps, not about making the library viable
    • Working theory:
      • Deferring the remaining batch-path optimization is reasonable if there are more fundamental correctness, architecture, or product-fit issues still open.
      • The benchmark investigation has already done its job by identifying the structural asymmetry and proving where it lives.
  • Updated decision from Costa:
    • continue the benchmark investigation for trust in the framework
    • investigate all remaining >1.20x differences
    • treat the Rust/Go batch-path asymmetry as already identified, and focus next on the remaining unexplained gaps
  • Remaining unexplained Linux gaps after excluding the known batch-path issue:
    • lookup
      • c = 170,976,986
      • rust = 150,660,413
      • go = 121,278,244
    • uds-pipeline-d16
      • c->c = 713,563
      • c->go = 548,145
      • rust->go = 563,484
      • fact:
        • the Go server remains the unexplained outlier on the non-batch pipeline path
  • New concrete finding: Go lookup still pays a by-value item copy on every successful bucket probe
    • Evidence:
      • actual cache lookup:
        • src/go/pkg/netipc/service/raw/cache.go:122-130
        • item := c.items[c.buckets[slot].index] copies the whole CacheItem
      • Go lookup benchmark mirrors the same behavior:
        • bench/drivers/go/main.go:1133-1136
        • bucketItem := cacheItems[lookupIndex[slot].index] copies the whole struct
      • Rust uses a reference:
        • src/crates/netipc/src/service/raw.rs:2376-2379
      • C returns a pointer:
        • src/libnetdata/netipc/src/service/netipc_service.c:1492-1497
    • Implication:
      • the current Go lookup gap is still at least partly a real Go implementation issue, not just a benchmark artifact
  • Follow-up measurement on the Go lookup bucket-copy fix:
    • Applied:
      • src/go/pkg/netipc/service/raw/cache.go
      • bench/drivers/go/main.go
      • changed bucket probes from by-value CacheItem copies to pointer/reference access
    • Rerun results:
      • c = 172,638,775
      • rust = 153,518,048
      • go = 115,783,444
    • Fact:
      • the fix had no material positive effect on Go lookup throughput
      • therefore the by-value bucket copy was not the dominant cause of the remaining Go lookup gap
  • Go lookup profile after the bucket-copy fix:
    • Evidence:
      • live perf profile of bench_posix_go lookup-bench
      • output row:
        • lookup,go,go,127,972,744
      • visible hot frames from /tmp/nipc-go-lookup-perf.data:
        • main.runLookupBench almost all samples
        • time.runtimeNano about 8%
        • runtime.memequal about 2%
    • Fact:
      • no single framework/library helper stands out as the dominant hotspot
      • the operation is so small that benchmark loop overhead and inlining dominate the profile
    • Working theory:
      • the remaining Go lookup gap is not currently a strong signal about the IPC framework itself
      • it is at least partly a benchmark-methodology issue for a tiny in-memory operation
  • Go non-batch pipeline server profile:
    • Evidence:
      • live perf profile of bench_posix_go uds-ping-pong-server under uds-pipeline-d16 load from a C client
      • client result during profile:
        • uds-pipeline-d16,c,c,567,061,...
      • hot frames from /tmp/nipc-go-server-perf.data:
        • Session.Send about 39.5%
        • Session.Receive about 33.8%
        • raw.pollFd about 23.1%
        • increment dispatch does not materially appear
    • Fact:
      • the remaining Go server gap on uds-pipeline-d16 is not in increment handler logic
      • it is dominated by the Go UDS server transport/poll path
    • Supporting fact:
      • Go as a client on the same scenario is only slightly slower than C/Rust:
        • go->c = 699,976 vs c->c = 713,563
        • go->rust = 685,614 vs c->rust = 720,602
      • implication:
        • the big remaining gap is mainly server-side, and pollFd is the strongest server-only suspect
  • New concrete finding: Go non-batch server gap is transport/poll dominated, not dispatch dominated
    • Evidence:
      • live perf profile of bench_posix_go uds-ping-pong-server under uds-pipeline-d16 load
      • hot path breakdown from /tmp/nipc-go-server-perf.data:
        • Session.Send about 39.5%
        • Session.Receive about 33.8%
        • raw.pollFd about 23.1%
        • increment dispatch does not materially appear in the hot path
    • Working theory:
      • the remaining Go server delta on uds-pipeline-d16 is in the Go UDS server transport/wrapper path, especially poll + recvmsg + sendmsg, not in the increment handler logic

Benchmark refresh slice (2026-03-26)

  • TL;DR:
    • rerun the full official benchmark suites on the current worktree for both Linux and Windows
    • regenerate the checked-in benchmark artifacts from those reruns
    • compare the refreshed Linux and Windows matrices and flag any materially strange language deltas
    • review and follow the existing repo TODO guidance for the real Windows win11 benchmark workflow
  • Analysis:
    • current checked-in benchmark artifacts are from 2026-03-25:
      • benchmarks-posix.md
      • benchmarks-windows.md
      • README.md
    • the official full-matrix runners are:
      • Linux:
        • tests/run-posix-bench.sh
        • tests/generate-benchmarks-posix.sh
      • Windows:
        • tests/run-windows-bench.sh
        • tests/generate-benchmarks-windows.sh
    • the verified Windows execution guidance already exists in repo TODOs and README:
      • README.md:342-365
      • TODO-pending-from-rewrite.md:2754-2849
    • current runner/generator methodology facts for Windows trustworthiness:
      • tests/run-windows-bench.sh currently writes exactly one CSV row per benchmark cell:
        • run_pair() parses one client result and immediately appends it to OUTPUT_CSV
        • there is no built-in repetition, aggregation, or instability gate
      • tests/generate-benchmarks-windows.sh validates completeness and floors, but it trusts each CSV row as final truth:
        • it has no notion of repeated samples, medians, spread, or outlier detection
      • implication:
        • a single noisy Windows measurement can currently become the published benchmark artifact if it still parses and keeps throughput above zero
    • benchmark methodology references gathered before changing the Windows workflow:
      • Google Benchmark user guide:
        • repeated benchmarks exist because a single result may not be representative when benchmarks are noisy
        • when repetitions are used, mean / median / standard deviation are reported
        • source examined:
          • /tmp/google-benchmark-20260326/docs/user_guide.md
      • Criterion.rs analysis and user guide:
        • noisy runs should be treated skeptically
        • longer measurement time reduces the influence of outliers
        • outlier classification is a first-class part of reliable benchmark analysis
        • sources examined:
          • /tmp/criterion-rs-20260326/book/src/user_guide/command_line_output.md
          • /tmp/criterion-rs-20260326/book/src/analysis.md
    • verified workflow facts from those docs:
      • real Windows benchmark proof is expected on win11, not via Linux cross-compilation
      • login shell may start as MSYSTEM=MSYS; benchmark runs should set:
        • PATH="/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH"
        • MSYSTEM=MINGW64
        • CC=/mingw64/bin/gcc
        • CXX=/mingw64/bin/g++
      • official Windows benchmark commands are:
        • bash tests/run-windows-bench.sh benchmarks-windows.csv 5
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
    • the current local worktree is not clean and includes benchmark-related source edits:
      • bench/drivers/go/main.go
      • bench/drivers/go/main_windows.go
      • bench/drivers/rust/src/main.rs
      • bench/drivers/rust/src/bench_windows.rs
      • plus service/transport files that can affect benchmark behavior
    • implication:
      • the refreshed artifacts must reflect this exact current tree
      • benchmark interpretation must distinguish:
        • real implementation/runtime asymmetry
        • normal platform differences
        • measurement distortion or stale artifact drift
  • Decisions:
    • no new user decision required before execution
    • using the existing official full-suite runners is the correct path
    • using the existing real win11 workflow is the correct Windows path
  • Plan:
    • run the full Linux benchmark suite locally on the current tree
    • regenerate benchmarks-posix.md
    • run the full Windows benchmark suite on win11 using the documented native-toolchain environment
    • regenerate benchmarks-windows.md
    • compare refreshed CSVs and summarize the largest cross-language spreads by scenario
    • classify strange deltas as:
      • expected platform/runtime behavior
      • suspicious and possibly measurement-related
      • suspicious and likely implementation-related
    • update benchmark-derived docs if the refreshed artifacts materially change the published snapshot
    • for the Windows trustworthiness fix:
      • change the Windows runner to collect multiple measured repetitions per benchmark cell instead of trusting a single sample
      • aggregate repeated samples into one publication row using a robust statistic instead of one lucky or unlucky run
      • preserve a fail-closed path:
        • if repeated Windows samples for a cell diverge beyond a configured spread threshold, fail the run instead of publishing that cell
      • keep the published CSV shape stable if possible, so the existing generator/report consumers do not need a schema rewrite just to gain trustworthiness
  • Implied decisions:
    • benchmark duration remains the documented default 5 seconds unless the runner fails and forces a diagnostic rerun
    • the first full pass should use the official artifact filenames:
      • benchmarks-posix.csv
      • benchmarks-posix.md
      • benchmarks-windows.csv
      • benchmarks-windows.md
    • if Windows artifacts are produced remotely, copy them back into this repo without resetting unrelated local files
  • Testing requirements:
    • Linux benchmark CSV must contain 201 data rows and pass the generator validation
    • Windows benchmark CSV must contain 201 data rows and pass the generator validation
    • refreshed artifacts must have no duplicate scenario keys and no zero-throughput rows
  • Documentation updates required:
    • update the checked-in benchmark markdown files to match the refreshed CSVs
    • update README.md only if the published generated dates, machine snapshot, or headline benchmark ranges are no longer true after the refresh
  • Execution results:
    • reviewed Windows benchmark handoff guidance before execution:
      • README.md:342-365
      • TODO-pending-from-rewrite.md:2754-2849
    • Linux benchmark refresh completed successfully on the current worktree:
      • command:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_posix
        • bash tests/run-posix-bench.sh benchmarks-posix.csv 5
        • bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
      • result:
        • 201 rows
        • generator passed
        • all configured POSIX floors passed
    • Windows benchmark refresh completed on win11 native MSYS/MinGW toolchain path:
      • disposable synced tree:
        • /tmp/plugin-ipc-bench-20260326
      • command:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
        • bash tests/run-windows-bench.sh benchmarks-windows.csv 5
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
      • factual result:
        • benchmark runner completed 201 rows
        • generator wrote benchmarks-windows.md
        • generator exited non-zero because of one floor violation:
          • shm-ping-pong rust->c @ max = 850,994
          • configured floor: 1,000,000
    • new user requirement after the unstable Windows reruns:
      • make Windows benchmarks trustworthy instead of relying on single noisy runs
      • allowed direction from user:
        • increase duration
        • run multiple repetitions
        • use any stronger methodology needed, as long as the published Windows benchmark artifacts become trustworthy
      • fit-for-purpose clarification:
        • Windows benchmark artifacts must be publication-grade on win11
        • single-run outliers must not be able to define the checked-in benchmark matrix
    • Windows trustworthiness implementation now applied locally:
      • tests/run-windows-bench.sh
        • new default: 5 measured samples per Windows benchmark cell
        • each published CSV row is now the median aggregate of those samples
        • the runner now persists per-cell repeated samples in RUN_DIR during execution
        • initial implementation used a blunt raw spread gate:
          • fail if max(sample_throughput) / min(sample_throughput) > 1.35
      • tests/generate-benchmarks-windows.sh
        • markdown output now states that the current Windows report is based on repeated aggregated measurements instead of one single sample
    • targeted proof of the new Windows trust method on win11:
      • synced the updated Windows runner/generator into the same disposable proof tree:
        • /tmp/plugin-ipc-bench-20260326
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-trust.csv 5
      • factual result:
        • completed successfully with the new 5-sample median path
        • no stability-gate failure
        • the previously suspicious rows are now stable:
          • shm-ping-pong rust->c @ max = 2,527,551
          • shm-ping-pong rust->rust @ 10000 = 9,999
        • all reported SHM sample ratios observed during that proof stayed well below the 1.35 gate
      • implication:
        • the old single-shot Windows SHM collapses were publication-methodology failures
        • with repeated measurement + median aggregation + spread gating, the same win11 host now produces a stable SHM matrix
    • first stability-gate refinement after proof runs on win11:
      • fact:
        • the initial raw max/min gate was too blunt for legitimate runs with one obvious transient outlier
      • evidence:
        • repeated sample file from the first full repeated run:
          • /tmp/netipc-bench-300472/samples-np-ping-pong-c-go-100000.csv
        • measured throughputs:
          • 17,798
          • 19,059
          • 15,586
          • 6,741
          • 18,303
        • implication:
          • one bad transient sample should not discard the whole row if the remaining samples agree tightly
      • attempted follow-up:
        • a Tukey-style outlier fence was tested next
      • fact:
        • with only 5 samples, that approach was too aggressive and incorrectly marked normal edge values as outliers
      • evidence:
        • repeated sample file:
          • /tmp/netipc-bench-287769/samples-np-ping-pong-go-c-0.csv
        • measured throughputs:
          • 17,419
          • 18,049
          • 18,078
          • 18,229
          • 18,533
        • implication:
          • the real spread there is only about 1.06x, so that row is stable and should be published
    • final trust method now applied locally after those proof runs:
      • tests/run-windows-bench.sh
        • keep 5 measured samples per published row
        • publish medians for throughput and latency/CPU columns
        • when there are at least 5 samples, drop exactly one lowest and one highest throughput sample before the stability check
        • require the remaining stable core to contain at least 3 samples
        • require stable-core throughput spread:
          • stable_max / stable_min <= 1.35
        • if the raw extremes are noisy but the stable core is good:
          • publish the row
          • print a warning that records both raw and stable spreads
      • tests/generate-benchmarks-windows.sh
        • methodology text updated to describe the stable-core rule instead of the original raw-spread wording
    • second stability-gate refinement after full-suite evidence on win11:
      • fact:
        • the first repeated full-suite rerun still found a real unstable case at 5s max-throughput duration:
          • snapshot-shm rust->go @ max
      • evidence:
        • repeated sample file:
          • /tmp/netipc-bench-300472/samples-snapshot-shm-rust-go-0.csv
        • measured throughputs:
          • 1,042,824
          • 977,680
          • 648,337
          • 367,491
          • 1,027,273
        • stable core after dropping one low and one high sample:
          • 648,337
          • 977,680
          • 1,027,273
        • stable-core ratio:
          • 1.584474
      • implication:
        • repeated measurement alone was not enough for all Windows max-throughput rows
        • some max rows needed a longer measurement window, not just more samples
    • max-throughput duration refinement now applied locally:
      • tests/run-windows-bench.sh
        • fixed-rate rows still use the CLI duration default:
          • 5s
        • max-throughput rows now use a separate default duration:
          • NIPC_BENCH_MAX_DURATION=10
        • the runner logs both durations at startup
      • targeted proof on win11 for the previously failing case:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=4 NIPC_BENCH_LAST_BLOCK=4 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/snapshot-shm-10s.csv 10
        • factual result:
          • previously failing snapshot-shm rust->go @ max became stable:
            • median throughput 1,053,376
            • stable-core ratio 1.018280
          • another noisy row also stabilized after trimming one low and one high sample:
            • snapshot-shm rust->c @ max
            • raw range:
              • 460,343 .. 1,167,598
            • stable-core range:
              • 1,109,218 .. 1,133,875
            • stable-core ratio:
              • 1.022229
      • implication:
        • the final trustworthy Windows method is now:
          • repeated measurement
          • median publication
          • stable-core gating
          • longer max-throughput samples
    • final proof run status after the trust-method changes:
      • full-suite rerun now in progress on win11 with the final method:
        • fixed-rate rows:
          • 5 samples x 5s
        • max-throughput rows:
          • 5 samples x 10s
        • stability rule:
          • publish only if the trimmed stable core stays within 1.35x
      • live confirmed progress:
        • np-ping-pong block completed cleanly under the final method
        • shm-ping-pong block started cleanly under the final method
    • first full repeated rerun with the 10s max default found one remaining unstable row late in the suite:
      • scenario:
        • np-pipeline-batch-d16 rust->rust @ max
      • preserved sample file:
        • /tmp/netipc-bench-331471/samples-np-pipeline-batch-d16-rust-rust-0.csv
      • measured throughputs:
        • 37,400,757
        • 31,635,302
        • 26,609,207
        • 39,324,202
        • 24,312,207
      • trimmed stable core:
        • 26,609,207 .. 37,400,757
      • stable-core ratio:
        • 1.405557
      • implication:
        • the runner correctly failed closed
        • the remaining instability was no longer global Windows SHM noise
        • it was narrowed to np-pipeline-batch @ max on win11
    • targeted proof for the remaining pipeline-batch max instability:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=9 NIPC_BENCH_LAST_BLOCK=9 NIPC_BENCH_MAX_DURATION=20 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv 5
      • factual result:
        • the full np-pipeline-batch-d16 matrix passed cleanly at 20s
        • previously failing row became stable:
          • rust->rust @ max = 34,184,748
          • stable-core ratio 1.064913
        • previously noisy go->c @ max also tightened materially:
          • 38,364,026
          • stable-core ratio 1.024521
      • implication:
        • the remaining issue was short-window measurement noise for np-pipeline-batch @ max
        • a longer max window fixes it without relaxing the trust gate
    • final Windows trust method now applied locally:
      • tests/run-windows-bench.sh
        • fixed-rate rows:
          • 5s
        • most max-throughput rows:
          • 10s
        • np-pipeline-batch-d16 @ max:
          • 20s
        • runner knobs now include:
          • NIPC_BENCH_MAX_DURATION
          • NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION
      • tests/generate-benchmarks-windows.sh
        • methodology section now documents the 20s pipeline-batch max window explicitly
    • final published Windows artifact assembly:
      • full repeated rerun output from:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows.csv
        • used for all stable rows outside np-pipeline-batch-d16
        • notable publishable warning retained from that full rerun:
          • np-pipeline-d16 go->c @ max
          • raw range:
            • 111,201 .. 255,780
            • raw ratio 2.300159
          • trimmed stable core:
            • 234,582 .. 241,982
            • stable ratio 1.031545
          • implication:
            • the outlier-handling path is doing real work on win11
            • the published median row is still trustworthy because the stable core stayed tight
      • targeted validated 20s rerun output from:
        • /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv
        • used to replace the incomplete/unstable np-pipeline-batch-d16 block
      • locally assembled final CSV:
        • 202 lines total
        • 201 data rows
        • scenario counts all correct
      • local validation:
        • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
        • result:
          • all configured Windows floors pass
          • report generation passes cleanly
    • follow-up approved by Costa after the first trustworthy publish:
      • run one fresh full Windows suite on win11 with the current default methodology
      • objective:
        • remove the remaining "assembled artifact" caveat if the one-shot full run now passes end to end
      • execution rule:
        • sync the current local benchmark-related sources to the disposable win11 proof tree first
        • only replace the checked-in Windows CSV/MD if that single fresh rerun passes with all floors green
    • current fresh-proof-tree rerun on win11 uses a new disposable tree based on origin/main plus the current local benchmark-related worktree files overlaid onto it:
      • fresh tree:
        • /tmp/plugin-ipc-bench-20260327-fullrun-150313
      • factual setup issue discovered before the real rerun:
        • tests/run-windows-bench.sh builds the C and Go benchmark binaries itself, but it only consumes an already-built Rust benchmark binary
        • on a fresh disposable tree, the first launch printed:
          • Rust benchmark binary not found: .../src/crates/netipc/target/release/bench_windows.exe (Rust tests will be skipped)
        • implication:
          • a fresh tree needs an explicit Rust build before the full Windows benchmark suite, or the run degrades to a 2-language matrix and is not publishable
      • corrective action applied on win11 before the real rerun:
        • cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
      • real rerun then restarted from the same fresh tree with diagnostics enabled:
        • NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv 5
      • live evidence from the ongoing one-shot full rerun:
        • no new diagnostics summary file has appeared so far
        • block 1 (np-ping-pong) is already materially clean end to end:
          • np-ping-pong c->c @ max = 19,627, stable_ratio=1.018133
          • np-ping-pong rust->c @ max = 19,880, stable_ratio=1.045638
          • np-ping-pong go->go @ max = 19,195, with one low and one high outlier trimmed, stable_ratio=1.098122
          • all published 10000/s rows reached target cleanly:
            • examples:
              • rust->c = 9,999, stable_ratio=1.000000
              • rust->rust = 9,999, stable_ratio=1.000000
              • go->go = 10,000, stable_ratio=1.000000
          • the first published 1000/s rows are also landing at target:
            • go->c = 1,000, stable_ratio=1.000000
            • go->go = 1,000, stable_ratio=1.000000
        • the rerun has already crossed into the historically suspicious SHM block without reproducing the old collapse:
          • shm-ping-pong c->c @ max = 2,565,990, stable_ratio=1.042022
          • shm-ping-pong rust->c @ max = 2,443,021, stable_ratio=1.089130
          • shm-ping-pong c->rust @ max = 2,611,306, stable_ratio=1.071212
          • shm-ping-pong rust->rust @ max = 2,617,581, stable_ratio=1.027963
          • shm-ping-pong go->rust @ max = 2,327,904, stable_ratio=1.012447
        • factual interim conclusion:
          • the current one-shot full rerun is already materially stronger evidence than the older failing full runs
          • the earlier full-suite shm-ping-pong rust->c collapse is not reproducing on the same win11 host after the current lifecycle and Windows SHM fixes
      • live continuation coordinates for the long one-shot rerun:
        • win11 source tree:
          • /tmp/plugin-ipc-bench-20260327-fullrun-150313
        • live output files:
          • CSV:
            • /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv
          • log:
            • /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.log
        • last verified progress in this session:
          • 75 lines in the CSV (74 data rows)
          • blocks 1 and 2 completed cleanly
          • block 3 (snapshot-baseline) had started and was publishing stable @ max rows:
            • c->c = 19,872, stable_ratio=1.029521
            • rust->c = 19,291, stable_ratio=1.043116
          • no new diagnostics summary file had appeared yet
      • later checkpoint from the same still-running one-shot rerun:
        • 121 lines in the CSV (120 data rows)
        • blocks 1 through 4 had already cleared cleanly and the run had advanced deep into block 5 (np-batch-ping-pong)
        • live batch evidence:
          • np-batch-ping-pong c->go @ max = 7,699,399, stable_ratio=1.045676
          • np-batch-ping-pong rust->go @ max = 7,532,805, stable_ratio=1.018880
          • np-batch-ping-pong go->go @ max = 7,152,856, stable_ratio=1.030591
          • np-batch-ping-pong c->c @ 100000/s = 7,693,465, stable_ratio=1.011300
          • np-batch-ping-pong rust->c @ 100000/s = 7,497,010, stable_ratio=1.015083
        • no new diagnostics summary file had appeared yet at this checkpoint either
      • completed outcome of the clean one-shot Windows rerun:
        • the long win11 one-shot rerun finished cleanly
        • final CSV size:
          • 202 logical lines
          • 201 data rows
        • no new diagnostics summary file was produced during this rerun
        • tests/generate-benchmarks-windows.sh passed on win11 against the final CSV:
          • All performance floors met
        • the final generated report was copied back into the repo as:
          • benchmarks-windows.csv
          • benchmarks-windows.md
        • the same generator also passed locally after copying the artifacts back:
          • bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
          • result:
            • All performance floors met
        • user-approved follow-up after the successful one-shot rerun:
          • commit the Windows artifact refresh and the TODO update as a separate git commit
          • do not include unrelated dirty files from the broader worktree
        • user-approved follow-up after the local commit:
          • push commit 768cca3 to origin/main
          • do not include any of the remaining unrelated dirty files
        • implication:
          • the remaining "assembled artifact" caveat is now removed
          • the checked-in Windows artifacts now come from a single clean one-shot full rerun on win11
        • stable final Windows max-throughput spreads from that clean one-shot artifact:
          • shm-ping-pong:
            • best:
              • rust->rust = 2,617,581
            • worst:
              • go->go = 2,113,834
            • spread:
              • 1.238x
            • conclusion:
              • no strange SHM collapse remains in the final clean artifact
          • lookup:
            • best:
              • rust = 176,259,707
            • worst:
              • go = 98,385,649
            • spread:
              • 1.792x
          • np-pipeline-d16:
            • best:
              • go->rust = 240,205
            • worst:
              • c->go = 216,940
            • spread:
              • 1.107x
          • np-pipeline-batch-d16:
            • best:
              • go->c = 39,065,948
            • worst:
              • c->go = 27,896,181
            • spread:
              • 1.400x
    • first one-shot full rerun attempt with the current defaults did not produce a clean replacement artifact:
      • partial output path:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot.csv
      • factual failure observed during block 1:
        • np-ping-pong rust->rust @ 1000/s
        • Rust client exited non-zero
        • streamed client output reported:
          • client: 4207 errors
          • partial line:
            • np-ping-pong,rust,rust,159,75.500,177.400,177.400,5.6,0.0,5.6
      • implication:
        • the one-shot rerun cannot replace the current published Windows artifact
        • before attempting another full rerun, the new failure should be isolated on block 1 to determine whether it is reproducible or a one-off transport/runtime glitch
    • isolated recheck of block 1 completed cleanly on the same win11 proof tree:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/block1-recheck.csv 5
      • output path:
        • /tmp/plugin-ipc-bench-20260326/block1-recheck.csv
      • factual result:
        • all 36 block-1 measurements completed with exit code 0
        • the previously failing row completed cleanly:
          • np-ping-pong rust->rust @ 1000/s = 1000
          • p50=66.200us
          • p95=248.200us
          • p99=369.500us
          • stable_ratio=1.000000
      • implication:
        • the first one-shot block-1 failure is not immediately reproducible
        • this currently looks like a transient host/runtime glitch, not established deterministic instability in the rust->rust @ 1000/s pair
        • the next valid check is another clean one-shot full Windows rerun with the same default methodology
    • second one-shot full rerun with the current defaults also failed to produce a clean replacement artifact:
      • partial output path:
        • /tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot-2.csv
      • factual failure observed during block 2:
        • shm-ping-pong rust->c @ max
        • repeated-sample file:
          • /tmp/netipc-bench-410987/samples-shm-ping-pong-rust-c-0.csv
        • repeated throughputs:
          • 618,076
          • 618,160
          • 1,951,036
          • 2,303,714
          • 2,476,081
        • stable-core gate result:
          • stable_min=618,160
          • stable_max=2,303,714
          • stable_ratio=3.726728
          • configured max: 1.35
      • implication:
        • the current default methodology still does not guarantee a clean one-shot full Windows run on win11
        • the blocker has moved from a random-looking block-1 client failure to a concrete SHM max-throughput instability event
    • focused reproduction of the same SHM pair in isolation did not reproduce the collapse:
      • direct pair under the same synced win11 tree:
        • C server: bench_windows_c.exe shm-ping-pong-server
        • Rust client: bench_windows.exe shm-ping-pong-client
      • isolated rust -> c @ max repeated 10 times with 10s samples:
        • throughput range:
          • 2,446,407 .. 2,578,450
        • all 10 runs stayed in the fast band
      • isolated rust -> c @ max repeated 10 times with 20s samples:
        • throughput range:
          • 2,363,335 .. 2,589,588
        • all 10 runs stayed in the fast band
      • implication:
        • the SHM collapse is not a simple deterministic rust client -> c server bug
        • longer isolated samples are stable, but that alone does not explain the one-shot full-run failure
    • sequence test also failed to reproduce the SHM collapse:
      • setup:
        • one c -> c @ max SHM prime run
        • followed immediately by 5 direct rust -> c @ max SHM runs
        • repeated for 5 cycles on the same RUN_DIR
      • factual result:
        • all 25 post-prime rust -> c runs stayed in the fast band:
          • 2,357,337 .. 2,664,284
      • implication:
        • the failure is not explained by a simple "previous c -> c SHM row poisons the next rust -> c row" theory
        • current best description:
          • rare transient host/runtime glitch during full-matrix execution on win11
          • not immediately reproducible in dedicated pair or simple sequence tests
    • pending user decision before more Windows runner code changes:
      • context:
        • Costa asked for trustworthy Windows benchmarks
        • current state is better than before, but a clean one-shot full run is still not guaranteed
      • user constraint raised during decision review:
        • automatic retries must not hide real failures or real bugs
        • if retries are ever used, first-attempt failures must remain visible and reportable
      • user decision:
        • keep the main Windows benchmark publication path fail-closed
        • do not add silent self-healing retries to publish mode
        • add a separate diagnostic mode that can rerun failed rows in isolation
        • diagnostic mode must preserve and report the original first-attempt failure evidence side by side with any diagnostic rerun evidence
      • option A:
        • add automatic per-row retry on Windows when a row fails because of client error or stability-gate failure
        • keep the current 5-sample median + 1.35 stable-core gate inside each attempt
        • implications:
          • one transient bad row no longer destroys a 2-hour full run
          • a row is still published only if a full fresh attempt passes the same gate
        • risks:
          • published rows may come from retry attempt 2 or 3, not from the first pass
          • the report and logs must say that retries happened, or the methodology becomes misleading
      • option B:
        • keep fail-closed behavior, but increase Windows SHM max collection further:
          • for example 20s per sample and/or 7-9 repeats
        • implications:
          • simpler story than retries
          • every accepted row is still strictly one attempt
        • risks:
          • much longer full-suite runtime
          • evidence so far does not prove that longer duration alone fixes the rare full-run glitch
      • option C:
        • keep the current runner and accept targeted reruns / assembled Windows artifacts when one-shot full runs glitch
        • implications:
          • fastest operationally
          • still produces trustworthy rows when each replacement row is validated carefully
        • risks:
          • no clean single-command reproduction
          • more manual work and more caveats around publication
      • accepted direction:
        • strict publish mode plus separate diagnostic reruns
        • rationale:
          • failures stay visible
          • diagnostic reruns can still accelerate root-cause work without turning the publication path into silent self-healing
    • implemented Windows diagnostic mode for failed rows:
      • file:
        • tests/run-windows-bench.sh
      • new behavior:
        • publish mode remains fail-closed by default
        • opt-in diagnostics via:
          • NIPC_BENCH_DIAGNOSE_FAILURES=1
        • when a row fails in publish mode:
          • the original failure remains authoritative
          • the original RUN_DIR and first-attempt sample file remain preserved
          • the same row is rerun in an isolated diagnostic subdirectory under the preserved RUN_DIR
          • diagnostic rerun output is recorded in:
            • ${RUN_DIR}/diagnostics-summary.txt
          • diagnostic reruns never write rows into the publish CSV
      • implementation details:
        • row-level measurement state is now tracked explicitly:
          • failure reason
          • sample-file path
          • aggregate throughput/latency/CPU values
          • stability metrics
        • diagnostic reruns restore the original first-failure state after logging the isolated rerun evidence
    • forced validation of the new diagnostic mode on win11:
      • purpose:
        • prove that publish mode still fails closed
        • prove that diagnostic reruns preserve the original evidence and create side-by-side isolated rerun evidence
      • command:
        • NIPC_BENCH_FIRST_BLOCK=7 NIPC_BENCH_LAST_BLOCK=7 NIPC_BENCH_DIAGNOSE_FAILURES=1 NIPC_BENCH_REPETITIONS=3 NIPC_BENCH_MAX_DURATION=1 NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION=1 NIPC_BENCH_MAX_THROUGHPUT_RATIO=0.9 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv 1
      • factual result:
        • runner exited non-zero as expected
        • publish CSV remained header-only:
          • /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv
        • preserved original run dir:
          • /tmp/netipc-bench-425494
        • diagnostic summary created:
          • /tmp/netipc-bench-425494/diagnostics-summary.txt
        • distinct diagnostic rerun dirs created per failed row:
          • /tmp/netipc-bench-425494/diagnostics/001-lookup-c-c-0
          • /tmp/netipc-bench-425494/diagnostics/002-lookup-rust-rust-0
          • /tmp/netipc-bench-425494/diagnostics/003-lookup-go-go-0
      • implication:
        • the new mode preserves truth in publish mode
        • it also gives immediate isolated rerun evidence for investigation without silently healing the benchmark artifact
    • next-step approval from Costa:
      • commit and push the strict publish + diagnostic-mode runner changes
      • then proceed immediately to the real Windows SHM investigation using the new diagnostic mode on the actual failing slice
    • commit / push completed for the diagnostic-mode runner change:
      • commit:
        • 870fc93
      • subject:
        • bench: add Windows diagnostic reruns
      • pushed to:
        • origin/main
    • real Windows SHM investigation with the new diagnostic mode:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
      • factual result:
        • block 2 completed successfully with exit code 0
        • no diagnostic rerun triggered for any SHM row
        • the previously suspicious row completed cleanly:
          • shm-ping-pong rust->c @ max = 2,465,857
          • stable_ratio=1.021516
        • the full SHM max matrix stayed stable:
          • c->c = 2,461,053
          • rust->c = 2,465,857
          • go->c = 2,162,135
          • c->rust = 2,597,936
          • rust->rust = 2,530,435
          • go->rust = 2,065,765
          • c->go = 2,570,619
          • rust->go = 2,254,772
          • go->go = 2,079,323
        • all 100000/s, 10000/s, and 1000/s SHM rows also completed stably in the same block run
      • implication:
        • the Windows SHM instability still does not reproduce when block 2 runs in isolation under the real runner
        • current strongest working theory:
          • the failure depends on broader full-suite context on win11
          • not on the standalone SHM block itself
    • targeted confirmation of the Windows SHM anomaly:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-confirm.csv 5
      • confirmed max-throughput rerun on the same win11 tree:
        • c->c = 2,396,963
        • rust->c = 1,708,649
        • go->c = 886,451
        • c->rust = 2,566,391
        • rust->rust = 2,563,582
        • go->rust = 2,053,507
        • c->go = 2,539,899
        • rust->go = 2,215,733
        • go->go = 2,047,115
      • factual conclusion:
        • the original rust->c full-suite collapse is not stable
        • max-throughput Windows SHM rows can swing materially between reruns on win11
        • target-rate Windows SHM rows remain stable near their requested rates
        • implication:
          • the strange Windows SHM max delta is currently a measurement-stability / host-noise issue, not proven deterministic language regression
  • Refreshed max-throughput spread summary:
    • Linux:
      • lookup:
        • fastest c->c = 167,974,040
        • slowest go->go = 127,908,975
        • spread: 1.31x
        • improvement versus checked-in previous artifact: 1.77x -> 1.31x
      • shm-ping-pong:
        • fastest rust->rust = 3,486,454
        • slowest go->go = 1,725,340
        • spread: 2.02x
        • note:
          • this widened versus the previous checked-in artifact because go->go max throughput dropped materially
      • shm-batch-ping-pong:
        • fastest c->c = 61,778,266
        • slowest go->go = 31,810,209
        • spread: 1.94x
      • uds-pipeline-d16:
        • fastest rust->c = 712,544
        • slowest rust->go = 550,630
        • spread: 1.29x
      • uds-pipeline-batch-d16:
        • fastest c->c = 99,746,787
        • slowest go->go = 50,690,629
        • spread: 1.97x
    • Windows:
      • lookup:
        • fastest rust->rust = 178,835,588
        • slowest go->go = 97,109,788
        • spread: 1.84x
      • shm-ping-pong full suite:
        • fastest c->rust = 2,650,754
        • slowest rust->c = 850,994
        • spread: 3.11x
        • but targeted confirmation disproved rust->c as a stable deterministic outlier
      • shm-batch-ping-pong:
        • fastest c->c = 52,520,469
        • slowest go->go = 34,390,650
        • spread: 1.53x
      • np-pipeline-batch-d16:
        • fastest go->rust = 38,249,582
        • slowest go->go = 24,333,588
        • spread: 1.57x
  • Strange delta findings that remain real after the refresh:
    • Linux uds-pipeline-d16:
      • Go server remains the clear slow case across clients:
        • c->go = 559,691
        • rust->go = 550,630
        • go->go = 553,858
        • versus C/Rust servers near 686k-713k
      • implication:
        • this is a stable Go-server transport/runtime cost, not client-specific noise
    • Linux uds-pipeline-batch-d16:
      • server choice dominates:
        • C server: 96.2M-99.7M
        • Rust server: 84.1M-86.3M
        • Go server: 50.7M-51.3M
      • implication:
        • the known batch-path server asymmetry is still real
    • Linux shm-batch-ping-pong:
      • C server stays strongest
      • Rust server is mid-band
      • Go server is slowest
      • implication:
        • still consistent with real server-side implementation overhead, not runner corruption
    • Linux / Windows lookup:
      • Linux:
        • c = 167.97M
        • rust = 146.15M
        • go = 127.91M
      • Windows:
        • rust = 178.84M
        • c = 125.60M
        • go = 97.11M
      • implication:
        • lookup is now measuring runtime/data-structure efficiency more than IPC transport behavior
        • the previous fake linear-scan distortion is gone, but cross-language runtime overhead remains visible
  • Strange delta finding that is currently suspicious but not yet proven real:
    • Windows shm-ping-pong @ max:
      • full-suite run made rust->c miss the floor
      • immediate confirmation run moved the collapse to go->c instead
      • conclusion:
        • this is currently a max-throughput measurement-stability issue on win11
        • do not interpret a single bad max row there as a stable language-specific regression without targeted rerun confirmation
    • second isolated Windows SHM rerun on the same win11 tree reinforced the same conclusion:
      • command:
        • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-rerun.csv 5
      • @max rows:
        • c->c = 2,516,450
        • rust->c = 2,430,413
        • go->c = 2,179,591
        • c->rust = 2,497,180
        • rust->rust = 2,473,159
        • go->rust = 2,114,944
        • c->go = 2,571,394
        • rust->go = 2,282,433
        • go->go = 2,100,658
      • implication:
        • the full-suite rust->c collapse to 850,994 is definitely not stable
      • additional warning sign from the same isolated rerun:
        • some target_rps=10000 rows also became unstable:
          • c->rust = 5,073
          • rust->rust = 4,098
          • while other rows in the same block stayed near 10,000
        • implication:
          • the Windows SHM benchmark instability is not limited to one language pair or only to the first full-suite run
  • Post-commit diagnostic runner work (870fc93 bench: add Windows diagnostic reruns):
    • committed and pushed:
      • commit: 870fc93
      • pushed to origin/main
    • immediate next investigation on win11:
      • goal:
        • identify the smallest Windows benchmark context that reproduces the earlier full-suite SHM collapse
      • standalone SHM block with diagnostics enabled:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
        • result:
          • exited 0
          • no diagnostics triggered
        • key shm-ping-pong @ max rows:
          • c->c = 2,461,053 with stable_ratio=1.018190
          • rust->c = 2,465,857 with stable_ratio=1.021516
          • go->c = 2,162,135 with stable_ratio=1.017540
          • c->rust = 2,597,936 with stable_ratio=1.016334
          • rust->rust = 2,530,435 with stable_ratio=1.020250
          • go->rust = 2,065,765 with stable_ratio=1.029206
          • c->go = 2,571,619 with stable_ratio=1.013998
          • rust->go = 2,254,772 with stable_ratio=1.022145
          • go->go = 2,079,323 with stable_ratio=1.010925
        • factual conclusion:
          • block 2 alone is stable under the real repeated-median runner
          • the earlier full-suite rust->c collapse is not a standalone SHM bug
      • combined NP -> SHM prefix with diagnostics enabled:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-diagnose.csv 5
        • result:
          • exited 0
          • no diagnostics triggered
          • total measurements: 72
        • key np-ping-pong @ max rows:
          • c->c = 19,411
          • rust->c = 19,735
          • go->c = 18,744
          • c->rust = 20,188
          • rust->rust = 20,301
          • go->rust = 19,277
          • c->go = 19,383
          • rust->go = 18,558
          • go->go = 19,241
        • key shm-ping-pong @ max rows:
          • c->c = 2,522,584
          • rust->c = 2,522,004
          • go->c = 2,071,095
          • c->rust = 2,580,971
          • rust->rust = 2,511,775
          • go->rust = 2,308,182
          • c->go = 2,657,019
          • rust->go = 2,273,563
          • go->go = 2,109,132
        • factual conclusion:
          • the failure does not reproduce with blocks 1-2
          • the earlier bad rust->c full-suite row requires broader full-suite context than just the NP -> SHM transition
    • updated working theory:
      • speculation:
        • a later block, or cumulative state from multiple later blocks, is needed to trigger the rare full-suite Windows instability
      • not supported by evidence anymore:
        • standalone SHM bug
        • simple NP -> SHM transition bug
    • next diagnostic step:
      • extend the prefix to block 3 and repeat:
        • NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-batch-diagnose.csv 5
  • Current deep-dive findings after extending the prefix to block 3:
  • Decision needed before code:
    • 1. A Fix the Windows benchmark runner only.
      • scope:
        • replace hard-kill shutdown with graceful server stop / wait and hard-kill fallback only on timeout
        • make per-repeat server/client output files unique
        • fix diagnostic bookkeeping so preserved run dirs and summaries always match the actual run
      • benefits:
        • directly targets the strongest evidence
        • smallest code-change surface
        • most likely enough to make the benchmark harness trustworthy
      • implications:
        • benchmark methodology changes only, not transport semantics
        • if Windows SHM object-collision handling is also weak, the benchmark harness may become stable while the product bug remains latent
      • risks:
        • could leave a real Windows transport bug hidden until another scenario hits it outside the benchmark harness
    • 1. B Fix the Windows benchmark runner and harden Windows SHM object creation in C, Rust, and Go.
      • scope:
        • everything in 1. A
        • plus explicit ERROR_ALREADY_EXISTS handling for Windows SHM mappings/events and clearer collision errors
      • benefits:
        • addresses both the likely benchmark root cause and a real transport safety gap
        • makes leaked object collisions explicit instead of nondeterministic
      • implications:
        • larger change across multiple language implementations
        • requires more testing
      • risks:
        • broader patch, more review surface, more chance of side effects if the three implementations are not kept perfectly aligned
    • 1. C Continue diagnosis without code changes.
      • scope:
        • more targeted reruns and more artifact collection
      • benefits:
        • lowest code risk
      • implications:
        • more benchmark time burned with a runner we already know is violating server lifecycle on Windows
      • risks:
        • low leverage
        • likely delays the obvious fix
    • recommendation:
      • 1. B
      • reasoning:
        • the hard-kill runner behavior is the strongest causal explanation for the benchmark instability
        • but the Windows SHM create path also has a real hardening gap
        • if the goal is "Windows benchmarks trustworthy", fixing only the runner is probably enough for the harness, but not enough for the underlying transport robustness
    • user decision:
      • 1. B
      • accepted scope:
        • fix the Windows benchmark runner lifecycle and diagnostics bookkeeping
        • harden Windows SHM object creation in C, Rust, and Go to detect existing named objects explicitly
    • implementation and verification after 1. B:
      • local code changes completed:
        • runner:
          • tests/run-windows-bench.sh
        • Windows SHM hardening:
          • src/libnetdata/netipc/include/netipc/netipc_win_shm.h
          • src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c
          • src/crates/netipc/src/transport/win_shm.rs
          • src/go/pkg/netipc/transport/windows/shm.go
        • regression coverage:
          • tests/fixtures/c/test_win_shm.c
          • src/crates/netipc/src/transport/win_shm.rs
          • src/go/pkg/netipc/transport/windows/shm_test.go
      • factual runner behavior after the patch:
        • the Windows runner now:
          • uses a unique per-repeat runtime/artifact directory instead of reusing the same RUN_DIR for every repeat
          • waits for benchmark servers to stop themselves before killing them
          • preserves the root run dir on any measurement-command failure, not only on stability-gate failures
          • records the first-attempt artifact directory in the diagnostics summary
      • factual transport behavior after the patch:
        • C, Rust, and Go Windows SHM server-create paths now reject existing named mappings/events explicitly instead of treating them as successful creates
        • new error surface:
          • C:
            • NIPC_WIN_SHM_ERR_ADDR_IN_USE
          • Rust:
            • WinShmError::AddrInUse
          • Go:
            • ErrWinShmAddrInUse
      • first verification on win11:
        • focused Windows SHM duplicate-create coverage now passes in all three implementations:
          • Go:
            • cd src/go && GOOS=windows GOARCH=amd64 go test -run TestWinShmServerCreateRejectsExistingObjects -count=1 ./pkg/netipc/transport/windows
          • Rust:
            • cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_create_rejects_existing_objects_windows -- --test-threads=1
          • C:
            • cmake --build build -j4 --target test_win_shm
            • ctest --test-dir build --output-on-failure -R '^test_win_shm$'
        • result:
          • all passed
      • factual new issue exposed by the stricter runner:
        • extending the real benchmark rerun to NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 no longer reproduced the old random SHM collapse first
        • instead, it exposed a deterministic Rust benchmark-driver shutdown bug:
          • every row using a Rust Windows benchmark server failed with:
            • Server rust (...) did not exit cleanly within 10s; forcing kill
          • preserved server output contained only:
            • READY
          • implication:
            • the stricter runner removed the old benchmark-driver hard-kill masking and surfaced a real Rust benchmark-driver lifecycle bug
        • root cause:
          • bench/drivers/rust/src/bench_windows.rs still used the old Windows stop pattern:
            • only running_flag.store(false, ...)
            • no wake connection
          • this is the same Windows accept-loop issue already fixed earlier in the Rust Windows tests:
            • ConnectNamedPipe() stays blocked until a connection wakes it
        • fix:
          • bench/drivers/rust/src/bench_windows.rs now mirrors the tested shutdown pattern:
            • after duration+3, set running_flag = false
            • then issue a dummy NpSession::connect(...) so the blocked accept loop can observe shutdown and exit cleanly
        • direct proof on win11:
          • command:
            • timeout 20 src/crates/netipc/target/release/bench_windows.exe np-ping-pong-server /tmp/plugin-ipc-bench-20260327 rust-stop-check 1
          • result:
            • READY
            • SERVER_CPU_SEC=0.000000
          • implication:
            • the Rust Windows benchmark server now exits on its own instead of hanging until killed
      • focused real benchmark proof after all fixes:
        • command:
          • NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/block-2-after-fix.csv 5
        • result:
          • exited 0
          • no diagnostic reruns were needed
          • all 36 shm-ping-pong rows published
        • key evidence:
          • the previously suspicious Windows row is now stable:
            • shm-ping-pong rust->c @ max = 2,458,786
            • stable ratio:
              • 1.009920
          • all SHM @ max rows completed inside the stability gate:
            • c->c = 2,505,981 with stable_ratio=1.038817
            • rust->c = 2,458,786 with stable_ratio=1.009920
            • c->rust = 2,588,642 with stable_ratio=1.028021
            • rust->rust = 2,649,571 with stable_ratio=1.018367
            • rust->go = 2,242,750 with stable_ratio=1.045399
          • the previously suspicious fixed-rate rows are now also stable:
            • rust->c @ 100000/s = 99,997 with stable_ratio=1.000010
            • rust->c @ 10000/s = 9,999 with stable_ratio=1.000000
            • rust->rust @ 10000/s = 9,999 with stable_ratio=1.000000
        • factual conclusion from the focused SHM rerun:
          • the Windows SHM benchmark instability is materially reduced after:
            • runner lifecycle fixes
            • per-repeat runtime isolation
            • explicit Windows SHM collision detection
            • Rust benchmark-server wake-on-stop fix
          • the earlier rust->c SHM collapse no longer reproduces in the real benchmark block that used to be suspicious
      • partial full-suite proof after the focused fixes:
        • command started on win11:
          • NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/full-after-fix.csv 5
        • factual behavior before manual interruption:
          • no diagnostics were emitted
          • no forced-kill Rust benchmark-server failures reappeared
          • the run cleared the exact NP area where the stricter runner had previously exposed the Rust benchmark-server shutdown bug:
            • np-ping-pong @ max rows for Rust servers completed cleanly
            • np-ping-pong @ 100000/s rows for Rust servers completed cleanly
            • np-ping-pong @ 10000/s rows were still running cleanly when the run was stopped intentionally for time
        • reason for interruption:
          • no new technical blocker remained
          • the rest of the work was wall-clock runtime only
  • Finding recorded on 2026-04-15 during clean full validation after commit 074a3a5c8f552f15473a0bc929b96da9e71f79b7:
    • Windows full validation failed in the MSYS bounded benchmark policy step.
    • Failing command context:
      • bash tests/run-windows-msys-validation.sh
      • comparison artifact directory:
        • /tmp/plugin-ipc-full-windows-20260415-162223/msys-validation/bench-compare/
    • Concrete failing policy row:
      • np-max,np-ping-pong,c,c,0,both,70.0,44.9,fail
    • Concrete joined comparison row:
      • np-max,np-ping-pong,c,c,0,both,14058.000,6317.000,44.9,55.1,185.000,459.200,148.2,31.900,30.300
    • Rows that passed in the same policy run:
      • np-100k: MSYS was 100.5% of native mingw64
      • snapshot-np: MSYS was 96.9% of native mingw64
      • shm-max: MSYS was 88.5% of native mingw64
      • all mixed C/Rust SHM rows passed their configured floors
    • Interpretation:
      • this is narrow to unbounded/max-rate C-to-C named-pipe ping-pong under MSYS
      • fixed-rate named-pipe behavior did not regress in the same run
      • no code or policy change is justified yet without investigating whether the np-max floor is a valid semantic regression floor or an invalid saturation policy for MSYS
    • Immediate investigation plan:
      • inspect tests/compare-windows-bench-toolchains.sh policy intent for np-max
      • inspect the per-sample artifacts for both mingw64 and msys np-max
      • compare C named-pipe benchmark behavior under unbounded and fixed-rate modes
      • rerun the narrow np-max pair enough to determine whether this is repeatable or a noisy saturation outlier
      • only then decide whether to fix code, fix harness policy, or both
    • Follow-up evidence from a narrow rerun on the same win11:~/src/plugin-ipc.git checkout:
      • command:
        • paired targeted rerun of np-ping-pong,c,c,0 for mingw64 and msys
        • artifact directory:
          • /tmp/plugin-ipc-investigate-npmax-20260415-165234/
      • result:
        • mingw64: 22489.000
        • msys: 22492.000
        • effective MSYS/native ratio: approximately 100.0%
      • conclusion:
        • the implementation can pass the intended np-max relative policy immediately after the failed full run
        • the problem is a compare-lane false failure mode: a stable-looking saturation outlier can pass row-local guards and fail only at the final relative-policy stage
    • Implemented harness fix:
      • tests/compare-windows-bench-toolchains.sh now reruns policy-failed labels as paired mingw64+msys rows before final failure
      • default final-policy attempt budget:
        • NIPC_BENCH_COMPARE_POLICY_ATTEMPTS=3
      • prior failed attempts are preserved as:
        • summary.attempt-N.csv
        • joined.attempt-N.csv
        • policy.attempt-N.csv
      • no throughput floor was lowered
    • Documentation/test updates:
      • README.md and WINDOWS-COVERAGE.md document the paired policy retry behavior
      • added tests/test_windows_compare_policy_retry.sh with a stub targeted runner proving:
        • first policy attempt fails np-max
        • only the failed label is rerun
        • final policy passes after the paired retry
    • Local verification:
      • bash -n tests/compare-windows-bench-toolchains.sh tests/test_windows_compare_policy_retry.sh tests/run-windows-msys-validation.sh
      • bash tests/test_windows_compare_policy_retry.sh
      • bash tests/test_windows_bench_stability_policy.sh
      • git diff --check
      • all passed
  • Validation evidence recorded after commit bb996c638f6c73cb9f3a8b0aac55d09819548979:
    • Linux full validation evidence:
      • command family:
        • build, ctest, Rust tests, Go tests, Go race, extended fuzz, C/Go/Rust coverage, ASAN, TSAN, Valgrind, all POSIX interop matrices, and POSIX benchmark generation
      • artifact directory:
        • /tmp/plugin-ipc-full-linux-20260415-162222/
      • result:
        • full-linux.log ended with FULL LINUX VALIDATION PASSED
        • ctest: 100% tests passed, 0 tests failed out of 42
        • Rust: 305 passed; 0 failed
        • Go: all packages passed
        • Go race: all packages passed
        • extended fuzz: 11 passed, 0 failed
        • C coverage total: 92.3%
        • Go coverage total: 94.3%
        • Rust coverage total: 95.17%
        • ASAN: 7/7 passed
        • TSAN: 6/6 passed with no races
        • Valgrind: 7/7 passed with zero errors/leaks/invalid accesses
        • POSIX benchmark CSV: /tmp/plugin-ipc-full-linux-20260415-162222/benchmarks-posix.csv
        • POSIX benchmark data rows: 201
        • POSIX generator result: all performance floors met
      • exactness caveat:
        • this full Linux matrix ran at commit 074a3a5c8f552f15473a0bc929b96da9e71f79b7
        • commit bb996c638f6c73cb9f3a8b0aac55d09819548979 changed only Windows compare/docs/TODO/test files:
          • tests/compare-windows-bench-toolchains.sh
          • tests/test_windows_compare_policy_retry.sh
          • README.md
          • WINDOWS-COVERAGE.md
          • TODO-netdata-plugin-ipc-integration.md
        • the changed Windows compare test was verified locally after bb996c6 with:
          • bash tests/test_windows_compare_policy_retry.sh
          • bash tests/test_windows_bench_stability_policy.sh
    • Native Windows correctness/coverage/interop evidence from the full validation run:
      • checkout:
        • win11:~/src/plugin-ipc.git
        • commit 074a3a5c8f552f15473a0bc929b96da9e71f79b7
      • artifact directory:
        • /tmp/plugin-ipc-full-windows-20260415-162223/
      • results before the MSYS bounded-compare failure:
        • build and ctest: 30/30 passed
        • Rust Windows lib tests: 195 passed; 0 failed
        • Go Windows tests: passed
        • App Verifier/PageHeap: passed
        • Windows C coverage: all files met the 90% threshold
        • Windows Go coverage: 92.0%
        • Windows Rust line coverage: 90.46%
        • native standalone Windows interop/service/cache matrices: passed
        • MSYS functional slice: passed, including test_win_shm repeated 10/10
      • exact failure from that full run:
        • only the MSYS bounded compare policy failed
        • failing policy artifact:
          • /tmp/plugin-ipc-full-windows-20260415-162223/msys-validation/bench-compare/policy.csv
        • failing row:
          • np-max,np-ping-pong,c,c,0,both,70.0,44.9,fail
        • this is the issue fixed by commit bb996c6
    • MSYS validation evidence after the paired policy-retry fix:
      • checkout:
        • win11:~/src/plugin-ipc.git
        • commit bb996c638f6c73cb9f3a8b0aac55d09819548979
      • command:
        • bash tests/run-windows-msys-validation.sh /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830 3
      • artifact directory:
        • /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/
      • result:
        • exited 0
        • summary: /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/summary.txt
        • policy: /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/bench-compare/policy.csv
        • joined comparison: /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/bench-compare/joined.csv
        • every configured MSYS-vs-native policy row passed
    • Strict native Windows benchmark evidence after the paired policy-retry fix:
      • checkout:
        • win11:~/src/plugin-ipc.git
        • commit bb996c638f6c73cb9f3a8b0aac55d09819548979
      • command:
        • bash tests/run-windows-bench.sh /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.csv 5
        • bash tests/generate-benchmarks-windows.sh /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.csv /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.md
      • artifact directory:
        • /tmp/plugin-ipc-windows-native-bench-20260415-171700/
      • result:
        • exited 0
        • total measurements: 201
        • CSV line count including header: 202
        • generator result: All performance floors met
        • summary: /tmp/plugin-ipc-windows-native-bench-20260415-171700/summary.txt
        • report: /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.md
    • Current validation conclusion:
      • the concrete failures found during the full validation loop were fixed:
        • C benchmark server lifecycle/timer shutdown fixed in commit 074a3a5
        • MSYS bounded compare policy false failure fixed in commit bb996c6
      • no tracked Linux or Windows source divergence is being used as the sync mechanism
      • next operational step:
        • commit this evidence update locally
        • push
        • pull on win11:~/src/plugin-ipc.git

Active Task: Client-Side SHM Attach Fallback

  • Goal:
    • implement the refined transport rule in plugin-ipc
    • if handshake selects SHM and the client cannot attach SHM, the client must close that session, exclude SHM from future proposals for that client context, and reconnect on baseline
    • no server-side same-session fallback is allowed
  • Scope:
    • update C, Rust, and Go L2 client reconnect logic
    • keep L3 behavior inherited from L2
    • update the handshake/spec docs to describe this exception precisely
    • add tests for client-side SHM attach failure fallback
  • Implementation status:
    • done locally in C, Rust, and Go
    • behavior now is:
      • handshake may negotiate SHM
      • if client-side SHM attach fails, that session is closed
      • the client removes SHM from future proposals in that client context
      • the client reconnects through a new handshake and falls back to baseline
      • no same-session fallback is used
  • Linux evidence:
    • go test ./pkg/netipc/service/raw -run 'TestUnixShmAttachFailureFallsBackToBaseline|TestUnixShmPrepareFailureFallsBackToBaseline' -count=1 -v
      • passed
    • cargo test --manifest-path src/crates/netipc/Cargo.toml test_refresh_shm_attach_failure_falls_back_to_baseline -- --nocapture
      • passed
    • cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_falls_back_to_baseline_when_linux_shm_prepare_fails -- --nocapture
      • passed
    • cmake --build build --target test_service -j4
      • passed
    • ./build/bin/test_service
      • passed, including:
        • Test: Client-side SHM attach failure falls back to baseline
  • Windows evidence on win11:~/src/plugin-ipc.git after pulling commit 2bca7bb:
    • cd src/go && go test ./pkg/netipc/service/raw -count=1 -run 'TestWinShmAttachFailureFallsBackToBaseline|TestWinShmPrepareFailureFallsBackToBaseline' -v
      • passed
    • cargo test --manifest-path src/crates/netipc/Cargo.toml test_refresh_winshm_attach_failure_falls_back_to_baseline -- --nocapture
      • passed
    • bash tests/run-coverage-c-windows.sh
      • passed
      • included:
        • test_win_service_guards.exe
        • test_win_service_guards_extra.exe
        • test_win_service_extra.exe
        • targeted Windows C interop/service matrix
      • coverage summary:
        • netipc_service_win.c: 90.6%
        • netipc_named_pipe.c: 92.4%
        • netipc_win_shm.c: 94.2%
        • total: 91.9%
      • included the new guard:
        • Hybrid attach failure falls back to baseline
    • bash tests/run-coverage-rust-windows.sh
      • passed
      • total line coverage: 90.54%
      • critical-file line coverage:
        • service/cgroups.rs: 92.37%
        • transport/windows.rs: 91.65%
        • transport/win_shm.rs: 94.11%
    • bash tests/run-coverage-go-windows.sh
      • passed
      • total coverage: 92.1%
    • NETIPC_BUILD_DIR=\"$HOME/src/plugin-ipc.git/build-windows-coverage-c\" bash tests/run-verifier-windows.sh
      • passed
      • no Application Verifier or PageHeap findings for:
        • test_named_pipe.exe
        • test_win_shm.exe
        • test_win_service.exe
        • test_win_service_extra.exe
  • Sync status:
    • local /home/costa/src/plugin-ipc.git and win11:~/src/plugin-ipc.git were synchronized to 2bca7bb for the validation run