Fit-for-purpose goal: integrate plugin-ipc into ~/src/netdata/netdata/ so Netdata can immediately replace the current Linux cgroups.plugin -> ebpf.plugin custom metadata transport with typed IPC that is reliable, maintainable, testable, and ready for guarded production rollout.
- Decision recorded on 2026-04-18:
- PR scope for Thiago's formatter-only
netdata-otelchange:- keep the
netdata-otelformatting/cosmetic change in the Netdata integration PR - rationale:
- it is formatter-driven cosmetic churn
- it is harmless
- there is no requirement to split it out at this stage
- keep the
- PR scope for Thiago's formatter-only
- Decision recorded on 2026-04-18:
- Upstream sync policy for vendored library changes:
- every change that touches vendored library code in the Netdata integration PR must be copied back to the upstream
plugin-ipcrepository - rationale:
- the Netdata vendoring script will overwrite vendored library trees on the next sync
- leaving library-only fixes in Netdata would make them disappear on the next revendor
- concrete scope identified from
33ecdf4de..d97e8fa1c:src/crates/netipc/src/protocol/cgroups.rssrc/crates/netipc/src/transport/shm.rssrc/crates/netipc/src/transport/shm_tests.rs
- implication:
- Netdata-only integration files stay in the Netdata PR
- vendored
netipcRust fixes must be ported upstream before the next vendoring run
- every change that touches vendored library code in the Netdata integration PR must be copied back to the upstream
- Upstream sync policy for vendored library changes:
- Decision recorded on 2026-04-14:
- Limit negotiation contract:
- the general contract is:
- client proposes
- server decides
- the final negotiated values are returned by the server in the handshake response
- each negotiated field must be specified independently; there is no single universal formula such as always
min()or always server value - examples explicitly clarified by the user:
- request size limit:
- client proposes
- server decides whether to echo it unchanged or alter it
- this must be defined explicitly in the spec, field by field
- response size limit:
- client may propose
- server returns its own value because only the server knows what it may need to send
- packet chunking / packet size:
- server decides using
min(client, server)so the session can actually communicate
- server decides using
- request size limit:
- the specification must be updated to state this explicitly and unambiguously for every negotiated field
- all implementations must be reviewed and aligned to this rule field by field
- the general contract is:
- Limit negotiation contract:
- Decision recorded on 2026-04-14:
- Transport-profile lock after handshake:
- the selected transport/profile is negotiated during handshake and locked for the lifetime of that session
- no fallback is allowed after transport negotiation has completed
- if SHM is negotiated, SHM must be usable for that session
- any post-handshake SHM fallback to baseline transport is considered a contract violation and must not be adopted as the upstream fix
- Transport-profile lock after handshake:
- Decision recorded on 2026-04-14:
- Request-direction negotiation policy:
max_request_payload_bytes- client proposes the whole-request payload ceiling
- server echoes it back unchanged when it is acceptable
- hard-cap the field at
1 MiB - if the client proposes more than
1 MiB, reject the handshake - do not silently clamp down
max_request_batch_items- client proposes the intended batch size
- if there is no concrete protocol-level constraint, server echoes it back unchanged
- do not invent hypothetical lowering logic without evidence
- Request-direction negotiation policy:
- Decision recorded on 2026-04-14:
- SHM readiness and profile lock:
- the negotiated profile is locked for the lifetime of the session
- if SHM is selected, SHM must already be guaranteed usable for that session when the handshake succeeds
- no post-handshake fallback is allowed
- this requires moving SHM readiness earlier than the current implementation does today
- SHM readiness and profile lock:
- Decision recorded on 2026-04-14:
- Typed L2 request sizing policy:
- the client should proactively propose
max_request_payload_bytes - it should not rely primarily on overflow/reconnect learning
- for typed L2 methods, the library should calculate the initial request payload proposal from:
- the method schema
- the configured/requested batch size
- explicit sizing assumptions for dynamic fields
- current approved sizing assumption for strings:
- assume strings up to 1024 bytes when deriving request payload ceilings unless a method-specific rule says otherwise
- objective:
- initial negotiation should already be close to the real need
- reconnects due to request overflow should be rare safety-net events, not the normal sizing mechanism
- implication:
- public typed L2 methods need method-specific sizing rules
- opaque raw/internal L2 paths may still need reactive overflow recovery as a fallback because they cannot infer request schema automatically
- the client should proactively propose
- Typed L2 request sizing policy:
- Decision recorded on 2026-04-14:
max_request_payload_bytespolicy:- hard-cap the negotiated request payload ceiling at
1 MiB - if the client proposes anything above
1 MiB, reject the handshake - do not silently clamp down and continue
- below
1 MiB, preferred behavior is to echo the client proposal back unchanged
- hard-cap the negotiated request payload ceiling at
- Decision recorded on 2026-04-14:
max_request_batch_itemspolicy:- if there is no concrete protocol-level constraint, echo the client proposal back unchanged
- do not invent hypothetical lowering logic without evidence
- Decision recorded on 2026-04-14:
max_response_batch_itemsprotocol field:- keep it in the protocol handshake payloads
- define it as symmetric with request batch items
- the server must return the same effective batch-item ceiling for requests and responses
- rationale:
- current protocol/method behavior is symmetric by position for batch responses
- current implementations mirror request
item_countinto batch responseitem_count
- implication:
- this is now a semantic contract clarification, not a handshake wire-layout removal
- the specs and all implementations must still be aligned so this field is never independently negotiated
- Decision recorded on 2026-04-14:
- Handshake specification deliverable requirements:
- before implementation, the docs/specs must contain the full handshake description as an overall process/strategy
- the handshake docs/specs must include per-field analysis:
- what the client does
- what the client sends
- what the server does
- what the server sends back
- the docs/specs must include Mermaid sequence diagrams for the handshake process
- Handshake specification deliverable requirements:
- Decision recorded on 2026-04-14:
- Handshake correctness and guarantee requirements:
- the negotiated profile must be guaranteed to work after handshake
- this guarantee must be explicit in the docs/specs and enforced by implementation/tests
- Handshake correctness and guarantee requirements:
- Decision recorded on 2026-04-14:
- Handshake test requirements:
- the handshake process must be fully tested field by field
- tests must ensure all implementations comply with the documented handshake semantics 100%
- all auth failures must be tested individually
- reconnection due to payload overflow must be fully tested
- Handshake test requirements:
- Decision recorded on 2026-04-14:
- L2 public API requirement:
- L2 users must not provide
max_request_payload_bytes - request payload sizing is internal library logic, not a user-facing L2 knob
- L2 users must not provide
- L2 public API requirement:
- Decision recorded on 2026-04-14:
- Handshake wire evolution for
max_response_batch_items:- keep the field on the wire
- do not introduce a new handshake layout version for this point alone
- document and enforce that it is symmetric with
max_request_batch_items
- Handshake wire evolution for
- Decision recorded on 2026-04-15:
- Cross-machine workflow and completion bar:
- the only valid workflow is:
- commit and push in
/home/costa/src/plugin-ipc.git - pull on
win11:~/src/plugin-ipc.git - if fixes are needed after Windows validation:
- fix locally in
/home/costa/src/plugin-ipc.git - commit and push locally
- pull again on
win11:~/src/plugin-ipc.git
- fix locally in
- commit and push in
- do not leave uncommitted divergence as the way to sync Linux and Windows
- the task is not complete until:
- the entire relevant Linux test suite is green
- the entire relevant native Windows (
win11) test suite is green - the repository state is judged correct enough to proceed to the Netdata integration PR follow-up
- the only valid workflow is:
- Cross-machine workflow and completion bar:
- Decision recorded on 2026-04-15:
- Windows sync cleanup before validation:
win11:~/src/plugin-ipc.git/benchmarks-windows.csvmay be discarded if locally dirty- rationale:
- it is a generated artifact
win11must remain a clean validation checkout- the authoritative workflow is commit/push here, pull there
- Windows sync cleanup before validation:
- Decision recorded on 2026-04-15:
- Baseline validation pass:
- run all practical test suites and benchmark suites on both Linux and native Windows in parallel
- objective:
- confirm that the handshake rewrite and related fixes did not regress correctness
- confirm that the benchmark baselines still hold after the handshake and SHM-readiness changes
- Linux validation matrix:
cmake --build build -j4ctest --test-dir build --output-on-failure -j4cd src/crates/netipc && cargo testcd src/go && go test ./...bash tests/run-go-race.shbash tests/run-extended-fuzz.shbash tests/run-posix-bench.sh
- Windows validation matrix on
win11:~/src/plugin-ipc.git:cmake --build build -j4ctest --test-dir build --output-on-failure -j4cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1cd src/go && go test ./...bash tests/run-windows-msys-validation.shbash tests/run-windows-bench.sh
- precondition verified before launch:
- local
/home/costa/src/plugin-ipc.gitandwin11:~/src/plugin-ipc.gitare both on commit50c4a2d21d3009c53520d1b7fc4fac78ce77e876 50c4a2dis a TODO-only validation-matrix commit on top of code commit313f7ed- no tracked local modifications are present on either host
- local
- Baseline validation pass:
- Decision recorded on 2026-04-15:
- Expanded validation scope for the baseline pass:
- "all possible tests" must include standalone validation entrypoints that are not covered by the basic
ctest/cargo test/go testlanes - additional Linux validation entrypoints to run:
bash tests/run-coverage-c.shbash tests/run-coverage-go.shbash tests/run-coverage-rust.shbash tests/run-sanitizer-asan.shbash tests/run-sanitizer-tsan.shbash tests/run-valgrind.shbash tests/interop_codec.shbash tests/test_uds_interop.shbash tests/test_shm_interop.shbash tests/test_service_interop.shbash tests/test_service_shm_interop.shbash tests/test_cache_interop.shbash tests/test_cache_shm_interop.sh
- additional native Windows validation entrypoints on
win11:~/src/plugin-ipc.gitto run after the strict native benchmark finishes:bash tests/run-verifier-windows.shbash tests/run-coverage-c-windows.shbash tests/run-coverage-go-windows.shbash tests/run-coverage-rust-windows.shbash tests/test_named_pipe_interop.shbash tests/test_win_shm_interop.shbash tests/test_service_win_interop.shbash tests/test_service_win_shm_interop.shbash tests/test_cache_win_interop.shbash tests/test_cache_win_shm_interop.sh
- benchmark scope already covered by:
bash tests/run-posix-bench.shbash tests/run-windows-msys-validation.shbash tests/run-windows-bench.sh
- "all possible tests" must include standalone validation entrypoints that are not covered by the basic
- Expanded validation scope for the baseline pass:
- Finding recorded on 2026-04-15 during full Windows validation:
- Windows C coverage still meets the configured threshold, but gcov showed the new
successful-dispatch oversized-response branch in
src/libnetdata/netipc/src/service/netipc_service_win.cwas uncovered:- branch location:
server_handle_session()case NIPC_OK - lines observed uncovered in the run:
response_len > session->max_response_payload_bytesserver_note_response_capacity(...)resp_hdr.transport_status = NIPC_STATUS_LIMIT_EXCEEDEDresponse_len = 0
- branch location:
- This is not acceptable for the spec-compliance bar because Linux has a targeted payload-limit test for this path and Windows must prove the same server behavior.
- Implementation plan:
- add a small dedicated Windows coverage-only C target for service payload limits
- keep it separate from the already oversized
test_win_service_guards.c - wire it into
tests/run-coverage-c-windows.sh - rerun Windows C coverage and then rerun the required full validation loop after the fix is committed and pulled on
win11:~/src/plugin-ipc.git
- Windows C coverage still meets the configured threshold, but gcov showed the new
successful-dispatch oversized-response branch in
- Finding recorded on 2026-04-15 during native Windows/MSYS validation:
- The full Windows run reached
tests/run-windows-msys-validation.shand failed at:test_service_win_shm_interop- pair:
C server, C client - observed client output:
client: not ready
- Immediate targeted reproduction on the same
win11:~/src/plugin-ipc.gitcheckout passedtest_service_win_shm_interop10/10 times, so the failure is intermittent. - Concrete harness issue found:
tests/test_service_win_interop.shgives the server up toTIMEOUT=10seconds to printREADY- the C/Rust/Go Windows service clients each only wait 200 x 10ms = 2 seconds for client readiness
- under heavy validation load, one early client attempt can return
client: not readyeven though the same pair passes immediately on repeat
- Implementation plan:
- harden the Windows service and cache interop shell harnesses so
client: not readyis retried until the existingTIMEOUTbudget expires - keep persistent failures visible after the timeout
- do not hide real call/decoding failures; only retry the pre-call readiness race
- harden the Windows service and cache interop shell harnesses so
- The full Windows run reached
- Finding recorded on 2026-04-15 during full Windows validation after commit
cf5cf8dfdaf223460763bf8287ce55394ab912f0:- the full run reached the MSYS bounded benchmark comparison and exposed a
harness-policy contradiction in the native reference row:
- failing row:
snapshot-baseline c->c @ max - samples path reported by the runner:
/tmp/netipc-bench-143223/samples-snapshot-baseline-c-c-0.csv - observed raw spread:
raw_min=5730.000raw_max=22687.000raw_ratio=3.959337- configured max ratio:
2.00
- observed trimmed stable core:
stable_min=21620.000stable_max=22637.000stable_ratio=1.047040
- the same stale full run later exposed another compare-lane noisy row:
- failing row:
snapshot-shm c->c @ max - samples path reported by the runner:
/tmp/netipc-bench-143696/samples-snapshot-shm-c-c-0.csv - observed stable spread:
stable_min=175040.000stable_max=462863.000stable_ratio=2.644327- configured max ratio:
2.00
- failing row:
tests/compare-windows-bench-toolchains.shintentionally runs a bounded comparison with regression floors, buttests/run-windows-bench.shstill rejects the whole row when the raw sample set contains an outlier, even if the trimmed stable core is valid.
- failing row:
- implementation plan:
- keep the normal Windows benchmark publication path fail-closed on raw instability
- add an explicit opt-in runner mode for comparison lanes that allows a row with raw outliers only when the trimmed stable core already passed the existing stable-sample and stable-ratio checks
- add bounded per-row retry to the targeted runner so compare lanes do not accept rows with unstable trimmed cores; instead, noisy rows must rerun and produce a stable sample set before they are used in the policy CSV
- enable that opt-in only from
tests/compare-windows-bench-toolchains.sh - add a shell policy test for this pure stability decision so the compare lane cannot silently regress back to raw-outlier flapping
- the full run reached the MSYS bounded benchmark comparison and exposed a
harness-policy contradiction in the native reference row:
- Finding recorded on 2026-04-15 during full Windows validation after commit
d4ff75787743d28ee7f6eedd73e274d0cb608506:- the full Windows run failed inside
tests/run-windows-msys-validation.shbefore the strict full native Windows benchmark could run - exact failed artifact:
/tmp/plugin-ipc-full-windows-20260415-153255/msys-validation/bench-compare/policy.csv
- concrete policy failures:
np-100k: MSYS49.5%of mingw64, required70.0%shm-max: MSYS54.0%of mingw64, required85.0%shm-100k: MSYS32.7%of mingw64, required95.0%snapshot-np: MSYS28.3%of mingw64, required80.0%
- concrete harness issue:
tests/compare-windows-bench-toolchains.shmeasures allmingw64rows first and allmsysrows second- this makes the policy ratio vulnerable to cross-phase host-load drift during the requested parallel Linux + Windows validation
- the comparison policy is meant to compare paired measurements, not two long phases that may run under different external load
- implementation plan:
- keep the existing policy floors unchanged
- change the compare harness to run each row as an adjacent pair:
mingw64measurement for the rowmsysmeasurement for the same row
- keep the existing per-row retry and raw-outlier opt-in policy
- rerun Windows validation after committing and pulling on
win11:~/src/plugin-ipc.git
- the full Windows run failed inside
- Finding recorded on 2026-04-15 during the full Linux/POSIX benchmark run
after commit
1e3da7da60c1923dbd8f436ae6bef35b29066b5c:- the POSIX benchmark does not fail the performance floors at this point,
but the C server rows repeatedly emit:
Server c (...) did not exit cleanly within 5s; forcing kill
- concrete cause:
tests/run-posix-bench.shgives every server extra lifetime:server_duration=$((duration + 5))bench/drivers/c/bench_posix.cstarts a timer that sleepsduration_sec + 3bench/drivers/c/bench_posix.cthen callspthread_join(timer_tid, NULL)afternipc_server_run()returns- when the harness sends SIGTERM after the client run completes, the C
signal handler stops
nipc_server_run(), but process exit is still delayed while joining the timer thread - for the normal 5-second benchmark rows this can leave roughly 7-8 seconds of timer sleep, while the harness only waits 5 seconds before force-killing
- cross-platform audit:
bench/drivers/c/bench_windows.chas the same timer-wait pattern withWaitForSingleObject(timer, INFINITE)after the server thread exits- the current Windows harness usually avoids the warning because it passes
server duration without the POSIX
+5, but the C driver still has the same shutdown-latency bug
- implementation plan:
- make the POSIX C benchmark timer cancelable when the server exits before the timer fires
- make the Windows C benchmark timer wait on a cancellation event instead of an unconditional sleep
- keep timer-driven self-stop behavior unchanged for standalone benchmark server runs
- keep harness thresholds and benchmark floors unchanged
- rerun affected benchmark validation after committing and pulling on
win11:~/src/plugin-ipc.git
- fix applied locally:
bench/drivers/c/bench_posix.c- cancel and join the timer thread when
nipc_server_run()exits before the timer fires - report timer-thread creation failure instead of silently continuing
- cancel and join the timer thread when
bench/drivers/c/bench_windows.c- replace unconditional timer sleep with a cancel event
- signal the cancel event and join the timer thread when the server thread exits
- report timer-thread creation failure instead of silently continuing
- targeted local verification:
cmake --build build-bench-posix --target bench_posix_c -j24passed- manual
bench_posix_cC server/client row:- server command:
uds-ping-pong-server ... 10 - client command:
uds-ping-pong-client ... 1 1000 - after SIGTERM the server exited within the 2-second proof window
- server output contained
READYandSERVER_CPU_SEC=... - no forced-kill warning was needed
- server command:
git diff --checkpassedbash tests/test_windows_bench_stability_policy.shpassed
- the POSIX benchmark does not fail the performance floors at this point,
but the C server rows repeatedly emit:
- Specs/docs updated before implementation:
docs/level1-wire-envelope.mddocs/level1-transport.mddocs/level1-posix-uds.mddocs/level1-windows-np.mddocs/level2-typed-api.mddocs/getting-started.md
- Implemented in C / Go / Rust:
- handshake negotiation aligned to the documented field-by-field contract
max_request_payload_byteshard-capped at1 MiB- proposals above
1 MiBrejected withLIMIT_EXCEEDED max_request_payload_bytesechoed unchanged below capmax_request_batch_itemsechoed unchangedmax_response_payload_bytesserver-ownedmax_response_batch_itemskept symmetric with request batch itemspacket_sizenegotiated asmin(client, server)and rejected if not usable- typed L2 public configs no longer expose
max_request_payload_bytes - SHM readiness moved before handshake completion in managed/raw service paths so negotiated SHM is guaranteed for that session
- Verified test results after implementation:
- Rust:
cd src/crates/netipc && cargo test- result:
305 passed; 0 failed
- Go POSIX/raw/cgroups/protocol:
cd src/go && go test ./pkg/netipc/protocol ./pkg/netipc/service/cgroups ./pkg/netipc/service/raw ./pkg/netipc/transport/posix- result: all passed
- Go Windows compile checks:
GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windowsGOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/raw- result: both compile successfully
- C targeted transport/service tests:
/usr/bin/ctest --test-dir build --output-on-failure -R '^test_uds$'- result: passed
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_service|test_hardening|test_service_extra|test_ping_pong)$'- result:
test_udspassed after fixing C client mapping forHELLO_ACK transport_status = LIMIT_EXCEEDEDtest_servicepassedtest_service_extrapassedtest_hardeningpassedtest_ping_pongtimed out
- Rust:
- Open obstacle discovered during verification:
tests/fixtures/c/test_ping_pong.changs in the empty-snapshot case- concrete trace:
- running
timeout 20 stdbuf -o0 ./build/bin/test_ping_pongstalls after:Test: empty snapshot is valid for the service kindPASS: server startedPASS: client ready
strace -ff -o /tmp/test_ping_pong.trace timeout 10 stdbuf -o0 ./build/bin/test_ping_pongshows:- the server sends a
STATUS_OKresponse withpayload_len = 0for the empty snapshot request - the client then disconnects and reconnects
- the server accept loop subsequently polls invalid fds (
0/32765) instead of the listening socket
- the server sends a
- running
- refined local findings from the reproduction on 2026-04-15:
- the earlier "empty snapshot sends zero payload" theory was wrong
- debugger evidence showed the empty-snapshot typed dispatch path computes a non-zero typed payload as expected
- the real root cause was the test fixture lifecycle:
tests/fixtures/c/test_ping_pong.cdetached the server accept thread and never joined it- after test teardown, detached accept loops continued polling invalid or reused fds, contaminating later cases
- that produced the observed
fd=0/fd=32765evidence and the spurious header-onlyUNSUPPORTEDresponse seen by the third test
- evidence:
tests/fixtures/c/test_ping_pong.csrc/libnetdata/netipc/src/service/netipc_service.cgdbtrace onserver_handle_sessionshowed normal non-zero response sizes for the first two tests and no empty-snapshot dispatch failurestraceartifacts:/tmp/test_ping_pong.recheck.431492/tmp/test_ping_pong.recheck.431494/tmp/test_ping_pong.recheck.431496
- fix applied on 2026-04-15:
tests/fixtures/c/test_ping_pong.c- store the accept thread
- stop detaching it
- join it during teardown after
nipc_server_drain()
- verification after the fix:
timeout 20 stdbuf -o0 ./build/bin/test_ping_pong- result:
20 passed, 0 failed /usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_service|test_hardening|test_service_extra|test_ping_pong)$'- result:
100% tests passed, 0 tests failed out of 5
- next full-suite obstacle discovered on 2026-04-15 after commit
2674826:- the full Linux/native-Windows validation loop is still blocked by stale Go typed-L2 fixture code in:
tests/fixtures/go/cmd/interop_service/main.gotests/fixtures/go/cmd/interop_service_win/main.gotests/fixtures/go/cmd/interop_cache/main.gotests/fixtures/go/cmd/interop_cache_win/main.gobench/drivers/go/main.gobench/drivers/go/main_windows.go
- exact build failures:
unknown field MaxRequestPayloadBytes in struct literal of type cgroups.ServerConfigunknown field MaxResponseBatchItems in struct literal of type cgroups.ServerConfigunknown field MaxRequestPayloadBytes in struct literal of type cgroups.ClientConfigunknown field MaxResponseBatchItems in struct literal of type cgroups.ClientConfig
- meaning:
- the public typed L2 API cleanup is correct
- several typed Go service/benchmark helpers still reference removed fields and must be aligned before the full Linux and native Windows suites can pass
- the full Linux/native-Windows validation loop is still blocked by stale Go typed-L2 fixture code in:
- next full-suite obstacle discovered on 2026-04-15 after aligning the Go typed helpers:
- the full Linux/native-Windows validation loop is also blocked by stale C typed-L2 service tests and interop helpers
- concrete failing files already observed during the Linux build:
tests/fixtures/c/test_multi_server.ctests/fixtures/c/interop_service.c
- broader audit shows the same stale public-field pattern in multiple typed C files:
tests/fixtures/c/interop_cache.ctests/fixtures/c/interop_cache_win.ctests/fixtures/c/interop_service.ctests/fixtures/c/interop_service_win.ctests/fixtures/c/test_cache.ctests/fixtures/c/test_chaos.ctests/fixtures/c/test_hardening.ctests/fixtures/c/test_multi_server.ctests/fixtures/c/test_service.ctests/fixtures/c/test_stress.ctests/fixtures/c/test_win_service.ctests/fixtures/c/test_win_service_extra.ctests/fixtures/c/test_win_service_guards.ctests/fixtures/c/test_win_service_guards_extra.ctests/fixtures/c/test_win_stress.c
- important nuance:
- raw transport configs in the same files still legitimately carry
max_request_payload_bytesandmax_response_batch_items - only public typed
nipc_client_config_t/nipc_server_config_tuses are stale
- raw transport configs in the same files still legitimately carry
- implication:
- the cleanup must be type-aware
- Windows-only white-box overflow tests that used the removed public request-payload field need manual rewriting so overflow-reconnect remains covered without reintroducing the public knob
- Final result of the full baseline pass:
- code checkout under test:
- Linux:
50c4a2d21d3009c53520d1b7fc4fac78ce77e876 - Windows:
win11:~/src/plugin-ipc.gitat50c4a2d21d3009c53520d1b7fc4fac78ce77e876 - note:
50c4a2dis the TODO-only validation-matrix commit on top of code commit313f7ed
- Linux:
- Linux is not fully green:
- first full
ctestfailed once ingo_FuzzDecodeHellowithcontext deadline exceeded - isolated fuzz rerun passed
- second full
ctestpassed - C coverage failed the per-file gate because
netipc_service.cis87.2%against a90%threshold
- first full
- native Windows is not fully green:
go test ./...failed inTestWinServerDispatchSingleSnapshotZeroCapacityrun-windows-msys-validation.shfailed 3 targeted comparison rowsrun-verifier-windows.shfailed becausegflags.exe /p /enable test_named_pipe.exe /fullreturned exit code 1run-coverage-c-windows.shfailed intest_win_service_guards.exewith 8 failed guard assertions
- benchmark floors:
- Linux POSIX benchmark generated 202 CSV lines and passed all performance floors
- native Windows strict benchmark generated 202 CSV lines and passed all Windows performance floors
- MSYS comparison benchmark did not pass because the MSYS validation script failed targeted rows
- code checkout under test:
- Validation artifacts:
- Linux:
/tmp/plugin-ipc-validate-linux-20260415-045044/tmp/plugin-ipc-validate-linux-extra-20260415-0615/tmp/plugin-ipc-validate-linux-coverage-20260415-0616/tmp/plugin-ipc-validate-linux-coverage-split-20260415-0616/tmp/plugin-ipc-validate-linux-ctest-rerun-20260415-0615
- Windows:
/tmp/plugin-ipc-validate-windows-20260415-045103/tmp/plugin-ipc-validate-windows-extra-20260415-0622
- Linux:
- Linux results:
cmake --build build -j4: passed- first
ctest --test-dir build --output-on-failure -j4:- failed only in
go_FuzzDecodeHello - exact log:
FuzzDecodeHello (30.06s)context deadline exceeded
- evidence:
/tmp/plugin-ipc-validate-linux-20260415-045044/linux-ctest.log
- failed only in
- isolated rerun:
cd src/go/pkg/netipc/protocol && go test -run=^$ -fuzz=^FuzzDecodeHello$ -fuzztime=30s- passed
- evidence:
/tmp/plugin-ipc-validate-linux-20260415-045044/linux-fuzzdecodehello-isolated.log
- second full
ctest --test-dir build --output-on-failure -j4:- passed
- evidence:
/tmp/plugin-ipc-validate-linux-ctest-rerun-20260415-0615/linux-ctest-rerun.log
- meaning:
- the first Linux
ctestfailure is currently classified as a flake / scheduling-sensitive failure, not a deterministic regression
- the first Linux
cargo test: passedgo test ./...: passedbash tests/run-go-race.sh: passedbash tests/run-extended-fuzz.sh: passedbash tests/run-posix-bench.sh: passedbash tests/generate-benchmarks-posix.sh: passed- generator confirmed:
All performance floors met
- CSV rows:
202
- generator confirmed:
bash tests/run-sanitizer-asan.sh: passedbash tests/run-sanitizer-tsan.sh: passedbash tests/run-valgrind.sh: passed- Linux interop shell tests:
tests/interop_codec.sh: passedtests/test_uds_interop.sh: passedtests/test_shm_interop.sh: passedtests/test_service_interop.sh: passedtests/test_service_shm_interop.sh: passedtests/test_cache_interop.sh: passedtests/test_cache_shm_interop.sh: passed
- coverage:
bash tests/run-coverage-go.sh: passed- total coverage
94.3%
- total coverage
bash tests/run-coverage-rust.sh: passed- total coverage
95.17%
- total coverage
bash tests/run-coverage-c.sh: failed- direct functional rerun of
./build-coverage/bin/test_servicepassed with209 passed, 0 failed - the script still fails because per-file coverage gate is missed:
netipc_service.c = 87.2%- threshold =
90% - overall total still
90.7%
- this is below the repository's documented Linux/POSIX C coverage baseline:
COVERAGE-EXCLUSIONS.mdrecordsnetipc_service.c = 92.1%
- evidence:
/tmp/plugin-ipc-validate-linux-coverage-20260415-0616/linux-coverage-c.log/tmp/plugin-ipc-validate-linux-coverage-20260415-0616/test_service_direct.log
- direct functional rerun of
- Windows results:
cmake --build build -j4: passedctest --test-dir build --output-on-failure -j4: passedcargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1: passedcd src/go && go test ./...: failed- exact failing test:
TestWinServerDispatchSingleSnapshotZeroCapacity
- exact panic:
CgroupsBuilder buffer too small: need at least 48 bytes, got 0
- evidence:
/tmp/plugin-ipc-validate-windows-20260415-045103/win-go.log
- exact failing test:
bash tests/run-windows-msys-validation.sh: failed- exact evidence:
/tmp/plugin-ipc-validate-windows-20260415-045103/win-msys-validation.log
- concrete failing targeted rows already captured there:
- snapshot-baseline
c->c @ 0 - snapshot-shm
c->c @ 0 - shm-ping-pong
c->rust @ 0
- snapshot-baseline
- final script summary:
3 targeted row(s) failed
- refined failing-row evidence:
- snapshot-baseline
c->c @ 0:stable_ratio=2.332047, max allowed2.00 - snapshot-shm
c->c @ 0:raw_ratio=2.411989, max allowed2.00 - shm-ping-pong
c->rust @ 0:raw_ratio=7.526888, max allowed2.00
- snapshot-baseline
- exact evidence:
bash tests/run-windows-bench.sh: passed- evidence:
/tmp/plugin-ipc-validate-windows-20260415-045103/win-bench-native.log/tmp/plugin-ipc-validate-windows-20260415-045103/benchmarks-windows-full.csv/tmp/plugin-ipc-validate-windows-20260415-045103/win-bench-gen-native.log/tmp/plugin-ipc-validate-windows-20260415-045103/benchmarks-windows-full.md
- CSV rows:
202
- generator confirmed:
All performance floors met
- evidence:
bash tests/run-verifier-windows.sh: failed- evidence:
/tmp/plugin-ipc-validate-windows-extra-20260415-0622/win-verifier.log
- exact failure:
gflags.exe /p /enable test_named_pipe.exe /full- exit code
1
- evidence:
bash tests/run-coverage-c-windows.sh: failed- evidence:
/tmp/plugin-ipc-validate-windows-extra-20260415-0622/win-coverage-c.log
- exact summary:
190 passed, 8 failed
- failing assertions:
increment batch transparently resizes and succeedsincrement batch negotiated request size growsstring reverse transparently resizes and succeedsstring reverse negotiated request size growshybrid SHM request overflow transparently recovershybrid send-capacity resize keeps client READYhybrid batch request overflow transparently recovershybrid batch resize keeps client READY
- evidence:
bash tests/run-coverage-go-windows.sh: passedbash tests/run-coverage-rust-windows.sh: passed- Windows interop shell tests:
tests/test_named_pipe_interop.sh: passedtests/test_win_shm_interop.sh: passedtests/test_service_win_interop.sh: passedtests/test_service_win_shm_interop.sh: passedtests/test_cache_win_interop.sh: passedtests/test_cache_win_shm_interop.sh: passed
- Fixes started after the full baseline pass exposed red lanes:
- Go cgroups snapshot dispatch:
- evidence:
- Windows
go test ./...panicked inTestWinServerDispatchSingleSnapshotZeroCapacity - panic came from
protocol.NewCgroupsBuilder()with a zero response buffer and explicitmaxItems = 3
- Windows
- fix:
- expose
protocol.CgroupsBuilderMinBytes(maxItems) - make
SnapshotDispatch()returnErrOverflowbefore constructing a builder when the response buffer cannot reserve the requested directory slots - also reuse the helper in
DispatchCgroupsSnapshot()
- expose
- local verification:
cd src/go && go test ./pkg/netipc/protocol ./pkg/netipc/service/raw- result: passed
- evidence:
- CTest Go fuzz timeout margin:
- evidence:
- first full Linux
ctestfailed only ingo_FuzzDecodeHello - failure was
context deadline exceededat approximately the requested30sfuzz duration - isolated rerun passed, showing the target is not deterministically crashing
- first full Linux
- fix:
- keep these as short CTest smoke fuzzers, but run
-fuzztime=20s - longer fuzz coverage remains owned by
tests/run-extended-fuzz.sh
- keep these as short CTest smoke fuzzers, but run
- local verification:
/usr/bin/ctest --test-dir build --output-on-failure -R '^go_FuzzDecodeHello$'/usr/bin/ctest --test-dir build --output-on-failure -R '^go_FuzzDecodeHelloAck$'- result: both passed
- evidence:
- Windows verifier
gflags.exeinvocation:- evidence:
- verifier log showed
GFLAGS: Unexpected argument - 'P:/' - direct test confirmed MSYS argument conversion changed
/pintoP:/
- verifier log showed
- fix:
- call
gflags.exethroughenv MSYS2_ARG_CONV_EXCL='*'
- call
- Windows verification:
bash tests/run-verifier-windows.sh test_named_pipe.exe- result: passed
- evidence:
- C transport outbound limit enforcement:
- evidence:
- Windows C guard tests still failed the request-overflow reconnect cases after the first fix
- debugger evidence on
test_win_service_guards.exeshowed:- first batch request payload was
32bytes - session request limit was
8 - final error remained
NIPC_ERR_OVERFLOW - client request capacity stayed
8
- first batch request payload was
- source evidence:
- C
nipc_np_send()andnipc_uds_send()did not validate outgoing payload size against the negotiated directional payload limit before writing - the receiver returned
LIMIT_EXCEEDED, so the client learned response capacity instead of request capacity
- C
- fix:
- C POSIX UDS send now rejects over-limit outbound request/response payloads before writing
- C Windows named-pipe send now rejects over-limit outbound request/response payloads before writing
- the client can now learn request overflow locally and reconnect with a larger request proposal
- evidence:
- Managed SHM pre-handshake request capacity:
- evidence:
- Windows C hybrid guard tests failed the request-overflow reconnect cases
- SHM regions are created before the server reads
HELLO - with the approved handshake contract, the server may echo any client request proposal up to
1 MiB - therefore a SHM request segment sized from the server's current learned request value can be smaller than the request size the handshake just accepted
- fix:
- POSIX and Windows managed servers now pre-create SHM request segments at
NIPC_MAX_PAYLOAD_CAP + NIPC_HEADER_LEN - response segments remain server-sized because response capacity is server-owned
- POSIX and Windows managed servers now pre-create SHM request segments at
- local verification:
cmake --build build -j4/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds|test_shm|test_service|test_service_extra|test_cache|test_ping_pong)$'- result: passed
- Windows verification after commit
b5f9cd6:bash tests/run-coverage-c-windows.sh- previous
test_win_service_guards.exefailures are fixed:198 passed, 0 failed test_win_service_guards_extra.exe:93 passed, 0 failedtest_win_service_extra.exe:167 passed, 0 failed- remaining failure: coverage threshold,
netipc_service_win.cis88.3%against required90% - next fix: add focused Windows service tests for real uncovered branches; do not lower the threshold
- evidence:
- Windows verification after commit
aed8e57:bash tests/run-coverage-c-windows.shtest_win_service_guards.exe:226 passed, 0 failed- coverage results:
netipc_service_win.c:90.5%netipc_named_pipe.c:92.6%netipc_win_shm.c:94.2%
- result: all Windows C files meet the
90%coverage threshold
- Linux C coverage after commit
1ce2446:- command:
bash tests/run-coverage-c.sh 90 - all C coverage test binaries passed
- remaining failure:
netipc_service.cis87.2%against required90% - uncovered branches include the POSIX typed-client request-overflow recovery paths and no-growth overflow guard
- next fix: add the POSIX equivalent of the Windows production request-overflow guard tests; do not lower the threshold
- command:
- Linux C coverage after adding POSIX request-overflow tests:
test_service_extra:90 passed, 0 failedbash tests/run-coverage-c.sh 90: all C coverage test binaries passed- remaining failure:
netipc_service.cimproved to88.0%, still below required90% - next fix: add focused POSIX typed response-overflow recovery tests for baseline and hybrid profiles; do not lower the threshold
- POSIX response-overflow evidence after adding response-overflow tests:
test_service_extra:109 passed, 0 failedbash tests/run-coverage-c.sh 90: all C coverage test binaries passed, butnetipc_service.cstayed at88.0%- important finding:
- the tests recovered through broken-session retry after the server learned response capacity
- the explicit
NIPC_STATUS_LIMIT_EXCEEDEDresponse path was still uncovered - this means a successful dispatch whose encoded response exceeds the negotiated response cap was reaching transport send as an oversized response instead of being converted into a zero-payload
LIMIT_EXCEEDEDresponse
- fix:
- POSIX and Windows service dispatch now convert successful-but-oversized responses into
NIPC_STATUS_LIMIT_EXCEEDEDbefore transport send - this aligns response-overflow recovery with the negotiated handshake contract instead of relying on transport breakage
- POSIX and Windows service dispatch now convert successful-but-oversized responses into
- Linux C coverage after the response-overflow service fix:
bash tests/run-coverage-c.sh 90: all C coverage test binaries passednetipc_service.cimproved to89.0%, still below required90%- next fix: add a focused POSIX managed-server unsupported-method response test to cover the explicit
NIPC_STATUS_UNSUPPORTEDpath
- Linux C coverage after adding focused POSIX unsupported-method and dispatch-overflow tests:
- command:
bash tests/run-coverage-c.sh 90 - all C coverage test binaries passed
- coverage results:
netipc_protocol.c:96.3%netipc_uds.c:91.7%netipc_shm.c:92.6%netipc_service.c:90.7%- total:
92.3%
- result: all POSIX C files now meet the
90%coverage threshold - added proof covers:
- SHM unsupported-method response path after profile negotiation
- typed dispatch overflow returning explicit
LIMIT_EXCEEDED, learning response capacity, reconnecting, and succeeding
- test organization:
- new payload-limit coverage lives in
tests/fixtures/c/test_service_payload_limits.c - new method-status coverage lives in
tests/fixtures/c/test_service_method_limits.c - shared fixture setup lives in
tests/fixtures/c/test_service_limit_helpers.h
- new payload-limit coverage lives in
- command:
- Windows/MSYS validation on
win11:~/src/plugin-ipc.gitbefore the final POSIX service commit:- command:
bash tests/run-windows-msys-validation.sh /tmp/netipc-msys-validation-20260415-141321 3
- result: passed
- evidence:
- summary:
/tmp/netipc-msys-validation-20260415-141321/summary.txt - policy:
/tmp/netipc-msys-validation-20260415-141321/bench-compare/policy.csv - joined comparison:
/tmp/netipc-msys-validation-20260415-141321/bench-compare/joined.csv
- summary:
- caveat:
- this was run before the current local service/test changes are committed and pulled to Windows, so affected Windows checks still need to be rerun after sync
- command:
- next full-suite obstacle discovered on 2026-04-15 during the first native Windows rebuild from commit
b4a44fa:- Windows-only Rust typed-L2 helpers still reference removed public cgroups config fields
- concrete failing files:
tests/fixtures/rust/src/bin/interop_service_win.rstests/fixtures/rust/src/bin/interop_cache_win.rsbench/drivers/rust/src/bench_windows.rs
- important nuance:
- the stale fields are only on typed
netipc::service::cgroups::{ClientConfig, ServerConfig} - raw transport configs in
interop_named_pipe.rs,interop_uds.rs, and the Windows transport client helpers remain valid and must not be changed
- the stale fields are only on typed
- next native Windows runtime obstacles discovered on 2026-04-15 during full
cteston commit6651ba6:test_named_pipe_go- failure:
TestSessionSendRejectsTooSmallPacketSizeConnect failed: protocol or layout version mismatch
- source:
src/go/pkg/netipc/transport/windows/pipe_edge_test.go
- verified root cause:
- the test still expects connect success followed by
Send()rejection for too-small negotiated packet size - current Windows named-pipe handshake rejects unusable packet sizes during
HELLO_ACKnegotiation withSTATUS_INCOMPATIBLE - evidence:
src/go/pkg/netipc/transport/windows/pipe_edge_test.gosrc/go/pkg/netipc/transport/windows/pipe.go
- the test still expects connect success followed by
- failure:
test_win_service_extra- failure:
server SHM create fault disconnects hybrid client
- source:
tests/fixtures/c/test_win_service_extra.c
- verified root cause:
- the test still expects disconnect when server-side SHM creation fails after the new handshake guarantee work
- current Windows managed server now pre-creates SHM before handshake and strips failing SHM profiles from the accept config, so the correct behavior is baseline-ready without SHM, not disconnect
- evidence:
tests/fixtures/c/test_win_service_extra.csrc/libnetdata/netipc/src/service/netipc_service_win.c
- failure:
- implication:
- the tree is not yet green on native Windows
- these are behavioral/runtime regressions, not more stale API field references
- local follow-up on 2026-04-15:
- both remaining failures were patched as stale test expectations, not runtime/library logic changes:
src/go/pkg/netipc/transport/windows/pipe_edge_test.goTestSessionSendRejectsTooSmallPacketSizenow expects handshake rejection withErrIncompatible
tests/fixtures/c/test_win_service_extra.c- first patch attempt was only partially correct:
- the reconnect expectation was right
- but the fault was armed too late to guarantee baseline fallback on the first session
- verified root cause:
start_server_named()returns after setting the ready event, and the server thread can immediately enternipc_server_run()nipc_server_run()pre-creates SHM inserver_prepare_accept_config()before any client connect- so arming
NIPC_WIN_SHM_TEST_FAULT_CREATE_MAPPINGafterstart_server_named()races with the already-prepared first session
- implication:
- the test must arm the create-mapping fault before starting the server thread if it wants deterministic baseline fallback on the first handshake
- first patch attempt was only partially correct:
- local Linux verification after these patches:
cmake --build build -j4/usr/bin/ctest --test-dir build --output-on-failure -j4- result:
100% tests passed, 0 tests failed out of 39
- local Windows Go compile verification after these patches:
cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
- external reviewer round after both Linux and native Windows went green:
- useful true findings:
- stale Go raw-client SHM attach comment still says the server creates SHM after handshake
- stale Windows C test function name still says
disconnectsafter the test now verifies baseline fallback
- verified false positives:
- "Go/Rust lack explicit request-payload over-cap rejection tests"
- evidence already exists in:
src/go/pkg/netipc/transport/posix/uds_test.gosrc/go/pkg/netipc/transport/windows/pipe_integration_test.gosrc/crates/netipc/src/transport/posix_tests.rs
- useful true findings:
- Go cgroups snapshot dispatch:
- Wire-level negotiated fields are explicit in the protocol payloads:
supported_profilespreferred_profilesmax_request_payload_bytesmax_request_batch_itemsmax_response_payload_bytesmax_response_batch_itemspacket_size- evidence:
src/libnetdata/netipc/include/netipc/netipc_protocol.htests/test_protocol.c
- The docs are currently too coarse:
- they describe request limits generically as sender-driven and response limits generically as server-driven
- they do not define a per-field negotiation matrix
- evidence:
docs/level1-wire-envelope.mddocs/level1-posix-uds.mddocs/level1-windows-np.mddocs/level1-transport.md
- The current transport implementations are aligned with one hard-coded generic policy:
- request payload =
max(client, server)capped atMAX_PAYLOAD_CAP - request batch items =
max(client, server) - response payload = server value
- response batch items = server value
- packet size =
min(client, server) - profile = highest bit from preferred intersection, else highest bit from full intersection
- evidence:
- C POSIX:
src/libnetdata/netipc/src/transport/posix/netipc_uds.c
- C Windows:
src/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
- Go POSIX:
src/go/pkg/netipc/transport/posix/uds.go
- Go Windows:
src/go/pkg/netipc/transport/windows/pipe.go
- Rust POSIX:
src/crates/netipc/src/transport/posix.rs
- Rust Windows:
src/crates/netipc/src/transport/windows.rs
- C POSIX:
- request payload =
- The tests also encode that same generic request-side
max()policy today:- C:
tests/fixtures/c/test_uds.c
- Go POSIX:
src/go/pkg/netipc/transport/posix/uds_test.go
- Go Windows:
src/go/pkg/netipc/transport/windows/pipe_integration_test.go
- Rust:
src/crates/netipc/src/transport/posix_tests.rs
- C:
- The old request-side
max()policy is not arbitrary:- L2/L3 clients learn larger request capacities after overflow/reconnect
- managed servers remember learned request/response capacities and advertise them to later sessions
- this is why the existing transport handshake upgrades later clients to the larger server-advertised request envelope
- evidence:
- C:
src/libnetdata/netipc/src/service/netipc_service.csrc/libnetdata/netipc/src/service/netipc_service_win.c
- Go:
src/go/pkg/netipc/service/raw/client.gosrc/go/pkg/netipc/service/raw/client_windows.go
- Rust:
src/crates/netipc/src/service/raw.rs
- C:
- Under the user-approved contract, that is still insufficient:
- the protocol contract must be:
- client proposes
- server decides
- server returns final negotiated values in
HELLO_ACK
- but each field must define its own decision rule explicitly
- therefore the current generic request-side
max()rule cannot remain as an undocumented blanket policy
- the protocol contract must be:
- Obstacle 1: the current SHM guarantee is false
- current docs still say the data plane switches to SHM after the handshake completes
- current code still provisions SHM after the handshake-selected profile is already returned
- implication:
- the requested rule "negotiated profile is guaranteed to work after handshake" requires architectural change, not only docs/tests
- Obstacle 2: public typed APIs and current docs still treat
max_response_batch_itemsas independently tunable- current
HELLOandHELLO_ACKpayload layouts still carry it as if it were independently negotiated - current docs and configs still expose it as a separate knob
- implication:
- the requested contract requires semantic cleanup across docs, APIs, codecs, and tests so it stays symmetric with request batch items
- current
- Obstacle 3: public typed L2 APIs currently expose internal handshake knobs the user wants removed
- public service client/server configs still expose
max_request_payload_bytes - public configs also still expose
max_response_batch_items - implication:
- the requested contract requires public API cleanup in C / Rust and likely Go typed surfaces, not only handshake docs
- public service client/server configs still expose
- SHM currently violates the user's transport-lock expectation at the architectural level:
- Level 1 handshake already returns
selected_profile = SHM - but SHM create/attach still happens later in L2 service code
- this is why Thiago's PR added post-handshake fallback in vendored POSIX C service code
- under the user-approved contract, that fallback must not be adopted as the upstream fix
- evidence:
- handshake/profile selection:
docs/level1-transport.mddocs/level1-posix-uds.mddocs/level1-windows-np.md
- late SHM setup:
src/libnetdata/netipc/src/service/netipc_service.csrc/libnetdata/netipc/src/service/netipc_service_win.csrc/go/pkg/netipc/service/raw/client.gosrc/go/pkg/netipc/service/raw/client_windows.gosrc/crates/netipc/src/service/raw.rs
- handshake/profile selection:
- Level 1 handshake already returns
-
This section is the corrected handshake matrix draft derived from the user's decisions so far.
-
Global rule:
- the client sends
HELLO - the server decides the final session values
- the server returns those values in
HELLO_ACK - every field has its own decision rule
- on handshake failure, the server sends
HELLO_ACKwith non-OKtransport_statusand then closes
- the client sends
-
Important distinction:
- some
HELLOfields are proposal inputs only - the operational values for the session are the
HELLO_ACKfields
- some
-
auth_token->transport_status- client sends:
auth_token
- server does:
- validates exact match
- server returns:
- no negotiated auth value
- only
transport_status
- operational meaning:
OKmeans authorizedAUTH_FAILEDmeans handshake rejected before session establishment
- client sends:
-
supported_profiles+preferred_profiles->server_supported_profiles+intersection_profiles+selected_profile- client sends:
supported_profilespreferred_profiles
- server does:
- computes
intersection = client_supported & server_supported - if
intersection == 0, returnstransport_status = UNSUPPORTED - otherwise selects the final profile
- computes
- current source-of-truth selection algorithm:
- highest bit of
(intersection & client_preferred & server_preferred) - else highest bit of
intersection
- highest bit of
- server returns:
server_supported_profilesintersection_profilesselected_profile
- operational meaning:
- client does not continue using its own
supported_profiles - both sides use
selected_profilefor the session
- client does not continue using its own
- user-approved invariant:
- once returned by handshake, the profile is locked for the session
- if SHM is selected, SHM must already be usable for that session
- client sends:
-
max_request_payload_bytes->agreed_max_request_payload_bytes- client sends:
- proposed request payload ceiling
- user direction so far:
- typed L2 should proactively compute and propose this from method schema, desired batch size, and dynamic-field assumptions
- the server should not increase it
- preferred behavior is to echo it back unchanged
- concrete current protocol constraint:
- source-of-truth currently enforces a hard payload cap of
NIPC_MAX_PAYLOAD_CAP = 256 MB
- source-of-truth currently enforces a hard payload cap of
- evidence:
src/libnetdata/netipc/include/netipc/netipc_protocol.hsrc/libnetdata/netipc/src/transport/posix/netipc_uds.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.c
- server returns:
agreed_max_request_payload_bytes
- operational meaning:
- both sides use
agreed_max_request_payload_bytesfor the session
- both sides use
- user decision now recorded:
- replace the current
256 MBhard cap with1 MiB - if the client proposes a larger value, reject the handshake
- do not silently cap-down
- replace the current
- client sends:
-
max_request_batch_items->agreed_max_request_batch_items- client sends:
- proposed request batch-item ceiling
- user direction so far:
- client proposes intended batch size
- preferred behavior is to echo it back unchanged
- if there is no concrete server-side constraint, echo it back unchanged
- concrete current evidence:
- none found yet for a protocol-level hard maximum analogous to
NIPC_MAX_PAYLOAD_CAP - current source-of-truth raises this value with
max(client, server), but that is existing behavior, not evidence of necessity
- none found yet for a protocol-level hard maximum analogous to
- server returns:
agreed_max_request_batch_items
- operational meaning:
- both sides use
agreed_max_request_batch_itemsfor the session
- both sides use
- user decision now recorded:
- if there is no concrete protocol-level constraint, echo the client proposal back unchanged
- client sends:
-
max_response_payload_bytes->agreed_max_response_payload_bytes- client sends:
- optional hint/current expectation
- user-approved direction:
- server ignores the client for the final value
- server returns the value it will actually use
- server returns:
agreed_max_response_payload_bytes
- operational meaning:
- both sides use
agreed_max_response_payload_bytesfor the session
- both sides use
- client sends:
-
max_response_batch_items->agreed_max_response_batch_items- concrete current evidence:
- current typed batch semantics are symmetric by position
- docs:
docs/level1-transport.mdsays batch request/response items are correlated by array positiondocs/level2-typed-api.mdsays the managed server assembles one batch response preserving request order
- implementations:
- C sets
resp_hdr.item_count = hdr.item_countfor batch responses - Go sets
resp_hdr.ItemCount = hdr.ItemCountfor batch responses - Rust sets
resp_hdr.item_count = hdr.item_countfor batch responses
- C sets
- user decision now recorded:
- keep the field on the wire
- require strict symmetry with request batch items
- server must return the same effective batch-item ceiling for requests and responses for the session
- impact:
- no handshake layout removal for this field
- docs, APIs, codecs, and tests still need coordinated semantic cleanup so the field is never treated as independently negotiated
- concrete current evidence:
-
packet_size->agreed_packet_size- client sends:
- proposed transport packet size
- user-approved direction:
- server decides with
min(client, server)
- server decides with
- server returns:
agreed_packet_size
- operational meaning:
- both sides use
agreed_packet_sizefor chunking in the session
- both sides use
- client sends:
-
session_id- client sends:
- nothing
- server does:
- allocates a per-session identifier
- server returns:
session_id
- operational meaning:
- identifies this session
- used in per-session SHM naming/derivation
- client sends:
- Analyze how
plugin-ipcshould be integrated into the Netdata repo and build. - Before any Netdata integration, implement transparent SHM resizing in
plugin-ipcitself. - Validate that feature thoroughly first, including full C/Rust/Go interop matrices on Unix and Windows.
- Use it first to replace the current
cgroups.plugin->ebpf.pluginmetadata channel on Linux. - Make the library available to C, Rust, and Go code inside Netdata.
- Record integration design decisions before implementation.
- User-approved local workspace cleanup in this slice:
- remove the generated Go test / helper binaries after the push
- affected files:
src/go/cgroups.test.exesrc/go/mainsrc/go/raw.test.exesrc/go/windows.test.exe
- User-directed benchmark follow-up now in scope:
- treat the Linux
shm-batch-ping-pongC/Rust spread as two independent problems:- Rust server penalty versus C server with the same C client
- Rust client penalty versus C client with the same C server
- worst-case
rust -> rustis the compounded result of both penalties - objective:
- identify the exact Rust-side hot paths responsible for the server-side and client-side losses
- fix Rust until the Linux C/Rust SHM batch path is materially closer to the C baseline
- scope expansion approved by the user:
- do the same benchmark-delta investigation across all material language/client/server combinations
- identify every real implementation issue behind the benchmark gaps
- fix the implementation issues, not just explain them
- keep benchmark artifacts and benchmark-derived docs in sync after each validated fix
- first verified benchmark-delta findings:
- POSIX
shm-batch-ping-pongwithclient ∈ {c,rust}andserver ∈ {c,rust}still has a real Rust penalty on both sides:c -> c = 64,148,960c -> rust = 58,334,803rust -> c = 52,277,542rust -> rust = 48,220,338- implication:
- Rust server penalty is real
- Rust client penalty is larger
rust -> rustis the compounded case
- benchmark-driver distortion is also real and must be fixed before deeper transport conclusions:
- Go
lookupbenchmark does a synthetic linear scan instead of using the actual O(1) cache structure:bench/drivers/go/main.go
- Rust
lookupbenchmark also does a synthetic linear scan:bench/drivers/rust/src/main.rs
- Rust actual cache lookup currently allocates
name.to_string()on every lookup:src/crates/netipc/src/service/raw.rs
- Go and Rust batch / pipeline clients still do avoidable hot-loop allocations that C avoids or minimizes:
- Go:
bench/drivers/go/main.go
- Rust:
bench/drivers/rust/src/main.rs
- Go:
- Go
- POSIX
- treat the Linux
- Current execution scope:
- remove the multi-method service drift from docs, code, tests, and public APIs
- align the implementation to one-service-kind-per-endpoint
- implement the accepted SHM resize / renegotiation behavior
- eliminate contradictory wording and examples across the repository
- refresh the Linux and Windows benchmark matrices on the current tree
- update benchmark artifacts and all benchmark-derived docs so everything is in sync
- investigate the remaining benchmark spreads and identify whether they reflect real transport/runtime inefficiency, measurement distortion, or pair-specific implementation overhead
- correct the benchmark build path so C benchmark results are generated from optimized C libraries, not from a local Debug CMake tree
- Current implementation status:
- docs/specs/TODOs now explicitly state service-oriented discovery and one request kind per endpoint
- Go public cgroups APIs and Go raw service/tests were rewritten to the single-kind model
cd src/go && go test -count=1 ./pkg/netipc/service/rawnow passes after aligning the raw client/server with learned SHM req/resp capacities and transparent overflow-driven reconnect/retrycd src/go && go test -count=1 ./pkg/netipc/service/cgroupsnow passes- Rust public cgroups facade now uses the single-kind raw server constructor instead of the old multi-handler bundle
- targeted Rust verification now passes:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::cgroups:: -- --test-threads=1
- Rust raw Unix tests no longer use the old mixed
pingpong_handlers()helper - the Rust raw service subset now passes after binding increment-only and string-reverse-only endpoints explicitly and teaching the raw client/server the learned SHM req/resp resize path:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1
- Go raw L2 now tracks learned request/response capacities, treats
STATUS_LIMIT_EXCEEDEDas an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls - Rust raw L2 now tracks learned request/response capacities, treats
STATUS_LIMIT_EXCEEDEDas an overflow signal, reconnects, renegotiates larger capacities, and retries transparently for overflow-safe calls - Go and Rust transport listeners now expose payload-limit setters so the server can advertise learned capacities to later clients before
accept():- Go POSIX:
src/go/pkg/netipc/transport/posix/uds.go - Go Windows:
src/go/pkg/netipc/transport/windows/pipe.go - Rust POSIX:
src/crates/netipc/src/transport/posix.rs - Rust Windows:
src/crates/netipc/src/transport/windows.rs
- Go POSIX:
src/crates/netipc/src/service/raw.rsno longer exposes the genericHandlersbundle or the transitionalnew_single_kind/with_workers_single_kindconstructorssrc/crates/netipc/src/service/raw.rsnow models managed servers as single-kind endpoints directly:ManagedServer::new(..., expected_method_code, handler)ManagedServer::with_workers(..., expected_method_code, handler, worker_count)
- Rust POSIX and Windows benchmark drivers now use the single-kind raw service surface instead of the deleted multi-handler
Handlersbundle:bench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs
src/crates/netipc/src/service/raw_unix_tests.rsandsrc/crates/netipc/src/service/raw_windows_tests.rsnow use that single-kind raw service surface directly instead of feeding a generic handler bundle into the raw server- verified source-level residue scan for
src/crates/netipc/src/service/raw_windows_tests.rsis now clean:- no remaining
Handlers - no remaining
test_cgroups_handlers() - no remaining
increment_handlers()
- no remaining
- verified source-level residue scan for
src/crates/netipc/src/service/raw.rsandsrc/crates/netipc/src/service/raw_unix_tests.rsis now clean:- no remaining
Handlers - no remaining
new_single_kind - no remaining
with_workers_single_kind
- no remaining
- C public naming drift was reduced from plural handler bundles to singular service-handler naming
tests/fixtures/c/test_win_service.cis now snapshot-only; it no longer starts a typed snapshot service and then exercises increment / string-reverse / batch calls against it- source-level cleanup of the remaining Windows C fixtures is only partial so far:
- the obvious typed snapshot
.on_increment/.on_string_reversebundle drift was removed from:tests/fixtures/c/test_win_service_extra.ctests/fixtures/c/test_win_stress.ctests/fixtures/c/test_win_service_guards.ctests/fixtures/c/test_win_service_guards_extra.c
- but real
win11compilation later proved these files still contain stale calls to removed C APIs and stale raw-server assumptions
- the obvious typed snapshot
- verified source-level residue scan across the touched Windows C fixtures is therefore not enough on its own:
- it proves only that the obvious typed-handler bundle names were removed
- it does not prove runtime or even compile-time correctness on Windows
- verified source-level residue scan for the touched Windows Go raw helpers/tests is now clean:
- no remaining
Handlers{...}bundle initializers - no remaining
winTestHandlers()/winFailingHandlers()helpers - no remaining
server.handlersreferences in the Windows raw tests
- no remaining
- Windows Go package cross-compile proof now passes from this Linux host:
cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/rawcd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/service/cgroups
- the Unix interop/service/cache matrix now passes end-to-end after the resize rewrite:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop)$'
- the broader Unix shm/service/cache slice across C, Rust, and Go now also passes:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go)$'
- the previously exposed POSIX UDS mismatch is now resolved:
- Rust
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1now passes299/299 - the stale transport tests were rewritten to match the accepted directional negotiation semantics:
- requests are sender-driven
- responses are server-driven
- C
test_udsnow proves directional negotiation explicitly and keeps direct receive-limit coverage through a raw malformed-response path - the broader non-fuzz Unix CTest sweep now passes end-to-end:
/usr/bin/ctest --test-dir build --output-on-failure -E '^(fuzz_protocol_30s|go_FuzzDecodeHeader|go_FuzzDecodeChunkHeader|go_FuzzDecodeHello|go_FuzzDecodeHelloAck|go_FuzzDecodeCgroupsRequest|go_FuzzDecodeCgroupsResponse|go_FuzzBatchDirDecode|go_FuzzBatchItemGet)$'- result:
28/28passed
- Rust
- the public docs now match the accepted directional handshake semantics:
docs/level1-wire-envelope.mdexplicitly says request limits are sender-driven and response limits are server-drivendocs/getting-started.mdno longer documents the deleted RustCgroupsHandlers/CgroupsServersurface
- Windows transport test sources were aligned to the same directional contract:
- Go
src/go/pkg/netipc/transport/windows/pipe_integration_test.gono longer expects the old min-style negotiation - Rust
src/crates/netipc/src/transport/windows.rsnow contains a matching directional negotiation test - Go Windows transport tests still have cross-compile proof from this Linux host:
cd src/go && GOOS=windows GOARCH=amd64 go test -c ./pkg/netipc/transport/windows
- Go
- local source checks are clean for the touched Windows C files:
git diff --check -- tests/fixtures/c/test_win_stress.c tests/fixtures/c/test_win_service_guards.c tests/fixtures/c/test_win_service_guards_extra.c TODO-netdata-plugin-ipc-integration.md
- local source checks are also clean for the touched Go/Rust raw files:
git diff --check -- src/crates/netipc/src/service/raw.rs src/crates/netipc/src/service/raw_unix_tests.rs src/go/pkg/netipc/service/raw/client.go src/go/pkg/netipc/service/raw/client_windows.go src/go/pkg/netipc/service/raw/shm_unix_test.go src/go/pkg/netipc/service/raw/helpers_windows_test.go src/go/pkg/netipc/service/raw/more_windows_test.go src/go/pkg/netipc/service/raw/shm_windows_test.go TODO-netdata-plugin-ipc-integration.md
- limitation:
- this Linux host does not have
x86_64-w64-mingw32-gcc - so local source cleanup alone is not enough for the edited Windows C fixtures
- the same host limitation means the
raw_windows_tests.rssource cleanup is not backed by a real Windows Rust compile/run proof from this environment either - the touched Windows Go packages now have cross-compile proof, but still do not have a real Windows runtime proof from this environment
- this Linux host does not have
- current verified Windows runtime status from the real
win11workflow:- the documented
ssh win11+MSYSTEM=MINGW64toolchain path works and has been used for real validation - after syncing the local tree,
cmake --build build -j4onwin11exposed real stale C fixture/API mismatches that were not visible from Linux source scans alone - the first verified
win11failure classes were:- stale removed client helpers:
nipc_client_call_incrementnipc_client_call_increment_batchnipc_client_call_string_reverse
- stale internal error enum usage:
NIPC_ERR_INTERNAL_ERROR
- stale raw-server handler signature assumptions:
- old
boolraw handlers instead ofnipc_error_t (*)(..., const nipc_header_t *, ...)
- old
- stale
nipc_server_init(...)argument ordering under the internal test macro path - stale client struct field assumptions such as
client.request_buf_size
- stale removed client helpers:
- those compile-time failures have now been corrected locally and revalidated on
win11:test_win_service_extra.exenow builds and passes onwin11
- the remaining active Windows C problem is now narrower and runtime-only:
- after correcting the stale Windows C fixture/API mismatches and the baseline request-overflow signaling gap,
test_win_service_guards.exenow passes onwin11:=== Results: 141 passed, 0 failed ===- the previous apparent timeout was not a persistent runtime hang:
- later reruns completed normally once the stale one-item batch test drift was removed
- the last real guard-binary contradiction was:
- a one-item increment "batch" test still expecting reconnect/growth
- that expectation was wrong under the accepted semantics:
- one-item increment batches are normalized to the plain increment path
- the guard was rewritten to use a real 2-item batch for baseline request-resize coverage
- the rest of the edited Windows C runtime slice has now been validated on
win11too:test_win_service.exe:=== Results: 80 passed, 0 failed ===
test_win_service_extra.exe:=== Results: 82 passed, 0 failed ===
test_win_service_guards_extra.exe:=== Results: 93 passed, 0 failed ===
test_win_stress.exe:=== Results: 1 passed, 0 failed ===
- a combined rerun of all edited Windows C binaries also passed cleanly on
win11
- the earlier
test_win_service.exetimeout is not currently reproducible as a deterministic bug:- it timed out once in a combined slice and once in an early soak run
- after the stale guard/test contradictions were removed, a focused rerun passed
- a subsequent combined rerun passed
- a targeted 3-run
win11soak oftest_win_service.exealso passed3/3 - working theory:
- that earlier timeout was a transient host/process stall, not a currently reproducible library correctness bug
- a real L2 behavior gap was exposed and fixed during this
win11investigation:- on baseline request overflow, the server session loop now emits a zero-payload
LIMIT_EXCEEDEDresponse before disconnecting, instead of silently breaking the session - this fix was needed for transparent request-side resize/reconnect to work on Windows baseline transport at all
- on baseline request overflow, the server session loop now emits a zero-payload
- current remaining Windows Rust runtime blocker:
- focused
win11run:timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1
- current observed behavior:
- build completes
- test process prints:
running 1 testtest service::cgroups::windows_tests::test_cache_round_trip_windows ...
- then stalls without completing
- strongest current evidence:
- Rust raw Windows tests already implement reliable Windows shutdown by:
- storing the service name + wake client config
- setting
running_flag = false - issuing a dummy
NpSession::connect(...)to wake the blockingConnectNamedPipe()
- cgroups Windows tests and Rust Windows interop binaries still use the weaker pattern:
- only
running_flag = false - no wake connection
- only
- the Windows accept loop in
src/crates/netipc/src/service/raw.rsblocks inlistener.accept(), which ultimately blocks inConnectNamedPipe(), sorunning_flag = falsealone is not sufficient to stop the server reliably on Windows
- Rust raw Windows tests already implement reliable Windows shutdown by:
- working theory:
- the cache test body may already be completing
- the stall is very likely in Windows server shutdown/join, not in snapshot/cache decoding itself
- focused
- that Rust Windows blocker is now verified fixed on
win11:- fix:
- cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
- set
running_flag = false - then issue a wake connection so the blocking
ConnectNamedPipe()returns and the accept loop can observe shutdown
- set
- cgroups Windows tests and Rust Windows interop binaries now use the same reliable Windows stop pattern already used by the Rust raw Windows tests:
- focused proof:
timeout 120 cargo test --manifest-path src/crates/netipc/Cargo.toml test_cache_round_trip_windows -- --nocapture --test-threads=1- result:
test service::cgroups::windows_tests::test_cache_round_trip_windows ... ok
- full Rust Windows lib proof:
timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- factual conclusion:
- the live bug was stale Windows shutdown/test-fixture behavior, not a current Rust cache decode/refresh correctness issue
- fix:
- broader real Windows interop/service/cache proof is now also green on
win11:- command:
timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"
- result:
test_named_pipe_interop: passedtest_win_shm_interop: passedtest_service_win_interop: passedtest_service_win_shm_interop: passedtest_cache_win_interop: passedtest_cache_win_shm_interop: passed- summary:
100% tests passed, 0 tests failed out of 6
- command:
- the documented
- targeted C rebuild and runtime verification now passes:
cmake --build build --target test_service test_hardening test_ping_pong/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_hardening|test_ping_pong)$'
- the latest naming / contract cleanup slice is now backed by both local Linux and real
win11proof:- local Linux rerun:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_hardening|test_ping_pong)$'- result:
100% tests passed, 0 failed
- after syncing this slice's edited files to
win11, targeted rebuild passed:cmake --build build -j4 --target test_win_service test_win_service_extra test_win_service_guards test_win_service_guards_extra
- direct
win11runtime proof for the edited guard binaries passed:./test_win_service_guards.exe- result:
=== Results: 141 passed, 0 failed ===
- result:
./test_win_service_guards_extra.exe- result:
=== Results: 93 passed, 0 failed ===
- result:
- direct
win11runtime proof for the edited service binaries also passed via CTest:ctest --test-dir build --output-on-failure -R "^(test_win_service|test_win_service_extra)$"- result:
test_win_service: passedtest_win_service_extra: passed
- local Linux rerun:
- benchmark refresh on the current tree is now complete and synced:
- factual root cause of the benchmark blocker:
- the C and Rust batch benchmark clients still generated random batch sizes in the range
1..1000 - the actual batch protocol normalizes
item_count == 1to the non-batch path - Go was already correct and generated
2..1000, which is why the same C batch server still interoperated with the Go client
- the C and Rust batch benchmark clients still generated random batch sizes in the range
- fixed in:
bench/drivers/c/bench_posix.cbench/drivers/c/bench_windows.cbench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rsbench/drivers/go/main.gotests/run-posix-bench.shtests/run-windows-bench.sh
- specific fixes:
- batch benchmark generators now use
2..1000items for real batch scenarios - Windows benchmark failure reporting now defines
server_outbefore callingdump_server_output
- batch benchmark generators now use
- targeted proof after the fix:
- the previously failing pairs now succeed locally and on
win11:uds-batch-ping-pong c->cuds-batch-ping-pong rust->cshm-batch-ping-pong c->cshm-batch-ping-pong rust->cnp-batch-ping-pong c->cnp-batch-ping-pong rust->c
- the previously failing pairs now succeed locally and on
- clean official reruns:
- Linux:
bash tests/run-posix-bench.sh benchmarks-posix.csv 5- result:
Total measurements: 201
- Windows:
ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/run-windows-bench.sh benchmarks-windows.csv 5'- result:
Total measurements: 201
- Linux:
- clean generated artifacts:
bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md- result:
All performance floors met
- result:
ssh win11 'cd /tmp/plugin-ipc-bench-fixed && ... && bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md'- result:
All performance floors met
- result:
- factual root cause of the benchmark blocker:
- the follow-up benchmark spread investigation has now established a real benchmark-build bug on POSIX:
- the local benchmark runner used:
- C from
build/bin/bench_posix_c - Rust from
src/crates/netipc/target/release/bench_posix - Go from
build/bin/bench_posix_go
- C from
- the local CMake tree used for the C benchmark was configured as:
build/CMakeCache.txt:CMAKE_BUILD_TYPE:STRING=Debug
- the benchmark target itself added
-O2, but the C libraries it linked against were still unoptimized:build/CMakeFiles/bench_posix_c.dir/flags.make:C_FLAGS = -g -std=gnu11 -O2
build/CMakeFiles/netipc_protocol.dir/flags.make:C_FLAGS = -g -std=gnu11
build/CMakeFiles/netipc_service.dir/flags.make:C_FLAGS = -g -std=gnu11
- a dedicated optimized benchmark tree proved this materially changes the published POSIX rows:
- release build setup:
cmake -S . -B build-release -DCMAKE_BUILD_TYPE=Releasecmake --build build-release --target bench_posix_c bench_posix_go -j8
- direct targeted reruns:
- published
shm-batch-ping-pong c->c:25,947,290
- optimized C libs
shm-batch-ping-pong c(rel)->c(rel):63,699,472
- published
uds-pipeline-batch-d16 c->c:49,512,090
- optimized C libs
uds-pipeline-batch-d16 c(rel)->c(rel):103,212,623
- published
- mixed-language targeted reruns also moved sharply upward when the C side used optimized libraries:
- intended
shm-batch-ping-pong c(rel)->rust:57,122,454
- intended
shm-batch-ping-pong rust->c(rel):52,041,263
- intended
uds-pipeline-batch-d16 c(rel)->rust:91,093,895
- intended
uds-pipeline-batch-d16 rust->c(rel):101,978,294
- intended
- release build setup:
- implemented fix:
tests/run-posix-bench.shnow configures and uses a dedicated optimized benchmark tree:- default:
build-bench-posix - build type:
Release
- default:
tests/run-windows-bench.shnow configures and uses a dedicated optimized benchmark tree:- default:
build-bench-windows - build type:
Release - explicit MinGW toolchain export on
win11
- default:
- factual conclusion:
- the old checked-in POSIX benchmark report was distorted by linking the C benchmark binary against Debug-built C libraries
- the current checked-in POSIX and Windows benchmark artifacts now come from the corrected dedicated benchmark build paths
- the local benchmark runner used:
- the Windows benchmark tree is not affected by the same local Debug-build distortion:
ssh win11 '... grep CMAKE_BUILD_TYPE build/CMakeCache.txt'CMAKE_BUILD_TYPE:STRING=RelWithDebInfo
- the previously suspicious Windows SHM batch outlier did not survive the corrected rerun:
- old checked-in row:
shm-batch-ping-pong c->rust = 9,282,667
- corrected clean rerun row:
shm-batch-ping-pong c->rust = 55,868,058
- old checked-in row:
- final artifact sanity checks:
benchmarks-posix.csv- rows:
201 - duplicate keys:
0 - zero-throughput rows:
0
- rows:
benchmarks-windows.csv- rows:
201 - duplicate keys:
0 - zero-throughput rows:
0
- rows:
- checked-in benchmark docs are now synced to the refreshed artifacts:
benchmarks-posix.csvbenchmarks-posix.mdbenchmarks-windows.csvbenchmarks-windows.mdREADME.md
- corrected max-throughput ranges from the current checked-in artifacts:
- POSIX:
uds-ping-pong:182,963to231,160shm-ping-pong:2,460,317to3,450,961uds-batch-ping-pong:27,182,404to40,240,940shm-batch-ping-pong:31,250,784to64,148,960uds-pipeline-d16:568,373to735,829uds-pipeline-batch-d16:51,960,946to102,954,841snapshot-baseline:158,948to205,624snapshot-shm:1,006,053to1,738,616lookup:114,556,227to203,279,430
- Windows:
np-ping-pong:18,241to21,039shm-ping-pong:2,099,392to2,715,487np-batch-ping-pong:7,013,700to8,550,220shm-batch-ping-pong:36,494,096to58,768,397np-pipeline-d16:245,420to270,488np-pipeline-batch-d16:28,977,365to41,270,903snapshot-baseline:16,090to20,967snapshot-shm:857,823to1,262,493lookup:107,472,315to164,305,717
- POSIX:
- current remaining raw Rust drift is now narrower and well-scoped:
- the raw managed server already enforces one
expected_method_code - the raw client surface still exposes a generic constructor and mixed call surface under the stale internal name
CgroupsClient - the next cleanup slice is to bind the raw Rust client constructors to one service kind and migrate the raw Rust tests to those constructors, matching the already-correct Go raw design
- the raw managed server already enforces one
- raw Rust client drift is now removed from the active service surface:
src/crates/netipc/src/service/raw.rsnow exposesRawClientinstead of the stale internal multi-kind nameCgroupsClient- the raw client is now created only through service-kind-specific constructors:
RawClient::new_snapshot(...)RawClient::new_increment(...)RawClient::new_string_reverse(...)
- request kind remains only as envelope validation on the raw client
- the raw Rust Unix/Windows tests now create snapshot, increment, and string-reverse clients explicitly instead of reusing one generic constructor across service kinds
- local Linux Rust proof for that slice is now green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1- result:
75 passed0 failed
- result:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
299 passed0 failed
- result:
- real
win11Rust proof for that slice is now green too:timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- result:
- the broader
win11interop/service/cache matrix initially exposed two more stale constructor residues outside the Rust raw tests:- Rust benchmark drivers still imported the deleted raw
CgroupsClientinstead of using the public snapshot facade- fixed in:
bench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs
- fixed in:
- Go public cgroups wrappers still called the deleted generic raw constructor:
raw.NewClient(...)- fixed in:
src/go/pkg/netipc/service/cgroups/client.gosrc/go/pkg/netipc/service/cgroups/client_windows.go
- Go benchmark drivers still hand-rolled the stale raw dispatch signature instead of using the single-kind increment adapter
- fixed in:
bench/drivers/go/main.go
- fixed in:
- Rust benchmark drivers still imported the deleted raw
- the next verified contradiction slice was documentation-heavy and is now resolved:
- low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
- request limits are sender-driven
- response limits are server-driven
- SHM capacities are fixed per session
- larger learned capacities require a reconnect and a new session, not in-place SHM resize
docs/level1-wire-envelope.mdno longer says handshake rule 6 takes the minimum of client and server valuesdocs/level1-windows-np.mdnow documents per-session Windows SHM object names withsession_id, aligned with both code anddocs/level1-windows-shm.md- public L2 comments/docs no longer claim a blanket "retry ONCE":
- ordinary failures still retry once
- overflow-driven resize recovery may reconnect more than once while capacities grow
- Unix test/script cleanup helpers no longer remove the stale pre-session path
{service}.ipcshm; they now use per-session cleanup that matches{service}-{session_id}.ipcshm - validation for this slice is green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib service::raw::tests:: -- --test-threads=1- result:
75 passed0 failed
- result:
cd src/go && go test -count=1 ./pkg/netipc/service/raw- result:
ok
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service_interop|test_cache_interop|test_shm_interop)$'- result:
100% tests passed0 failed
- result:
- low-level SHM / handshake docs now describe the accepted directional negotiation model and the current session-scoped SHM lifecycle:
- the next verified residue slice is narrower and fixture-focused:
- several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path
{service}.ipcshminstead of using per-session cleanup - current proven hits:
tests/fixtures/c/test_service.ctests/fixtures/c/test_cache.ctests/fixtures/c/test_hardening.ctests/fixtures/c/test_chaos.ctests/fixtures/c/test_multi_server.ctests/fixtures/c/test_stress.csrc/go/pkg/netipc/service/cgroups/cgroups_unix_test.go
- several Unix C/Go fixture cleanup helpers still unlink the dead pre-session path
- that Unix fixture-cleanup residue slice is now resolved:
- the touched Unix C fixtures now use
nipc_shm_cleanup_stale(TEST_RUN_DIR, service)instead of unlinking the dead{service}.ipcshmpath - the touched Go public cgroups Unix tests now use
posix.ShmCleanupStale(testRunDirUnix, service)instead of removing the dead{service}.ipcshmpath - validation for this slice is green:
cd src/go && go test -count=1 ./pkg/netipc/service/cgroups- result:
ok
- result:
cmake --build build --target test_service test_cache test_hardening test_multi_server test_chaos test_stress- result:
- rebuild passed
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_service|test_cache|test_hardening|test_multi_server|test_chaos|test_stress)$'- result:
100% tests passed0 failed
- result:
- the touched Unix C fixtures now use
- one more live Unix fixture contradiction remains after that cleanup pass:
tests/fixtures/c/test_chaos.c:test_shm_chaos()still opens the dead pre-session SHM path{run_dir}/{service}.ipcshm- this is not just stale cleanup text; it likely means the SHM-chaos path is not actually targeting the live per-session SHM file today
- that live SHM-chaos contradiction is now resolved:
tests/fixtures/c/test_chaos.c:test_shm_chaos()now captures the livesession_idfrom the ready client session and opens{run_dir}/{service}-{session_id}.ipcshm- the test no longer treats "SHM file not found" as an acceptable skip on this path
- validation:
cmake --build build --target test_chaos- result:
- rebuild passed
- result:
/usr/bin/ctest --test-dir build --output-on-failure -R '^test_chaos$'- result:
100% tests passed0 failed
- result:
- current residue scan excluding this TODO file is now clean for the main drift markers:
- no remaining old
{service}.ipcshmpath literals - no remaining deleted
CgroupsHandlers/CgroupsServerAPI references - no remaining deleted
raw.NewClient(...)/service::raw::CgroupsClientreferences - no remaining deleted
new_single_kind/with_workers_single_kindreferences
- no remaining old
- broader Unix validation after these cleanup passes is also green:
/usr/bin/ctest --test-dir build --output-on-failure -R '^(test_uds_interop|test_shm_interop|test_service_interop|test_service_shm_interop|test_cache_interop|test_cache_shm_interop|test_shm|test_service|test_cache|test_shm_rust|test_service_rust|test_shm_go|test_service_go|test_cache_go|test_hardening|test_ping_pong|test_multi_server|test_chaos|test_stress)$'- result:
100% tests passed0 failed19/19passedbench/drivers/go/main_windows.go
- result:
- local Go proof for the wrapper/benchmark cleanup is now green:
cd src/go && go test -count=1 ./pkg/netipc/service/cgroups- result:
ok
- result:
cd bench/drivers/go && go test -run '^$' ./...- result:
- compile-only pass
- result:
- real
win11build + matrix proof after those residue fixes is now green:cmake --build build -j4- result:
- build succeeds again after the Rust/Go constructor cleanup
- result:
timeout 1800 ctest --test-dir build --output-on-failure -R "^(test_named_pipe_interop|test_win_shm_interop|test_service_win_interop|test_service_win_shm_interop|test_cache_win_interop|test_cache_win_shm_interop)$"- result:
test_named_pipe_interop: passedtest_win_shm_interop: passedtest_service_win_interop: passedtest_service_win_shm_interop: passedtest_cache_win_interop: passedtest_cache_win_shm_interop: passed- summary:
100% tests passed, 0 tests failed out of 6
- result:
- verified residue scan for the stale constructor names used in this slice is now clean:
- no remaining
raw.NewClient - no remaining
service::raw::CgroupsClient - no remaining
RawClient::new(
- no remaining
- a smaller cross-platform residue cleanup is now also complete:
- the test-only Rust helper
dispatch_single()insrc/crates/netipc/src/service/raw.rsis now explicitly marked as dead-code-tolerant under test builds, so Windows lib-test builds no longer emit the stale unused-function warning - the remaining public docs/spec wording in this slice was normalized away from the older "method-specific" phrasing where it described the public L2 service surface or service contracts:
docs/level1-transport.mddocs/codec.mddocs/level2-typed-api.mddocs/code-organization.mddocs/codec-cgroups-snapshot.md
- the test-only Rust helper
- local Linux validation after that wording/test-helper cleanup is still green:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
299 passed0 failed
- result:
- real
win11validation after that cleanup is also still green:timeout 900 cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1- result:
176 passed0 failed1 ignored
- factual note:
- the previous Windows-only
dispatch_singleunused-function warning is no longer present in this run
- the previous Windows-only
- the Windows guard output still shows the accepted request-resize behavior:
- transparent recovery
- exactly one reconnect
- negotiated request-size growth
- result:
- new verified internal raw-client alignment:
- fact:
- the raw managed servers in Go and Rust were already bound to one
expected_method_code - the remaining client-side drift was that one long-lived raw client context still exposed multiple service-kind calls
- the raw managed servers in Go and Rust were already bound to one
- implementation slice now completed in Go:
- raw Go clients are now created per service kind:
NewSnapshotClient(...)NewIncrementClient(...)NewStringReverseClient(...)
- each client now stores one expected request code and rejects wrong-kind calls as validation failures instead of pretending one client can legitimately serve multiple service kinds
- the cache helpers now bind explicitly to
cgroups-snapshot
- raw Go clients are now created per service kind:
- exact local Unix proof:
cd src/go && go test -count=1 ./pkg/netipc/service/raw- result:
ok
- exact real Windows proof on
win11:cd ~/src/plugin-ipc.git/src/go && go test -count=1 ./pkg/netipc/service/raw- first rerun exposed one Windows-only missed constructor site:
pkg/netipc/service/raw/shm_windows_test.go:334- stale
NewClient(...)
- after correcting that last Windows-only leftover and resyncing:
- result:
ok
- result:
- factual conclusion:
- the Go raw helper layer is now materially aligned with the accepted single-service-kind design on both Unix and Windows
- remaining work is to carry the same invariant through the remaining Rust raw helper surface
- fact:
- a full Rust
cargo test --librun is still blocked by one unrelated transport failure outside this rewrite slice:transport::posix::tests::test_receive_batch_count_exceeds_limit
- remaining heavy work is now concentrated in:
- proving the accepted resize behavior with the full interop/service/cache matrices on Unix and Windows, not just the targeted raw suites
- getting real Windows compile/run proof for the edited Rust/Go/C Windows test surfaces
- reconciling the current C path with the final single-kind + learned-size design language everywhere, then validating all 3 languages together
cgroups.pluginis not an external executable. It runs inside the Netdata daemon:cgroups_main()is started fromsrc/daemon/static_threads_linux.c.
ebpf.pluginis a separate external executable:- built by
add_executable(ebpf.plugin ...)inCMakeLists.txt.
- built by
- Current
cgroups.plugin->ebpf.pluginintegration is a custom SHM + semaphore contract:- producer:
src/collectors/cgroups.plugin/cgroup-discovery.c - shared structs:
src/collectors/cgroups.plugin/sys_fs_cgroup.h - consumer:
src/collectors/ebpf.plugin/ebpf_cgroup.c
- producer:
- The shared payload currently transports cgroup metadata, not PID membership:
- fields:
name,hash,options,enabled,path ebpf.pluginstill reads eachcgroup.procsfile itself.
- fields:
- Netdata already has a stable per-run invocation identifier:
src/libnetdata/log/nd_log-init.c- Netdata reads
NETDATA_INVOCATION_ID, elseINVOCATION_ID, else generates a UUID and exportsNETDATA_INVOCATION_ID.
- External plugins are documented to receive
NETDATA_INVOCATION_ID:src/plugins.d/README.md
- Netdata already exposes plugin environment variables centrally:
src/daemon/environment.c
- Netdata already has the right build roots for all 3 languages:
- C via top-level
CMakeLists.txt - Rust workspace in
src/crates/Cargo.toml - Go module in
src/go/go.mod
- C via top-level
plugin-ipcalready has the exact L3 cgroups snapshot API for this use case:docs/level3-snapshot-api.md
- The typed snapshot schema closely matches Netdata’s current SHM payload:
src/libnetdata/netipc/include/netipc/netipc_protocol.h
- The C API already supports:
- managed server lifecycle
- typed cgroups client/cache
- POSIX transport with negotiated SHM fast path
- Authentication in
plugin-ipcis auint64_t auth_token:src/libnetdata/netipc/include/netipc/netipc_service.hsrc/libnetdata/netipc/include/netipc/netipc_uds.h- Rust/Go implementations use the same concept.
- Phase 1 can replace the metadata transport only.
- Phase 1 will not remove
ebpf.pluginreads ofcgroup.procs. - The default
plugin-ipcresponse size is too small for real Netdata snapshots on large hosts, so Linux integration must use an explicit large response limit. - The best build/distribution model is in-tree vendoring inside Netdata, not an external system dependency.
- Current Netdata payload sizing evidence already proves this:
cgroup_root_maxdefault is1000insrc/collectors/cgroups.plugin/sys_fs_cgroup.c- current per-item SHM body carries
name[256]andpath[FILENAME_MAX + 1]insrc/collectors/cgroups.plugin/sys_fs_cgroup.h FILENAME_MAXon this Linux build environment is4096- this means the current per-item shape is already about
4.3 KiBbefore protocol framing/alignment
- The original written phase plan did not describe a multi-method server.
- Evidence:
TODO-plugin-ipc.history.md- historical phase plan still says:
Define and freeze a minimal v1 typed schema for one RPC method ('increment')
- Evidence:
- The first generated L2 spec also did not need a multi-method server model.
- Evidence:
- initial
docs/level2-typed-api.mdfrom commit1722f95 - handler contract was framed as one typed request view + one response builder per handler callback
- no raw transport-level switch over multiple method codes in that initial text
- initial
- Evidence:
- The history TODO already contained the correct service-oriented discovery model.
- Evidence:
TODO-plugin-ipc.history.md- explicit historical decisions already said:
- discovery is service-oriented, not plugin-oriented
- service names are the stable public contract
- one endpoint per service
- one persistent client context per service
- startup order can remain random
- caller owns reconnect cadence via
refresh(ctx)
- Implication:
- the later multi-method server model was not a missing discussion
- it was drift away from an already-decided service model
- Evidence:
- The first explicit spec drift appears in commit
53b5e5aon2026-03-16.- Evidence:
docs/level2-typed-api.mdin commit53b5e5a- handler contract changed to:
- raw-byte transport handler
switch(method_code)INCREMENTSTRING_REVERSECGROUPS
- this is the first clear documentation model where one server endpoint dispatches multiple request kinds
- Evidence:
- The first strong implementation-level generalization appears the same day in commit
69bb794.- Evidence:
- commit message explicitly says:
Add dispatch_increment(), dispatch_string_reverse(), dispatch_cgroups_snapshot()
docs/getting-started.mdin that commit adds typed helper examples for more than one method family- this widened the implementation and examples toward a generic multi-method dispatch surface
- commit message explicitly says:
- Evidence:
- The drift was then reinforced in public examples in commit
6014b0eon2026-03-17.- Evidence:
docs/getting-started.md- C example registers:
.on_increment.on_cgroups
- Rust example registers:
on_incrementon_cgroups
- Go example registers:
OnIncrementOnSnapshot
- text says:
You register typed callbacks for the supported methods
- Evidence:
- The drift became operationally entrenched in interop in commit
099945bon2026-03-16.- Evidence:
- commit message explicitly says:
Cross-language interop now tests all method types
- interop fixtures for C, Rust, and Go on POSIX and Windows all dispatch:
INCREMENTCGROUPS_SNAPSHOTSTRING_REVERSE
- commit message explicitly says:
- Evidence:
- The drift later propagated into current coverage/TODO planning and the repository README.
- Evidence:
TODO-pending-from-rewrite.mdplanned:snapshot / increment / string-reverse / batch over SHM
README.mdnow says:servers register typed handlers
- Evidence:
- There is currently no evidence in the TODO history that the original direction from the user was:
- one server should serve multiple request kinds
- The strongest historical evidence points the other way:
- the original phase plan explicitly named one RPC method only
- Working theory:
- the drift started when the typed API was generalized from:
- one typed request kind per server
- to
- one generic server dispatching multiple method codes
- then examples, interop fixtures, tests, coverage plans, and README text copied that model until it felt normal
- the drift started when the typed API was generalized from:
-
Windows runtime validation host
- User decision: use
win11over SSH for real Windows proof instead of stopping at source cleanup or cross-compilation from Linux. - Constraint:
- prefer the already-documented
win11workflow from this repository's TODOs/docs - do not guess the Windows execution flow when the repo already documents it
- prefer the already-documented
- Implication:
- touched Windows Rust/Go/C transport/service/interop/cache surfaces should now be proven on a real Windows runtime, not just by static review or Linux-hosted cross-compilation
- the next implementation slice should follow the existing
win11operational guidance already captured in the repo
- User decision: use
-
Authentication source
- User decision: use
NETDATA_INVOCATION_IDfor authentication. - Meaning:
- the auth value changes on every Netdata run
- only plugins launched under the same Netdata instance can authenticate
- Evidence:
src/libnetdata/log/nd_log-init.ccreates/exportsNETDATA_INVOCATION_IDsrc/plugins.d/README.mddocuments it for external plugins
- Implication:
- this is stronger than a machine-stable token for local plugin-to-plugin IPC
- restarts invalidate old clients automatically
- User decision: use
-
Source layout in Netdata
- User decision: native Netdata layout.
- Layout:
- C in
src/libnetdata/netipc/ - Rust in
src/crates/netipc/ - Go in
src/go/pkg/netipc/
- C in
- Implication:
- the library becomes a first-class internal Netdata component in all 3 languages
- future sync from
plugin-ipcupstream will be manual/curated, not subtree-based
-
Invocation ID to auth-token mapping
- User decision: derive the
plugin-ipcuint64_t auth_tokenfromNETDATA_INVOCATION_IDusing a deterministic hash. - Constraint:
- the mapping must be identical in C, Rust, and Go
- Implication:
- only processes launched under the same Netdata run can authenticate
- Netdata restart rotates auth automatically
- User decision: derive the
-
Rollout mode
- User decision: big-bang switch.
- Implication:
- there will be no legacy custom-SHM fallback path for this metadata channel
- Risk:
- any bug in the new path blocks
ebpf.plugincgroup metadata integration immediately
- any bug in the new path blocks
-
Linux response size policy
- User concern/decision direction:
- do not accept a large fixed memory cost such as
16 MiBjust for this IPC path - prefer dynamic behavior that adapts to actual payload size
- allocation should happen only when needed
- do not accept a large fixed memory cost such as
- Implication:
- the current
plugin-ipcresponse budgeting model needs review before integration - response sizing / negotiation may need design changes, not just configuration
- the current
- User concern/decision direction:
-
Snapshot overflow handling direction
- User decision direction:
- reconnect is acceptable for snapshot overflow handling
- growth policy should be power-of-two
- SHM L2 should transparently handle overflow-driven resizing, hidden from both L2 clients and L2 servers
- User design intent:
- the server should not need to know the final safe snapshot size before the first request
- the first real overflow during response preparation should trigger the resize path
- once the server has learned a larger size from a real snapshot, later clients should negotiate into that larger size automatically
- Implication:
- current fixed per-session SHM sizing and current HELLO/HELLO_ACK limit semantics are not sufficient as-is for this Netdata use case
- the growth mechanism likely needs new L2 protocol behavior, not only implementation tweaks
- User decision direction:
-
Pre-integration gating
- User decision:
- implement this transparent SHM resize behavior in
plugin-ipcfirst - do not start Netdata integration before it is done
- require thorough validation first, including full interop matrices across C/Rust/Go on Unix and Windows
- implement this transparent SHM resize behavior in
- Verified evidence that the repo already has the right validation scaffolding:
- POSIX interop tests in
CMakeLists.txt:test_uds_interoptest_shm_interoptest_service_interoptest_service_shm_interoptest_cache_interoptest_cache_shm_interop
- Windows interop tests in
CMakeLists.txt:test_named_pipe_interoptest_win_shm_interoptest_service_win_interoptest_cache_win_interop
- Existing transport-specific integration tests already exist:
- POSIX SHM:
tests/fixtures/c/test_shm.c, Rustsrc/crates/netipc/src/transport/shm_tests.rs - Windows SHM:
tests/fixtures/c/test_win_shm.c, Rustsrc/crates/netipc/src/transport/win_shm.rs, Gosrc/go/pkg/netipc/transport/windows/shm_test.go
- POSIX SHM:
- POSIX interop tests in
- Implication:
- the resize feature must be proven at:
- L1 transport level
- L2 service/client level
- cross-language interop level
- both POSIX and Windows implementations
- the resize feature must be proven at:
- User decision:
-
Design priorities for the resize rewrite
- User decision:
- optimize for long-term correctness, reliability, robustness, and performance
- backward compatibility is not required
- do not optimize for minimizing work now
- prefer the right design even if that means a substantial rewrite
- Implication:
- decisions should favor clean semantics and maintainability over preserving current handshake/transport structure
- a third rewrite is acceptable if it produces a better architecture
- User decision:
-
User design constraints from follow-up discussion
- IPC servers should service a single request kind.
- Sessions should be assumed long-lived:
- connect once
- serve many requests
- disconnect on shutdown or exceptional recovery
-
Benchmark refresh slice disposition
- User decision:
- commit and push the refreshed benchmark slice now
- then investigate the remaining benchmark spreads separately
- Implication:
- commit only the benchmark-fix, benchmark-artifact, and benchmark-doc sync files from this slice
- do not mix this commit with unrelated cleanup or integration work
- Current commit scope
- User decision:
- commit and push the full remaining work from this task now
- Implication:
- stage the remaining drift-removal, SHM-resize, service-kind alignment, test, and doc changes that belong to this task
- avoid unrelated local or user-owned changes outside this task
- Steady-state fast path matters far more than the rare resize path.
- Learned transport sizes are important:
- adapt automatically
- stabilize quickly
- then remain fixed for the lifetime of the process
- reset on restart
- Separate request and response sizing should exist.
- Variable sizing pressure is expected mainly on responses, not requests.
- Artificial hard caps are not acceptable as a design crutch.
- Disconnect-based recovery is acceptable if it is reliable and the system stabilizes.
- Accepted architecture decisions for the SHM resize rewrite
- User accepted:
- L2 service model: single-method-per-server
- Resize signaling path: explicit
LIMIT_EXCEEDEDsignal, then disconnect/reconnect - Auto-resize scope: separate learned request and response sizing, both supported
- Initial size policy: per-server-kind compile-time defaults
- Learned-size lifetime: in-memory only for the current process lifetime, reset on restart
- Implication:
- the current generic multi-method service abstraction is now known design drift
- the rewrite should simplify transport/service code around one request kind per server
- Service discovery and availability model
- User clarified the intended service model explicitly:
- clients connect to a service kind, not to a specific plugin implementation
- each service endpoint serves one request kind only
- example service kinds include:
cgroups-snapshotip-to-asnpid-traffic
- the serving plugin is intentionally abstracted away from clients
- User clarified the intended runtime model explicitly:
- plugins are asynchronous peers
- startup order is not guaranteed
- enrichments from other plugins/services are optional
- a client plugin may start before the service it needs exists
- a service may disappear and reappear during runtime
- clients must reconnect periodically and tolerate service absence
- Implication:
- repository docs/specs/TODOs must describe:
- service-name-based discovery
- service-type ownership independent from plugin identity
- optional dependency semantics
- reconnect / retry behavior for not-yet-available services
- repository docs/specs/TODOs must describe:
- Execution mandate for this phase
- User decision:
- proceed autonomously to remove the drift from implementation and docs
- align code, tests, and examples to the single-service-kind model
- implement the accepted SHM size renegotiation / resize behavior
- remove contradictory wording and stale examples that preserve the wrong model
- Implication:
- this is now a repository-wide consistency and implementation task
- active docs, public APIs, interop fixtures, and validation must converge on the same model before Netdata integration
- Request-kind field semantics
- User clarification:
- request type / method code may remain in wire structures and headers
- its role is validation, not public multi-method dispatch
- a service endpoint expects exactly one request kind
- any other request kind must be rejected
- Implication:
- we can keep method codes in the protocol
- service implementations must bind one endpoint to one expected request kind
- public APIs/tests/docs must not imply that one service endpoint accepts multiple unrelated request kinds
- Payload-vs-service boundary
- User clarification:
- if a service needs arrays of things, batching belongs to that service payload/codec
- batching is not a reason for one L2 endpoint to expose multiple public request kinds
- Implication:
- the public L2 service layer should not keep generic multi-method or generic batch dispatch as part of its contract
INCREMENT,STRING_REVERSE, and batch ping-pong traffic can remain at protocol / transport / benchmark level- the public cgroups snapshot service should be snapshot-only
-
Service naming and endpoint placement
- Context:
- POSIX transport needs a service name and run-dir placement.
- Netdata already has
os_run_dir(true).
- Open question:
- exact service name/versioning strategy for the cgroups snapshot endpoint
- Context:
-
Exact Linux response-size budget
- Context:
- user rejected a large fixed per-connection budget as bad for footprint
- dynamic/adaptive options must be evaluated against the current
plugin-ipcdesign
- Current hard payload evidence:
1000cgroups at roughly4.3 KiBeach already implies multi-megabyte worst-case snapshots
- Open question:
- what protocol / implementation change best preserves low idle footprint while still supporting large snapshots
- Context:
-
Dynamic response sizing model
- Context:
- current
plugin-ipcsession handshake negotiatesagreed_max_response_payload_bytesonce - current implementations then size buffers against that session-wide maximum
- current
- Verified evidence:
- handshake uses
min(client, server)insrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - C client allocates request/response/send buffers eagerly in
src/libnetdata/netipc/src/service/netipc_service.c - C server allocates per-session response buffer sized to the full negotiated maximum in
src/libnetdata/netipc/src/service/netipc_service.c - Linux SHM region size is fixed from negotiated request/response capacities in
src/libnetdata/netipc/src/transport/posix/netipc_shm.c - UDS chunked receive is already dynamically grown with
reallocinsrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - Rust and Go clients are already more dynamic and grow buffers lazily in:
src/crates/netipc/src/service/cgroups.rssrc/go/pkg/netipc/service/cgroups/client.go
- Netdata
ebpf.pluginrefreshes cgroup metadata every 30 seconds:src/collectors/ebpf.plugin/ebpf_process.hsrc/collectors/ebpf.plugin/ebpf_cgroup.c
- handshake uses
- Decision needed:
- choose whether to keep the current protocol and improve allocation policy only, or evolve the protocol to support truly dynamic large snapshots
- Options:
- A. Keep protocol, make implementation adaptive, and use baseline-only transport for the cgroups snapshot service in phase 1
- B. Add paginated snapshot requests/responses
- C. Add out-of-band exact-sized bulk snapshot transfer for large responses
- D. Keep the current fixed session-wide max model and just configure a large cap
- E. Keep SHM for data, but negotiate/create SHM capacity per request instead of per session
- F. Split transport into a tiny control channel plus ephemeral payload channel/object
- G. Add a small size-probe step before fetching the full snapshot
- H. Add true server-streamed snapshot responses (multi-message response sequence)
- I. Allow snapshot responses to return "resize to X bytes and retry", so the client grows once on demand and reuses that larger buffer from then on
- J. Make SHM L2 transparently reconnect and double capacities on overflow, so resizing is hidden from both clients and servers and the server retains the learned larger size for future sessions
- Current preferred direction under discussion:
- J, but it still needs stress-testing against the current HELLO/HELLO_ACK semantics, SHM lifecycle, and L2 retry behavior
- Context:
-
Transparent SHM resize semantics
- Context:
- user direction is to make SHM L2 resizing automatic and transparent to both clients and servers
- reconnect is acceptable and growth should be power-of-two on overflow
- Verified evidence:
- current server sends
NIPC_STATUS_INTERNAL_ERRORon handler/batch failure insrc/libnetdata/netipc/src/service/netipc_service.c - current C/Go/Rust clients treat any non-
OKresponse transport status as bad layout / failure:src/libnetdata/netipc/src/service/netipc_service.csrc/go/pkg/netipc/service/cgroups/client.gosrc/crates/netipc/src/service/cgroups.rs
NIPC_STATUS_LIMIT_EXCEEDEDalready exists insrc/libnetdata/netipc/include/netipc/netipc_protocol.h
- current server sends
- Corrected layering rule from user discussion:
- transport/L2 may handle overflow signaling, reconnect, and shared-memory remap mechanics
- replay detection for mutating RPCs belongs to the request payload and the server business logic, not to transport-level semantic dedupe
- Clarified implication:
- transport should not try to "understand" whether a mutation was already applied
- if a mutating method cares about replay safety, it must carry a request identity / idempotency token in its own payload and the server method must enforce it
- For the Netdata cgroups snapshot use case:
- this is not a blocker, because snapshot is read-only
- Open question:
- whether transparent reconnect-and-retry should be generic transport behavior for all methods, or exposed as a capability that higher layers opt into when their payload semantics make replay safe
- Context:
-
Negotiation semantics for learned SHM size
- Context:
- user correctly rejected the current
min(client, server)rule for learned snapshot sizing - current handshake stores only one scalar per direction, so it cannot distinguish:
- client hard cap
- client initial size
- server learned target size
- user correctly rejected the current
- Verified evidence:
- current HELLO/HELLO_ACK uses fixed
agreed_max_*fields in:src/libnetdata/netipc/src/transport/posix/netipc_uds.csrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- current HELLO/HELLO_ACK uses fixed
- Open question:
- should the protocol split "current operational size" from "hard ceiling", so the server can advertise a learned larger target without losing the client’s ability to refuse absurd allocations
- Context:
-
Request-side vs response-side SHM growth asymmetry
- Verified evidence:
- POSIX SHM send rejects oversize messages locally before the peer can react:
src/libnetdata/netipc/src/transport/posix/netipc_shm.c
- existing tests already cover this class of failure:
tests/fixtures/c/test_shm.ctests/fixtures/c/test_service.c(test_shm_batch_send_overflow_on_negotiated_limit)tests/fixtures/c/test_win_shm.ctests/fixtures/c/test_win_service_guards.c
- POSIX SHM send rejects oversize messages locally before the peer can react:
- Implication:
- response-capacity growth can be learned by the server while building a response
- request-capacity growth cannot be learned the same way, because an oversize request fails client-side before the server sees it
- Open question:
- should the first implementation cover:
- response-side transparent resize only
- or symmetric request+response resize with separate client-learned request sizing semantics
- should the first implementation cover:
- Verified evidence:
-
Netdata lifecycle ownership details
- Context:
cgroups.pluginruns in-daemonebpf.pluginis external
- Open question:
- exact daemon init/shutdown points for starting/stopping the
plugin-ipccgroups server and for initializing theebpf.pluginclient cache
- exact daemon init/shutdown points for starting/stopping the
- Context:
-
Update the wire-level negotiation implementation in C / Go / Rust transports.
- Enforce
max_request_payload_bytes <= 1 MiBduring handshake. - Reject oversized request proposals with handshake
LIMIT_EXCEEDED. - Stop using request-side
max(client, server). - Make
agreed_max_response_batch_itemsstrictly equal to the effective request batch-item limit. - Evidence:
src/libnetdata/netipc/src/transport/posix/netipc_uds.csrc/libnetdata/netipc/src/transport/windows/netipc_named_pipe.csrc/go/pkg/netipc/transport/posix/uds.gosrc/go/pkg/netipc/transport/windows/pipe.gosrc/crates/netipc/src/transport/posix.rssrc/crates/netipc/src/transport/windows.rs
- Enforce
-
Move SHM readiness earlier so successful handshake guarantees the selected profile is already usable.
- The server must not complete successful handshake with
selected_profile = SHMunless SHM for that session is already ready. - This requires changing the current “accept first, create SHM later” flow.
- Evidence:
src/libnetdata/netipc/src/service/netipc_service.csrc/libnetdata/netipc/src/service/netipc_service_win.csrc/go/pkg/netipc/service/raw/client.gosrc/go/pkg/netipc/service/raw/client_windows.gosrc/crates/netipc/src/service/raw.rs
- The server must not complete successful handshake with
-
Remove
max_request_payload_bytesfrom the public typed L2 API surfaces.- Keep internal learned sizing / overflow recovery machinery.
- Make typed L2 derive initial request sizing internally instead of exposing it publicly.
- Evidence:
src/libnetdata/netipc/include/netipc/netipc_service.hsrc/go/pkg/netipc/service/cgroups/types.gosrc/crates/netipc/src/service/cgroups.rs
-
Keep overflow-driven reconnect as a tested internal fallback.
- Preserve the existing recovery model, but align it to the new 1 MiB ceiling and the new public API contract.
- Evidence:
src/libnetdata/netipc/src/service/netipc_service.csrc/libnetdata/netipc/src/service/netipc_service_win.csrc/go/pkg/netipc/service/raw/client.gosrc/go/pkg/netipc/service/raw/client_windows.gosrc/crates/netipc/src/service/raw.rs
-
Rewrite and extend handshake tests so each negotiated field is validated individually across all implementations.
- Include all auth failures individually.
- Include explicit request-payload-cap rejection.
- Include request/response batch-item symmetry checks.
- Include overflow reconnect tests under the new contract.
- Evidence:
tests/fixtures/c/test_uds.ctests/fixtures/c/test_named_pipe.ctests/fixtures/c/test_service.ctests/fixtures/c/test_win_service.csrc/go/pkg/netipc/transport/posix/uds_test.gosrc/go/pkg/netipc/transport/windows/pipe_integration_test.gosrc/go/pkg/netipc/service/raw/more_unix_test.gosrc/go/pkg/netipc/service/raw/more_windows_test.gosrc/crates/netipc/src/transport/posix_tests.rssrc/crates/netipc/src/transport/windows.rssrc/crates/netipc/src/service/raw_unix_tests.rssrc/crates/netipc/src/service/raw_windows_tests.rs
-
Audit the current implementation surfaces that still encode multi-method service behavior.
-
Define the replacement public model in code terms:
- one service module per service kind
- one endpoint per request kind
- service-specific typed clients/servers/cache helpers
-
Redesign SHM resize semantics in implementation terms:
- explicit
LIMIT_EXCEEDED - disconnect/reconnect recovery
- separate learned request/response sizes
- process-lifetime learned sizing
- explicit
-
Rewrite the C, Rust, and Go Level 2 service layers to match the corrected model.
-
Rewrite interop/service fixtures and validation scripts to test one service kind per server.
-
Rewrite public docs/examples/specs to remove contradictory multi-method wording.
-
Run targeted tests first, then the full relevant Unix/Windows matrices required to trust the rewrite.
-
Summarize any residual risk or remaining ambiguity before starting Netdata integration work.
-
Rerun the current Linux and Windows benchmark matrices on the aligned tree.
-
Regenerate benchmark artifacts and update all benchmark-derived docs/README summaries.
- Preserve Level 1 transport interoperability work where still valid.
- Preserve codec/message-family work where it remains useful under a service-oriented split.
- Prefer removal/rename of drifted APIs over keeping compatibility shims, because backward compatibility is not required.
- Keep request-kind and outer-envelope metadata available to single-kind handlers only for:
- validating that the endpoint received the expected request kind
- reading transport batch metadata when a single service kind supports batched payloads
- Do not use that metadata to reintroduce generic multi-method dispatch at the public Level 2 surface.
- If a generic Level 2 helper remains for tests/benchmarks, keep it internal and single-kind:
- one expected request kind per endpoint
- no public multi-method callback surface
- no docs/examples presenting it as a production service model
- C, Rust, and Go unit tests for the rewritten service APIs
- POSIX interop matrix for corrected service identities and SHM resize behavior
- Windows interop matrix for corrected service identities and SHM resize behavior
- Explicit tests for:
- late provider startup
- reconnect after provider restart
- service absence as a tolerated state
- SHM resize on response overflow
- learned-size reuse after reconnect
- request-side and response-side learned sizing behavior
- Keep README, docs specs, and active TODOs aligned with:
- service-oriented discovery
- one request kind per endpoint
- optional asynchronous enrichments
- reconnect-driven recovery
- SHM resize / renegotiation behavior
- Finalize remaining design details above.
- Vendor
plugin-ipcinto Netdata in the chosen native layout. - Add a Linux
cgroupstyped server inside Netdata daemon lifecycle. - Replace
ebpf.pluginshared-memory metadata reader withplugin-ipccgroups cache client. - Keep existing PID membership logic in
ebpf.pluginunchanged in phase 1. - Remove the old custom SHM metadata path as part of the big-bang switch.
- Add tests for:
- normal metadata refresh
- stale/restarted Netdata invalidating old clients
- large snapshots
ebpf.pluginrecovery on server restart
- Phase 1 is Linux-only.
- Phase 1 targets
cgroups.plugin->ebpf.pluginmetadata only. - Current
collectors-ipc/ebpf-ipc.*apps/pid SHM remains untouched. NETDATA_INVOCATION_IDmust be available to theebpf.pluginlauncher path and any future external clients.- A deterministic invocation-id hashing helper will be needed in C, Rust, and Go.
- Unit tests for invocation-id to auth-token derivation in C, Rust, and Go.
- Integration test proving only same-run plugins can connect.
- Integration test proving restart rotates auth and old clients fail cleanly.
- Snapshot scale test with high cgroup counts and long names/paths.
ebpf.pluginregression test for existing cgroup discovery semantics.
- Netdata integration design note for the new cgroups metadata transport.
- Developer docs for the new in-tree
netipclayout and per-language use. ebpf.pluginandcgroups.plugininternal docs describing the new IPC path.- Rollout/kill-switch documentation if dual-path rollout is selected.
- Verified benchmark-distortion findings before changing code:
- POSIX
shm-batch-ping-pongforc/rustexceeds the1.2xthreshold:c->c = 64,148,960c->rust = 58,334,803rust->c = 52,277,542rust->rust = 48,220,338
- The full corrected Linux and Windows matrices also showed broader benchmark-driver artifacts:
- Go
lookupbenchmark used a synthetic linear scan instead of the actual cache-style hash lookup. - Rust
lookupbenchmark used a synthetic linear scan too. - Rust cache lookup allocated
name.to_string()on every lookup. - Go and Rust benchmark clients still had hot-loop buffer allocations in batch, pipeline, and ping-pong paths.
- Go
- POSIX
- Implemented first remediation pass:
src/crates/netipc/src/service/raw.rs- replaced the flat
(hash, String)lookup key with nested per-hash maps so Rust cache lookups stop allocating per call
- replaced the flat
bench/drivers/rust/src/main.rs- removed hot-loop allocations from SHM batch client
- removed hot-loop allocations from ping-pong client
- moved pipeline-batch receive buffer allocation out of the outer loop
- replaced lookup linear scan with hash-map lookup
bench/drivers/rust/src/bench_windows.rs- removed the same hot-loop allocations on Windows
- replaced lookup linear scan with hash-map lookup
bench/drivers/go/main.go- removed hot-loop allocations from batch, pipeline, pipeline-batch, and ping-pong clients
- replaced lookup linear scan with hash-map lookup
bench/drivers/go/main_windows.go- removed the same hot-loop allocations on Windows
- replaced lookup linear scan with hash-map lookup
- Validation after the first remediation pass:
cargo test --manifest-path src/crates/netipc/Cargo.toml --lib -- --test-threads=1299 passed, 0 failed
cd bench/drivers/go && go test -run '^$' ./...- compile-only pass
cd src/go && go test -count=1 ./pkg/netipc/service/raw ./pkg/netipc/service/cgroups- both packages passed
- Targeted Linux rerun after the first remediation pass:
lookupc = 173,132,146rust = 45,886,102go = 47,703,281- fact: the fake benchmark scans are gone; the remaining gap is now in the actual lookup data structures
shm-batch-ping-pong, target0c->c = 62,314,895c->rust = 57,112,806rust->c = 51,620,887rust->rust = 47,356,599- fact: the Rust client and Rust server penalties are both still real
uds-pipeline-d16, target0c->c = 721,232c->rust = 717,024c->go = 572,552rust->c = 719,458rust->rust = 727,197rust->go = 576,525- fact: the remaining delta is mostly a Go server issue, not a client issue
uds-pipeline-batch-d16, target0c->c = 103,250,763c->rust = 91,495,522c->go = 51,623,524rust->c = 102,367,177rust->rust = 89,465,821rust->go = 52,915,850- fact: the earlier client-side benchmark distortion is gone; the remaining large delta is mainly the Go server path
- Next concrete fixes identified from code + rerun evidence:
- Go and Rust cache lookup should mirror the C open-addressing hash table:
- evidence:
- C uses
hash ^ djb2(name)with open addressing insrc/libnetdata/netipc/src/service/netipc_service.c - Go still uses a composite
map[{hash,name}]insrc/go/pkg/netipc/service/raw/cache.go - Rust still uses nested
HashMap<u32, HashMap<String, usize>>insrc/crates/netipc/src/service/raw.rs
- C uses
- implication:
- Go and Rust still pay full runtime string hashing on every lookup while C does not
- evidence:
- Go POSIX UDS transport should mirror the C/Rust vectored send path:
- evidence:
- C uses
sendmsg+ twoiovecs insrc/libnetdata/netipc/src/transport/posix/netipc_uds.c - Rust uses
raw_send_iov()insrc/crates/netipc/src/transport/posix.rs - Go still copies header + payload into a merged scratch buffer in
src/go/pkg/netipc/transport/posix/uds.go
- C uses
- implication:
- Go server responses on UDS still pay an extra memcpy per message on the hot path
- evidence:
- Go and Rust cache lookup should mirror the C open-addressing hash table:
- Next measurement step:
- apply the lookup-index and Go UDS send fixes
- rerun only the affected slices first:
- Linux:
lookup,shm-batch-ping-pong,uds-pipeline-d16,uds-pipeline-batch-d16 - Windows:
lookup,shm-batch-ping-pong,np-pipeline-d16,np-pipeline-batch-d16
- Linux:
- only after the slice reruns are understood should the full matrices and docs be refreshed again.
- Second targeted Linux rerun after rebuilding the Rust release benchmark:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244- fact:
- Rust lookup is now near C after mirroring the C open-addressing structure
- Go lookup improved materially too, but it is still above the
1.2xthreshold versus C
shm-batch-ping-pong, target0c->c = 60,929,552c->rust = 55,151,867rust->c = 49,426,036rust->rust = 45,104,001- fact:
- Rust still has a real server-side penalty on this path
- Rust still has a larger real client-side penalty on this path
uds-pipeline-d16, target0c->c = 713,563c->rust = 720,602rust->c = 722,202rust->rust = 712,371c->go = 548,145rust->go = 563,484- fact:
- Rust is now aligned with C on the non-batch UDS pipeline path
- the remaining delta is almost entirely the Go server path
uds-pipeline-batch-d16, target0c->c = 101,588,680c->rust = 83,396,588rust->c = 99,570,528rust->rust = 86,762,291c->go = 52,899,078rust->go = 51,902,022- fact:
- Rust client-side is now close to C on this path
- Rust server-side still shows a real batch-path penalty
- Go server-side is still the dominant outlier
- Structural batch-path asymmetry verified from code:
- C managed server exposes a whole-request callback:
src/libnetdata/netipc/include/netipc/netipc_service.h:187-192- callback receives
request_hdr, fullrequest_payload, and wholeresponse_buf
- C benchmark server uses that whole-request callback to batch-specialize increment in one loop:
bench/drivers/c/bench_posix.c:164-216- the callback sees
NIPC_FLAG_BATCH, loops all items itself, and emits the whole batch response directly
- Rust managed server exposes only per-item raw dispatch:
src/crates/netipc/src/service/raw.rs:1285-1297- batch handling is then forced through the managed-server loop:
src/crates/netipc/src/service/raw.rs:2002-2047- per item:
batch_item_get()->dispatch_single_internal()->bb.add()
- Go managed server exposes the same per-item dispatch shape:
src/go/pkg/netipc/service/raw/types.go:57-59- batch handling is forced through:
src/go/pkg/netipc/service/raw/client.go:903-946- per item:
BatchItemGet()->dispatchSingle()->bb.Add()
- fact:
- the remaining Rust and Go batch server gaps are not just transport issues
- C can specialize whole-batch increment handling at the callback boundary; Rust and Go cannot
- C managed server exposes a whole-request callback:
- Working theory for the remaining Linux gaps:
shm-batch-ping-pong- Rust still has both client-side and server-side cost versus C
- the server-side part aligns with the batch callback asymmetry above
uds-pipeline-batch-d16- Rust client-side is now nearly aligned with C
- the remaining Rust delta is mainly server-side batch handling overhead
- the much larger Go delta is likely server-side too, with the same structural asymmetry plus extra Go dispatch/runtime overhead
- Decision required before the next implementation step:
- Background:
- The remaining batch-path gap is now tied to the managed-server design.
- Any serious fix must choose whether to optimize only the benchmarks or to change the service/server implementation model.
-
- Batch server optimization strategy
- Evidence:
- C whole-request callback:
src/libnetdata/netipc/include/netipc/netipc_service.h:187-192bench/drivers/c/bench_posix.c:164-216
- Rust per-item batch loop:
src/crates/netipc/src/service/raw.rs:1285-1297src/crates/netipc/src/service/raw.rs:2002-2047
- Go per-item batch loop:
src/go/pkg/netipc/service/raw/types.go:57-59src/go/pkg/netipc/service/raw/client.go:903-946
- C whole-request callback:
- A. Benchmark-only fast path
- Implement dedicated Rust/Go benchmark servers that bypass the managed server for increment batch.
- Pros:
- fastest way to measure the upper bound
- smallest code change
- Implications:
- benchmark numbers improve, but the library/server path stays asymmetric
- Risks:
- hides a real product/library performance issue
- docs and benchmarks stop representing real library behavior
- B. Internal managed-server specialization
- Keep the external single-kind API shape, but add internal fast paths for known service kinds such as increment batch.
- Pros:
- fixes real library behavior
- avoids large public API churn
- aligned with one-service-kind servers
- Implications:
- managed-server internals become aware of service-kind-specific fast paths
- Risks:
- hidden complexity if done ad hoc
- may still leave the public abstraction less explicit than the implementation
- C. Explicit service-kind-specific server APIs
- Redesign Rust/Go managed servers so each service kind gets its own whole-request server callback surface, matching the accepted single-kind architecture.
- Pros:
- cleanest long-term design
- makes the fast path explicit instead of hidden
- best fit for maintainability and performance
- Implications:
- broader API/implementation/test/doc rewrite in Rust and Go
- Risks:
- largest scope before the next measurement
- Recommendation:
1. C- Reason:
- the evidence shows a real API/implementation asymmetry, not just a hot-loop bug
- your accepted single-kind-service design already points in this direction
- Background:
- Priority check raised by Costa:
- Background:
- Current benchmark results are already very high in absolute terms.
- The remaining gaps are real, but fixing them now would require a broader Rust/Go managed-server redesign for batch-heavy paths.
- Facts:
- Clean Linux rerun:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244
shm-batch-ping-pongc->c = 60,929,552rust->rust = 45,104,001
uds-pipeline-batch-d16c->c = 101,588,680rust->rust = 86,762,291go->go = 51,355,370
- Fact:
- these are already very high throughputs in absolute terms
- the remaining work is now mainly about closing relative efficiency gaps, not about making the library viable
- Clean Linux rerun:
- Working theory:
- Deferring the remaining batch-path optimization is reasonable if there are more fundamental correctness, architecture, or product-fit issues still open.
- The benchmark investigation has already done its job by identifying the structural asymmetry and proving where it lives.
- Background:
- Updated decision from Costa:
- continue the benchmark investigation for trust in the framework
- investigate all remaining
>1.20xdifferences - treat the Rust/Go batch-path asymmetry as already identified, and focus next on the remaining unexplained gaps
- Remaining unexplained Linux gaps after excluding the known batch-path issue:
lookupc = 170,976,986rust = 150,660,413go = 121,278,244
uds-pipeline-d16c->c = 713,563c->go = 548,145rust->go = 563,484- fact:
- the Go server remains the unexplained outlier on the non-batch pipeline path
- New concrete finding: Go lookup still pays a by-value item copy on every successful bucket probe
- Evidence:
- actual cache lookup:
src/go/pkg/netipc/service/raw/cache.go:122-130item := c.items[c.buckets[slot].index]copies the wholeCacheItem
- Go lookup benchmark mirrors the same behavior:
bench/drivers/go/main.go:1133-1136bucketItem := cacheItems[lookupIndex[slot].index]copies the whole struct
- Rust uses a reference:
src/crates/netipc/src/service/raw.rs:2376-2379
- C returns a pointer:
src/libnetdata/netipc/src/service/netipc_service.c:1492-1497
- actual cache lookup:
- Implication:
- the current Go lookup gap is still at least partly a real Go implementation issue, not just a benchmark artifact
- Evidence:
- Follow-up measurement on the Go lookup bucket-copy fix:
- Applied:
src/go/pkg/netipc/service/raw/cache.gobench/drivers/go/main.go- changed bucket probes from by-value
CacheItemcopies to pointer/reference access
- Rerun results:
c = 172,638,775rust = 153,518,048go = 115,783,444
- Fact:
- the fix had no material positive effect on Go lookup throughput
- therefore the by-value bucket copy was not the dominant cause of the remaining Go lookup gap
- Applied:
- Go lookup profile after the bucket-copy fix:
- Evidence:
- live perf profile of
bench_posix_go lookup-bench - output row:
lookup,go,go,127,972,744
- visible hot frames from
/tmp/nipc-go-lookup-perf.data:main.runLookupBenchalmost all samplestime.runtimeNanoabout8%runtime.memequalabout2%
- live perf profile of
- Fact:
- no single framework/library helper stands out as the dominant hotspot
- the operation is so small that benchmark loop overhead and inlining dominate the profile
- Working theory:
- the remaining Go lookup gap is not currently a strong signal about the IPC framework itself
- it is at least partly a benchmark-methodology issue for a tiny in-memory operation
- Evidence:
- Go non-batch pipeline server profile:
- Evidence:
- live perf profile of
bench_posix_go uds-ping-pong-serverunderuds-pipeline-d16load from a C client - client result during profile:
uds-pipeline-d16,c,c,567,061,...
- hot frames from
/tmp/nipc-go-server-perf.data:Session.Sendabout39.5%Session.Receiveabout33.8%raw.pollFdabout23.1%- increment dispatch does not materially appear
- live perf profile of
- Fact:
- the remaining Go server gap on
uds-pipeline-d16is not in increment handler logic - it is dominated by the Go UDS server transport/poll path
- the remaining Go server gap on
- Supporting fact:
- Go as a client on the same scenario is only slightly slower than C/Rust:
go->c = 699,976vsc->c = 713,563go->rust = 685,614vsc->rust = 720,602
- implication:
- the big remaining gap is mainly server-side, and
pollFdis the strongest server-only suspect
- the big remaining gap is mainly server-side, and
- Go as a client on the same scenario is only slightly slower than C/Rust:
- Evidence:
- New concrete finding: Go non-batch server gap is transport/poll dominated, not dispatch dominated
- Evidence:
- live perf profile of
bench_posix_go uds-ping-pong-serverunderuds-pipeline-d16load - hot path breakdown from
/tmp/nipc-go-server-perf.data:Session.Sendabout39.5%Session.Receiveabout33.8%raw.pollFdabout23.1%- increment dispatch does not materially appear in the hot path
- live perf profile of
- Working theory:
- the remaining Go server delta on
uds-pipeline-d16is in the Go UDS server transport/wrapper path, especiallypoll + recvmsg + sendmsg, not in the increment handler logic
- the remaining Go server delta on
- Evidence:
- TL;DR:
- rerun the full official benchmark suites on the current worktree for both Linux and Windows
- regenerate the checked-in benchmark artifacts from those reruns
- compare the refreshed Linux and Windows matrices and flag any materially strange language deltas
- review and follow the existing repo TODO guidance for the real Windows
win11benchmark workflow
- Analysis:
- current checked-in benchmark artifacts are from
2026-03-25:benchmarks-posix.mdbenchmarks-windows.mdREADME.md
- the official full-matrix runners are:
- Linux:
tests/run-posix-bench.shtests/generate-benchmarks-posix.sh
- Windows:
tests/run-windows-bench.shtests/generate-benchmarks-windows.sh
- Linux:
- the verified Windows execution guidance already exists in repo TODOs and README:
README.md:342-365TODO-pending-from-rewrite.md:2754-2849
- current runner/generator methodology facts for Windows trustworthiness:
tests/run-windows-bench.shcurrently writes exactly one CSV row per benchmark cell:run_pair()parses one client result and immediately appends it toOUTPUT_CSV- there is no built-in repetition, aggregation, or instability gate
tests/generate-benchmarks-windows.shvalidates completeness and floors, but it trusts each CSV row as final truth:- it has no notion of repeated samples, medians, spread, or outlier detection
- implication:
- a single noisy Windows measurement can currently become the published benchmark artifact if it still parses and keeps throughput above zero
- benchmark methodology references gathered before changing the Windows workflow:
- Google Benchmark user guide:
- repeated benchmarks exist because a single result may not be representative when benchmarks are noisy
- when repetitions are used, mean / median / standard deviation are reported
- source examined:
/tmp/google-benchmark-20260326/docs/user_guide.md
- Criterion.rs analysis and user guide:
- noisy runs should be treated skeptically
- longer measurement time reduces the influence of outliers
- outlier classification is a first-class part of reliable benchmark analysis
- sources examined:
/tmp/criterion-rs-20260326/book/src/user_guide/command_line_output.md/tmp/criterion-rs-20260326/book/src/analysis.md
- Google Benchmark user guide:
- verified workflow facts from those docs:
- real Windows benchmark proof is expected on
win11, not via Linux cross-compilation - login shell may start as
MSYSTEM=MSYS; benchmark runs should set:PATH="/c/Users/costa/.cargo/bin:/c/Program Files/Go/bin:/mingw64/bin:$PATH"MSYSTEM=MINGW64CC=/mingw64/bin/gccCXX=/mingw64/bin/g++
- official Windows benchmark commands are:
bash tests/run-windows-bench.sh benchmarks-windows.csv 5bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
- real Windows benchmark proof is expected on
- the current local worktree is not clean and includes benchmark-related source edits:
bench/drivers/go/main.gobench/drivers/go/main_windows.gobench/drivers/rust/src/main.rsbench/drivers/rust/src/bench_windows.rs- plus service/transport files that can affect benchmark behavior
- implication:
- the refreshed artifacts must reflect this exact current tree
- benchmark interpretation must distinguish:
- real implementation/runtime asymmetry
- normal platform differences
- measurement distortion or stale artifact drift
- current checked-in benchmark artifacts are from
- Decisions:
- no new user decision required before execution
- using the existing official full-suite runners is the correct path
- using the existing real
win11workflow is the correct Windows path
- Plan:
- run the full Linux benchmark suite locally on the current tree
- regenerate
benchmarks-posix.md - run the full Windows benchmark suite on
win11using the documented native-toolchain environment - regenerate
benchmarks-windows.md - compare refreshed CSVs and summarize the largest cross-language spreads by scenario
- classify strange deltas as:
- expected platform/runtime behavior
- suspicious and possibly measurement-related
- suspicious and likely implementation-related
- update benchmark-derived docs if the refreshed artifacts materially change the published snapshot
- for the Windows trustworthiness fix:
- change the Windows runner to collect multiple measured repetitions per benchmark cell instead of trusting a single sample
- aggregate repeated samples into one publication row using a robust statistic instead of one lucky or unlucky run
- preserve a fail-closed path:
- if repeated Windows samples for a cell diverge beyond a configured spread threshold, fail the run instead of publishing that cell
- keep the published CSV shape stable if possible, so the existing generator/report consumers do not need a schema rewrite just to gain trustworthiness
- Implied decisions:
- benchmark duration remains the documented default
5seconds unless the runner fails and forces a diagnostic rerun - the first full pass should use the official artifact filenames:
benchmarks-posix.csvbenchmarks-posix.mdbenchmarks-windows.csvbenchmarks-windows.md
- if Windows artifacts are produced remotely, copy them back into this repo without resetting unrelated local files
- benchmark duration remains the documented default
- Testing requirements:
- Linux benchmark CSV must contain
201data rows and pass the generator validation - Windows benchmark CSV must contain
201data rows and pass the generator validation - refreshed artifacts must have no duplicate scenario keys and no zero-throughput rows
- Linux benchmark CSV must contain
- Documentation updates required:
- update the checked-in benchmark markdown files to match the refreshed CSVs
- update
README.mdonly if the published generated dates, machine snapshot, or headline benchmark ranges are no longer true after the refresh
- Execution results:
- reviewed Windows benchmark handoff guidance before execution:
README.md:342-365TODO-pending-from-rewrite.md:2754-2849
- Linux benchmark refresh completed successfully on the current worktree:
- command:
cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_posixbash tests/run-posix-bench.sh benchmarks-posix.csv 5bash tests/generate-benchmarks-posix.sh benchmarks-posix.csv benchmarks-posix.md
- result:
201rows- generator passed
- all configured POSIX floors passed
- command:
- Windows benchmark refresh completed on
win11native MSYS/MinGW toolchain path:- disposable synced tree:
/tmp/plugin-ipc-bench-20260326
- command:
cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windowsbash tests/run-windows-bench.sh benchmarks-windows.csv 5bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md
- factual result:
- benchmark runner completed
201rows - generator wrote
benchmarks-windows.md - generator exited non-zero because of one floor violation:
shm-ping-pong rust->c @ max = 850,994- configured floor:
1,000,000
- benchmark runner completed
- disposable synced tree:
- new user requirement after the unstable Windows reruns:
- make Windows benchmarks trustworthy instead of relying on single noisy runs
- allowed direction from user:
- increase duration
- run multiple repetitions
- use any stronger methodology needed, as long as the published Windows benchmark artifacts become trustworthy
- fit-for-purpose clarification:
- Windows benchmark artifacts must be publication-grade on
win11 - single-run outliers must not be able to define the checked-in benchmark matrix
- Windows benchmark artifacts must be publication-grade on
- Windows trustworthiness implementation now applied locally:
tests/run-windows-bench.sh- new default:
5measured samples per Windows benchmark cell - each published CSV row is now the median aggregate of those samples
- the runner now persists per-cell repeated samples in
RUN_DIRduring execution - initial implementation used a blunt raw spread gate:
- fail if
max(sample_throughput) / min(sample_throughput) > 1.35
- fail if
- new default:
tests/generate-benchmarks-windows.sh- markdown output now states that the current Windows report is based on repeated aggregated measurements instead of one single sample
- targeted proof of the new Windows trust method on
win11:- synced the updated Windows runner/generator into the same disposable proof tree:
/tmp/plugin-ipc-bench-20260326
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-trust.csv 5
- factual result:
- completed successfully with the new 5-sample median path
- no stability-gate failure
- the previously suspicious rows are now stable:
shm-ping-pong rust->c @ max = 2,527,551shm-ping-pong rust->rust @ 10000 = 9,999
- all reported SHM sample ratios observed during that proof stayed well below the
1.35gate
- implication:
- the old single-shot Windows SHM collapses were publication-methodology failures
- with repeated measurement + median aggregation + spread gating, the same
win11host now produces a stable SHM matrix
- synced the updated Windows runner/generator into the same disposable proof tree:
- first stability-gate refinement after proof runs on
win11:- fact:
- the initial raw
max/mingate was too blunt for legitimate runs with one obvious transient outlier
- the initial raw
- evidence:
- repeated sample file from the first full repeated run:
/tmp/netipc-bench-300472/samples-np-ping-pong-c-go-100000.csv
- measured throughputs:
17,79819,05915,5866,74118,303
- implication:
- one bad transient sample should not discard the whole row if the remaining samples agree tightly
- repeated sample file from the first full repeated run:
- attempted follow-up:
- a Tukey-style outlier fence was tested next
- fact:
- with only
5samples, that approach was too aggressive and incorrectly marked normal edge values as outliers
- with only
- evidence:
- repeated sample file:
/tmp/netipc-bench-287769/samples-np-ping-pong-go-c-0.csv
- measured throughputs:
17,41918,04918,07818,22918,533
- implication:
- the real spread there is only about
1.06x, so that row is stable and should be published
- the real spread there is only about
- repeated sample file:
- fact:
- final trust method now applied locally after those proof runs:
tests/run-windows-bench.sh- keep
5measured samples per published row - publish medians for throughput and latency/CPU columns
- when there are at least
5samples, drop exactly one lowest and one highest throughput sample before the stability check - require the remaining stable core to contain at least
3samples - require stable-core throughput spread:
stable_max / stable_min <= 1.35
- if the raw extremes are noisy but the stable core is good:
- publish the row
- print a warning that records both raw and stable spreads
- keep
tests/generate-benchmarks-windows.sh- methodology text updated to describe the stable-core rule instead of the original raw-spread wording
- second stability-gate refinement after full-suite evidence on
win11:- fact:
- the first repeated full-suite rerun still found a real unstable case at
5smax-throughput duration:snapshot-shm rust->go @ max
- the first repeated full-suite rerun still found a real unstable case at
- evidence:
- repeated sample file:
/tmp/netipc-bench-300472/samples-snapshot-shm-rust-go-0.csv
- measured throughputs:
1,042,824977,680648,337367,4911,027,273
- stable core after dropping one low and one high sample:
648,337977,6801,027,273
- stable-core ratio:
1.584474
- repeated sample file:
- implication:
- repeated measurement alone was not enough for all Windows max-throughput rows
- some max rows needed a longer measurement window, not just more samples
- fact:
- max-throughput duration refinement now applied locally:
tests/run-windows-bench.sh- fixed-rate rows still use the CLI duration default:
5s
- max-throughput rows now use a separate default duration:
NIPC_BENCH_MAX_DURATION=10
- the runner logs both durations at startup
- fixed-rate rows still use the CLI duration default:
- targeted proof on
win11for the previously failing case:- command:
NIPC_BENCH_FIRST_BLOCK=4 NIPC_BENCH_LAST_BLOCK=4 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/snapshot-shm-10s.csv 10
- factual result:
- previously failing
snapshot-shm rust->go @ maxbecame stable:- median throughput
1,053,376 - stable-core ratio
1.018280
- median throughput
- another noisy row also stabilized after trimming one low and one high sample:
snapshot-shm rust->c @ max- raw range:
460,343 .. 1,167,598
- stable-core range:
1,109,218 .. 1,133,875
- stable-core ratio:
1.022229
- previously failing
- command:
- implication:
- the final trustworthy Windows method is now:
- repeated measurement
- median publication
- stable-core gating
- longer max-throughput samples
- the final trustworthy Windows method is now:
- final proof run status after the trust-method changes:
- full-suite rerun now in progress on
win11with the final method:- fixed-rate rows:
5 samples x 5s
- max-throughput rows:
5 samples x 10s
- stability rule:
- publish only if the trimmed stable core stays within
1.35x
- publish only if the trimmed stable core stays within
- fixed-rate rows:
- live confirmed progress:
np-ping-pongblock completed cleanly under the final methodshm-ping-pongblock started cleanly under the final method
- full-suite rerun now in progress on
- first full repeated rerun with the
10smax default found one remaining unstable row late in the suite:- scenario:
np-pipeline-batch-d16 rust->rust @ max
- preserved sample file:
/tmp/netipc-bench-331471/samples-np-pipeline-batch-d16-rust-rust-0.csv
- measured throughputs:
37,400,75731,635,30226,609,20739,324,20224,312,207
- trimmed stable core:
26,609,207 .. 37,400,757
- stable-core ratio:
1.405557
- implication:
- the runner correctly failed closed
- the remaining instability was no longer global Windows SHM noise
- it was narrowed to
np-pipeline-batch @ maxonwin11
- scenario:
- targeted proof for the remaining pipeline-batch max instability:
- command:
NIPC_BENCH_FIRST_BLOCK=9 NIPC_BENCH_LAST_BLOCK=9 NIPC_BENCH_MAX_DURATION=20 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv 5
- factual result:
- the full
np-pipeline-batch-d16matrix passed cleanly at20s - previously failing row became stable:
rust->rust @ max = 34,184,748- stable-core ratio
1.064913
- previously noisy
go->c @ maxalso tightened materially:38,364,026- stable-core ratio
1.024521
- the full
- implication:
- the remaining issue was short-window measurement noise for
np-pipeline-batch @ max - a longer max window fixes it without relaxing the trust gate
- the remaining issue was short-window measurement noise for
- command:
- final Windows trust method now applied locally:
tests/run-windows-bench.sh- fixed-rate rows:
5s
- most max-throughput rows:
10s
np-pipeline-batch-d16 @ max:20s
- runner knobs now include:
NIPC_BENCH_MAX_DURATIONNIPC_BENCH_PIPELINE_BATCH_MAX_DURATION
- fixed-rate rows:
tests/generate-benchmarks-windows.sh- methodology section now documents the
20spipeline-batch max window explicitly
- methodology section now documents the
- final published Windows artifact assembly:
- full repeated rerun output from:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows.csv- used for all stable rows outside
np-pipeline-batch-d16 - notable publishable warning retained from that full rerun:
np-pipeline-d16 go->c @ max- raw range:
111,201 .. 255,780- raw ratio
2.300159
- trimmed stable core:
234,582 .. 241,982- stable ratio
1.031545
- implication:
- the outlier-handling path is doing real work on
win11 - the published median row is still trustworthy because the stable core stayed tight
- the outlier-handling path is doing real work on
- targeted validated
20srerun output from:/tmp/plugin-ipc-bench-20260326/np-pipeline-batch-20s.csv- used to replace the incomplete/unstable
np-pipeline-batch-d16block
- locally assembled final CSV:
202lines total201data rows- scenario counts all correct
- local validation:
bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md- result:
- all configured Windows floors pass
- report generation passes cleanly
- full repeated rerun output from:
- follow-up approved by Costa after the first trustworthy publish:
- run one fresh full Windows suite on
win11with the current default methodology - objective:
- remove the remaining "assembled artifact" caveat if the one-shot full run now passes end to end
- execution rule:
- sync the current local benchmark-related sources to the disposable
win11proof tree first - only replace the checked-in Windows CSV/MD if that single fresh rerun passes with all floors green
- sync the current local benchmark-related sources to the disposable
- run one fresh full Windows suite on
- current fresh-proof-tree rerun on
win11uses a new disposable tree based onorigin/mainplus the current local benchmark-related worktree files overlaid onto it:- fresh tree:
/tmp/plugin-ipc-bench-20260327-fullrun-150313
- factual setup issue discovered before the real rerun:
tests/run-windows-bench.shbuilds the C and Go benchmark binaries itself, but it only consumes an already-built Rust benchmark binary- on a fresh disposable tree, the first launch printed:
Rust benchmark binary not found: .../src/crates/netipc/target/release/bench_windows.exe (Rust tests will be skipped)
- implication:
- a fresh tree needs an explicit Rust build before the full Windows benchmark suite, or the run degrades to a 2-language matrix and is not publishable
- corrective action applied on
win11before the real rerun:cargo build --release --manifest-path src/crates/netipc/Cargo.toml --bin bench_windows
- real rerun then restarted from the same fresh tree with diagnostics enabled:
NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv 5
- live evidence from the ongoing one-shot full rerun:
- no new diagnostics summary file has appeared so far
- block
1(np-ping-pong) is already materially clean end to end:np-ping-pong c->c @ max = 19,627,stable_ratio=1.018133np-ping-pong rust->c @ max = 19,880,stable_ratio=1.045638np-ping-pong go->go @ max = 19,195, with one low and one high outlier trimmed,stable_ratio=1.098122- all published
10000/srows reached target cleanly:- examples:
rust->c = 9,999,stable_ratio=1.000000rust->rust = 9,999,stable_ratio=1.000000go->go = 10,000,stable_ratio=1.000000
- examples:
- the first published
1000/srows are also landing at target:go->c = 1,000,stable_ratio=1.000000go->go = 1,000,stable_ratio=1.000000
- the rerun has already crossed into the historically suspicious SHM block without reproducing the old collapse:
shm-ping-pong c->c @ max = 2,565,990,stable_ratio=1.042022shm-ping-pong rust->c @ max = 2,443,021,stable_ratio=1.089130shm-ping-pong c->rust @ max = 2,611,306,stable_ratio=1.071212shm-ping-pong rust->rust @ max = 2,617,581,stable_ratio=1.027963shm-ping-pong go->rust @ max = 2,327,904,stable_ratio=1.012447
- factual interim conclusion:
- the current one-shot full rerun is already materially stronger evidence than the older failing full runs
- the earlier full-suite
shm-ping-pong rust->ccollapse is not reproducing on the samewin11host after the current lifecycle and Windows SHM fixes
- live continuation coordinates for the long one-shot rerun:
win11source tree:/tmp/plugin-ipc-bench-20260327-fullrun-150313
- live output files:
- CSV:
/tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.csv
- log:
/tmp/plugin-ipc-bench-20260327-fullrun-150313/full-after-fix.log
- CSV:
- last verified progress in this session:
75lines in the CSV (74data rows)- blocks
1and2completed cleanly - block
3(snapshot-baseline) had started and was publishing stable@ maxrows:c->c = 19,872,stable_ratio=1.029521rust->c = 19,291,stable_ratio=1.043116
- no new diagnostics summary file had appeared yet
- later checkpoint from the same still-running one-shot rerun:
121lines in the CSV (120data rows)- blocks
1through4had already cleared cleanly and the run had advanced deep into block5(np-batch-ping-pong) - live batch evidence:
np-batch-ping-pong c->go @ max = 7,699,399,stable_ratio=1.045676np-batch-ping-pong rust->go @ max = 7,532,805,stable_ratio=1.018880np-batch-ping-pong go->go @ max = 7,152,856,stable_ratio=1.030591np-batch-ping-pong c->c @ 100000/s = 7,693,465,stable_ratio=1.011300np-batch-ping-pong rust->c @ 100000/s = 7,497,010,stable_ratio=1.015083
- no new diagnostics summary file had appeared yet at this checkpoint either
- completed outcome of the clean one-shot Windows rerun:
- the long
win11one-shot rerun finished cleanly - final CSV size:
202logical lines201data rows
- no new diagnostics summary file was produced during this rerun
tests/generate-benchmarks-windows.shpassed onwin11against the final CSV:All performance floors met
- the final generated report was copied back into the repo as:
benchmarks-windows.csvbenchmarks-windows.md
- the same generator also passed locally after copying the artifacts back:
bash tests/generate-benchmarks-windows.sh benchmarks-windows.csv benchmarks-windows.md- result:
All performance floors met
- user-approved follow-up after the successful one-shot rerun:
- commit the Windows artifact refresh and the TODO update as a separate git commit
- do not include unrelated dirty files from the broader worktree
- user-approved follow-up after the local commit:
- push commit
768cca3toorigin/main - do not include any of the remaining unrelated dirty files
- push commit
- implication:
- the remaining "assembled artifact" caveat is now removed
- the checked-in Windows artifacts now come from a single clean one-shot full rerun on
win11
- stable final Windows max-throughput spreads from that clean one-shot artifact:
shm-ping-pong:- best:
rust->rust = 2,617,581
- worst:
go->go = 2,113,834
- spread:
1.238x
- conclusion:
- no strange SHM collapse remains in the final clean artifact
- best:
lookup:- best:
rust = 176,259,707
- worst:
go = 98,385,649
- spread:
1.792x
- best:
np-pipeline-d16:- best:
go->rust = 240,205
- worst:
c->go = 216,940
- spread:
1.107x
- best:
np-pipeline-batch-d16:- best:
go->c = 39,065,948
- worst:
c->go = 27,896,181
- spread:
1.400x
- best:
- the long
- fresh tree:
- first one-shot full rerun attempt with the current defaults did not produce a clean replacement artifact:
- partial output path:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot.csv
- factual failure observed during block
1:np-ping-pong rust->rust @ 1000/s- Rust client exited non-zero
- streamed client output reported:
client: 4207 errors- partial line:
np-ping-pong,rust,rust,159,75.500,177.400,177.400,5.6,0.0,5.6
- implication:
- the one-shot rerun cannot replace the current published Windows artifact
- before attempting another full rerun, the new failure should be isolated on block
1to determine whether it is reproducible or a one-off transport/runtime glitch
- partial output path:
- isolated recheck of block
1completed cleanly on the samewin11proof tree:- command:
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/block1-recheck.csv 5
- output path:
/tmp/plugin-ipc-bench-20260326/block1-recheck.csv
- factual result:
- all
36block-1 measurements completed with exit code0 - the previously failing row completed cleanly:
np-ping-pong rust->rust @ 1000/s = 1000p50=66.200usp95=248.200usp99=369.500usstable_ratio=1.000000
- all
- implication:
- the first one-shot block-1 failure is not immediately reproducible
- this currently looks like a transient host/runtime glitch, not established deterministic instability in the
rust->rust @ 1000/spair - the next valid check is another clean one-shot full Windows rerun with the same default methodology
- command:
- second one-shot full rerun with the current defaults also failed to produce a clean replacement artifact:
- partial output path:
/tmp/plugin-ipc-bench-20260326/benchmarks-windows-oneshot-2.csv
- factual failure observed during block
2:shm-ping-pong rust->c @ max- repeated-sample file:
/tmp/netipc-bench-410987/samples-shm-ping-pong-rust-c-0.csv
- repeated throughputs:
618,076618,1601,951,0362,303,7142,476,081
- stable-core gate result:
stable_min=618,160stable_max=2,303,714stable_ratio=3.726728- configured max:
1.35
- implication:
- the current default methodology still does not guarantee a clean one-shot full Windows run on
win11 - the blocker has moved from a random-looking block-1 client failure to a concrete SHM max-throughput instability event
- the current default methodology still does not guarantee a clean one-shot full Windows run on
- partial output path:
- focused reproduction of the same SHM pair in isolation did not reproduce the collapse:
- direct pair under the same synced
win11tree:- C server:
bench_windows_c.exe shm-ping-pong-server - Rust client:
bench_windows.exe shm-ping-pong-client
- C server:
- isolated
rust -> c @ maxrepeated10times with10ssamples:- throughput range:
2,446,407 .. 2,578,450
- all
10runs stayed in the fast band
- throughput range:
- isolated
rust -> c @ maxrepeated10times with20ssamples:- throughput range:
2,363,335 .. 2,589,588
- all
10runs stayed in the fast band
- throughput range:
- implication:
- the SHM collapse is not a simple deterministic
rust client -> c serverbug - longer isolated samples are stable, but that alone does not explain the one-shot full-run failure
- the SHM collapse is not a simple deterministic
- direct pair under the same synced
- sequence test also failed to reproduce the SHM collapse:
- setup:
- one
c -> c @ maxSHM prime run - followed immediately by
5directrust -> c @ maxSHM runs - repeated for
5cycles on the sameRUN_DIR
- one
- factual result:
- all
25post-primerust -> cruns stayed in the fast band:2,357,337 .. 2,664,284
- all
- implication:
- the failure is not explained by a simple "previous
c -> cSHM row poisons the nextrust -> crow" theory - current best description:
- rare transient host/runtime glitch during full-matrix execution on
win11 - not immediately reproducible in dedicated pair or simple sequence tests
- rare transient host/runtime glitch during full-matrix execution on
- the failure is not explained by a simple "previous
- setup:
- pending user decision before more Windows runner code changes:
- context:
- Costa asked for trustworthy Windows benchmarks
- current state is better than before, but a clean one-shot full run is still not guaranteed
- user constraint raised during decision review:
- automatic retries must not hide real failures or real bugs
- if retries are ever used, first-attempt failures must remain visible and reportable
- user decision:
- keep the main Windows benchmark publication path fail-closed
- do not add silent self-healing retries to publish mode
- add a separate diagnostic mode that can rerun failed rows in isolation
- diagnostic mode must preserve and report the original first-attempt failure evidence side by side with any diagnostic rerun evidence
- option A:
- add automatic per-row retry on Windows when a row fails because of client error or stability-gate failure
- keep the current
5-sample median +1.35stable-core gate inside each attempt - implications:
- one transient bad row no longer destroys a 2-hour full run
- a row is still published only if a full fresh attempt passes the same gate
- risks:
- published rows may come from retry attempt
2or3, not from the first pass - the report and logs must say that retries happened, or the methodology becomes misleading
- published rows may come from retry attempt
- option B:
- keep fail-closed behavior, but increase Windows SHM max collection further:
- for example
20sper sample and/or7-9repeats
- for example
- implications:
- simpler story than retries
- every accepted row is still strictly one attempt
- risks:
- much longer full-suite runtime
- evidence so far does not prove that longer duration alone fixes the rare full-run glitch
- keep fail-closed behavior, but increase Windows SHM max collection further:
- option C:
- keep the current runner and accept targeted reruns / assembled Windows artifacts when one-shot full runs glitch
- implications:
- fastest operationally
- still produces trustworthy rows when each replacement row is validated carefully
- risks:
- no clean single-command reproduction
- more manual work and more caveats around publication
- accepted direction:
- strict publish mode plus separate diagnostic reruns
- rationale:
- failures stay visible
- diagnostic reruns can still accelerate root-cause work without turning the publication path into silent self-healing
- context:
- implemented Windows diagnostic mode for failed rows:
- file:
tests/run-windows-bench.sh
- new behavior:
- publish mode remains fail-closed by default
- opt-in diagnostics via:
NIPC_BENCH_DIAGNOSE_FAILURES=1
- when a row fails in publish mode:
- the original failure remains authoritative
- the original
RUN_DIRand first-attempt sample file remain preserved - the same row is rerun in an isolated diagnostic subdirectory under the preserved
RUN_DIR - diagnostic rerun output is recorded in:
${RUN_DIR}/diagnostics-summary.txt
- diagnostic reruns never write rows into the publish CSV
- implementation details:
- row-level measurement state is now tracked explicitly:
- failure reason
- sample-file path
- aggregate throughput/latency/CPU values
- stability metrics
- diagnostic reruns restore the original first-failure state after logging the isolated rerun evidence
- row-level measurement state is now tracked explicitly:
- file:
- forced validation of the new diagnostic mode on
win11:- purpose:
- prove that publish mode still fails closed
- prove that diagnostic reruns preserve the original evidence and create side-by-side isolated rerun evidence
- command:
NIPC_BENCH_FIRST_BLOCK=7 NIPC_BENCH_LAST_BLOCK=7 NIPC_BENCH_DIAGNOSE_FAILURES=1 NIPC_BENCH_REPETITIONS=3 NIPC_BENCH_MAX_DURATION=1 NIPC_BENCH_PIPELINE_BATCH_MAX_DURATION=1 NIPC_BENCH_MAX_THROUGHPUT_RATIO=0.9 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv 1
- factual result:
- runner exited non-zero as expected
- publish CSV remained header-only:
/tmp/plugin-ipc-bench-20260326/diag-lookup-2.csv
- preserved original run dir:
/tmp/netipc-bench-425494
- diagnostic summary created:
/tmp/netipc-bench-425494/diagnostics-summary.txt
- distinct diagnostic rerun dirs created per failed row:
/tmp/netipc-bench-425494/diagnostics/001-lookup-c-c-0/tmp/netipc-bench-425494/diagnostics/002-lookup-rust-rust-0/tmp/netipc-bench-425494/diagnostics/003-lookup-go-go-0
- implication:
- the new mode preserves truth in publish mode
- it also gives immediate isolated rerun evidence for investigation without silently healing the benchmark artifact
- purpose:
- next-step approval from Costa:
- commit and push the strict publish + diagnostic-mode runner changes
- then proceed immediately to the real Windows SHM investigation using the new diagnostic mode on the actual failing slice
- commit / push completed for the diagnostic-mode runner change:
- commit:
870fc93
- subject:
bench: add Windows diagnostic reruns
- pushed to:
origin/main
- commit:
- real Windows SHM investigation with the new diagnostic mode:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
- factual result:
- block
2completed successfully with exit code0 - no diagnostic rerun triggered for any SHM row
- the previously suspicious row completed cleanly:
shm-ping-pong rust->c @ max = 2,465,857stable_ratio=1.021516
- the full SHM max matrix stayed stable:
c->c = 2,461,053rust->c = 2,465,857go->c = 2,162,135c->rust = 2,597,936rust->rust = 2,530,435go->rust = 2,065,765c->go = 2,570,619rust->go = 2,254,772go->go = 2,079,323
- all
100000/s,10000/s, and1000/sSHM rows also completed stably in the same block run
- block
- implication:
- the Windows SHM instability still does not reproduce when block
2runs in isolation under the real runner - current strongest working theory:
- the failure depends on broader full-suite context on
win11 - not on the standalone SHM block itself
- the failure depends on broader full-suite context on
- the Windows SHM instability still does not reproduce when block
- command:
- targeted confirmation of the Windows SHM anomaly:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-confirm.csv 5
- confirmed max-throughput rerun on the same
win11tree:c->c = 2,396,963rust->c = 1,708,649go->c = 886,451c->rust = 2,566,391rust->rust = 2,563,582go->rust = 2,053,507c->go = 2,539,899rust->go = 2,215,733go->go = 2,047,115
- factual conclusion:
- the original
rust->cfull-suite collapse is not stable - max-throughput Windows SHM rows can swing materially between reruns on
win11 - target-rate Windows SHM rows remain stable near their requested rates
- implication:
- the strange Windows SHM max delta is currently a measurement-stability / host-noise issue, not proven deterministic language regression
- the original
- command:
- reviewed Windows benchmark handoff guidance before execution:
- Refreshed max-throughput spread summary:
- Linux:
lookup:- fastest
c->c = 167,974,040 - slowest
go->go = 127,908,975 - spread:
1.31x - improvement versus checked-in previous artifact:
1.77x -> 1.31x
- fastest
shm-ping-pong:- fastest
rust->rust = 3,486,454 - slowest
go->go = 1,725,340 - spread:
2.02x - note:
- this widened versus the previous checked-in artifact because
go->gomax throughput dropped materially
- this widened versus the previous checked-in artifact because
- fastest
shm-batch-ping-pong:- fastest
c->c = 61,778,266 - slowest
go->go = 31,810,209 - spread:
1.94x
- fastest
uds-pipeline-d16:- fastest
rust->c = 712,544 - slowest
rust->go = 550,630 - spread:
1.29x
- fastest
uds-pipeline-batch-d16:- fastest
c->c = 99,746,787 - slowest
go->go = 50,690,629 - spread:
1.97x
- fastest
- Windows:
lookup:- fastest
rust->rust = 178,835,588 - slowest
go->go = 97,109,788 - spread:
1.84x
- fastest
shm-ping-pongfull suite:- fastest
c->rust = 2,650,754 - slowest
rust->c = 850,994 - spread:
3.11x - but targeted confirmation disproved
rust->cas a stable deterministic outlier
- fastest
shm-batch-ping-pong:- fastest
c->c = 52,520,469 - slowest
go->go = 34,390,650 - spread:
1.53x
- fastest
np-pipeline-batch-d16:- fastest
go->rust = 38,249,582 - slowest
go->go = 24,333,588 - spread:
1.57x
- fastest
- Linux:
- Strange delta findings that remain real after the refresh:
- Linux
uds-pipeline-d16:- Go server remains the clear slow case across clients:
c->go = 559,691rust->go = 550,630go->go = 553,858- versus C/Rust servers near
686k-713k
- implication:
- this is a stable Go-server transport/runtime cost, not client-specific noise
- Go server remains the clear slow case across clients:
- Linux
uds-pipeline-batch-d16:- server choice dominates:
- C server:
96.2M-99.7M - Rust server:
84.1M-86.3M - Go server:
50.7M-51.3M
- C server:
- implication:
- the known batch-path server asymmetry is still real
- server choice dominates:
- Linux
shm-batch-ping-pong:- C server stays strongest
- Rust server is mid-band
- Go server is slowest
- implication:
- still consistent with real server-side implementation overhead, not runner corruption
- Linux / Windows
lookup:- Linux:
c = 167.97Mrust = 146.15Mgo = 127.91M
- Windows:
rust = 178.84Mc = 125.60Mgo = 97.11M
- implication:
- lookup is now measuring runtime/data-structure efficiency more than IPC transport behavior
- the previous fake linear-scan distortion is gone, but cross-language runtime overhead remains visible
- Linux:
- Linux
- Strange delta finding that is currently suspicious but not yet proven real:
- Windows
shm-ping-pong @ max:- full-suite run made
rust->cmiss the floor - immediate confirmation run moved the collapse to
go->cinstead - conclusion:
- this is currently a max-throughput measurement-stability issue on
win11 - do not interpret a single bad max row there as a stable language-specific regression without targeted rerun confirmation
- this is currently a max-throughput measurement-stability issue on
- full-suite run made
- second isolated Windows SHM rerun on the same
win11tree reinforced the same conclusion:- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-rerun.csv 5
@maxrows:c->c = 2,516,450rust->c = 2,430,413go->c = 2,179,591c->rust = 2,497,180rust->rust = 2,473,159go->rust = 2,114,944c->go = 2,571,394rust->go = 2,282,433go->go = 2,100,658
- implication:
- the full-suite
rust->ccollapse to850,994is definitely not stable
- the full-suite
- additional warning sign from the same isolated rerun:
- some
target_rps=10000rows also became unstable:c->rust = 5,073rust->rust = 4,098- while other rows in the same block stayed near
10,000
- implication:
- the Windows SHM benchmark instability is not limited to one language pair or only to the first full-suite run
- some
- command:
- Windows
- Post-commit diagnostic runner work (
870fc93 bench: add Windows diagnostic reruns):- committed and pushed:
- commit:
870fc93 - pushed to
origin/main
- commit:
- immediate next investigation on
win11:- goal:
- identify the smallest Windows benchmark context that reproduces the earlier full-suite SHM collapse
- standalone SHM block with diagnostics enabled:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/shm-diagnose.csv 5
- result:
- exited
0 - no diagnostics triggered
- exited
- key
shm-ping-pong @ maxrows:c->c = 2,461,053withstable_ratio=1.018190rust->c = 2,465,857withstable_ratio=1.021516go->c = 2,162,135withstable_ratio=1.017540c->rust = 2,597,936withstable_ratio=1.016334rust->rust = 2,530,435withstable_ratio=1.020250go->rust = 2,065,765withstable_ratio=1.029206c->go = 2,571,619withstable_ratio=1.013998rust->go = 2,254,772withstable_ratio=1.022145go->go = 2,079,323withstable_ratio=1.010925
- factual conclusion:
- block
2alone is stable under the real repeated-median runner - the earlier full-suite
rust->ccollapse is not a standalone SHM bug
- block
- command:
- combined
NP -> SHMprefix with diagnostics enabled:- command:
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-diagnose.csv 5
- result:
- exited
0 - no diagnostics triggered
- total measurements:
72
- exited
- key
np-ping-pong @ maxrows:c->c = 19,411rust->c = 19,735go->c = 18,744c->rust = 20,188rust->rust = 20,301go->rust = 19,277c->go = 19,383rust->go = 18,558go->go = 19,241
- key
shm-ping-pong @ maxrows:c->c = 2,522,584rust->c = 2,522,004go->c = 2,071,095c->rust = 2,580,971rust->rust = 2,511,775go->rust = 2,308,182c->go = 2,657,019rust->go = 2,273,563go->go = 2,109,132
- factual conclusion:
- the failure does not reproduce with blocks
1-2 - the earlier bad
rust->cfull-suite row requires broader full-suite context than just theNP -> SHMtransition
- the failure does not reproduce with blocks
- command:
- goal:
- updated working theory:
- speculation:
- a later block, or cumulative state from multiple later blocks, is needed to trigger the rare full-suite Windows instability
- not supported by evidence anymore:
- standalone SHM bug
- simple
NP -> SHMtransition bug
- speculation:
- next diagnostic step:
- extend the prefix to block
3and repeat:NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-batch-diagnose.csv 5
- extend the prefix to block
- committed and pushed:
- Current deep-dive findings after extending the prefix to block
3:- factual setup:
- command executed on
win11:NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260326/np-shm-shmbatch-diagnose.csv 5
- result:
- exited non-zero
- publish CSV was partial, not empty
- remote CSV line count:
89total lines88data rows
- expected for blocks
1-3:90data rows
- missing rows:
shm-ping-pong,rust,go,0shm-ping-pong,c,c,10000
- command executed on
- factual evidence that the run continued after failures:
- the partial CSV still contains later rows after the first failed
shm-ping-pong rust->go @ max - it also contains all
snapshot-baselinerows for block3 - implication:
- the runner is correctly fail-recording rows while continuing the remaining matrix
- the partial CSV still contains later rows after the first failed
- factual evidence that first failure was an intermittent runtime failure, not a stable throughput regression:
- preserved first-attempt sample file for
shm-ping-pong rust->go @ maxhas only4completed repeats in/tmp/netipc-bench-410987/samples-shm-ping-pong-rust-go-0.csv - those four repeats were all healthy:
2,183,3402,290,9652,295,0262,240,149
- implication:
- the row failed because one repeat died mid-row, not because all repeats drifted slow
- preserved first-attempt sample file for
- factual evidence of a runner/server lifecycle bug:
- the runner hard-kills every benchmark server after each sample in tests/run-windows-bench.sh:247 to tests/run-windows-bench.sh:263
- the Windows benchmark servers are implemented to stop themselves after
duration+3seconds and then run normal teardown / CPU reporting: - implication:
- the runner is violating the server lifecycle contract on Windows
- hard kill can bypass
nipc_server_destroy()/server.Stop()cleanup - this directly explains:
- transient client timeouts
"in use by live server"collisions on the next repeat- immediate success when the same row is rerun in isolation
- factual evidence that Windows SHM naming is sensitive to leaked sessions:
- Windows server session IDs restart from
1for every server process in src/libnetdata/netipc/src/service/netipc_service_win.c:933 - new sessions increment from that counter in src/libnetdata/netipc/src/service/netipc_service_win.c:1008
- Windows SHM object names include
run_dir + service_name + auth_token + session_idin src/libnetdata/netipc/include/netipc/netipc_win_shm.h:8 to src/libnetdata/netipc/include/netipc/netipc_win_shm.h:11 - stale cleanup is intentionally a no-op on Windows in src/libnetdata/netipc/include/netipc/netipc_win_shm.h:215 to src/libnetdata/netipc/include/netipc/netipc_win_shm.h:220 and src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:781
- implication:
- if a previous sample's server/session stays alive briefly, the next sample for the same service can collide on named pipe and/or SHM object names
- Windows server session IDs restart from
- factual evidence of a separate diagnostic bookkeeping bug:
- the root run dir found on disk for the latest run was
/tmp/netipc-bench-410987 - it contains files only up to SHM
@ maxrows - later successful rows from the same run are present in the output CSV but their sample files are not present under that root
- the runner warning printed a different root path (
/tmp/netipc-bench-456611) that does not exist on disk - implication:
- diagnostic mode currently preserves the truth of the first failure in the terminal output
- but it does not yet preserve the filesystem evidence reliably enough
- the root run dir found on disk for the latest run was
- factual evidence of a Windows SHM transport hardening gap:
- C Windows SHM create path does not check
GetLastError() == ERROR_ALREADY_EXISTSafterCreateFileMappingW/CreateEventWin src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:198 to src/libnetdata/netipc/src/transport/windows/netipc_win_shm.c:266 - Rust and Go Windows SHM server-create paths also do not appear to check for existing named objects:
- implication:
- a leaked Windows SHM object may be treated as a successful create instead of an explicit collision
- this can turn cleanup problems into nondeterministic runtime behavior
- C Windows SHM create path does not check
- factual setup:
- Decision needed before code:
1. AFix the Windows benchmark runner only.- scope:
- replace hard-kill shutdown with graceful server stop / wait and hard-kill fallback only on timeout
- make per-repeat server/client output files unique
- fix diagnostic bookkeeping so preserved run dirs and summaries always match the actual run
- benefits:
- directly targets the strongest evidence
- smallest code-change surface
- most likely enough to make the benchmark harness trustworthy
- implications:
- benchmark methodology changes only, not transport semantics
- if Windows SHM object-collision handling is also weak, the benchmark harness may become stable while the product bug remains latent
- risks:
- could leave a real Windows transport bug hidden until another scenario hits it outside the benchmark harness
- scope:
1. BFix the Windows benchmark runner and harden Windows SHM object creation in C, Rust, and Go.- scope:
- everything in
1. A - plus explicit
ERROR_ALREADY_EXISTShandling for Windows SHM mappings/events and clearer collision errors
- everything in
- benefits:
- addresses both the likely benchmark root cause and a real transport safety gap
- makes leaked object collisions explicit instead of nondeterministic
- implications:
- larger change across multiple language implementations
- requires more testing
- risks:
- broader patch, more review surface, more chance of side effects if the three implementations are not kept perfectly aligned
- scope:
1. CContinue diagnosis without code changes.- scope:
- more targeted reruns and more artifact collection
- benefits:
- lowest code risk
- implications:
- more benchmark time burned with a runner we already know is violating server lifecycle on Windows
- risks:
- low leverage
- likely delays the obvious fix
- scope:
- recommendation:
1. B- reasoning:
- the hard-kill runner behavior is the strongest causal explanation for the benchmark instability
- but the Windows SHM create path also has a real hardening gap
- if the goal is "Windows benchmarks trustworthy", fixing only the runner is probably enough for the harness, but not enough for the underlying transport robustness
- user decision:
1. B- accepted scope:
- fix the Windows benchmark runner lifecycle and diagnostics bookkeeping
- harden Windows SHM object creation in C, Rust, and Go to detect existing named objects explicitly
- implementation and verification after
1. B:- local code changes completed:
- runner:
tests/run-windows-bench.sh
- Windows SHM hardening:
src/libnetdata/netipc/include/netipc/netipc_win_shm.hsrc/libnetdata/netipc/src/transport/windows/netipc_win_shm.csrc/crates/netipc/src/transport/win_shm.rssrc/go/pkg/netipc/transport/windows/shm.go
- regression coverage:
tests/fixtures/c/test_win_shm.csrc/crates/netipc/src/transport/win_shm.rssrc/go/pkg/netipc/transport/windows/shm_test.go
- runner:
- factual runner behavior after the patch:
- the Windows runner now:
- uses a unique per-repeat runtime/artifact directory instead of reusing the same
RUN_DIRfor every repeat - waits for benchmark servers to stop themselves before killing them
- preserves the root run dir on any measurement-command failure, not only on stability-gate failures
- records the first-attempt artifact directory in the diagnostics summary
- uses a unique per-repeat runtime/artifact directory instead of reusing the same
- the Windows runner now:
- factual transport behavior after the patch:
- C, Rust, and Go Windows SHM server-create paths now reject existing named mappings/events explicitly instead of treating them as successful creates
- new error surface:
- C:
NIPC_WIN_SHM_ERR_ADDR_IN_USE
- Rust:
WinShmError::AddrInUse
- Go:
ErrWinShmAddrInUse
- C:
- first verification on
win11:- focused Windows SHM duplicate-create coverage now passes in all three implementations:
- Go:
cd src/go && GOOS=windows GOARCH=amd64 go test -run TestWinShmServerCreateRejectsExistingObjects -count=1 ./pkg/netipc/transport/windows
- Rust:
cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_create_rejects_existing_objects_windows -- --test-threads=1
- C:
cmake --build build -j4 --target test_win_shmctest --test-dir build --output-on-failure -R '^test_win_shm$'
- Go:
- result:
- all passed
- focused Windows SHM duplicate-create coverage now passes in all three implementations:
- factual new issue exposed by the stricter runner:
- extending the real benchmark rerun to
NIPC_BENCH_FIRST_BLOCK=1 NIPC_BENCH_LAST_BLOCK=3no longer reproduced the old random SHM collapse first - instead, it exposed a deterministic Rust benchmark-driver shutdown bug:
- every row using a Rust Windows benchmark server failed with:
Server rust (...) did not exit cleanly within 10s; forcing kill
- preserved server output contained only:
READY
- implication:
- the stricter runner removed the old benchmark-driver hard-kill masking and surfaced a real Rust benchmark-driver lifecycle bug
- every row using a Rust Windows benchmark server failed with:
- root cause:
bench/drivers/rust/src/bench_windows.rsstill used the old Windows stop pattern:- only
running_flag.store(false, ...) - no wake connection
- only
- this is the same Windows accept-loop issue already fixed earlier in the Rust Windows tests:
ConnectNamedPipe()stays blocked until a connection wakes it
- fix:
bench/drivers/rust/src/bench_windows.rsnow mirrors the tested shutdown pattern:- after
duration+3, setrunning_flag = false - then issue a dummy
NpSession::connect(...)so the blocked accept loop can observe shutdown and exit cleanly
- after
- direct proof on
win11:- command:
timeout 20 src/crates/netipc/target/release/bench_windows.exe np-ping-pong-server /tmp/plugin-ipc-bench-20260327 rust-stop-check 1
- result:
READYSERVER_CPU_SEC=0.000000
- implication:
- the Rust Windows benchmark server now exits on its own instead of hanging until killed
- command:
- extending the real benchmark rerun to
- focused real benchmark proof after all fixes:
- command:
NIPC_BENCH_FIRST_BLOCK=2 NIPC_BENCH_LAST_BLOCK=2 NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/block-2-after-fix.csv 5
- result:
- exited
0 - no diagnostic reruns were needed
- all
36shm-ping-pongrows published
- exited
- key evidence:
- the previously suspicious Windows row is now stable:
shm-ping-pong rust->c @ max = 2,458,786- stable ratio:
1.009920
- all SHM
@ maxrows completed inside the stability gate:c->c = 2,505,981withstable_ratio=1.038817rust->c = 2,458,786withstable_ratio=1.009920c->rust = 2,588,642withstable_ratio=1.028021rust->rust = 2,649,571withstable_ratio=1.018367rust->go = 2,242,750withstable_ratio=1.045399
- the previously suspicious fixed-rate rows are now also stable:
rust->c @ 100000/s = 99,997withstable_ratio=1.000010rust->c @ 10000/s = 9,999withstable_ratio=1.000000rust->rust @ 10000/s = 9,999withstable_ratio=1.000000
- the previously suspicious Windows row is now stable:
- factual conclusion from the focused SHM rerun:
- the Windows SHM benchmark instability is materially reduced after:
- runner lifecycle fixes
- per-repeat runtime isolation
- explicit Windows SHM collision detection
- Rust benchmark-server wake-on-stop fix
- the earlier
rust->cSHM collapse no longer reproduces in the real benchmark block that used to be suspicious
- the Windows SHM benchmark instability is materially reduced after:
- command:
- partial full-suite proof after the focused fixes:
- command started on
win11:NIPC_BENCH_DIAGNOSE_FAILURES=1 bash tests/run-windows-bench.sh /tmp/plugin-ipc-bench-20260327/full-after-fix.csv 5
- factual behavior before manual interruption:
- no diagnostics were emitted
- no forced-kill Rust benchmark-server failures reappeared
- the run cleared the exact NP area where the stricter runner had previously exposed the Rust benchmark-server shutdown bug:
np-ping-pong @ maxrows for Rust servers completed cleanlynp-ping-pong @ 100000/srows for Rust servers completed cleanlynp-ping-pong @ 10000/srows were still running cleanly when the run was stopped intentionally for time
- reason for interruption:
- no new technical blocker remained
- the rest of the work was wall-clock runtime only
- command started on
- local code changes completed:
- Finding recorded on 2026-04-15 during clean full validation after commit
074a3a5c8f552f15473a0bc929b96da9e71f79b7:- Windows full validation failed in the MSYS bounded benchmark policy step.
- Failing command context:
bash tests/run-windows-msys-validation.sh- comparison artifact directory:
/tmp/plugin-ipc-full-windows-20260415-162223/msys-validation/bench-compare/
- Concrete failing policy row:
np-max,np-ping-pong,c,c,0,both,70.0,44.9,fail
- Concrete joined comparison row:
np-max,np-ping-pong,c,c,0,both,14058.000,6317.000,44.9,55.1,185.000,459.200,148.2,31.900,30.300
- Rows that passed in the same policy run:
np-100k: MSYS was100.5%of nativemingw64snapshot-np: MSYS was96.9%of nativemingw64shm-max: MSYS was88.5%of nativemingw64- all mixed C/Rust SHM rows passed their configured floors
- Interpretation:
- this is narrow to unbounded/max-rate C-to-C named-pipe ping-pong under MSYS
- fixed-rate named-pipe behavior did not regress in the same run
- no code or policy change is justified yet without investigating whether the
np-maxfloor is a valid semantic regression floor or an invalid saturation policy for MSYS
- Immediate investigation plan:
- inspect
tests/compare-windows-bench-toolchains.shpolicy intent fornp-max - inspect the per-sample artifacts for both
mingw64andmsysnp-max - compare C named-pipe benchmark behavior under unbounded and fixed-rate modes
- rerun the narrow
np-maxpair enough to determine whether this is repeatable or a noisy saturation outlier - only then decide whether to fix code, fix harness policy, or both
- inspect
- Follow-up evidence from a narrow rerun on the same
win11:~/src/plugin-ipc.gitcheckout:- command:
- paired targeted rerun of
np-ping-pong,c,c,0formingw64andmsys - artifact directory:
/tmp/plugin-ipc-investigate-npmax-20260415-165234/
- paired targeted rerun of
- result:
mingw64:22489.000msys:22492.000- effective MSYS/native ratio: approximately
100.0%
- conclusion:
- the implementation can pass the intended
np-maxrelative policy immediately after the failed full run - the problem is a compare-lane false failure mode: a stable-looking saturation outlier can pass row-local guards and fail only at the final relative-policy stage
- the implementation can pass the intended
- command:
- Implemented harness fix:
tests/compare-windows-bench-toolchains.shnow reruns policy-failed labels as pairedmingw64+msysrows before final failure- default final-policy attempt budget:
NIPC_BENCH_COMPARE_POLICY_ATTEMPTS=3
- prior failed attempts are preserved as:
summary.attempt-N.csvjoined.attempt-N.csvpolicy.attempt-N.csv
- no throughput floor was lowered
- Documentation/test updates:
README.mdandWINDOWS-COVERAGE.mddocument the paired policy retry behavior- added
tests/test_windows_compare_policy_retry.shwith a stub targeted runner proving:- first policy attempt fails
np-max - only the failed label is rerun
- final policy passes after the paired retry
- first policy attempt fails
- Local verification:
bash -n tests/compare-windows-bench-toolchains.sh tests/test_windows_compare_policy_retry.sh tests/run-windows-msys-validation.shbash tests/test_windows_compare_policy_retry.shbash tests/test_windows_bench_stability_policy.shgit diff --check- all passed
- Validation evidence recorded after commit
bb996c638f6c73cb9f3a8b0aac55d09819548979:- Linux full validation evidence:
- command family:
- build,
ctest, Rust tests, Go tests, Go race, extended fuzz, C/Go/Rust coverage, ASAN, TSAN, Valgrind, all POSIX interop matrices, and POSIX benchmark generation
- build,
- artifact directory:
/tmp/plugin-ipc-full-linux-20260415-162222/
- result:
full-linux.logended withFULL LINUX VALIDATION PASSEDctest:100% tests passed, 0 tests failed out of 42- Rust:
305 passed; 0 failed - Go: all packages passed
- Go race: all packages passed
- extended fuzz:
11 passed, 0 failed - C coverage total:
92.3% - Go coverage total:
94.3% - Rust coverage total:
95.17% - ASAN:
7/7passed - TSAN:
6/6passed with no races - Valgrind:
7/7passed with zero errors/leaks/invalid accesses - POSIX benchmark CSV:
/tmp/plugin-ipc-full-linux-20260415-162222/benchmarks-posix.csv - POSIX benchmark data rows:
201 - POSIX generator result: all performance floors met
- exactness caveat:
- this full Linux matrix ran at commit
074a3a5c8f552f15473a0bc929b96da9e71f79b7 - commit
bb996c638f6c73cb9f3a8b0aac55d09819548979changed only Windows compare/docs/TODO/test files:tests/compare-windows-bench-toolchains.shtests/test_windows_compare_policy_retry.shREADME.mdWINDOWS-COVERAGE.mdTODO-netdata-plugin-ipc-integration.md
- the changed Windows compare test was verified locally after
bb996c6with:bash tests/test_windows_compare_policy_retry.shbash tests/test_windows_bench_stability_policy.sh
- this full Linux matrix ran at commit
- command family:
- Native Windows correctness/coverage/interop evidence from the full validation run:
- checkout:
win11:~/src/plugin-ipc.git- commit
074a3a5c8f552f15473a0bc929b96da9e71f79b7
- artifact directory:
/tmp/plugin-ipc-full-windows-20260415-162223/
- results before the MSYS bounded-compare failure:
- build and
ctest:30/30passed - Rust Windows lib tests:
195 passed; 0 failed - Go Windows tests: passed
- App Verifier/PageHeap: passed
- Windows C coverage: all files met the
90%threshold - Windows Go coverage:
92.0% - Windows Rust line coverage:
90.46% - native standalone Windows interop/service/cache matrices: passed
- MSYS functional slice: passed, including
test_win_shmrepeated10/10
- build and
- exact failure from that full run:
- only the MSYS bounded compare policy failed
- failing policy artifact:
/tmp/plugin-ipc-full-windows-20260415-162223/msys-validation/bench-compare/policy.csv
- failing row:
np-max,np-ping-pong,c,c,0,both,70.0,44.9,fail
- this is the issue fixed by commit
bb996c6
- checkout:
- MSYS validation evidence after the paired policy-retry fix:
- checkout:
win11:~/src/plugin-ipc.git- commit
bb996c638f6c73cb9f3a8b0aac55d09819548979
- command:
bash tests/run-windows-msys-validation.sh /tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830 3
- artifact directory:
/tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/
- result:
- exited
0 - summary:
/tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/summary.txt - policy:
/tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/bench-compare/policy.csv - joined comparison:
/tmp/plugin-ipc-msys-validation-policy-retry-20260415-165830/bench-compare/joined.csv - every configured MSYS-vs-native policy row passed
- exited
- checkout:
- Strict native Windows benchmark evidence after the paired policy-retry fix:
- checkout:
win11:~/src/plugin-ipc.git- commit
bb996c638f6c73cb9f3a8b0aac55d09819548979
- command:
bash tests/run-windows-bench.sh /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.csv 5bash tests/generate-benchmarks-windows.sh /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.csv /tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.md
- artifact directory:
/tmp/plugin-ipc-windows-native-bench-20260415-171700/
- result:
- exited
0 - total measurements:
201 - CSV line count including header:
202 - generator result:
All performance floors met - summary:
/tmp/plugin-ipc-windows-native-bench-20260415-171700/summary.txt - report:
/tmp/plugin-ipc-windows-native-bench-20260415-171700/benchmarks-windows.md
- exited
- checkout:
- Current validation conclusion:
- the concrete failures found during the full validation loop were fixed:
- C benchmark server lifecycle/timer shutdown fixed in commit
074a3a5 - MSYS bounded compare policy false failure fixed in commit
bb996c6
- C benchmark server lifecycle/timer shutdown fixed in commit
- no tracked Linux or Windows source divergence is being used as the sync mechanism
- next operational step:
- commit this evidence update locally
- push
- pull on
win11:~/src/plugin-ipc.git
- the concrete failures found during the full validation loop were fixed:
- Linux full validation evidence:
- Goal:
- implement the refined transport rule in
plugin-ipc - if handshake selects SHM and the client cannot attach SHM, the client must close that session, exclude SHM from future proposals for that client context, and reconnect on baseline
- no server-side same-session fallback is allowed
- implement the refined transport rule in
- Scope:
- update C, Rust, and Go L2 client reconnect logic
- keep L3 behavior inherited from L2
- update the handshake/spec docs to describe this exception precisely
- add tests for client-side SHM attach failure fallback
- Implementation status:
- done locally in C, Rust, and Go
- behavior now is:
- handshake may negotiate SHM
- if client-side SHM attach fails, that session is closed
- the client removes SHM from future proposals in that client context
- the client reconnects through a new handshake and falls back to baseline
- no same-session fallback is used
- Linux evidence:
go test ./pkg/netipc/service/raw -run 'TestUnixShmAttachFailureFallsBackToBaseline|TestUnixShmPrepareFailureFallsBackToBaseline' -count=1 -v- passed
cargo test --manifest-path src/crates/netipc/Cargo.toml test_refresh_shm_attach_failure_falls_back_to_baseline -- --nocapture- passed
cargo test --manifest-path src/crates/netipc/Cargo.toml test_server_falls_back_to_baseline_when_linux_shm_prepare_fails -- --nocapture- passed
cmake --build build --target test_service -j4- passed
./build/bin/test_service- passed, including:
Test: Client-side SHM attach failure falls back to baseline
- passed, including:
- Windows evidence on
win11:~/src/plugin-ipc.gitafter pulling commit2bca7bb:cd src/go && go test ./pkg/netipc/service/raw -count=1 -run 'TestWinShmAttachFailureFallsBackToBaseline|TestWinShmPrepareFailureFallsBackToBaseline' -v- passed
cargo test --manifest-path src/crates/netipc/Cargo.toml test_refresh_winshm_attach_failure_falls_back_to_baseline -- --nocapture- passed
bash tests/run-coverage-c-windows.sh- passed
- included:
test_win_service_guards.exetest_win_service_guards_extra.exetest_win_service_extra.exe- targeted Windows C interop/service matrix
- coverage summary:
netipc_service_win.c:90.6%netipc_named_pipe.c:92.4%netipc_win_shm.c:94.2%- total:
91.9%
- included the new guard:
Hybrid attach failure falls back to baseline
bash tests/run-coverage-rust-windows.sh- passed
- total line coverage:
90.54% - critical-file line coverage:
service/cgroups.rs:92.37%transport/windows.rs:91.65%transport/win_shm.rs:94.11%
bash tests/run-coverage-go-windows.sh- passed
- total coverage:
92.1%
NETIPC_BUILD_DIR=\"$HOME/src/plugin-ipc.git/build-windows-coverage-c\" bash tests/run-verifier-windows.sh- passed
- no Application Verifier or PageHeap findings for:
test_named_pipe.exetest_win_shm.exetest_win_service.exetest_win_service_extra.exe
- Sync status:
- local
/home/costa/src/plugin-ipc.gitandwin11:~/src/plugin-ipc.gitwere synchronized to2bca7bbfor the validation run
- local