drivers/nutdrv_qx.c, drivers/blazer_usb.c: cypress_command()/phoenix_command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [#598] by Pyker · Pull Request #3448 · networkupstools/nut

Pyker · 2026-05-18T12:57:02Z

Summary

Adds a usb_clear_halt() + retry on LIBUSB_ERROR_OVERFLOW for the EP 0x81 interrupt read in both cypress_command() and phoenix_command() in nutdrv_qx.c and blazer_usb.c, addressing the long-running #598 hang on the 0665:5161 Cypress USB-Serial bridge family (Salicru SPS ONE, Ippon, Belkin F6C1200-UNV, ViewPower, various Voltronic Power UPSes).

The byte-level root cause analysis, captured via usbmon + tshark on a Salicru SPS 1500 ONE BL, is in this comment on #598. Short version: firmware in this bridge family occasionally emits an interrupt-IN frame larger than the declared wMaxPacketSize=8, the kernel xHCI driver returns LIBUSB_ERROR_OVERFLOW and discards the data, and every subsequent read on the endpoint also returns OVERFLOW until a USB-level reset. Locally observed rate on a steady-state Salicru: 1-3 hangs/day with pollinterval=2; once stuck, only a port reset (or usb_resetter --reset-device) recovers it.

The patch issues a single usb_clear_halt(udev, 0x81) on OVERFLOW and re-reads the same 8-byte slice once. CLEAR_FEATURE(HALT) reliably unsticks both the kernel-side endpoint state and (empirically) the firmware's TX FIFO, so the next read returns the next clean frame and the protocol layer resumes without entering the existing MAXTRIES failure cluster. If the retry also overflows, the existing error path is taken unchanged.

The same fix is applied to cypress_command() and phoenix_command() in both drivers because the read loops are byte-identical and the bug is at the USB transport layer, below either subdriver's protocol logic.

Why draft

Marking as draft until I have a few days of post-patch uptime data from the affected hardware (currently mitigated by a local watchdog that triggers usb_resetter --reset-device on stale status). I will convert to ready-for-review once I have soak data showing the OVERFLOW retries are being hit and absorbed without escalating into the existing MAXTRIES failure cluster.

Test plan

Build nutdrv_qx and blazer_usb with the patch (verified locally: make -C drivers nutdrv_qx blazer_usb succeeds on Linux x86_64, libusb-1.0)
Run nut-driver@salicru for at least 48h with debug_min = 2 and confirm:
- At least one LIBUSB_ERROR_OVERFLOW on EP 0x81, clearing halt and retrying debug log line appears
- Per-cycle failure rate stays at the pre-patch baseline (~0.6-0.9 events/hr) without escalating to MAXTRIES clusters
- Watchdog at nut-salicru-watchdog.timer does not fire over the soak window
Report soak results in a comment; convert this PR to ready-for-review

Related issues

Fixes (or contributes to fixing) Communications with Ippon UPS lost (blazer_usb) #598: Ippon UPS "Communications with the UPS lost"
Plausibly relevant to Weird connectivity issues with a generic ViewPower UPS #2453: ViewPower flaky USB2Serial on the same bridge family
Plausibly relevant to nutdrv_qx multiple identical ups USB issue #1174: multiple Cypress UPSes return blank serials (suggests minimal firmware in this bridge family generally)

…command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [networkupstools#598] Some firmware in the 0665:5161 Cypress USB-Serial bridge family (Salicru SPS ONE, Ippon, Belkin F6C1200-UNV, ViewPower, various Voltronic Power UPSes) occasionally emits an interrupt-IN frame larger than the declared wMaxPacketSize=8, mid-reply. The kernel xHCI driver flags this as LIBUSB_ERROR_OVERFLOW and discards the data; from that point on, every subsequent read on the endpoint also returns OVERFLOW until a USB-level reset is performed, which results in "Data stale" from upsd until the driver is restarted. Issue an immediate CLEAR_FEATURE(HALT) on EP 0x81 and retry the read once. CLEAR_FEATURE(HALT) unsticks the kernel-side endpoint state, and in the observed cases the firmware's TX FIFO state as well, so the next read returns a clean frame. If the retry also overflows, the existing error path is taken unchanged. Captured at byte level on a Salicru SPS 1500 ONE BL (Voltronic-QS H-protocol over 0665:5161, host xHCI on Linux 6.14): firmware emits chunks 3..6 of the QS reply collapsed into a single frame that exceeds wMaxPacketSize, kernel reports LIBUSB_ERROR_OVERFLOW, and all subsequent reads on EP 0x81 return OVERFLOW until a port reset. The same fix is applied to cypress_command() and phoenix_command() in both drivers because the read loops are byte-identical and the bug is at the USB transport layer, below either subdriver's protocol logic. Signed-off-by: Pedro Cunha <pedroagracio+nut@gmail.com>

github-actions · 2026-05-18T12:57:22Z

A ZIP file with standard source tarball and another tarball with pre-built docs for commit 9eefcfd is temporarily available: NUT-tarballs-PR-3448.zip.

AppVeyorBot · 2026-05-18T13:42:55Z

✅ Build nut 2.8.5.4740-master completed (commit ca38a9e7d6 by @Pyker)

artifacts
A 7-zip archive with built NUT for Windows debug binaries is temporarily available: NUT-for-Windows-x86_64-SNAPSHOT-2.8.5.4740-master.7z

AppVeyorBot · 2026-05-18T13:42:56Z

✅ Build nut 2.8.5.4740-master completed (commit ca38a9e7d6 by @Pyker)

Pyker · 2026-05-20T22:10:46Z

Status update after ~2.5 days soaking this on the originally-affected hardware (Salicru SPS 1500 ONE BL, 0665:5161, libusb-1.0, Linux 7.0.x).

The endpoint-halt-clear in this PR turns out to be necessary but not sufficient. It does cleanly recover isolated overflows: I captured one where usb_clear_halt(0x81) plus a single re-read returned valid data on the same poll. But it does nothing for the sustained lockup that is the actual chronic hang. In one captured event the firmware returned LIBUSB_ERROR_OVERFLOW on every read for ~25 seconds (76 consecutive overflows); the clear-halt-and-retry fired on all of them and not one recovered. Only a USB-level device reset (an external usb_resetter --reset-device, i.e. libusb_reset_device) cleared it.

Why the driver never self-recovers: in qx_command(), LIBUSB_ERROR_OVERFLOW is grouped with LIBUSB_ERROR_TIMEOUT and default (all break, i.e. "transient, retry next poll"), so a wedged endpoint never reaches the usb_reset() + reconnect path that PIPE/ETIME/IO/NO_DEVICE already use. From git history this grouping appears to date to the original blazer import and has only been mechanically renamed since (-EOVERFLOW -> ERROR_OVERFLOW -> LIBUSB_ERROR_OVERFLOW); it does not look like a deliberate "overflow is benign" choice.

Revised approach (under soak now; I'll replace the commit here once it's validated): route LIBUSB_ERROR_OVERFLOW to usb_reset() + reconnect, gated behind a small consecutive-overflow counter that any clean read resets. The first couple of overflows are still retried on the next poll, so genuine one-off transients never trigger a reset; only a sustained run escalates. This reuses the existing recovery machinery rather than the in-loop retry currently in this PR.

A couple of things I'd value maintainer input on:

Is changing the shared LIBUSB_ERROR_OVERFLOW handling acceptable for all nutdrv_qx devices, or would you prefer it scoped behind a subdriver flag? Argument for leaving it global: a compliant device never overruns wMaxPacketSize, so the path is inert for healthy hardware, while for this firmware it converts an indefinite hang into a bounded auto-recovery.
Any preference on the escalation threshold, and on applying the same change to blazer_usb.c (structurally identical switch, and the driver Communications with Ippon UPS lost (blazer_usb) #598 was originally filed against)?

Leaving this as draft until the soak confirms an in-driver reset clears the wedge as reliably as the external one does.

jimklimov · 2026-05-20T22:28:12Z

Thanks for the details and revision. As for the questions - I think yes to both, better nail this exceptional situation widely.

jimklimov added USB Qx protocol driver Driver based on Megatec Q<number> such as new nutdrv_qx, or obsoleted blazer and some others Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time labels May 18, 2026

jimklimov added this to the 2.8.6 milestone May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

drivers/nutdrv_qx.c, drivers/blazer_usb.c: cypress_command()/phoenix_command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [#598]#3448

drivers/nutdrv_qx.c, drivers/blazer_usb.c: cypress_command()/phoenix_command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [#598]#3448
Pyker wants to merge 1 commit into
networkupstools:masterfrom
Pyker:issue-598-cypress-overflow-clear-halt

Pyker commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

AppVeyorBot commented May 18, 2026

Uh oh!

AppVeyorBot commented May 18, 2026

Uh oh!

Pyker commented May 20, 2026

Uh oh!

jimklimov commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Pyker commented May 18, 2026

Summary

Why draft

Test plan

Related issues

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AppVeyorBot commented May 18, 2026

Uh oh!

AppVeyorBot commented May 18, 2026

Uh oh!

Pyker commented May 20, 2026

Uh oh!

jimklimov commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 18, 2026 •

edited

Loading