Skip to content

fix: radio lifecycle and SX1262 settings update bugs#70

Open
zindello wants to merge 5 commits into
pyMC-dev:devfrom
zindello:fix/radio-management-bugs
Open

fix: radio lifecycle and SX1262 settings update bugs#70
zindello wants to merge 5 commits into
pyMC-dev:devfrom
zindello:fix/radio-management-bugs

Conversation

@zindello
Copy link
Copy Markdown

@zindello zindello commented May 15, 2026

Problems

1. Fatal errors instead of exceptions in GPIOPinManager
Nine places across setup_output_pin, setup_input_pin, and setup_interrupt_pin called sys.exit(1) on GPIO failures (pin in use, permission denied, general error). This killed the entire process rather than allowing the caller to handle or report the failure cleanly.

2. is_connected was a mutable flag, not derived state
Both KissModemWrapper and KissSerialWrapper maintained a boolean is_connected that was set manually in connect(), disconnect(), and error paths. This created a split-brain problem: the flag could say connected while the worker threads had already died (or vice versa), causing the workers to suppress error logs and exit silently.

3. SX1262 GPIO not released on cleanup
cleanup() called gpio_manager.cleanup_all() which closes the GPIO file descriptors — but the Linux GPIO character device does not reset physical pin voltage on close. TX enable, RX enable, and EN pins were left in whatever state they were in when cleanup was called, leaving hardware in an unknown state for the next init.

4. _rx_irq_task not cancelled on SX1262 cleanup
The asyncio task running the RX IRQ background loop was lazily created in set_rx_callback(). cleanup() called lora.end() without first cancelling the task, leaving it running against a torn-down driver.

5. SX1262 edge detection thread held GPIO for up to 30s after cleanup
_monitor_edge_events blocked in gpio.poll(30.0). cleanup_all() joined with a 2s timeout — the thread was still alive inside the poll call, so it held the GPIO file descriptor open, blocking the kernel release.

6. KISS wrappers had no cleanup() method
KissModemWrapper and KissSerialWrapper had no cleanup(). RadioManager guarded with hasattr(radio, "cleanup") and silently skipped it, leaving the serial file descriptor open after disconnect.

7. SX1262 had no configure_radio() method
KissModemWrapper, USBLoRaRadio, and TCPLoRaRadio all expose configure_radio(frequency, bandwidth, spreading_factor, coding_rate). The repeater and companion use hasattr(radio, "configure_radio") to detect and call it. SX1262Radio had no equivalent, so callers fell back to individual setters (set_frequency, set_spreading_factor, set_bandwidth). Each of those issues a hardware command that leaves the radio in standby — setFrequency internally calls calibrateImage (opcode 0x98) which requires STANDBY_RC per the Semtech datasheet and does not return to RX. The radio stopped receiving after any settings update.

8. _rx_irq_background_task swallowed errors and kept running
On unexpected exceptions in the RX IRQ loop, the task logged and slept for 1s then continued, leaving _initialized = True despite the radio being in an unknown state.


Fix Approaches

  • Replace sys.exit(1) with raise RuntimeError(...) from e so failures propagate normally.
  • Derive is_connected as a @property from thread liveness rather than maintaining a flag manually. Worker loops driven solely by stop_event so they don't depend on the flag.
  • Drive all output pins (txen, rxen, en_pins) LOW before calling cleanup_all(). -1 pins explicitly skipped.
  • Cancel _rx_irq_task before lora.end() in SX1262Radio.cleanup(), guarded with hasattr since the task is lazily created.
  • Reduce gpio.poll(30.0)gpio.poll(1.0) — the timeout only controls how long the thread blocks before looping to re-check stop_event. It does not affect interrupt reliability: the Linux GPIO character device queues edge events in the kernel regardless of whether poll() is currently blocking, so any edge that fires during the loop turnaround is already in the queue and consumed on the next iteration. The sole effect of the shorter timeout is that the edge detection thread now exits within the existing join(2.0) window during cleanup, rather than holding the GPIO file descriptor open for up to 30 seconds.
    Added cleanup() → self.disconnect() to both KISS wrappers rather than exposing disconnect() directly — RadioManager calls cleanup() on all radio types, and SX1262Radio, USBLoRaRadio, and TCPLoRaRadio all already implement it. The KISS wrappers were the only exceptions, papered over with a hasattr(radio, "cleanup") guard. This brings them in line so RadioManager can call cleanup() unconditionally.
  • Add configure_radio() to SX1262Radio matching the same signature and return convention as the KISS implementation. One standby/restore cycle: apply frequency and modulation together, reset noise floor state, reconfigure IRQ masks, return to RX_CONTINUOUS.
  • Set _initialized = False and break on unexpected errors in the RX IRQ loop.

Changes

gpio_manager.py — 9× sys.exit(1) replaced with raise RuntimeError(...) from e. gpio.poll(30.0)gpio.poll(1.0).

kiss_modem_wrapper.pyis_connected converted to @property derived from rx_thread.is_alive() and tx_thread.is_alive(). Manual flag assignments removed. Worker loops simplified to stop_event only. Added cleanup() → self.disconnect().

kiss_serial_wrapper.py — Same is_connected and worker loop changes as above. Added cleanup() → self.disconnect().

sx1262_wrapper.py_rx_irq_background_task sets _initialized = False and breaks on error. cleanup() now cancels _rx_irq_task first (with hasattr guard), drives txen/rxen/en_pins LOW before cleanup_all(). Added configure_radio(frequency, bandwidth, spreading_factor, coding_rate) → bool: waits up to 10s for any in-progress TX to complete (polling _tx_lock.locked() every 50ms), then applies all params in a single standby/restore cycle and returns the radio to RX_CONTINUOUS.

TX wait note — the clean solution for the TX wait would be async with self._tx_lock, but making configure_radio async is a breaking change: companion_radio.set_radio_params() calls it synchronously, and the KISS implementation is also sync. Keeping the interface consistent across radio types outweighs the elegance here. LoRa TX completes in well under a second even at the slowest settings, so the 10s polling timeout gives enormous headroom.

Joshua Mesilane and others added 2 commits May 15, 2026 13:46
Replace sys.exit(1) with raise RuntimeError in gpio_manager so init
failures are catchable by except Exception. Make is_connected a
@Property derived from thread liveness in KissModemWrapper and
KissSerialWrapper so runtime failures are reflected accurately without
manual state management. Fix SX1262 background task to set
_initialized=False and break on unexpected error rather than sleeping
and looping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gpio_manager: reduce edge detection poll timeout from 30s to 1s so
cleanup_all() join(2.0) is sufficient to stop the thread synchronously
and GPIO lines are released before the caller returns.

sx1262_wrapper: cancel _rx_irq_background_task before teardown; drive
txen, rxen, and en_pins LOW before cleanup_all() so output pins are in
a known safe state when the kernel GPIO claim is released.

kiss_modem_wrapper, kiss_serial_wrapper: add cleanup() delegating to
disconnect(), giving callers a consistent interface across all radio
types and ensuring the serial fd is properly closed on teardown.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zindello zindello changed the title hardware: fix inconsistent radio failure signalling DRAFT: hardware: fix inconsistent radio failure signalling May 15, 2026
@zindello zindello marked this pull request as draft May 15, 2026 06:18
Individual setters (set_frequency, set_bandwidth, set_spreading_factor) leave the
radio in standby after calling calibrateImage/setLoRaModulation, causing the noise
floor sampler to read RSSI=255 (-127.5 dBm) and the radio to stop receiving.

configure_radio() applies all params in one standby/restore cycle, resets the noise
floor state, and returns the radio to RX_CONTINUOUS. Waits up to 10s for any
in-progress TX to complete before reconfiguring.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zindello zindello marked this pull request as ready for review May 15, 2026 08:12
@zindello zindello changed the title DRAFT: hardware: fix inconsistent radio failure signalling fix: radio lifecycle and SX1262 settings update bugs May 15, 2026
@rightup
Copy link
Copy Markdown
Collaborator

rightup commented May 15, 2026

Thanks for tackling these radio lifecycle issues. I think the overall direction is right, but there are a few things here I don’t think we should merge as-is.

The biggest one is is_connected. In both wrappers it’s gone from being a mutable attribute to a read-only property (see kiss_modem_wrapper.py and kiss_serial_wrapper.py). That breaks existing usage where state gets assigned directly, especially tests/mocks, and it’s already blowing up a lot of cases in test_kiss_modem_wrapper.py.

I’m fine with moving toward a property-based approach, but we need backward compatibility here somehow. Either add a setter back in, or introduce a proper override/mock mechanism and update tests plus migration notes in this in code comments and PR.

Also noticed GPIO handling changed from exiting the process to raising RuntimeError (in gpio_manager.py). I agree raising is probably cleaner for a library, but it’s still a runtime change. We should ensure top-level shutdown flow catches and handles it properly.

One other thing I want clarified: SX1262 RX task behavior changed significantly in sx1262_wrapper.py. Previously unexpected exceptions did await asyncio.sleep(1.0) and continued; now that sleep/retry path is gone and the task marks itself uninitialized and exits immediately. That worries me for long-running deployments where transient radio faults are common.

If that change is intentional, can you confirm what is now responsible for recovery/restart?

@agessaman do you have any comments?

Joshua Mesilane and others added 2 commits May 15, 2026 21:08
- Revert is_connected to plain bool attribute on KissModemWrapper and
  KissSerialWrapper — property approach broke backward compatibility with
  existing callers and tests that assign the flag directly
- Revert SX1262 RX task to sleep-and-continue on unexpected errors rather
  than marking uninitialized and exiting — transient faults are common in
  long-running deployments and check_radio_health() cannot revive the task
  once _initialized is False
- Call cleanup() before re-raising RuntimeError in SX1262Radio.begin() so
  GPIO file descriptors and pin state are released regardless of whether
  the caller handles the exception

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a worker thread exits due to an exception, set is_connected=False
and stop_event so RadioManager's _wait_for_disconnect() polling loop
detects the failure within 1s and triggers a reconnect attempt.

Full disconnect() cannot be called from within a worker thread as it
attempts to join the current thread, which raises RuntimeError.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zindello
Copy link
Copy Markdown
Author

zindello commented May 15, 2026

You're right. I hadn't considered a few things that were fundamental here. Backwards comaptability. I also hadn't considered the transient error situation.

I've reverted the change that converts the is_connected to a property, and added some extra code to catch the failure mode that it was missing before. I do think this is probably better as a read only item, but, if we're relying on it not to be in other places, I guess that's set now.

For the Exceptions in the GPIO handler, as discussed I think exit() was a little heavy handed, but you're right, the way I had done it left the GPIO pins dangling in whatever state they were in. Not good. I've added a cleanup to the exception now that resets the pins back to a default state and releases the GPIO pins now before we throw the exception. I think this is a lot cleaner approach as even in the exit() approach before, it would exit and close the GPIO subsystem but it would leave the pins in whatever state they were last in. Potentially leaving any Rx, Tx or En pin that was applied to the radio in that state even after the SPI had stopped.

The exit approach had this bug too however, to be fair.

I think this is much cleaner now.

Please re-review. Have tested with the new repeater branch and all is working as intended. All tests in pyMC_Core pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants