fix: radio lifecycle and SX1262 settings update bugs#70
Conversation
Replace sys.exit(1) with raise RuntimeError in gpio_manager so init failures are catchable by except Exception. Make is_connected a @Property derived from thread liveness in KissModemWrapper and KissSerialWrapper so runtime failures are reflected accurately without manual state management. Fix SX1262 background task to set _initialized=False and break on unexpected error rather than sleeping and looping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gpio_manager: reduce edge detection poll timeout from 30s to 1s so cleanup_all() join(2.0) is sufficient to stop the thread synchronously and GPIO lines are released before the caller returns. sx1262_wrapper: cancel _rx_irq_background_task before teardown; drive txen, rxen, and en_pins LOW before cleanup_all() so output pins are in a known safe state when the kernel GPIO claim is released. kiss_modem_wrapper, kiss_serial_wrapper: add cleanup() delegating to disconnect(), giving callers a consistent interface across all radio types and ensuring the serial fd is properly closed on teardown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Individual setters (set_frequency, set_bandwidth, set_spreading_factor) leave the radio in standby after calling calibrateImage/setLoRaModulation, causing the noise floor sampler to read RSSI=255 (-127.5 dBm) and the radio to stop receiving. configure_radio() applies all params in one standby/restore cycle, resets the noise floor state, and returns the radio to RX_CONTINUOUS. Waits up to 10s for any in-progress TX to complete before reconfiguring. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for tackling these radio lifecycle issues. I think the overall direction is right, but there are a few things here I don’t think we should merge as-is. The biggest one is is_connected. In both wrappers it’s gone from being a mutable attribute to a read-only property (see kiss_modem_wrapper.py and kiss_serial_wrapper.py). That breaks existing usage where state gets assigned directly, especially tests/mocks, and it’s already blowing up a lot of cases in test_kiss_modem_wrapper.py. I’m fine with moving toward a property-based approach, but we need backward compatibility here somehow. Either add a setter back in, or introduce a proper override/mock mechanism and update tests plus migration notes in this in code comments and PR. Also noticed GPIO handling changed from exiting the process to raising RuntimeError (in gpio_manager.py). I agree raising is probably cleaner for a library, but it’s still a runtime change. We should ensure top-level shutdown flow catches and handles it properly. One other thing I want clarified: SX1262 RX task behavior changed significantly in sx1262_wrapper.py. Previously unexpected exceptions did await asyncio.sleep(1.0) and continued; now that sleep/retry path is gone and the task marks itself uninitialized and exits immediately. That worries me for long-running deployments where transient radio faults are common. If that change is intentional, can you confirm what is now responsible for recovery/restart? @agessaman do you have any comments? |
- Revert is_connected to plain bool attribute on KissModemWrapper and KissSerialWrapper — property approach broke backward compatibility with existing callers and tests that assign the flag directly - Revert SX1262 RX task to sleep-and-continue on unexpected errors rather than marking uninitialized and exiting — transient faults are common in long-running deployments and check_radio_health() cannot revive the task once _initialized is False - Call cleanup() before re-raising RuntimeError in SX1262Radio.begin() so GPIO file descriptors and pin state are released regardless of whether the caller handles the exception Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a worker thread exits due to an exception, set is_connected=False and stop_event so RadioManager's _wait_for_disconnect() polling loop detects the failure within 1s and triggers a reconnect attempt. Full disconnect() cannot be called from within a worker thread as it attempts to join the current thread, which raises RuntimeError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
You're right. I hadn't considered a few things that were fundamental here. Backwards comaptability. I also hadn't considered the transient error situation. I've reverted the change that converts the is_connected to a property, and added some extra code to catch the failure mode that it was missing before. I do think this is probably better as a read only item, but, if we're relying on it not to be in other places, I guess that's set now. For the Exceptions in the GPIO handler, as discussed I think exit() was a little heavy handed, but you're right, the way I had done it left the GPIO pins dangling in whatever state they were in. Not good. I've added a cleanup to the exception now that resets the pins back to a default state and releases the GPIO pins now before we throw the exception. I think this is a lot cleaner approach as even in the exit() approach before, it would exit and close the GPIO subsystem but it would leave the pins in whatever state they were last in. Potentially leaving any Rx, Tx or En pin that was applied to the radio in that state even after the SPI had stopped. The exit approach had this bug too however, to be fair. I think this is much cleaner now. Please re-review. Have tested with the new repeater branch and all is working as intended. All tests in pyMC_Core pass. |
Problems
1. Fatal errors instead of exceptions in
GPIOPinManagerNine places across
setup_output_pin,setup_input_pin, andsetup_interrupt_pincalledsys.exit(1)on GPIO failures (pin in use, permission denied, general error). This killed the entire process rather than allowing the caller to handle or report the failure cleanly.2.
is_connectedwas a mutable flag, not derived stateBoth
KissModemWrapperandKissSerialWrappermaintained a booleanis_connectedthat was set manually inconnect(),disconnect(), and error paths. This created a split-brain problem: the flag could say connected while the worker threads had already died (or vice versa), causing the workers to suppress error logs and exit silently.3. SX1262 GPIO not released on cleanup
cleanup()calledgpio_manager.cleanup_all()which closes the GPIO file descriptors — but the Linux GPIO character device does not reset physical pin voltage on close. TX enable, RX enable, and EN pins were left in whatever state they were in when cleanup was called, leaving hardware in an unknown state for the next init.4.
_rx_irq_tasknot cancelled on SX1262 cleanupThe asyncio task running the RX IRQ background loop was lazily created in
set_rx_callback().cleanup()calledlora.end()without first cancelling the task, leaving it running against a torn-down driver.5. SX1262 edge detection thread held GPIO for up to 30s after cleanup
_monitor_edge_eventsblocked ingpio.poll(30.0).cleanup_all()joined with a 2s timeout — the thread was still alive inside the poll call, so it held the GPIO file descriptor open, blocking the kernel release.6. KISS wrappers had no
cleanup()methodKissModemWrapperandKissSerialWrapperhad nocleanup().RadioManagerguarded withhasattr(radio, "cleanup")and silently skipped it, leaving the serial file descriptor open after disconnect.7. SX1262 had no
configure_radio()methodKissModemWrapper,USBLoRaRadio, andTCPLoRaRadioall exposeconfigure_radio(frequency, bandwidth, spreading_factor, coding_rate). The repeater and companion usehasattr(radio, "configure_radio")to detect and call it.SX1262Radiohad no equivalent, so callers fell back to individual setters (set_frequency,set_spreading_factor,set_bandwidth). Each of those issues a hardware command that leaves the radio in standby —setFrequencyinternally callscalibrateImage(opcode 0x98) which requires STANDBY_RC per the Semtech datasheet and does not return to RX. The radio stopped receiving after any settings update.8.
_rx_irq_background_taskswallowed errors and kept runningOn unexpected exceptions in the RX IRQ loop, the task logged and slept for 1s then continued, leaving
_initialized = Truedespite the radio being in an unknown state.Fix Approaches
sys.exit(1)withraise RuntimeError(...) from eso failures propagate normally.is_connectedas a@propertyfrom thread liveness rather than maintaining a flag manually. Worker loops driven solely bystop_eventso they don't depend on the flag.cleanup_all().-1pins explicitly skipped._rx_irq_taskbeforelora.end()inSX1262Radio.cleanup(), guarded withhasattrsince the task is lazily created.gpio.poll(30.0)→gpio.poll(1.0)— the timeout only controls how long the thread blocks before looping to re-checkstop_event. It does not affect interrupt reliability: the Linux GPIO character device queues edge events in the kernel regardless of whetherpoll()is currently blocking, so any edge that fires during the loop turnaround is already in the queue and consumed on the next iteration. The sole effect of the shorter timeout is that the edge detection thread now exits within the existingjoin(2.0)window during cleanup, rather than holding the GPIO file descriptor open for up to 30 seconds.Added
cleanup() → self.disconnect()to both KISS wrappers rather than exposingdisconnect()directly —RadioManagercallscleanup()on all radio types, andSX1262Radio,USBLoRaRadio, andTCPLoRaRadioall already implement it. The KISS wrappers were the only exceptions, papered over with ahasattr(radio, "cleanup")guard. This brings them in line soRadioManagercan callcleanup()unconditionally.configure_radio()toSX1262Radiomatching the same signature and return convention as the KISS implementation. One standby/restore cycle: apply frequency and modulation together, reset noise floor state, reconfigure IRQ masks, return toRX_CONTINUOUS._initialized = Falseand break on unexpected errors in the RX IRQ loop.Changes
gpio_manager.py— 9×sys.exit(1)replaced withraise RuntimeError(...) from e.gpio.poll(30.0)→gpio.poll(1.0).kiss_modem_wrapper.py—is_connectedconverted to@propertyderived fromrx_thread.is_alive() and tx_thread.is_alive(). Manual flag assignments removed. Worker loops simplified tostop_eventonly. Addedcleanup() → self.disconnect().kiss_serial_wrapper.py— Sameis_connectedand worker loop changes as above. Addedcleanup() → self.disconnect().sx1262_wrapper.py—_rx_irq_background_tasksets_initialized = Falseand breaks on error.cleanup()now cancels_rx_irq_taskfirst (withhasattrguard), drives txen/rxen/en_pins LOW beforecleanup_all(). Addedconfigure_radio(frequency, bandwidth, spreading_factor, coding_rate) → bool: waits up to 10s for any in-progress TX to complete (polling_tx_lock.locked()every 50ms), then applies all params in a single standby/restore cycle and returns the radio toRX_CONTINUOUS.TX wait note — the clean solution for the TX wait would be
async with self._tx_lock, but makingconfigure_radioasync is a breaking change:companion_radio.set_radio_params()calls it synchronously, and the KISS implementation is also sync. Keeping the interface consistent across radio types outweighs the elegance here. LoRa TX completes in well under a second even at the slowest settings, so the 10s polling timeout gives enormous headroom.