tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570
Open
yurypm wants to merge 1 commit intosonic-net:masterfrom
Open
tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570yurypm wants to merge 1 commit intosonic-net:masterfrom
yurypm wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
paulmenzel
suggested changes
May 7, 2026
| Added new napi_enable flag in tg3 struct and don't call napi_disable if | ||
| napi_enable was not called before. | ||
|
|
||
| Index: source_sonic/drivers/net/ethernet/broadcom/tg3.c |
Contributor
There was a problem hiding this comment.
Please add a properly git format-patch-formatted patch with Signed-off-by line. Is the issue present in the current Linux upstream release?
Contributor
Author
There was a problem hiding this comment.
- quilt is used in Debian Trixie (202511).
Converted quilt patch to git diff - Yes, there is the same issue in Linux upstream
paulmenzel
suggested changes
May 7, 2026
| rtnl_unlock(); | ||
|
|
||
| - pci_disable_device(pdev); | ||
| + if (pci_is_enabled(pdev)) |
Contributor
Author
There was a problem hiding this comment.
If an AER error is reported, recovery is started and tg3_io_error_detected is called. In tg3_io_error_detected, NAPI is disabled and pci_disable_device is called. Then, if we try to reset the device, pci_disable_device will be called again; we want to avoid this.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch. During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler. From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable-> napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error. We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will be called. But AER error recovery can fail. For example, when one of PCIe devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result, tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is disabled and NAPI is disabled (pci_disable_device and napi_disabled are called from tg3_io_error_detected). Then we can try to disable PCIe link and napi_disable will be called again: napi_disable+0x1b/0x1b0 tg3_napi_disable+0x89/0xa0 [tg3] tg3_netif_stop+0x37/0xe3 [tg3] tg3_stop+0x30/0x160 [tg3] tg3_close+0x2a/0x60 [tg3] __dev_close_many+0xad/0x130 dev_close_many+0xb2/0x190 unregister_netdevice_many_notify+0x19d/0xa00 ? try_to_wake_up+0x302/0x680 unregister_netdevice_queue+0xf8/0x140 unregister_netdev+0x1c/0x30 tg3_remove_one+0xaa/0x150 [tg3] pci_device_remove+0x42/0xb0 device_release_driver_internal+0x19c/0x200 pci_stop_bus_device+0x85/0xb0 pci_stop_bus_device+0x2c/0xb0 pci_stop_bus_device+0x2c/0xb0 pci_stop_and_remove_bus_device+0x12/0x20 pciehp_unconfigure_device+0x9f/0x160 pciehp_disable_slot+0x67/0x100 pciehp_handle_presence_or_link_change+0x77/0x350 This is not expected by napi_disable and a thread can be locked in napi_disable forever. We have pcierr_recovery to cover similar issue, but for fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume, but it is not called when AER recovery fails. Added new napi_enable flag in tg3 struct and don't call napi_disable if napi_enable was not called before. Signed-off-by: Yury Murashka <yurypm@arista.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.
From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.
Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.