Skip to content

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570

Open
yurypm wants to merge 1 commit intosonic-net:masterfrom
yurypm:master
Open

tg3: add napi_enabled flag to track napi_enable/napi_disable calls#570
yurypm wants to merge 1 commit intosonic-net:masterfrom
yurypm:master

Conversation

@yurypm
Copy link
Copy Markdown
Contributor

@yurypm yurypm commented May 7, 2026

We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.
Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.

@yurypm yurypm requested a review from a team as a code owner May 7, 2026 15:46
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 7, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: yurypm / name: Yury Murashka (9564cef)

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.

Index: source_sonic/drivers/net/ethernet/broadcom/tg3.c
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a properly git format-patch-formatted patch with Signed-off-by line. Is the issue present in the current Linux upstream release?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. quilt is used in Debian Trixie (202511).
    Converted quilt patch to git diff
  2. Yes, there is the same issue in Linux upstream

rtnl_unlock();

- pci_disable_device(pdev);
+ if (pci_is_enabled(pdev))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an AER error is reported, recovery is started and tg3_io_error_detected is called. In tg3_io_error_detected, NAPI is disabled and pci_disable_device is called. Then, if we try to reset the device, pci_disable_device will be called again; we want to avoid this.

Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
Comment thread patches-sonic/driver-arista-net-tg3-napi-enable-called-flag.patch
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

We need this patch to fix a soft lockup in the Linux kernel on Arista modular chassis in the 202511 branch.
During linecard resets, uncorrectable errors could be reported. As a result, AER recovery for the tg3 device can be initiated by the AER kernel driver. The tg3_io_error_detected function is the AER error recovery handler.

From tg3_io_error_detected, we call tg3_netif_stop->tg3_napi_disable->
napi_disable and return PCI_ERS_RESULT_NEED_RESET on non-fatal error.
We expect that during AER recovery tg3_io_slot_reset and tg3_io_resume will
be called. But AER error recovery can fail. For example, when one of PCIe
devices on the same bus reports PCI_ERS_RESULT_NO_AER_DRIVER. As a result,
tg3_io_slot_reset and tg3_io_resume are not called, PCIe device is
disabled and NAPI is disabled (pci_disable_device and napi_disabled
are called from tg3_io_error_detected). Then we can try to disable PCIe link
and napi_disable will be called again:
napi_disable+0x1b/0x1b0
tg3_napi_disable+0x89/0xa0 [tg3]
tg3_netif_stop+0x37/0xe3 [tg3]
tg3_stop+0x30/0x160 [tg3]
tg3_close+0x2a/0x60 [tg3]
__dev_close_many+0xad/0x130
dev_close_many+0xb2/0x190
unregister_netdevice_many_notify+0x19d/0xa00
? try_to_wake_up+0x302/0x680
unregister_netdevice_queue+0xf8/0x140
unregister_netdev+0x1c/0x30
tg3_remove_one+0xaa/0x150 [tg3]
pci_device_remove+0x42/0xb0
device_release_driver_internal+0x19c/0x200
pci_stop_bus_device+0x85/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_bus_device+0x2c/0xb0
pci_stop_and_remove_bus_device+0x12/0x20
pciehp_unconfigure_device+0x9f/0x160
pciehp_disable_slot+0x67/0x100
pciehp_handle_presence_or_link_change+0x77/0x350
This is not expected by napi_disable and a thread can be locked in
napi_disable forever. We have pcierr_recovery to cover similar issue, but for
fatal errors. We cannot reuse this flag because it is reset in tg3_io_resume,
but it is not called when AER recovery fails.
Added new napi_enable flag in tg3 struct and don't call napi_disable if
napi_enable was not called before.

Signed-off-by: Yury Murashka <yurypm@arista.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants