-
Notifications
You must be signed in to change notification settings - Fork 5
bpf, arm64: fix bpf line info #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Master branch: edc21dc |
|
Master branch: d2b94f3 |
94a52c0 to
0d3d900
Compare
|
Master branch: 8cbf062 |
0d3d900 to
3ec6d4d
Compare
|
Master branch: 8cbf062 |
3ec6d4d to
4a4a2e3
Compare
|
Master branch: 477bb4c |
4a4a2e3 to
217d192
Compare
|
Master branch: 2e3f7be |
217d192 to
8f0cfc6
Compare
2e3f7be to
f76d850
Compare
|
Master branch: f76d850 |
8f0cfc6 to
a679793
Compare
|
Master branch: 9b6eb04 |
a679793 to
9177c6c
Compare
|
Master branch: 1b8c924 |
9177c6c to
84c1c6a
Compare
|
Master branch: 9e98ace |
84c1c6a to
df4c8bd
Compare
9e98ace to
1b8c924
Compare
|
Master branch: 1b8c924 |
df4c8bd to
ecb65a0
Compare
|
Master branch: b38101c |
ecb65a0 to
ea250fb
Compare
|
Master branch: b75daca |
ea250fb to
89eae2a
Compare
Problem description
===================
Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.
phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
-> phy_config_inband() // acquires &pl->phydev->lock
whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().
The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.
phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.
Problem impact
==============
I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.
Proposed solution
=================
Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.
Solution details, considerations, notes
=======================================
This is the phy_config_inband() call graph:
sfp_upstream_ops :: connect_phy()
|
v
phylink_sfp_connect_phy()
|
v
phylink_sfp_config_phy()
|
| sfp_upstream_ops :: module_insert()
| |
| v
| phylink_sfp_module_insert()
| |
| | sfp_upstream_ops :: module_start()
| | |
| | v
| | phylink_sfp_module_start()
| | |
| v v
| phylink_sfp_config_optical()
phylink_start() | |
| phylink_resume() v v
| | phylink_sfp_set_config()
| | |
v v v
phylink_mac_initial_config()
| phylink_resolve()
| | phylink_ethtool_ksettings_set()
v v v
phylink_major_config()
|
v
phy_config_inband()
phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().
phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.
phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.
Other solutions
===============
The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.
Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250904125238.193990-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says: ==================== ipv4: icmp: Fix source IP derivation in presence of VRFs Align IPv4 with IPv6 and in the presence of VRFs generate ICMP error messages with a source IP that is derived from the receiving interface and not from its VRF master. This is especially important when the error messages are "Time Exceeded" messages as it means that utilities like traceroute will show an incorrect packet path. Patches #1-#2 are preparations. Patch #3 is the actual change. Patches #4-#7 make small improvements in the existing traceroute test. Patch #8 extends the traceroute test with VRF test cases for both IPv4 and IPv6. Changes since v1 [1]: * Rebase. [1] https://lore.kernel.org/netdev/20250901083027.183468-1-idosch@nvidia.com/ ==================== Link: https://patch.msgid.link/20250908073238.119240-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1
This fixes the following UAF caused by not properly locking hdev when processing HCI_EV_NUM_COMP_PKTS: BUG: KASAN: slab-use-after-free in hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 Read of size 4 at addr ffff8880740f0940 by task kworker/u11:0/54 CPU: 1 UID: 0 PID: 54 Comm: kworker/u11:0 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci1 hci_rx_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_conn_tx_dequeue+0x1be/0x220 net/bluetooth/hci_conn.c:3036 hci_num_comp_pkts_evt+0x1c8/0xa50 net/bluetooth/hci_event.c:4404 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 54: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 le_conn_complete_evt+0x3d6/0x1220 net/bluetooth/hci_event.c:5628 hci_le_enh_conn_complete_evt+0x189/0x470 net/bluetooth/hci_event.c:5794 hci_event_func net/bluetooth/hci_event.c:7474 [inline] hci_event_packet+0x78c/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Freed by task 9572: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_abort_conn_sync+0x5d1/0xdf0 net/bluetooth/hci_sync.c:5689 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: 134f4b3 ("Bluetooth: add support for skb TX SND/COMPLETION timestamping") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the following UFA in hci_acl_create_conn_sync where a connection still pending is command submission (conn->state == BT_OPEN) maybe freed, also since this also can happen with the likes of hci_le_create_conn_sync fix it as well: BUG: KASAN: slab-use-after-free in hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 Write of size 2 at addr ffff88805ffcc038 by task kworker/u11:2/9541 CPU: 1 UID: 0 PID: 9541 Comm: kworker/u11:2 Not tainted 6.16.0-rc7 #3 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Workqueue: hci3 hci_cmd_sync_work Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x230 mm/kasan/report.c:480 kasan_report+0x118/0x150 mm/kasan/report.c:593 hci_acl_create_conn_sync+0x5ef/0x790 net/bluetooth/hci_sync.c:6861 hci_cmd_sync_work+0x210/0x3a0 net/bluetooth/hci_sync.c:332 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 </TASK> Allocated by task 123736: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __kmalloc_cache_noprof+0x230/0x3d0 mm/slub.c:4359 kmalloc_noprof include/linux/slab.h:905 [inline] kzalloc_noprof include/linux/slab.h:1039 [inline] __hci_conn_add+0x233/0x1b30 net/bluetooth/hci_conn.c:939 hci_conn_add_unset net/bluetooth/hci_conn.c:1051 [inline] hci_connect_acl+0x16c/0x4e0 net/bluetooth/hci_conn.c:1634 pair_device+0x418/0xa70 net/bluetooth/mgmt.c:3556 hci_mgmt_cmd+0x9c9/0xef0 net/bluetooth/hci_sock.c:1719 hci_sock_sendmsg+0x6ca/0xef0 net/bluetooth/hci_sock.c:1839 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 sock_write_iter+0x258/0x330 net/socket.c:1131 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x54b/0xa90 fs/read_write.c:686 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 103680: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:576 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x62/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2381 [inline] slab_free mm/slub.c:4643 [inline] kfree+0x18e/0x440 mm/slub.c:4842 device_release+0x9c/0x1c0 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 hci_conn_cleanup net/bluetooth/hci_conn.c:175 [inline] hci_conn_del+0x8ff/0xcb0 net/bluetooth/hci_conn.c:1173 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:47 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:548 insert_work+0x3d/0x330 kernel/workqueue.c:2183 __queue_work+0xbd9/0xfe0 kernel/workqueue.c:2345 queue_delayed_work_on+0x18b/0x280 kernel/workqueue.c:2561 pairing_complete+0x1e7/0x2b0 net/bluetooth/mgmt.c:3451 pairing_complete_cb+0x1ac/0x230 net/bluetooth/mgmt.c:3487 hci_connect_cfm include/net/bluetooth/hci_core.h:2064 [inline] hci_conn_failed+0x24d/0x310 net/bluetooth/hci_conn.c:1275 hci_conn_complete_evt+0x3c7/0x1040 net/bluetooth/hci_event.c:3199 hci_event_func net/bluetooth/hci_event.c:7477 [inline] hci_event_packet+0x7e0/0x1200 net/bluetooth/hci_event.c:7531 hci_rx_work+0x46a/0xe80 net/bluetooth/hci_core.c:4070 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 home/kwqcheii/source/fuzzing/kernel/kasan/linux-6.16-rc7/arch/x86/entry/entry_64.S:245 Fixes: aef2aa4 ("Bluetooth: hci_event: Fix creating hci_conn object on error status") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Ido Schimmel says: ==================== nexthop: Various fixes Patch #1 fixes a NPD that was recently reported by syzbot. Patch #2 fixes an issue in the existing FIB nexthop selftest. Patch #3 extends the selftest with test cases for the bug that was fixed in the first patch. ==================== Link: https://patch.msgid.link/20250921150824.149157-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We check the version of SPE twice, and we'll add one more check in the next commit so factor out a macro to do this. Change the #3 magic number to the actual SPE version define (V1p2) to make it more readable. No functional changes intended. Tested-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
Write combining is an optimization feature in CPUs that is frequently used by modern devices to generate 32 or 64 byte TLPs at the PCIe level. These large TLPs allow certain optimizations in the driver to HW communication that improve performance. As WC is unpredictable and optional the HW designs all tolerate cases where combining doesn't happen and simply experience a performance degradation. Unfortunately many virtualization environments on all architectures have done things that completely disable WC inside the VM with no generic way to detect this. For example WC was fully blocked in ARM64 KVM until commit 8c47ce3 ("KVM: arm64: Set io memory s2 pte as normalnc for vfio pci device"). Trying to use WC when it is known not to work has a measurable performance cost (~5%). Long ago mlx5 developed an boot time algorithm to test if WC is available or not by using unique mlx5 HW features to measure how many large TLPs the device is receiving. The SW generates a large number of combining opportunities and if any succeed then WC is declared working. In mlx5 the WC optimization feature is never used by the kernel except for the boot time test. The WC is only used by userspace in rdma-core. Sadly modern ARM CPUs, especially NVIDIA Grace, have a combining implementation that is very unreliable compared to pretty much everything prior. This is being fixed architecturally in new CPUs with a new ST64B instruction, but current shipping devices suffer this problem. Unreliable means the SW can present thousands of combining opportunities and the HW will not combine for any of them, which creates a performance degradation, and critically fails the mlx5 boot test. However, the CPU is very sensitive to the instruction sequence used, with the better options being sufficiently good that the performance loss from the unreliable CPU is not measurable. Broadly there are several options, from worst to best: 1) A C loop doing a u64 memcpy. This was used prior to commit ef30228 ("IB/mlx5: Use __iowrite64_copy() for write combining stores") and failed almost all the time on Grace CPUs. 2) ARM64 assembly with consecutive 8 byte stores. This was implemented as an arch-generic __iowriteXX_copy() family of functions suitable for performance use in drivers for WC. commit ead7911 ("arm64/io: Provide a WC friendly __iowriteXX_copy()") provided the ARM implementation. 3) ARM64 assembly with consecutive 16 byte stores. This was rejected from kernel use over fears of virtualization failures. Common ARM VMMs will crash if STP is used against emulated memory. 4) A single NEON store instruction. Userspace has used this option for a very long time, it performs well. 5) For future silicon the new ST64B instruction is guaranteed to generate a 64 byte TLP 100% of the time The past upgrade from #1 to #2 was thought to be sufficient to solve this problem. However, more testing on more systems shows that #3 is still problematic at a low frequency and the kernel test fails. Thus, make the mlx5 use the same instructions as userspace during the boot time WC self test. This way the WC test matches the userspace and will properly detect the ability of HW to support the WC workload that userspace will generate. While #4 still has imperfect combining performance, it is substantially better than #2, and does actually give a performance win to applications. Self-test failures with #2 are like 3/10 boots, on some systems, #4 has never seen a boot failure. There is no real general use case for a NEON based WC flow in the kernel. This is not suitable for any performance path work as getting into/out of a NEON context is fairly expensive compared to the gain of WC. Future CPUs are going to fix this issue by using an new ARM instruction and __iowriteXX_copy() will be updated to use that automatically, probably using the ALTERNATES mechanism. Since this problem is constrained to mlx5's unique situation of needing a non-performance code path to duplicate what mlx5 userspace is doing as a matter of self-testing, implement it as a one line inline assembly in the driver directly. Lastly, this was concluded from the discussion with ARM maintainers which confirms that this is the best approach for the solution: https://lore.kernel.org/r/aHqN_hpJl84T1Usi@arm.com Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1759093688-841357-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are affected by Shadow Stacks and/or Indirect Branch Tracking when said features are enabled in the guest, as fully emulating CET would require significant complexity for no practical benefit (KVM shouldn't need to emulate branch instructions on modern hosts). Simply doing nothing isn't an option as that would allow a malicious entity to subvert CET protections via the emulator. To detect instructions that are subject to IBT or affect IBT state, use the existing IsBranch flag along with the source operand type to detect indirect branches, and the existing NearBranch flag to detect far JMPs and CALLs, all of which are effectively indirect. Explicitly check for emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches) instead of adding another flag, e.g. IsRet, as it's unlikely the emulator will ever need to check for return-like instructions outside of this one specific flow. Use an allow-list instead of a deny-list because (a) it's a shorter list and (b) so that a missed entry gets a false positive, not a false negative (i.e. reject emulation instead of clobbering CET state). For Shadow Stacks, explicitly track instructions that directly affect the current SSP, as KVM's emulator doesn't have existing flags that can be used to precisely detect such instructions. Alternatively, the em_xxx() helpers could directly check for ShadowStack interactions, but using a dedicated flag is arguably easier to audit, and allows for handling both IBT and SHSTK in one fell swoop. Note! On far transfers, do NOT consult the current privilege level and instead treat SHSTK/IBT as being enabled if they're enabled for User *or* Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT can be in play for the target privilege level, i.e. checking the current privilege could get a false negative, and KVM doesn't know the target privilege level until emulation gets under way. Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP as SHSTK, which would be rather confusing and would result in FAR JMP being rejected unnecessarily the vast majority of the time (ignoring that it's unlikely to ever be emulated). A future commit will add the #GP(0) check for the specific FAR JMP scenario. Note #3, task switches also modify SSP and so need to be rejected. That too will be addressed in a future commit. Suggested-by: Chao Gao <chao.gao@intel.com> Originally-by: Yang Weijiang <weijiang.yang@intel.com> Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 #7 [3800313fd60] crw_collect_info at c91905822 #8 [3800313fe10] kthread at c90feb390 #9 [3800313fe68] __ret_from_fork at c90f6aa64 #10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.17, round #3 - Invalidate nested MMUs upon freeing the PGD to avoid WARNs when visiting from an MMU notifier - Fixes to the TLB match process and TLB invalidation range for managing the VCNR pseudo-TLB - Prevent SPE from erroneously profiling guests due to UNKNOWN reset values in PMSCR_EL1 - Fix save/restore of host MDCR_EL2 to account for eagerly programming at vcpu_load() on VHE systems - Correct lock ordering when dealing with VGIC LPIs, avoiding scenarios where an xarray's spinlock was nested with a *raw* spinlock - Permit stage-2 read permission aborts which are possible in the case of NV depending on the guest hypervisor's stage-2 translation - Call raw_spin_unlock() instead of the internal spinlock API - Fix parameter ordering when assigning VBAR_EL1 [Pull into kvm/master to fix conflicts. - Paolo]
The ns_bpf_qdisc selftest triggers a kernel panic: Oops[#1]: CPU 0 Unable to handle kernel paging request at virtual address 0000000000741d58, era == 90000000851b5ac0, ra == 90000000851b5aa4 CPU: 0 UID: 0 PID: 449 Comm: test_progs Tainted: G OE 6.16.0+ #3 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU QEMU Virtual Machine, BIOS unknown 2/2/2022 pc 90000000851b5ac0 ra 90000000851b5aa4 tp 90000001076b8000 sp 90000001076bb600 a0 0000000000741ce8 a1 0000000000000001 a2 90000001076bb5c0 a3 0000000000000008 a4 90000001004c4620 a5 9000000100741ce8 a6 0000000000000000 a7 0100000000000000 t0 0000000000000010 t1 0000000000000000 t2 9000000104d24d30 t3 0000000000000001 t4 4f2317da8a7e08c4 t5 fffffefffc002f00 t6 90000001004c4620 t7 ffffffffc61c5b3d t8 0000000000000000 u0 0000000000000001 s9 0000000000000050 s0 90000001075bc800 s1 0000000000000040 s2 900000010597c400 s3 0000000000000008 s4 90000001075bc880 s5 90000001075bc8f0 s6 0000000000000000 s7 0000000000741ce8 s8 0000000000000000 ra: 90000000851b5aa4 __qdisc_run+0xac/0x8d8 ERA: 90000000851b5ac0 __qdisc_run+0xc8/0x8d8 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 00000004 (PPLV0 +PIE -PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000741d58 PRID: 0014c010 (Loongson-64bit, Loongson-3A5000) Modules linked in: bpf_testmod(OE) [last unloaded: bpf_testmod(OE)] Process test_progs (pid: 449, threadinfo=000000009af02b3a, task=00000000e9ba4956) Stack : 0000000000000000 90000001075bc8ac 90000000869524a8 9000000100741ce8 90000001075bc800 9000000100415300 90000001075bc8ac 0000000000000000 900000010597c400 900000008694a000 0000000000000000 9000000105b59000 90000001075bc800 9000000100741ce8 0000000000000050 900000008513000c 9000000086936000 0000000100094d4c fffffff400676208 0000000000000000 9000000105b59000 900000008694a000 9000000086bf0dc0 9000000105b59000 9000000086bf0d68 9000000085147010 90000001075be788 0000000000000000 9000000086bf0f98 0000000000000001 0000000000000010 9000000006015840 0000000000000000 9000000086be6c40 0000000000000000 0000000000000000 0000000000000000 4f2317da8a7e08c4 0000000000000101 4f2317da8a7e08c4 ... Call Trace: [<90000000851b5ac0>] __qdisc_run+0xc8/0x8d8 [<9000000085130008>] __dev_queue_xmit+0x578/0x10f0 [<90000000853701c0>] ip6_finish_output2+0x2f0/0x950 [<9000000085374bc8>] ip6_finish_output+0x2b8/0x448 [<9000000085370b24>] ip6_xmit+0x304/0x858 [<90000000853c4438>] inet6_csk_xmit+0x100/0x170 [<90000000852b32f0>] __tcp_transmit_skb+0x490/0xdd0 [<90000000852b47fc>] tcp_connect+0xbcc/0x1168 [<90000000853b9088>] tcp_v6_connect+0x580/0x8a0 [<90000000852e7738>] __inet_stream_connect+0x170/0x480 [<90000000852e7a98>] inet_stream_connect+0x50/0x88 [<90000000850f2814>] __sys_connect+0xe4/0x110 [<90000000850f2858>] sys_connect+0x18/0x28 [<9000000085520c94>] do_syscall+0x94/0x1a0 [<9000000083df1fb8>] handle_syscall+0xb8/0x158 Code: 4001ad80 2400873f 2400832d <240073cc> 001137ff 001133ff 6407b41f 001503cc 0280041d ---[ end trace 0000000000000000 ]--- The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer is treated as a 32bit value and sign extend to 64bit in epilogue. This behavior is right for most bpf prog types but wrong for struct ops which requires LoongArch ABI. So let's sign extend struct ops return values according to the LoongArch ABI ([1]) and return value spec in function model. [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Cc: stable@vger.kernel.org Fixes: 6abf17d ("LoongArch: BPF: Add struct ops support for trampoline") Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.
If the device list is too long, this will lock more device mutexes than
lockdep can handle:
unshare -n \
bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by kworker/u16:1/69:
#0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
#1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
#2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
#3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
#4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]
Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.
Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.
Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20251013185052.14021-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expand the prefault memory selftest to add a regression test for a KVM bug where KVM's retry logic would result in (breakable) deadlock due to the memslot deletion waiting on prefaulting to release SRCU, and prefaulting waiting on the memslot to fully disappear (KVM uses a two-step process to delete memslots, and KVM x86 retries page faults if a to-be-deleted, a.k.a. INVALID, memslot is encountered). To exercise concurrent memslot remove, spawn a second thread to initiate memslot removal at roughly the same time as prefaulting. Test memslot removal for all testcases, i.e. don't limit concurrent removal to only the success case. There are essentially three prefault scenarios (so far) that are of interest: 1. Success 2. ENOENT due to no memslot 3. EAGAIN due to INVALID memslot For all intents and purposes, #1 and #2 are mutually exclusive, or rather, easier to test via separate testcases since writing to non-existent memory is trivial. But for #3, making it mutually exclusive with #1 _or_ #2 is actually more complex than testing memslot removal for all scenarios. The only requirement to let memslot removal coexist with other scenarios is a way to guarantee a stable result, e.g. that the "no memslot" test observes ENOENT, not EAGAIN, for the final checks. So, rather than make memslot removal mutually exclusive with the ENOENT scenario, simply restore the memslot and retry prefaulting. For the "no memslot" case, KVM_PRE_FAULT_MEMORY should be idempotent, i.e. should always fail with ENOENT regardless of how many times userspace attempts prefaulting. Pass in both the base GPA and the offset (instead of the "full" GPA) so that the worker can recreate the memslot. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250924174255.2141847-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
The original code causes a circular locking dependency found by lockdep. ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S U ------------------------------------------------------ xe_fault_inject/5091 is trying to acquire lock: ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660 but task is already holding lock: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&devcd->mutex){+.+.}-{3:3}: mutex_lock_nested+0x4e/0xc0 devcd_data_write+0x27/0x90 sysfs_kf_bin_write+0x80/0xf0 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (kn->active#236){++++}-{0:0}: kernfs_drain+0x1e2/0x200 __kernfs_remove+0xae/0x400 kernfs_remove_by_name_ns+0x5d/0xc0 remove_files+0x54/0x70 sysfs_remove_group+0x3d/0xa0 sysfs_remove_groups+0x2e/0x60 device_remove_attrs+0xc7/0x100 device_del+0x15d/0x3b0 devcd_del+0x19/0x30 process_one_work+0x22b/0x6f0 worker_thread+0x1e8/0x3d0 kthread+0x11c/0x250 ret_from_fork+0x26c/0x2e0 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}: __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 __flush_work+0x27a/0x660 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&devcd->mutex); lock(kn->active#236); lock(&devcd->mutex); lock((work_completion)(&(&devcd->del_wk)->work)); *** DEADLOCK *** 5 locks held by xe_fault_inject/5091: #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660 stack backtrace: CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S U 6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)} Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021 Call Trace: <TASK> dump_stack_lvl+0x91/0xf0 dump_stack+0x10/0x20 print_circular_bug+0x285/0x360 check_noncircular+0x135/0x150 ? register_lock_class+0x48/0x4a0 __lock_acquire+0x1661/0x2860 lock_acquire+0xc4/0x2f0 ? __flush_work+0x25d/0x660 ? mark_held_locks+0x46/0x90 ? __flush_work+0x25d/0x660 __flush_work+0x27a/0x660 ? __flush_work+0x25d/0x660 ? trace_hardirqs_on+0x1e/0xd0 ? __pfx_wq_barrier_func+0x10/0x10 flush_delayed_work+0x5d/0xa0 dev_coredump_put+0x63/0xa0 xe_driver_devcoredump_fini+0x12/0x20 [xe] devm_action_release+0x12/0x30 release_nodes+0x3a/0x120 devres_release_all+0x8a/0xd0 device_unbind_cleanup+0x12/0x80 device_release_driver_internal+0x23a/0x280 ? bus_find_device+0xa8/0xe0 device_driver_detach+0x14/0x20 unbind_store+0xaf/0xc0 drv_attr_store+0x21/0x50 sysfs_kf_write+0x4a/0x80 kernfs_fop_write_iter+0x169/0x220 vfs_write+0x293/0x560 ksys_write+0x72/0xf0 __x64_sys_write+0x19/0x30 x64_sys_call+0x2bf/0x2660 do_syscall_64+0x93/0xb60 ? __f_unlock_pos+0x15/0x20 ? __x64_sys_getdents64+0x9b/0x130 ? __pfx_filldir64+0x10/0x10 ? do_syscall_64+0x1a2/0xb60 ? clear_bhb_loop+0x30/0x80 ? clear_bhb_loop+0x30/0x80 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x76e292edd574 Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574 RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063 R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000 </TASK> xe 0000:03:00.0: [drm] Xe device coredump has been deleted. Fixes: 01daccf ("devcoredump : Serialize devcd_del work") Cc: Mukesh Ojha <quic_mojha@quicinc.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Maarten Lankhorst <dev@lankhorst.se> Cc: Matthew Brost <matthew.brost@intel.com> Acked-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250723142416.1020423-1-dev@lankhorst.se Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ido Schimmel says:
====================
icmp: Add RFC 5837 support
tl;dr
=====
This patchset extends certain ICMP error messages (e.g., "Time
Exceeded") with incoming interface information in accordance with RFC
5837 [1]. This is required for more meaningful traceroute results in
unnumbered networks. Like other ICMP settings, the feature is controlled
via a per-{netns, address family} sysctl. The interface and the
implementation are designed to support more ICMP extensions.
Motivation
==========
Over the years, the kernel was extended with the ability to derive the
source IP of ICMP error messages from the interface that received the
datagram which elicited the ICMP error [2][3][4]. This is especially
important for "Time Exceeded" messages as it allows traceroute users to
trace the actual packet path along the network.
The above scheme does not work in unnumbered networks. In these
networks, only the loopback / VRF interface is assigned a global IP
address while router interfaces are assigned IPv6 link-local addresses.
As such, ICMP error messages are generated with a source IP derived from
the loopback / VRF interface, making it impossible to trace the actual
packet path when parallel links exist between routers.
The problem can be solved by implementing the solution proposed by RFC
4884 [5] and RFC 5837. The former defines an ICMP extension structure
that can be appended to selected ICMP messages and carry extension
objects. The latter defines an extension object called the "Interface
Information Object" (IIO) that can carry interface information (e.g.,
name, index, MTU) about interfaces with certain roles such as the
interface that received the datagram which elicited the ICMP error.
The payload of the datagram that elicited the error (potentially padded
/ trimmed) along with the ICMP extension structure will be queued to the
error queue of the originating socket, thereby allowing traceroute
applications to parse and display the information encoded in the ICMP
extension structure. Example:
# traceroute6 -e 2001:db8:1::3
traceroute to 2001:db8:1::3 (2001:db8:1::3), 30 hops max, 80 byte packets
1 2001:db8:1::2 (2001:db8:1::2) <INC:11,"eth1",mtu=1500> 0.214 ms 0.171 ms 0.162 ms
2 2001:db8:1::3 (2001:db8:1::3) <INC:12,"eth2",mtu=1500> 0.154 ms 0.135 ms 0.127 ms
# traceroute -e 192.0.2.3
traceroute to 192.0.2.3 (192.0.2.3), 30 hops max, 60 byte packets
1 192.0.2.2 (192.0.2.2) <INC:11,"eth1",mtu=1500> 0.191 ms 0.148 ms 0.144 ms
2 192.0.2.3 (192.0.2.3) <INC:12,"eth2",mtu=1500> 0.137 ms 0.122 ms 0.114 ms
Implementation
==============
As previously stated, the feature is controlled via a per-{netns,
address} sysctl. Specifically, a bit mask where each bit controls the
addition of a different ICMP extension to ICMP error messages.
Currently, only a single value is supported, to append the incoming
interface information.
Key points:
1. Global knob vs finer control. I am not aware of users who require
finer control, but it is possible that some users will want to avoid
appending ICMP extensions when the packet is sent out of a specific
interface (e.g., the management interface) or to a specific subnet. This
can be accomplished via a tc-bpf program that trims the ICMP extension
structure. An example program can be found here [6].
2. Split implementation between IPv4 / IPv6. While the implementation is
currently similar, there are some differences between both address
families. In addition, some extensions (e.g., RFC 8883 [7]) are
IPv6-specific. Given the above and given that the implementation is not
very complex, it makes sense to keep both implementations separate.
3. Compatibility with legacy applications. RFC 4884 from 2007 extended
certain ICMP messages with a length field that encodes the length of the
"original datagram" field, so that applications will be able to tell
where the "original datagram" ends and where the ICMP extension
structure starts.
Before the introduction of the IP{,6}_RECVERR_RFC4884 socket options
[8][9] in 2020 it was impossible for applications to know where the ICMP
extension structure starts and to this day some applications assume that
it starts at offset 128, which is the minimum length of the "original
datagram" field as specified by RFC 4884.
Therefore, in order to be compatible with both legacy and modern
applications, the datagram that elicited the ICMP error is trimmed /
padded to 128 bytes before appending the ICMP extension structure.
This behavior is specifically called out by RFC 4884: "Those wishing to
be backward compatible with non-compliant TRACEROUTE implementations
will include exactly 128 octets" [10].
Note that in 128 bytes we should be able to include enough headers for
the originating node to match the ICMP error message with the relevant
socket. For example, the following headers will be present in the
"original datagram" field when a VXLAN encapsulated IPv6 packet elicits
an ICMP error in an IPv6 underlay: IPv6 (40) | UDP (8) | VXLAN (8) | Eth
(14) | IPv6 (40) | UDP (8). Overall, 118 bytes.
If the 128 bytes limit proves to be insufficient for some use case, we
can consider dedicating a new bit in the previously mentioned sysctl to
allow for more bytes to be included in the "original datagram" field.
4. Extensibility. This patchset adds partial support for a single ICMP
extension. However, the interface and the implementation should be able
to support more extensions, if needed. Examples:
* More interface information objects as part of RFC 5837. We should be
able to derive the outgoing interface information and nexthop IP from
the dst entry attached to the packet that elicited the error.
* Node identification object (e.g., hostname / loopback IP) [11].
* Extended Information object which encodes aggregate header limits as
part of RFC 8883.
A previous proposal from Ishaan Gandhi and Ron Bonica is available here
[12].
Testing
=======
The existing traceroute selftest is extended to test that ICMP
extensions are reported correctly when enabled. Both address families
are tested and with different packet sizes in order to make sure that
trimming / padding works correctly. Tested that packets are parsed
correctly by the IP{,6}_RECVERR_RFC4884 socket options using Willem's
selftest [13].
Changelog
=========
Changes since v1 [14]:
* Patches #1-#2: Added a comment about field ordering and review tags.
* Patch #3: Converted "sysctl" to "echo" when testing the return value.
Added a check to skip the test if traceroute version is older
than 2.1.5.
[1] https://datatracker.ietf.org/doc/html/rfc5837
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1c2fb7f93cb20621772bf304f3dba0849942e5db
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fac6fce9bdb59837bb89930c3a92f5e0d1482f0b
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4a8c416602d97a4e2073ed563d4d4c7627de19cf
[5] https://datatracker.ietf.org/doc/html/rfc4884
[6] https://gist.github.com/idosch/5013448cdb5e9e060e6bfdc8b433577c
[7] https://datatracker.ietf.org/doc/html/rfc8883
[8] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eba75c587e811d3249c8bd50d22bb2266ccd3c0f
[9] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=01370434df85eb76ecb1527a4466013c4aca2436
[10] https://datatracker.ietf.org/doc/html/rfc4884#section-5.3
[11] https://datatracker.ietf.org/doc/html/draft-ietf-intarea-extended-icmp-nodeid-04
[12] https://lore.kernel.org/netdev/20210317221959.4410-1-ishaangandhi@gmail.com/
[13] https://lore.kernel.org/netdev/aPpMItF35gwpgzZx@shredder/
[14] https://lore.kernel.org/netdev/20251022065349.434123-1-idosch@nvidia.com/
====================
Link: https://patch.msgid.link/20251027082232.232571-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan says: ==================== bnxt_en: Bug fixes Patches 1, 3, and 4 are bug fixes related to the FW log tracing driver coredump feature recently added in 6.13. Patch #1 adds the necessary call to shutdown the FW logging DMA during PCI shutdown. Patch #3 fixes a possible null pointer derefernce when using early versions of the FW with this feature. Patch #4 adds the coredump header information unconditionally to make it more robust. Patch #2 fixes a possible memory leak during PTP shutdown. Patch #5 eliminates a dmesg warning when doing devlink reload. ==================== Link: https://patch.msgid.link/20251104005700.542174-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leon Hwang says: ==================== In the discussion thread "[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1], it was pointed out that missing calls to bpf_obj_free_fields() could lead to memory leaks. A selftest was added to confirm that this is indeed a real issue - the refcount of BPF_KPTR_REF field is not decremented when bpf_obj_free_fields() is missing after copy_map_value[,_long](). Further inspection of copy_map_value[,_long]() call sites revealed two locations affected by this issue: 1. pcpu_copy_value() 2. htab_map_update_elem() when used with BPF_F_LOCK Similar case happens when update local storage maps with BPF_F_LOCK. This series fixes the cases where BPF_F_LOCK is not involved by properly calling bpf_obj_free_fields() after copy_map_value[,_long](), and adds a selftest to verify the fix. The remaining cases involving BPF_F_LOCK will be addressed in a separate patch set after the series "bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps" is applied. Changes: v5 -> v6: * Update the test name to include "refcounted_kptr". * Update some local variables' name in the test (per Alexei). * v5: https://lore.kernel.org/bpf/20251104142714.99878-1-leon.hwang@linux.dev/ v4 -> v5: * Use a local variable to store the this_cpu_ptr()/per_cpu_ptr() result, and reuse it between copy_map_value[,_long]() and bpf_obj_free_fields() in patch #1 (per Andrii). * Drop patch #2 and #3, because the combination of BPF_F_LOCK with other special fields (except for BPF_SPIN_LOCK) will be disallowed on the UAPI side in the future (per Alexei). * v4: https://lore.kernel.org/bpf/20251030152451.62778-1-leon.hwang@linux.dev/ v3 -> v4: * Target bpf-next tree. * Address comments from Amery: * Drop 'bpf_obj_free_fields()' in the path of updating local storage maps without BPF_F_LOCK. * Drop the corresponding self test. * Respin the other test of local storage maps using syscall BPF programs. * v3: https://lore.kernel.org/bpf/20251026154000.34151-1-leon.hwang@linux.dev/ v2 -> v3: * Free special fields when update local storage maps without BPF_F_LOCK. * Add test to verify decrementing refcount when update cgroup local storage maps without BPF_F_LOCK. * Address review from AI bot: * Slow path with BPF_F_LOCK (around line 642-646) in 'bpf_local_storage.c'. * v2: https://lore.kernel.org/bpf/20251020164608.20536-1-leon.hwang@linux.dev/ v1 -> v2: * Add test to verify decrementing refcount when update cgroup local storage maps with BPF_F_LOCK. * Address review from AI bot: * Fast path without bucket lock (around line 610) in 'bpf_local_storage.c'. * v1: https://lore.kernel.org/bpf/20251016145801.47552-1-leon.hwang@linux.dev/ ==================== Link: https://patch.msgid.link/20251105151407.12723-1-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD guest attempts to execute SEAMCALL or TDCALL. Neither SEAMCALL nor TDCALL is gated by any software enablement other than VMXON, and so will generate a VM-Exit instead of e.g. a native #UD when executed from the guest kernel. Note! No unprivileged DoS of the L1 kernel is possible as TDCALL and SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can crash itself, but not L1. Note #2! The Intel® Trust Domain CPU Architectural Extensions spec's pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit, but that appears to be a documentation bug (likely because the CPL > 0 check was incorrectly bundled with other lower-priority #GP checks). Testing on SPR and EMR shows that the CPL > 0 check is performed before the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode. Note #3! The aforementioned Trust Domain spec uses confusing pseudocode that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM" specifically means in SEAM Root Mode, i.e. in the TDX-Module. The long- form description explicitly states that SEAMCALL generates an exit when executed in "SEAM VMX non-root operation". But that's a moot point as the TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as documented in the "Unconditionally Blocked Instructions" section of the TDX-Module base specification. Cc: stable@vger.kernel.org Cc: Kai Huang <kai.huang@intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20251016182148.69085-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
On completion of i915_vma_pin_ww(), a synchronous variant of dma_fence_work_commit() is called. When pinning a VMA to GGTT address space on a Cherry View family processor, or on a Broxton generation SoC with VTD enabled, i.e., when stop_machine() is then called from intel_ggtt_bind_vma(), that can potentially lead to lock inversion among reservation_ww and cpu_hotplug locks. [86.861179] ====================================================== [86.861193] WARNING: possible circular locking dependency detected [86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 Tainted: G U [86.861226] ------------------------------------------------------ [86.861238] i915_module_loa/1432 is trying to acquire lock: [86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50 [86.861290] but task is already holding lock: [86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.862233] which lock already depends on the new lock. [86.862251] the existing dependency chain (in reverse order) is: [86.862265] -> #5 (reservation_ww_class_mutex){+.+.}-{3:3}: [86.862292] dma_resv_lockdep+0x19a/0x390 [86.862315] do_one_initcall+0x60/0x3f0 [86.862334] kernel_init_freeable+0x3cd/0x680 [86.862353] kernel_init+0x1b/0x200 [86.862369] ret_from_fork+0x47/0x70 [86.862383] ret_from_fork_asm+0x1a/0x30 [86.862399] -> #4 (reservation_ww_class_acquire){+.+.}-{0:0}: [86.862425] dma_resv_lockdep+0x178/0x390 [86.862440] do_one_initcall+0x60/0x3f0 [86.862454] kernel_init_freeable+0x3cd/0x680 [86.862470] kernel_init+0x1b/0x200 [86.862482] ret_from_fork+0x47/0x70 [86.862495] ret_from_fork_asm+0x1a/0x30 [86.862509] -> #3 (&mm->mmap_lock){++++}-{3:3}: [86.862531] down_read_killable+0x46/0x1e0 [86.862546] lock_mm_and_find_vma+0xa2/0x280 [86.862561] do_user_addr_fault+0x266/0x8e0 [86.862578] exc_page_fault+0x8a/0x2f0 [86.862593] asm_exc_page_fault+0x27/0x30 [86.862607] filldir64+0xeb/0x180 [86.862620] kernfs_fop_readdir+0x118/0x480 [86.862635] iterate_dir+0xcf/0x2b0 [86.862648] __x64_sys_getdents64+0x84/0x140 [86.862661] x64_sys_call+0x1058/0x2660 [86.862675] do_syscall_64+0x91/0xe90 [86.862689] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.862703] -> #2 (&root->kernfs_rwsem){++++}-{3:3}: [86.862725] down_write+0x3e/0xf0 [86.862738] kernfs_add_one+0x30/0x3c0 [86.862751] kernfs_create_dir_ns+0x53/0xb0 [86.862765] internal_create_group+0x134/0x4c0 [86.862779] sysfs_create_group+0x13/0x20 [86.862792] topology_add_dev+0x1d/0x30 [86.862806] cpuhp_invoke_callback+0x4b5/0x850 [86.862822] cpuhp_issue_call+0xbf/0x1f0 [86.862836] __cpuhp_setup_state_cpuslocked+0x111/0x320 [86.862852] __cpuhp_setup_state+0xb0/0x220 [86.862866] topology_sysfs_init+0x30/0x50 [86.862879] do_one_initcall+0x60/0x3f0 [86.862893] kernel_init_freeable+0x3cd/0x680 [86.862908] kernel_init+0x1b/0x200 [86.862921] ret_from_fork+0x47/0x70 [86.862934] ret_from_fork_asm+0x1a/0x30 [86.862947] -> #1 (cpuhp_state_mutex){+.+.}-{3:3}: [86.862969] __mutex_lock+0xaa/0xed0 [86.862982] mutex_lock_nested+0x1b/0x30 [86.862995] __cpuhp_setup_state_cpuslocked+0x67/0x320 [86.863012] __cpuhp_setup_state+0xb0/0x220 [86.863026] page_alloc_init_cpuhp+0x2d/0x60 [86.863041] mm_core_init+0x22/0x2d0 [86.863054] start_kernel+0x576/0xbd0 [86.863068] x86_64_start_reservations+0x18/0x30 [86.863084] x86_64_start_kernel+0xbf/0x110 [86.863098] common_startup_64+0x13e/0x141 [86.863114] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [86.863135] __lock_acquire+0x1635/0x2810 [86.863152] lock_acquire+0xc4/0x2f0 [86.863166] cpus_read_lock+0x41/0x100 [86.863180] stop_machine+0x1c/0x50 [86.863194] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.863987] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.864735] __vma_bind+0x55/0x70 [i915] [86.865510] fence_work+0x26/0xa0 [i915] [86.866248] fence_notify+0xa1/0x140 [i915] [86.866983] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.867719] i915_sw_fence_commit+0x39/0x60 [i915] [86.868453] i915_vma_pin_ww+0x462/0x1360 [i915] [86.869228] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.870001] initial_plane_vma+0x307/0x840 [i915] [86.870774] intel_initial_plane_config+0x33f/0x670 [i915] [86.871546] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.872330] i915_driver_probe+0x7fa/0xe80 [i915] [86.873057] i915_pci_probe+0xe6/0x220 [i915] [86.873782] local_pci_probe+0x47/0xb0 [86.873802] pci_device_probe+0xf3/0x260 [86.873817] really_probe+0xf1/0x3c0 [86.873833] __driver_probe_device+0x8c/0x180 [86.873848] driver_probe_device+0x24/0xd0 [86.873862] __driver_attach+0x10f/0x220 [86.873876] bus_for_each_dev+0x7f/0xe0 [86.873892] driver_attach+0x1e/0x30 [86.873904] bus_add_driver+0x151/0x290 [86.873917] driver_register+0x5e/0x130 [86.873931] __pci_register_driver+0x7d/0x90 [86.873945] i915_pci_register_driver+0x23/0x30 [i915] [86.874678] i915_init+0x37/0x120 [i915] [86.875347] do_one_initcall+0x60/0x3f0 [86.875369] do_init_module+0x97/0x2a0 [86.875385] load_module+0x2c54/0x2d80 [86.875398] init_module_from_file+0x96/0xe0 [86.875413] idempotent_init_module+0x117/0x330 [86.875426] __x64_sys_finit_module+0x77/0x100 [86.875440] x64_sys_call+0x24de/0x2660 [86.875454] do_syscall_64+0x91/0xe90 [86.875470] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.875486] other info that might help us debug this: [86.875502] Chain exists of: cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex [86.875539] Possible unsafe locking scenario: [86.875552] CPU0 CPU1 [86.875563] ---- ---- [86.875573] lock(reservation_ww_class_mutex); [86.875588] lock(reservation_ww_class_acquire); [86.875606] lock(reservation_ww_class_mutex); [86.875624] rlock(cpu_hotplug_lock); [86.875637] *** DEADLOCK *** [86.875650] 3 locks held by i915_module_loa/1432: [86.875663] #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220 [86.875699] #1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.876512] #2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.877305] stack backtrace: [86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G U 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 PREEMPT(voluntary) [86.877334] Tainted: [U]=USER [86.877336] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020 [86.877339] Call Trace: [86.877344] <TASK> [86.877353] dump_stack_lvl+0x91/0xf0 [86.877364] dump_stack+0x10/0x20 [86.877369] print_circular_bug+0x285/0x360 [86.877379] check_noncircular+0x135/0x150 [86.877390] __lock_acquire+0x1635/0x2810 [86.877403] lock_acquire+0xc4/0x2f0 [86.877408] ? stop_machine+0x1c/0x50 [86.877422] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878173] cpus_read_lock+0x41/0x100 [86.878182] ? stop_machine+0x1c/0x50 [86.878191] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878916] stop_machine+0x1c/0x50 [86.878927] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.879652] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.880375] __vma_bind+0x55/0x70 [i915] [86.881133] fence_work+0x26/0xa0 [i915] [86.881851] fence_notify+0xa1/0x140 [i915] [86.882566] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.883286] i915_sw_fence_commit+0x39/0x60 [i915] [86.884003] i915_vma_pin_ww+0x462/0x1360 [i915] [86.884756] ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915] [86.885513] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.886281] initial_plane_vma+0x307/0x840 [i915] [86.887049] intel_initial_plane_config+0x33f/0x670 [i915] [86.887819] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.888587] i915_driver_probe+0x7fa/0xe80 [i915] [86.889293] ? mutex_unlock+0x12/0x20 [86.889301] ? drm_privacy_screen_get+0x171/0x190 [86.889308] ? acpi_dev_found+0x66/0x80 [86.889321] i915_pci_probe+0xe6/0x220 [i915] [86.890038] local_pci_probe+0x47/0xb0 [86.890049] pci_device_probe+0xf3/0x260 [86.890058] really_probe+0xf1/0x3c0 [86.890067] __driver_probe_device+0x8c/0x180 [86.890072] driver_probe_device+0x24/0xd0 [86.890078] __driver_attach+0x10f/0x220 [86.890083] ? __pfx___driver_attach+0x10/0x10 [86.890088] bus_for_each_dev+0x7f/0xe0 [86.890097] driver_attach+0x1e/0x30 [86.890101] bus_add_driver+0x151/0x290 [86.890107] driver_register+0x5e/0x130 [86.890113] __pci_register_driver+0x7d/0x90 [86.890119] i915_pci_register_driver+0x23/0x30 [i915] [86.890833] i915_init+0x37/0x120 [i915] [86.891482] ? __pfx_i915_init+0x10/0x10 [i915] [86.892135] do_one_initcall+0x60/0x3f0 [86.892145] ? __kmalloc_cache_noprof+0x33f/0x470 [86.892157] do_init_module+0x97/0x2a0 [86.892164] load_module+0x2c54/0x2d80 [86.892168] ? __kernel_read+0x15c/0x300 [86.892185] ? kernel_read_file+0x2b1/0x320 [86.892195] init_module_from_file+0x96/0xe0 [86.892199] ? init_module_from_file+0x96/0xe0 [86.892211] idempotent_init_module+0x117/0x330 [86.892224] __x64_sys_finit_module+0x77/0x100 [86.892230] x64_sys_call+0x24de/0x2660 [86.892236] do_syscall_64+0x91/0xe90 [86.892243] ? irqentry_exit+0x77/0xb0 [86.892249] ? sysvec_apic_timer_interrupt+0x57/0xc0 [86.892256] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.892261] RIP: 0033:0x7303e1b2725d [86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48 [86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d [86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c [86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80 [86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0 [86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710 [86.892304] </TASK> Call asynchronous variant of dma_fence_work_commit() in that case. v3: Provide more verbose in-line comment (Andi), - mention target environments in commit message. Fixes: 7d1c261 ("drm/i915: Take reservation lock around i915_vma_pin.") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985 Cc: Andi Shyti <andi.shyti@kernel.org> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Acked-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20251023082925.351307-6-janusz.krzysztofik@linux.intel.com (cherry picked from commit 648ef13) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.18, take #3 - Only adjust the ID registers when no irqchip has been created once per VM run, instead of doing it once per vcpu, as this otherwise triggers a pretty bad conbsistency check failure in the sysreg code. - Make sure the per-vcpu Fine Grain Traps are computed before we load the system registers on the HW, as we otherwise start running without anything set until the first preemption of the vcpu.
For some reason, of_find_node_with_property() is creating a spinlock recursion issue along with fwnode_count_parents(), and this issue is making all MediaTek boards unbootable. As of kernel v6.18-rc6, there are only three users of this function, one of which is this driver. Migrate away from of_find_node_with_property() by adding a local scpsys_get_legacy_regmap_node() function, which acts similarly to of_find_node_with_property(), and calling the former in place of the latter. This resolves the following spinlock recursion issue: [ 1.773979] BUG: spinlock recursion on CPU#2, kworker/u24:1/60 [ 1.790485] lock: devtree_lock+0x0/0x40, .magic: dead4ead, .owner: kworker/u24:1/60, .owner_cpu: 2 [ 1.791644] CPU: 2 UID: 0 PID: 60 Comm: kworker/u24:1 Tainted: G W 6.18.0-rc6 #3 PREEMPT [ 1.791649] Tainted: [W]=WARN [ 1.791650] Hardware name: MediaTek Genio-510 EVK (DT) [ 1.791653] Workqueue: events_unbound deferred_probe_work_func [ 1.791658] Call trace: [ 1.791659] show_stack+0x18/0x30 (C) [ 1.791664] dump_stack_lvl+0x68/0x94 [ 1.791668] dump_stack+0x18/0x24 [ 1.791672] spin_dump+0x78/0x88 [ 1.791678] do_raw_spin_lock+0x110/0x140 [ 1.791684] _raw_spin_lock_irqsave+0x58/0x6c [ 1.791690] of_get_parent+0x28/0x74 [ 1.791694] of_fwnode_get_parent+0x38/0x7c [ 1.791700] fwnode_count_parents+0x34/0xf0 [ 1.791705] fwnode_full_name_string+0x28/0x120 [ 1.791710] device_node_string+0x3e4/0x50c [ 1.791715] pointer+0x294/0x430 [ 1.791718] vsnprintf+0x21c/0x5bc [ 1.791722] vprintk_store+0x108/0x47c [ 1.791728] vprintk_emit+0xc4/0x350 [ 1.791732] vprintk_default+0x34/0x40 [ 1.791736] vprintk+0x24/0x30 [ 1.791740] _printk+0x60/0x8c [ 1.791744] of_node_release+0x154/0x194 [ 1.791749] kobject_put+0xa0/0x120 [ 1.791753] of_node_put+0x18/0x28 [ 1.791756] of_find_node_with_property+0x74/0x100 [ 1.791761] scpsys_probe+0x338/0x5e0 [ 1.791765] platform_probe+0x5c/0xa4 [ 1.791770] really_probe+0xbc/0x2ac [ 1.791774] __driver_probe_device+0x78/0x118 [ 1.791779] driver_probe_device+0x3c/0x170 [ 1.791783] __device_attach_driver+0xb8/0x150 [ 1.791788] bus_for_each_drv+0x88/0xe8 [ 1.791792] __device_attach+0x9c/0x1a0 [ 1.791796] device_initial_probe+0x14/0x20 [ 1.791801] bus_probe_device+0xa0/0xa4 [ 1.791805] deferred_probe_work_func+0x88/0xd0 [ 1.791809] process_one_work+0x1e8/0x448 [ 1.791813] worker_thread+0x1ac/0x340 [ 1.791816] kthread+0x138/0x220 [ 1.791821] ret_from_fork+0x10/0x20 Fixes: c29345f ("pmdomain: mediatek: Refactor bus protection regmaps retrieval") Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Tested-by: Louis-Alexis Eyraud <louisalexis.eyraud@collabora.com> Tested-by: Macpaul Lin <macpaul.lin@mediatek.com> Reviewed-by: Macpaul Lin <macpaul.lin@mediatek.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Marc Kleine-Budde <mkl@pengutronix.de> says: Similarly to how CAN FD reuses the bittiming logic of Classical CAN, CAN XL also reuses the entirety of CAN FD features, and, on top of that, adds new features which are specific to CAN XL. A so-called 'mixed-mode' is intended to have (XL-tolerant) CAN FD nodes and CAN XL nodes on one CAN segment, where the FD-controllers can talk CC/FD and the XL-controllers can talk CC/FD/XL. This mixed-mode utilizes the known error-signalling (ES) for sending CC/FD/XL frames. For CAN FD and CAN XL the tranceiver delay compensation (TDC) is supported to use common CAN and CAN-SIG transceivers. The CANXL-only mode disables the error-signalling in the CAN XL controller. This mode does not allow CC/FD frames to be sent but additionally offers a CAN XL transceiver mode switching (TMS) to send CAN XL frames with up to 20Mbit/s data rate. The TMS utilizes a PWM configuration which is added to the netlink interface. Configured with CAN_CTRLMODE_FD and CAN_CTRLMODE_XL this leads to: FD=0 XL=0 CC-only mode (ES=1) FD=1 XL=0 FD/CC mixed-mode (ES=1) FD=1 XL=1 XL/FD/CC mixed-mode (ES=1) FD=0 XL=1 XL-only mode (ES=0, TMS optional) Patch #1 print defined ctrlmode strings capitalized to increase the readability and to be in line with the 'ip' tool (iproute2). Patch #2 is a small clean-up which makes can_calc_bittiming() use NL_SET_ERR_MSG() instead of netdev_err(). Patch #3 adds a check in can_dev_dropped_skb() to drop CAN FD frames when CAN FD is turned off. Patch #4 adds CAN_CTRLMODE_RESTRICTED. Note that contrary to the other CAN_CTRL_MODE_XL_* that are introduced in the later patches, this control mode is not specific to CAN XL. The nuance is that because this restricted mode was only added in ISO 11898-1:2024, it is made mandatory for CAN XL devices but optional for other protocols. This is why this patch is added as a preparation before introducing the core CAN XL logic. Patch #5 adds all the CAN XL features which are inherited from CAN FD: the nominal bittiming, the data bittiming and the TDC. Patch #6 add a new CAN_CTRLMODE_XL_TMS control mode which is specific to CAN XL to enable the transceiver mode switching (TMS) in XL-only mode. Patch #7 adds a check in can_dev_dropped_skb() to drop CAN CC/FD frames when the CAN XL controller is in CAN XL-only mode. The introduced can_dev_in_xl_only_mode() function also determines the error-signalling configuration for the CAN XL controllers. Patch #8 to #11 add the PWM logic for the CAN XL TMS mode. Patch #12 to #14 add different default sample-points for standard CAN and CAN SIG transceivers (with TDC) and CAN XL transceivers using PWM in the CAN XL TMS mode. Patch #15 add a dummy_can driver for netlink testing and debugging. Patch #16 check CAN frame type (CC/FD/XL) when writing those frames to the CAN_RAW socket and reject them if it's not supported by the CAN interface. Patch #17 increase the resolution when printing the bitrate error and round-up the value to 0.01% in the case the resolution would still provide values which would lead to 0.00%. Link: https://patch.msgid.link/20251126-canxl-v8-0-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When interrupting perf stat in repeat mode with a signal the signal is passed to the child process but the repeat doesn't terminate: ``` $ perf stat -v --null --repeat 10 sleep 1 Control descriptor is not initialized [ perf stat: executing run #1 ... ] [ perf stat: executing run #2 ... ] ^Csleep: Interrupt [ perf stat: executing run #3 ... ] [ perf stat: executing run #4 ... ] [ perf stat: executing run #5 ... ] [ perf stat: executing run #6 ... ] [ perf stat: executing run #7 ... ] [ perf stat: executing run #8 ... ] [ perf stat: executing run #9 ... ] [ perf stat: executing run #10 ... ] Performance counter stats for 'sleep 1' (10 runs): 0.9500 +- 0.0512 seconds time elapsed ( +- 5.39% ) 0.01user 0.02system 0:09.53elapsed 0%CPU (0avgtext+0avgdata 18940maxresident)k 29944inputs+0outputs (0major+2629minor)pagefaults 0swaps ``` Terminate the repeated run and give a reasonable exit value: ``` $ perf stat -v --null --repeat 10 sleep 1 Control descriptor is not initialized [ perf stat: executing run #1 ... ] [ perf stat: executing run #2 ... ] [ perf stat: executing run #3 ... ] ^Csleep: Interrupt Performance counter stats for 'sleep 1' (10 runs): 0.680 +- 0.321 seconds time elapsed ( +- 47.16% ) Command exited with non-zero status 130 0.00user 0.01system 0:02.05elapsed 0%CPU (0avgtext+0avgdata 70688maxresident)k 0inputs+0outputs (0major+5002minor)pagefaults 0swaps ``` Note, this also changes the exit value for non-repeat runs when interrupted by a signal. Reported-by: Ingo Molnar <mingo@kernel.org> Closes: https://lore.kernel.org/lkml/aS5wjmbAM9ka3M2g@gmail.com/ Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Ihor Solodrai says: ==================== resolve_btfids: Support for BTF modifications This series changes resolve_btfids and kernel build scripts to enable BTF transformations in resolve_btfids. Main motivation for enhancing resolve_btfids is to reduce dependency of the kernel build on pahole capabilities [1] and enable BTF features and optimizations [2][3] particular to the kernel. Patches #1-#4 in the series are non-functional changes in resolve_btfids. Patch #5 makes kernel build notice pahole version changes between builds. Patch #6 changes minimum version of pahole required for CONFIG_DEBUG_INFO_BTF to v1.22 Patch #7 makes a small prep change in selftests/bpf build. The last patch (#8) makes significant changes in resolve_btfids and introduces scripts/gen-btf.sh. See implementation details in the patch description. Successful BPF CI run: https://github.com/kernel-patches/bpf/actions/runs/20378061470 [1] https://lore.kernel.org/dwarves/ba1650aa-fafd-49a8-bea4-bdddee7c38c9@linux.dev/ [2] https://lore.kernel.org/bpf/20251029190113.3323406-1-ihor.solodrai@linux.dev/ [3] https://lore.kernel.org/bpf/20251119031531.1817099-1-dolinux.peng@gmail.com/ --- v6->v7: - documentation edits in patches #5 and #6 (Nicolas) v6: https://lore.kernel.org/bpf/20251219020006.785065-1-ihor.solodrai@linux.dev/ v5->v6: - patch #8: fix double free when btf__distill_base fails (reported by AI) https://lore.kernel.org/bpf/e269870b8db409800045ee0061fc02d21721e0efadd99ca83960b48f8db7b3f3@mail.kernel.org/ v5: https://lore.kernel.org/bpf/20251219003147.587098-1-ihor.solodrai@linux.dev/ v4->v5: - patch #3: fix an off-by-one bug (reported by AI) https://lore.kernel.org/bpf/106b6e71bce75b8f12a85f2f99e75129e67af7287f6d81fa912589ece14044f9@mail.kernel.org/ - patch #8: cleanup GEN_BTF in Makefile.btf v4: https://lore.kernel.org/bpf/20251218003314.260269-1-ihor.solodrai@linux.dev/ v3->v4: - add patch #4: "resolve_btfids: Always build with -Wall -Werror" - add patch #5: "kbuild: Sync kconfig when PAHOLE_VERSION changes" (Alan) - fix clang cross-compilation (LKP) https://lore.kernel.org/bpf/cecb6351-ea9a-4f8a-863a-82c9ef02f012@linux.dev/ - remove GEN_BTF env variable (Andrii) - nits and cleanup in resolve_btfids/main.c (Andrii, Eduard) - nits in a patch bumping minimum pahole version (Andrii, AI) v3: https://lore.kernel.org/bpf/20251205223046.4155870-1-ihor.solodrai@linux.dev/ v2->v3: - add patch #4 bumping minimum pahole version (Andrii, Alan) - add patch #5 pre-fixing resolve_btfids test (Donglin) - add GEN_BTF var and assemble RESOLVE_BTFIDS_FLAGS in Makefile.btf (Alan) - implement --distill_base flag in resolve_btfids, set it depending on KBUILD_EXTMOD in Makefile.btf (Eduard) - various implementation nits, see the v2 thread for details (Andrii, Eduard) v2: https://lore.kernel.org/bpf/20251127185242.3954132-1-ihor.solodrai@linux.dev/ v1->v2: - gen-btf.sh and other shell script fixes (Donglin) - update selftests build (Donglin) - generate .BTF.base only when KBUILD_EXTMOD is set (Alan) - proper endianness handling for cross-compilation - change elf_begin mode from ELF_C_RDWR_MMAP to ELF_C_READ_MMAP_PRIVATE - remove compressed_section_fix() - nit NULL check in patch #3 (suggested by AI) v1: https://lore.kernel.org/bpf/20251126012656.3546071-1-ihor.solodrai@linux.dev/ ==================== Link: https://patch.msgid.link/20251219181321.1283664-1-ihor.solodrai@linux.dev Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
We sometimes observe use-after-free when dereferencing a neighbour [1]. The problem seems to be that the driver stores a pointer to the neighbour, but without holding a reference on it. A reference is only taken when the neighbour is used by a nexthop. Fix by simplifying the reference counting scheme. Always take a reference when storing a neighbour pointer in a neighbour entry. Avoid taking a referencing when the neighbour is used by a nexthop as the neighbour entry associated with the nexthop already holds a reference. Tested by running the test that uncovered the problem over 300 times. Without this patch the problem was reproduced after a handful of iterations. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_neigh_entry_update+0x2d4/0x310 Read of size 8 at addr ffff88817f8e3420 by task ip/3929 CPU: 3 UID: 0 PID: 3929 Comm: ip Not tainted 6.18.0-rc4-virtme-g36b21a067510 #3 PREEMPT(full) Hardware name: Nvidia SN5600/VMOD0013, BIOS 5.13 05/31/2023 Call Trace: <TASK> dump_stack_lvl+0x6f/0xa0 print_address_description.constprop.0+0x6e/0x300 print_report+0xfc/0x1fb kasan_report+0xe4/0x110 mlxsw_sp_neigh_entry_update+0x2d4/0x310 mlxsw_sp_router_rif_gone_sync+0x35f/0x510 mlxsw_sp_rif_destroy+0x1ea/0x730 mlxsw_sp_inetaddr_port_vlan_event+0xa1/0x1b0 __mlxsw_sp_inetaddr_lag_event+0xcc/0x130 __mlxsw_sp_inetaddr_event+0xf5/0x3c0 mlxsw_sp_router_netdevice_event+0x1015/0x1580 notifier_call_chain+0xcc/0x150 call_netdevice_notifiers_info+0x7e/0x100 __netdev_upper_dev_unlink+0x10b/0x210 netdev_upper_dev_unlink+0x79/0xa0 vrf_del_slave+0x18/0x50 do_set_master+0x146/0x7d0 do_setlink.isra.0+0x9a0/0x2880 rtnl_newlink+0x637/0xb20 rtnetlink_rcv_msg+0x6fe/0xb90 netlink_rcv_skb+0x123/0x380 netlink_unicast+0x4a3/0x770 netlink_sendmsg+0x75b/0xc90 __sock_sendmsg+0xbe/0x160 ____sys_sendmsg+0x5b2/0x7d0 ___sys_sendmsg+0xfd/0x180 __sys_sendmsg+0x124/0x1c0 do_syscall_64+0xbb/0xfd0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [...] Allocated by task 109: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x7b/0x90 __kmalloc_noprof+0x2c1/0x790 neigh_alloc+0x6af/0x8f0 ___neigh_create+0x63/0xe90 mlxsw_sp_nexthop_neigh_init+0x430/0x7e0 mlxsw_sp_nexthop_type_init+0x212/0x960 mlxsw_sp_nexthop6_group_info_init.constprop.0+0x81f/0x1280 mlxsw_sp_nexthop6_group_get+0x392/0x6a0 mlxsw_sp_fib6_entry_create+0x46a/0xfd0 mlxsw_sp_router_fib6_replace+0x1ed/0x5f0 mlxsw_sp_router_fib6_event_work+0x10a/0x2a0 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Freed by task 154: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x43/0x70 kmem_cache_free_bulk.part.0+0x1eb/0x5e0 kvfree_rcu_bulk+0x1f2/0x260 kfree_rcu_work+0x130/0x1b0 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Last potentially related work creation: kasan_save_stack+0x30/0x50 kasan_record_aux_stack+0x8c/0xa0 kvfree_call_rcu+0x93/0x5b0 mlxsw_sp_router_neigh_event_work+0x67d/0x860 process_one_work+0xd57/0x1390 worker_thread+0x4d6/0xd40 kthread+0x355/0x5b0 ret_from_fork+0x1d4/0x270 ret_from_fork_asm+0x11/0x20 Fixes: 6cf3c97 ("mlxsw: spectrum_router: Add private neigh table") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/92d75e21d95d163a41b5cea67a15cd33f547cba6.1764695650.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata says: ==================== selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic, forwarded over a multicast-routed VXLAN underlay. In order to determine whether packets reach their intended destination, it uses a TC match. For convenience, it uses a flower match, which however does not allow matching on the encapsulated packet. So various service traffic ends up being indistinguishable from the test packets, and ends up confusing the test. To alleviate the problem, the test uses sleep to allow the necessary service traffic to run and clear the channel, before running the test traffic. This worked for a while, but lately we have nevertheless seen flakiness of the test in the CI. In this patchset, first generalize tc_rule_stats_get() to support u32 in patch #1, then in patch #2 convert the test to use u32 to allow parsing deeper into the packet, and in #3 drop the now-unnecessary sleep. ==================== Link: https://patch.msgid.link/cover.1765289566.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a loop scenario of ethx:egress->ethx:egress
Example setup to reproduce:
tc qdisc add dev ethx root handle 1: drr
tc filter add dev ethx parent 1: protocol ip prio 1 matchall \
action mirred egress redirect dev ethx
Now ping out of ethx and you get a deadlock:
[ 116.892898][ T307] ============================================
[ 116.893182][ T307] WARNING: possible recursive locking detected
[ 116.893418][ T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted
[ 116.893682][ T307] --------------------------------------------
[ 116.893926][ T307] ping/307 is trying to acquire lock:
[ 116.894133][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.894517][ T307]
[ 116.894517][ T307] but task is already holding lock:
[ 116.894836][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.895252][ T307]
[ 116.895252][ T307] other info that might help us debug this:
[ 116.895608][ T307] Possible unsafe locking scenario:
[ 116.895608][ T307]
[ 116.895901][ T307] CPU0
[ 116.896057][ T307] ----
[ 116.896200][ T307] lock(&sch->root_lock_key);
[ 116.896392][ T307] lock(&sch->root_lock_key);
[ 116.896605][ T307]
[ 116.896605][ T307] *** DEADLOCK ***
[ 116.896605][ T307]
[ 116.896864][ T307] May be due to missing lock nesting notation
[ 116.896864][ T307]
[ 116.897123][ T307] 6 locks held by ping/307:
[ 116.897302][ T307] #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0
[ 116.897808][ T307] #1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600
[ 116.898138][ T307] #2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0
[ 116.898459][ T307] #3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.898782][ T307] #4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50
[ 116.899132][ T307] #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50
[ 116.899442][ T307]
[ 116.899442][ T307] stack backtrace:
[ 116.899667][ T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary)
[ 116.899672][ T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 116.899675][ T307] Call Trace:
[ 116.899678][ T307] <TASK>
[ 116.899680][ T307] dump_stack_lvl+0x6f/0xb0
[ 116.899688][ T307] print_deadlock_bug.cold+0xc0/0xdc
[ 116.899695][ T307] __lock_acquire+0x11f7/0x1be0
[ 116.899704][ T307] lock_acquire+0x162/0x300
[ 116.899707][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899713][ T307] ? srso_alias_return_thunk+0x5/0xfbef5
[ 116.899717][ T307] ? stack_trace_save+0x93/0xd0
[ 116.899723][ T307] _raw_spin_lock+0x30/0x40
[ 116.899728][ T307] ? __dev_queue_xmit+0x2210/0x3b50
[ 116.899731][ T307] __dev_queue_xmit+0x2210/0x3b50
Fixes: 178ca30 ("Revert "net/sched: Fix mirred deadlock on device recursion"")
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251210162255.1057663-1-jhs@mojatatu.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reject attempts to disable KVM_MEM_GUEST_MEMFD on a memslot that was initially created with a guest_memfd binding, as KVM doesn't support toggling KVM_MEM_GUEST_MEMFD on existing memslots. KVM prevents enabling KVM_MEM_GUEST_MEMFD, but doesn't prevent clearing the flag. Failure to reject the new memslot results in a use-after-free due to KVM not unbinding from the guest_memfd instance. Unbinding on a FLAGS_ONLY change is easy enough, and can/will be done as a hardening measure (in anticipation of KVM supporting dirty logging on guest_memfd at some point), but fixing the use-after-free would only address the immediate symptom. ================================================================== BUG: KASAN: slab-use-after-free in kvm_gmem_release+0x362/0x400 [kvm] Write of size 8 at addr ffff8881111ae908 by task repro/745 CPU: 7 UID: 1000 PID: 745 Comm: repro Not tainted 6.18.0-rc6-115d5de2eef3-next-kasan #3 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x51/0x60 print_report+0xcb/0x5c0 kasan_report+0xb4/0xe0 kvm_gmem_release+0x362/0x400 [kvm] __fput+0x2fa/0x9d0 task_work_run+0x12c/0x200 do_exit+0x6ae/0x2100 do_group_exit+0xa8/0x230 __x64_sys_exit_group+0x3a/0x50 x64_sys_call+0x737/0x740 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f581f2eac31 </TASK> Allocated by task 745 on cpu 6 at 9.746971s: kasan_save_stack+0x20/0x40 kasan_save_track+0x13/0x50 __kasan_kmalloc+0x77/0x90 kvm_set_memory_region.part.0+0x652/0x1110 [kvm] kvm_vm_ioctl+0x14b0/0x3290 [kvm] __x64_sys_ioctl+0x129/0x1a0 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Freed by task 745 on cpu 6 at 9.747467s: kasan_save_stack+0x20/0x40 kasan_save_track+0x13/0x50 __kasan_save_free_info+0x37/0x50 __kasan_slab_free+0x3b/0x60 kfree+0xf5/0x440 kvm_set_memslot+0x3c2/0x1160 [kvm] kvm_set_memory_region.part.0+0x86a/0x1110 [kvm] kvm_vm_ioctl+0x14b0/0x3290 [kvm] __x64_sys_ioctl+0x129/0x1a0 do_syscall_64+0x5b/0x900 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Reported-by: Alexander Potapenko <glider@google.com> Fixes: a7800aa ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251202020334.1171351-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
… to macb_open() In the non-RT kernel, local_bh_disable() merely disables preemption, whereas it maps to an actual spin lock in the RT kernel. Consequently, when attempting to refill RX buffers via netdev_alloc_skb() in macb_mac_link_up(), a deadlock scenario arises as follows: WARNING: possible circular locking dependency detected 6.18.0-08691-g2061f18ad76e #39 Not tainted ------------------------------------------------------ kworker/0:0/8 is trying to acquire lock: ffff00080369bbe0 (&bp->lock){+.+.}-{3:3}, at: macb_start_xmit+0x808/0xb7c but task is already holding lock: ffff000803698e58 (&queue->tx_ptr_lock){+...}-{3:3}, at: macb_start_xmit +0x148/0xb7c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&queue->tx_ptr_lock){+...}-{3:3}: rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x148/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #2 (_xmit_ETHER#2){+...}-{3:3}: rt_spin_lock+0x50/0x1f0 sch_direct_xmit+0x11c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #1 ((softirq_ctrl.lock)){+.+.}-{3:3}: lock_release+0x250/0x348 __local_bh_enable_ip+0x7c/0x240 __netdev_alloc_skb+0x1b4/0x1d8 gem_rx_refill+0xdc/0x240 gem_init_rings+0xb4/0x108 macb_mac_link_up+0x9c/0x2b4 phylink_resolve+0x170/0x614 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #0 (&bp->lock){+.+.}-{3:3}: __lock_acquire+0x15a8/0x2084 lock_acquire+0x1cc/0x350 rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x808/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 other info that might help us debug this: Chain exists of: &bp->lock --> _xmit_ETHER#2 --> &queue->tx_ptr_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&queue->tx_ptr_lock); lock(_xmit_ETHER#2); lock(&queue->tx_ptr_lock); lock(&bp->lock); *** DEADLOCK *** Call trace: show_stack+0x18/0x24 (C) dump_stack_lvl+0xa0/0xf0 dump_stack+0x18/0x24 print_circular_bug+0x28c/0x370 check_noncircular+0x198/0x1ac __lock_acquire+0x15a8/0x2084 lock_acquire+0x1cc/0x350 rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x808/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 Notably, invoking the mog_init_rings() callback upon link establishment is unnecessary. Instead, we can exclusively call mog_init_rings() within the ndo_open() callback. This adjustment resolves the deadlock issue. Furthermore, since MACB_CAPS_MACB_IS_EMAC cases do not use mog_init_rings() when opening the network interface via at91ether_open(), moving mog_init_rings() to macb_open() also eliminates the MACB_CAPS_MACB_IS_EMAC check. Fixes: 633e98a ("net: macb: use resolved link config in mac_link_up()") Cc: stable@vger.kernel.org Suggested-by: Kevin Hao <kexin.hao@windriver.com> Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Link: https://patch.msgid.link/20251222015624.1994551-1-xiaolei.wang@windriver.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pull request for series with
subject: bpf, arm64: fix bpf line info
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=612002