Cleanup submit bh patchv4 #1

riteshharjani · 2022-08-17T09:55:59Z

No description provided.

submit_bh always returns 0. This patch cleans up 2 of it's caller in jbd2 to drop submit_bh's useless return value. Once all submit_bh callers are cleaned up, we can make it's return type as void. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

…or_read submit_bh always returns 0. This patch drops the useless return value of submit_bh from ntfs_submit_bh_for_read(). Once all of submit_bh callers are cleaned up, we can make it's return type as void. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ritesh Harjani <ritesh.list@gmail.com>

submit_bh always returns 0. This patch drops the useless return value of submit_bh from __sync_dirty_buffer(). Once all of submit_bh callers are cleaned up, we can make it's return type as void. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

submit_bh/submit_bh_wbc are non-blocking functions which just submit the bio and return. The caller of submit_bh/submit_bh_wbc needs to wait on buffer till I/O completion and then check buffer head's b_state field to know if there was any I/O error. Hence there is no need for these functions to have any return type. Even now they always returns 0. Hence drop the return value and make their return type as void to avoid any confusion. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

Fix possible NULL pointer dereference, due to freeing of adapter->vf_res in iavf_init_get_resources. Previous commit introduced a regression, where receiving IAVF_ERR_ADMIN_QUEUE_NO_WORK from iavf_get_vf_config would free adapter->vf_res. However, netdev is still registered, so ethtool_ops can be called. Calling iavf_get_link_ksettings with no vf_res, will result with: [ 9385.242676] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 9385.242683] #PF: supervisor read access in kernel mode [ 9385.242686] #PF: error_code(0x0000) - not-present page [ 9385.242690] PGD 0 P4D 0 [ 9385.242696] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 9385.242701] CPU: 6 PID: 3217 Comm: pmdalinux Kdump: loaded Tainted: G S E 5.18.0-04958-ga54ce3703613-dirty #1 [ 9385.242708] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.11.0 11/02/2019 [ 9385.242710] RIP: 0010:iavf_get_link_ksettings+0x29/0xd0 [iavf] [ 9385.242745] Code: 00 0f 1f 44 00 00 b8 01 ef ff ff 48 c7 46 30 00 00 00 00 48 c7 46 38 00 00 00 00 c6 46 0b 00 66 89 46 08 48 8b 87 68 0e 00 00 <f6> 40 08 80 75 50 8b 87 5c 0e 00 00 83 f8 08 74 7a 76 1d 83 f8 20 [ 9385.242749] RSP: 0018:ffffc0560ec7fbd0 EFLAGS: 00010246 [ 9385.242755] RAX: 0000000000000000 RBX: ffffc0560ec7fc08 RCX: 0000000000000000 [ 9385.242759] RDX: ffffffffc0ad4550 RSI: ffffc0560ec7fc08 RDI: ffffa0fc66674000 [ 9385.242762] RBP: 00007ffd1fb2bf50 R08: b6a2d54b892363ee R09: ffffa101dc14fb00 [ 9385.242765] R10: 0000000000000000 R11: 0000000000000004 R12: ffffa0fc66674000 [ 9385.242768] R13: 0000000000000000 R14: ffffa0fc66674000 R15: 00000000ffffffa1 [ 9385.242771] FS: 00007f93711a2980(0000) GS:ffffa0fad72c0000(0000) knlGS:0000000000000000 [ 9385.242775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9385.242778] CR2: 0000000000000008 CR3: 0000000a8e61c003 CR4: 00000000003706e0 [ 9385.242781] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9385.242784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 9385.242787] Call Trace: [ 9385.242791] <TASK> [ 9385.242793] ethtool_get_settings+0x71/0x1a0 [ 9385.242814] __dev_ethtool+0x426/0x2f40 [ 9385.242823] ? slab_post_alloc_hook+0x4f/0x280 [ 9385.242836] ? kmem_cache_alloc_trace+0x15d/0x2f0 [ 9385.242841] ? dev_ethtool+0x59/0x170 [ 9385.242848] dev_ethtool+0xa7/0x170 [ 9385.242856] dev_ioctl+0xc3/0x520 [ 9385.242866] sock_do_ioctl+0xa0/0xe0 [ 9385.242877] sock_ioctl+0x22f/0x320 [ 9385.242885] __x64_sys_ioctl+0x84/0xc0 [ 9385.242896] do_syscall_64+0x3a/0x80 [ 9385.242904] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 9385.242918] RIP: 0033:0x7f93702396db [ 9385.242923] Code: 73 01 c3 48 8b 0d ad 57 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 57 38 00 f7 d8 64 89 01 48 [ 9385.242927] RSP: 002b:00007ffd1fb2bf18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 9385.242932] RAX: ffffffffffffffda RBX: 000055671b1d2fe0 RCX: 00007f93702396db [ 9385.242935] RDX: 00007ffd1fb2bf20 RSI: 0000000000008946 RDI: 0000000000000007 [ 9385.242937] RBP: 00007ffd1fb2bf20 R08: 0000000000000003 R09: 0030763066307330 [ 9385.242940] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd1fb2bf80 [ 9385.242942] R13: 0000000000000007 R14: 0000556719f6de90 R15: 00007ffd1fb2c1b0 [ 9385.242948] </TASK> [ 9385.242949] Modules linked in: iavf(E) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables rfkill nfnetlink vfat fat irdma ib_uverbs ib_core intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support ice irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl i40e pcspkr intel_cstate joydev mei_me intel_uncore mxm_wmi mei ipmi_ssif lpc_ich ipmi_si acpi_power_meter xfs libcrc32c mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper sd_mod t10_pi crc64_rocksoft crc64 syscopyarea sg sysfillrect sysimgblt fb_sys_fops drm ixgbe ahci libahci libata crc32c_intel mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse [ 9385.243065] [last unloaded: iavf] Dereference happens in if (ADV_LINK_SUPPORT(adapter)) statement Fixes: 209f2f9 ("iavf: Add support for VIRTCHNL_VF_OFFLOAD_VLAN_V2 negotiation") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

Do not call iavf_close in iavf_reset_task error handling. Doing so can lead to double call of napi_disable, which can lead to deadlock there. Removing VF would lead to iavf_remove task being stuck, because it requires crit_lock, which is held by iavf_close. Call iavf_disable_vf if reset fail, so that driver will clean up remaining invalid resources. During rapid VF resets, HW can fail to setup VF mailbox. Wrong error handling can lead to iavf_remove being stuck with: [ 5218.999087] iavf 0000:82:01.0: Failed to init adminq: -53 ... [ 5267.189211] INFO: task repro.sh:11219 blocked for more than 30 seconds. [ 5267.189520] Tainted: G S E 5.18.0-04958-ga54ce3703613-dirty #1 [ 5267.189764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5267.190062] task:repro.sh state:D stack: 0 pid:11219 ppid: 8162 flags:0x00000000 [ 5267.190347] Call Trace: [ 5267.190647] <TASK> [ 5267.190927] __schedule+0x460/0x9f0 [ 5267.191264] schedule+0x44/0xb0 [ 5267.191563] schedule_preempt_disabled+0x14/0x20 [ 5267.191890] __mutex_lock.isra.12+0x6e3/0xac0 [ 5267.192237] ? iavf_remove+0xf9/0x6c0 [iavf] [ 5267.192565] iavf_remove+0x12a/0x6c0 [iavf] [ 5267.192911] ? _raw_spin_unlock_irqrestore+0x1e/0x40 [ 5267.193285] pci_device_remove+0x36/0xb0 [ 5267.193619] device_release_driver_internal+0xc1/0x150 [ 5267.193974] pci_stop_bus_device+0x69/0x90 [ 5267.194361] pci_stop_and_remove_bus_device+0xe/0x20 [ 5267.194735] pci_iov_remove_virtfn+0xba/0x120 [ 5267.195130] sriov_disable+0x2f/0xe0 [ 5267.195506] ice_free_vfs+0x7d/0x2f0 [ice] [ 5267.196056] ? pci_get_device+0x4f/0x70 [ 5267.196496] ice_sriov_configure+0x78/0x1a0 [ice] [ 5267.196995] sriov_numvfs_store+0xfe/0x140 [ 5267.197466] kernfs_fop_write_iter+0x12e/0x1c0 [ 5267.197918] new_sync_write+0x10c/0x190 [ 5267.198404] vfs_write+0x24e/0x2d0 [ 5267.198886] ksys_write+0x5c/0xd0 [ 5267.199367] do_syscall_64+0x3a/0x80 [ 5267.199827] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 5267.200317] RIP: 0033:0x7f5b381205c8 [ 5267.200814] RSP: 002b:00007fff8c7e8c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 5267.201981] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5b381205c8 [ 5267.202620] RDX: 0000000000000002 RSI: 00005569420ee900 RDI: 0000000000000001 [ 5267.203426] RBP: 00005569420ee900 R08: 000000000000000a R09: 00007f5b38180820 [ 5267.204327] R10: 000000000000000a R11: 0000000000000246 R12: 00007f5b383c06e0 [ 5267.205193] R13: 0000000000000002 R14: 00007f5b383bb880 R15: 0000000000000002 [ 5267.206041] </TASK> [ 5267.206970] Kernel panic - not syncing: hung_task: blocked tasks [ 5267.207809] CPU: 48 PID: 551 Comm: khungtaskd Kdump: loaded Tainted: G S E 5.18.0-04958-ga54ce3703613-dirty #1 [ 5267.208726] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.11.0 11/02/2019 [ 5267.209623] Call Trace: [ 5267.210569] <TASK> [ 5267.211480] dump_stack_lvl+0x33/0x42 [ 5267.212472] panic+0x107/0x294 [ 5267.213467] watchdog.cold.8+0xc/0xbb [ 5267.214413] ? proc_dohung_task_timeout_secs+0x30/0x30 [ 5267.215511] kthread+0xf4/0x120 [ 5267.216459] ? kthread_complete_and_exit+0x20/0x20 [ 5267.217505] ret_from_fork+0x22/0x30 [ 5267.218459] </TASK> Fixes: f0db789 ("i40evf: use netdev variable in reset task") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags() to obtain the value of sk->sk_user_data, but that function is only usable if the RCU read lock is held, and neither that function nor any of its callers hold it. Fix this by adding a new helper, __locked_read_sk_user_data_with_flags() that checks to see if sk->sk_callback_lock() is held and use that here instead. Alternatively, making __rcu_dereference_sk_user_data_with_flags() use rcu_dereference_checked() might suffice. Without this, the following warning can be occasionally observed: ============================= WARNING: suspicious RCU usage 6.0.0-rc1-build2+ #563 Not tainted ----------------------------- include/net/sock.h:592 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by locktest/29873: #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4 stack backtrace: CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: <TASK> dump_stack_lvl+0x4c/0x5f bpf_sk_reuseport_detach+0x6d/0xa4 reuseport_detach_sock+0x75/0xdd inet_unhash+0xa5/0x1c0 tcp_set_state+0x169/0x20f ? lockdep_sock_is_held+0x3a/0x3a ? __lock_release.isra.0+0x13e/0x220 ? reacquire_held_locks+0x1bb/0x1bb ? hlock_class+0x31/0x96 ? mark_lock+0x9e/0x1af __tcp_close+0x50/0x4b6 tcp_close+0x28/0x70 inet_release+0x8e/0xa7 __sock_release+0x95/0x121 sock_close+0x14/0x17 __fput+0x20f/0x36a task_work_run+0xa3/0xcc exit_to_user_mode_prepare+0x9c/0x14d syscall_exit_to_user_mode+0x18/0x44 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: cf8c1e9 ("net: refactor bpf_sk_reuseport_detach()") Signed-off-by: David Howells <dhowells@redhat.com> cc: Hawkins Jiawei <yin31149@gmail.com> Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cec_unregister_adapter() assumes that the underlying adapter ops are callable. For example, if the CEC adapter currently has a valid physical address, then the unregistration procedure will invalidate the physical address by setting it to f.f.f.f. Whence the following kernel oops observed after removing the adv7511 module: Unable to handle kernel execution of user memory at virtual address 0000000000000000 Internal error: Oops: 86000004 [#1] PREEMPT_RT SMP Call trace: 0x0 adv7511_cec_adap_log_addr+0x1ac/0x1c8 [adv7511] cec_adap_unconfigure+0x44/0x90 [cec] __cec_s_phys_addr.part.0+0x68/0x230 [cec] __cec_s_phys_addr+0x40/0x50 [cec] cec_unregister_adapter+0xb4/0x118 [cec] adv7511_remove+0x60/0x90 [adv7511] i2c_device_remove+0x34/0xe0 device_release_driver_internal+0x114/0x1f0 driver_detach+0x54/0xe0 bus_remove_driver+0x60/0xd8 driver_unregister+0x34/0x60 i2c_del_driver+0x2c/0x68 adv7511_exit+0x1c/0x67c [adv7511] __arm64_sys_delete_module+0x154/0x288 invoke_syscall+0x48/0x100 el0_svc_common.constprop.0+0x48/0xe8 do_el0_svc+0x28/0x88 el0_svc+0x1c/0x50 el0t_64_sync_handler+0xa8/0xb0 el0t_64_sync+0x15c/0x160 Code: bad PC value ---[ end trace 0000000000000000 ]--- Protect against this scenario by unregistering i2c_cec after unregistering the CEC adapter. Duly disable the CEC clock afterwards too. Fixes: 3b1b975 ("drm: adv7511/33: add HDMI CEC support") Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Robert Foss <robert.foss@linaro.org> Signed-off-by: Robert Foss <robert.foss@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20220612144854.2223873-3-alvin@pqrs.dk

There are some struct drm_driver fields that are required by drivers since drm_copy_field() attempts to copy them to user-space via DRM_IOCTL_VERSION. But it can be possible that a driver has a bug and did not set some of the fields, which leads to drm_copy_field() attempting to copy a NULL pointer: [ +10.395966] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 [ +0.010955] Mem abort info: [ +0.002835] ESR = 0x0000000096000004 [ +0.003872] EC = 0x25: DABT (current EL), IL = 32 bits [ +0.005395] SET = 0, FnV = 0 [ +0.003113] EA = 0, S1PTW = 0 [ +0.003182] FSC = 0x04: level 0 translation fault [ +0.004964] Data abort info: [ +0.002919] ISV = 0, ISS = 0x00000004 [ +0.003886] CM = 0, WnR = 0 [ +0.003040] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000115dad000 [ +0.006536] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ +0.006925] Internal error: Oops: 96000004 [#1] SMP ... [ +0.011113] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ +0.007061] pc : __pi_strlen+0x14/0x150 [ +0.003895] lr : drm_copy_field+0x30/0x1a4 [ +0.004156] sp : ffff8000094b3a50 [ +0.003355] x29: ffff8000094b3a50 x28: ffff8000094b3b70 x27: 0000000000000040 [ +0.007242] x26: ffff443743c2ba00 x25: 0000000000000000 x24: 0000000000000040 [ +0.007243] x23: ffff443743c2ba00 x22: ffff8000094b3b70 x21: 0000000000000000 [ +0.007241] x20: 0000000000000000 x19: ffff8000094b3b90 x18: 0000000000000000 [ +0.007241] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaab14b9af40 [ +0.007241] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ +0.007239] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa524ad67d4d8 [ +0.007242] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : 6c6e6263606e7141 [ +0.007239] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ +0.007241] x2 : 0000000000000000 x1 : ffff8000094b3b90 x0 : 0000000000000000 [ +0.007240] Call trace: [ +0.002475] __pi_strlen+0x14/0x150 [ +0.003537] drm_version+0x84/0xac [ +0.003448] drm_ioctl_kernel+0xa8/0x16c [ +0.003975] drm_ioctl+0x270/0x580 [ +0.003448] __arm64_sys_ioctl+0xb8/0xfc [ +0.003978] invoke_syscall+0x78/0x100 [ +0.003799] el0_svc_common.constprop.0+0x4c/0xf4 [ +0.004767] do_el0_svc+0x38/0x4c [ +0.003357] el0_svc+0x34/0x100 [ +0.003185] el0t_64_sync_handler+0x11c/0x150 [ +0.004418] el0t_64_sync+0x190/0x194 [ +0.003716] Code: 92402c04 b200c3e8 f13fc09f 5400088c (a9400c02) [ +0.006180] ---[ end trace 0000000000000000 ]--- Reported-by: Peter Robinson <pbrobinson@gmail.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20220705100215.572498-3-javierm@redhat.com

Kurt Kanzenbach says: ==================== Hi, add a BPF-helper for accessing CLOCK_TAI. Use cases for such a BPF helper include functionalities such as Tx launch time (e.g. ETF and TAPRIO Qdiscs), timestamping and policing. Patch #1 - Introduce BPF helper Patch #2 - Add test case (skb based) Changes since v1: * Update changelog (Alexei Starovoitov) * Add test case (Alexei Starovoitov, Andrii Nakryiko) * Add missing function prototype (netdev ci) Previous versions: * v1: https://lore.kernel.org/r/20220606103734.92423-1-kurt@linutronix.de/ Jesper Dangaard Brouer (1): bpf: Add BPF-helper for accessing CLOCK_TAI ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>

We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Commit c89191c ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS") missed one use case of SWAPGS in entry_INT80_compat(). Removing of the SWAPGS macro led to asm just using "swapgs", as it is accepting instructions in capital letters, too. This in turn leads to splats in Xen PV guests like: [ 36.145223] general protection fault, maybe for address 0x2d: 0000 [#1] PREEMPT SMP NOPTI [ 36.145794] CPU: 2 PID: 1847 Comm: ld-linux.so.2 Not tainted 5.19.1-1-default #1 \ openSUSE Tumbleweed f3b44bfb672cdb9f235aff53b57724eba8b9411b [ 36.146608] Hardware name: HP ProLiant ML350p Gen8, BIOS P72 11/14/2013 [ 36.148126] RIP: e030:entry_INT80_compat+0x3/0xa3 Fix that by open coding this single instance of the SWAPGS macro. Fixes: c89191c ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Cc: <stable@vger.kernel.org> # 5.19 Link: https://lore.kernel.org/r/20220816071137.4893-1-jgross@suse.com

If amdgpu_cs_vm_handling returns r != 0, then it will unlock the bo_list_mutex inside the function amdgpu_cs_vm_handling and again on amdgpu_cs_parser_fini. This problem results in the following use-after-free problem: [ 220.280990] ------------[ cut here ]------------ [ 220.281000] refcount_t: underflow; use-after-free. [ 220.281019] WARNING: CPU: 1 PID: 3746 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 220.281029] ------------[ cut here ]------------ [ 220.281415] CPU: 1 PID: 3746 Comm: chrome:cs0 Tainted: G W L ------- --- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1 [ 220.281421] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 220.281426] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 220.281431] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 98 c6 05 ce 7e be 01 01 e8 56 4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 220.281437] RSP: 0018:ffffb4b0d18d7a80 EFLAGS: 00010282 [ 220.281443] RAX: 0000000000000026 RBX: 0000000000000003 RCX: 0000000000000000 [ 220.281448] RDX: 0000000000000001 RSI: ffffffff988d06dc RDI: 00000000ffffffff [ 220.281452] RBP: 00000000ffffffff R08: 0000000000000000 R09: ffffb4b0d18d7930 [ 220.281457] R10: 0000000000000003 R11: ffffa0672e2fffe8 R12: ffffa058ca360400 [ 220.281461] R13: ffffa05846c50a18 R14: 00000000fffffe00 R15: 0000000000000003 [ 220.281465] FS: 00007f82683e06c0(0000) GS:ffffa066e2e00000(0000) knlGS:0000000000000000 [ 220.281470] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 220.281475] CR2: 00003590005cc000 CR3: 00000001fca46000 CR4: 0000000000350ee0 [ 220.281480] Call Trace: [ 220.281485] <TASK> [ 220.281490] amdgpu_cs_ioctl+0x4e2/0x2070 [amdgpu] [ 220.281806] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282028] drm_ioctl_kernel+0xa4/0x150 [ 220.282043] drm_ioctl+0x21f/0x420 [ 220.282053] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282275] ? lock_release+0x14f/0x460 [ 220.282282] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282290] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282297] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282305] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 220.282317] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 220.282534] __x64_sys_ioctl+0x90/0xd0 [ 220.282545] do_syscall_64+0x5b/0x80 [ 220.282551] ? futex_wake+0x6c/0x150 [ 220.282568] ? lock_is_held_type+0xe8/0x140 [ 220.282580] ? do_syscall_64+0x67/0x80 [ 220.282585] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282592] ? do_syscall_64+0x67/0x80 [ 220.282597] ? do_syscall_64+0x67/0x80 [ 220.282602] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282609] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 220.282616] RIP: 0033:0x7f8282a4f8bf [ 220.282639] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 220.282644] RSP: 002b:00007f82683df410 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 220.282651] RAX: ffffffffffffffda RBX: 00007f82683df588 RCX: 00007f8282a4f8bf [ 220.282655] RDX: 00007f82683df4d0 RSI: 00000000c0186444 RDI: 0000000000000018 [ 220.282659] RBP: 00007f82683df4d0 R08: 00007f82683df5e0 R09: 00007f82683df4b0 [ 220.282663] R10: 00001d04000a0600 R11: 0000000000000246 R12: 00000000c0186444 [ 220.282667] R13: 0000000000000018 R14: 00007f82683df588 R15: 0000000000000003 [ 220.282689] </TASK> [ 220.282693] irq event stamp: 6232311 [ 220.282697] hardirqs last enabled at (6232319): [<ffffffff9718cd7e>] __up_console_sem+0x5e/0x70 [ 220.282704] hardirqs last disabled at (6232326): [<ffffffff9718cd63>] __up_console_sem+0x43/0x70 [ 220.282709] softirqs last enabled at (6232072): [<ffffffff970ff669>] __irq_exit_rcu+0xf9/0x170 [ 220.282716] softirqs last disabled at (6232061): [<ffffffff970ff669>] __irq_exit_rcu+0xf9/0x170 [ 220.282722] ---[ end trace 0000000000000000 ]--- Therefore, remove the mutex_unlock from the amdgpu_cs_vm_handling function, so that amdgpu_cs_submit and amdgpu_cs_parser_fini can handle the unlock. Fixes: 90af0ca ("drm/amdgpu: Protect the amdgpu_bo_list list with a mutex v2") Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Maíra Canal <mairacanal@riseup.net> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

When we try to transmit an skb with metadata_dst attached (i.e. dst->dev == NULL) through xfrm interface we can hit a null pointer dereference[1] in xfrmi_xmit2() -> xfrm_lookup_with_ifid() due to the check for a loopback skb device when there's no policy which dereferences dst->dev unconditionally. Not having dst->dev can be interepreted as it not being a loopback device, so just add a check for a null dst_orig->dev. With this fix xfrm interface's Tx error counters go up as usual. [1] net-next calltrace captured via netconsole: BUG: kernel NULL pointer dereference, address: 00000000000000c0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 7231 Comm: ping Kdump: loaded Not tainted 5.19.0+ #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:xfrm_lookup_with_ifid+0x5eb/0xa60 Code: 8d 74 24 38 e8 26 a4 37 00 48 89 c1 e9 12 fc ff ff 49 63 ed 41 83 fd be 0f 85 be 01 00 00 41 be ff ff ff ff 45 31 ed 48 8b 03 <f6> 80 c0 00 00 00 08 75 0f 41 80 bc 24 19 0d 00 00 01 0f 84 1e 02 RSP: 0018:ffffb0db82c679f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffd0db7fcad430 RCX: ffffb0db82c67a10 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb0db82c67a80 RBP: ffffb0db82c67a80 R08: ffffb0db82c67a14 R09: 0000000000000000 R10: 0000000000000000 R11: ffff8fa449667dc8 R12: ffffffff966db880 R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007ff35c83f000(0000) GS:ffff8fa478480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 000000001ebb7000 CR4: 0000000000350ee0 Call Trace: <TASK> xfrmi_xmit+0xde/0x460 ? tcf_bpf_act+0x13d/0x2a0 dev_hard_start_xmit+0x72/0x1e0 __dev_queue_xmit+0x251/0xd30 ip_finish_output2+0x140/0x550 ip_push_pending_frames+0x56/0x80 raw_sendmsg+0x663/0x10a0 ? try_charge_memcg+0x3fd/0x7a0 ? __mod_memcg_lruvec_state+0x93/0x110 ? sock_sendmsg+0x30/0x40 sock_sendmsg+0x30/0x40 __sys_sendto+0xeb/0x130 ? handle_mm_fault+0xae/0x280 ? do_user_addr_fault+0x1e7/0x680 ? kvm_read_and_reset_apf_flags+0x3b/0x50 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7ff35cac1366 Code: eb 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 RSP: 002b:00007fff738e4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007fff738e57b0 RCX: 00007ff35cac1366 RDX: 0000000000000040 RSI: 0000557164e4b450 RDI: 0000000000000003 RBP: 0000557164e4b450 R08: 00007fff738e7a2c R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 R13: 00007fff738e5770 R14: 00007fff738e4030 R15: 0000001d00000001 </TASK> Modules linked in: netconsole veth br_netfilter bridge bonding virtio_net [last unloaded: netconsole] CR2: 00000000000000c0 CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Daniel Borkmann <daniel@iogearbox.net> Fixes: 2d151d3 ("xfrm: Add possibility to set the default to block if we have no policy") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

The thermal zone is freed after being unregistered. The release method devm_thermal_zone_device_register() calls -> thermal_of_zone_device_unregister() This one calls thermal_zone_device_unregister() which frees the thermal zone. However, thermal_of_zone_device_unregister() does access this freed pointer to free different resources allocated by the thermal_of framework which is invalid. It results in a kernel panic: [ 1.915140] thermal_sys: Failed to find thermal zone for tmu id=2 [ 1.921279] qoriq_thermal 1f80000.tmu: Failed to register sensors [ 1.927395] qoriq_thermal: probe of 1f80000.tmu failed with error -22 [ 1.934189] Unable to handle kernel paging request at virtual address 01adadadadadad88 [ 1.942146] Mem abort info: [ 1.944948] ESR = 0x0000000096000004 [ 1.948708] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.954042] SET = 0, FnV = 0 [ 1.957107] EA = 0, S1PTW = 0 [ 1.960253] FSC = 0x04: level 0 translation fault [ 1.965147] Data abort info: [ 1.968030] ISV = 0, ISS = 0x00000004 [ 1.971878] CM = 0, WnR = 0 [ 1.974852] [01adadadadadad88] address between user and kernel address ranges [ 1.982016] Internal error: Oops: 96000004 [#1] SMP [ 1.986907] Modules linked in: [ 1.989969] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-next-20220808-00080-g1c46f44502e0 #1697 [ 1.999135] Hardware name: Kontron KBox A-230-LS (DT) [ 2.004199] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2.011185] pc : kfree+0x5c/0x3c0 [ 2.014516] lr : devm_thermal_of_zone_release+0x38/0x60 [ 2.019761] sp : ffff80000a22bad0 [ 2.023081] x29: ffff80000a22bad0 x28: 0000000000000000 x27: ffff800009960464 [ 2.030245] x26: ffff800009a16960 x25: 0000000000000006 x24: ffff800009f09a40 [ 2.037407] x23: ffff800009ab9008 x22: ffff800008d0eea8 x21: 01adadadadadad80 [ 2.044569] x20: 6b6b6b6b6b6b6b6b x19: ffff00200232b800 x18: 00000000fffffffb [ 2.051731] x17: ffff800008d0eea0 x16: ffff800008d07d44 x15: ffff800008d0d154 [ 2.056647] usb 1-1: new high-speed USB device number 2 using xhci-hcd [ 2.058893] x14: ffff800008d0cddc x13: ffff8000088d1c2c x12: ffff8000088d5034 [ 2.072597] x11: ffff8000088d46d4 x10: 0000000000000000 x9 : ffff800008d0eea8 [ 2.079759] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : ffff80000a0f53b8 [ 2.086921] x5 : ffff80000a22b960 x4 : 0000000000000000 x3 : 0000000000000000 [ 2.094082] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 01adb1adadadad80 [ 2.101244] Call trace: [ 2.103692] kfree+0x5c/0x3c0 [ 2.106666] devm_thermal_of_zone_release+0x38/0x60 [ 2.111561] release_nodes+0x64/0xd0 [ 2.115146] devres_release_all+0xbc/0x350 [ 2.119253] device_unbind_cleanup+0x20/0x70 [ 2.123536] really_probe+0x1a0/0x2e4 [ 2.127208] __driver_probe_device+0x80/0xec [ 2.131490] driver_probe_device+0x44/0x130 [ 2.135685] __driver_attach+0x104/0x1b4 [ 2.139619] bus_for_each_dev+0x7c/0xe0 [ 2.143465] driver_attach+0x30/0x40 [ 2.147048] bus_add_driver+0x160/0x210 [ 2.150894] driver_register+0x84/0x140 [ 2.154741] __platform_driver_register+0x34/0x40 [ 2.159461] qoriq_tmu_init+0x28/0x34 [ 2.163133] do_one_initcall+0x50/0x250 [ 2.166979] kernel_init_freeable+0x278/0x31c [ 2.171349] kernel_init+0x30/0x140 [ 2.174847] ret_from_fork+0x10/0x20 [ 2.178433] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1) [ 2.184546] ---[ end trace 0000000000000000 ]--- Store the allocated resource pointers before the thermal zone is free and use them to release the resource after unregistering the thermal zone. Fixes: 3bd52ac ("thermal/of: Rework the thermal device tree initialization") Reported-by: Michael Walle <michael@walle.cc> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Tested-by: Michael Walle <michael@walle.cc> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220809085629.509116-4-daniel.lezcano@linaro.org

We have been hitting the following lockdep splat with btrfs/187 recently WARNING: possible circular locking dependency detected 5.19.0-rc8+ #775 Not tainted ------------------------------------------------------ btrfs/752500 is trying to acquire lock: ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 but task is already holding lock: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-01/1){+.+.}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_init_new_buffer+0x7d/0x2c0 btrfs_alloc_tree_block+0x120/0x3b0 __btrfs_cow_block+0x136/0x600 btrfs_cow_block+0x10b/0x230 btrfs_search_slot+0x53b/0xb70 btrfs_lookup_inode+0x2a/0xa0 __btrfs_update_delayed_inode+0x5f/0x280 btrfs_async_run_delayed_root+0x24c/0x290 btrfs_work_helper+0xf2/0x3e0 process_one_work+0x271/0x590 worker_thread+0x52/0x3b0 kthread+0xf0/0x120 ret_from_fork+0x1f/0x30 -> #1 (btrfs-tree-01){++++}-{3:3}: down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_search_slot+0x3c3/0xb70 do_relocation+0x10c/0x6b0 relocate_tree_blocks+0x317/0x6d0 relocate_block_group+0x1f1/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}: __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 down_write_nested+0x41/0x80 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Chain exists of: btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-01/1); lock(btrfs-tree-01); lock(btrfs-tree-01/1); lock(btrfs-treloc-02#2); *** DEADLOCK *** 7 locks held by btrfs/752500: #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90 #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40 #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400 #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610 #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0 #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110 stack backtrace: CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack_lvl+0x56/0x73 check_noncircular+0xd6/0x100 ? lock_is_held_type+0xe2/0x140 __lock_acquire+0x1122/0x1e10 lock_acquire+0xc2/0x2d0 ? __btrfs_tree_lock+0x24/0x110 down_write_nested+0x41/0x80 ? __btrfs_tree_lock+0x24/0x110 __btrfs_tree_lock+0x24/0x110 btrfs_lock_root_node+0x31/0x50 btrfs_search_slot+0x1cb/0xb70 ? lock_release+0x137/0x2d0 ? _raw_spin_unlock+0x29/0x50 ? release_extent_buffer+0x128/0x180 replace_path+0x541/0x9f0 merge_reloc_root+0x1d6/0x610 merge_reloc_roots+0xe2/0x260 relocate_block_group+0x2c8/0x560 btrfs_relocate_block_group+0x23e/0x400 btrfs_relocate_chunk+0x4c/0x140 btrfs_balance+0x755/0xe40 btrfs_ioctl+0x1ea2/0x2c90 ? lock_is_held_type+0xe2/0x140 ? lock_is_held_type+0xe2/0x140 ? __x64_sys_ioctl+0x88/0xc0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd This isn't necessarily new, it's just tricky to hit in practice. There are two competing things going on here. With relocation we create a snapshot of every fs tree with a reloc tree. Any extent buffers that get initialized here are initialized with the reloc root lockdep key. However since it is a snapshot, any blocks that are currently in cache that originally belonged to the fs tree will have the normal tree lockdep key set. This creates the lock dependency of reloc tree -> normal tree for the extent buffer locking during the first phase of the relocation as we walk down the reloc root to relocate blocks. However this is problematic because the final phase of the relocation is merging the reloc root into the original fs root. This involves searching down to any keys that exist in the original fs root and then swapping the relocated block and the original fs root block. We have to search down to the fs root first, and then go search the reloc root for the block we need to replace. This creates the dependency of normal tree -> reloc tree which is why lockdep complains. Additionally even if we were to fix this particular mismatch with a different nesting for the merge case, we're still slotting in a block that has a owner of the reloc root objectid into a normal tree, so that block will have its lockdep key set to the tree reloc root, and create a lockdep splat later on when we wander into that block from the fs root. Unfortunately the only solution here is to make sure we do not set the lockdep key to the reloc tree lockdep key normally, and then reset any blocks we wander into from the reloc root when we're doing the merged. This solves the problem of having mixed tree reloc keys intermixed with normal tree keys, and then allows us to make sure in the merge case we maintain the lock order of normal tree -> reloc tree We handle this by setting a bit on the reloc root when we do the search for the block we want to relocate, and any block we search into or COW at that point gets set to the reloc tree key. This works correctly because we only ever COW down to the parent node, so we aren't resetting the key for the block we're linking into the fs root. With this patch we no longer have the lockdep splat in btrfs/187. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Syzkaller reported BUG_ON as follows: ------------[ cut here ]------------ kernel BUG at fs/ntfs/dir.c:86! invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 3 PID: 758 Comm: a.out Not tainted 5.19.0-next-20220808 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:ntfs_lookup_inode_by_name+0xd11/0x2d10 Code: ff e9 b9 01 00 00 e8 1e fe d6 fe 48 8b 7d 98 49 8d 5d 07 e8 91 85 29 ff 48 c7 45 98 00 00 00 00 e9 5a fb ff ff e8 ff fd d6 fe <0f> 0b e8 f8 fd d6 fe 0f 0b e8 f1 fd d6 fe 48 8b b5 50 ff ff ff 4c RSP: 0018:ffff888079607978 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000008000 RCX: 0000000000000000 RDX: ffff88807cf10000 RSI: ffffffff82a4a081 RDI: 0000000000000003 RBP: ffff888079607a70 R08: 0000000000000001 R09: ffff88807a6d01d7 R10: ffffed100f4da03a R11: 0000000000000000 R12: ffff88800f0fb110 R13: ffff88800f0ee000 R14: ffff88800f0fb000 R15: 0000000000000001 FS: 00007f33b63c7540(0000) GS:ffff888108580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f33b635c090 CR3: 000000000f39e005 CR4: 0000000000770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> load_system_files+0x1f7f/0x3620 ntfs_fill_super+0xa01/0x1be0 mount_bdev+0x36a/0x440 ntfs_mount+0x3a/0x50 legacy_get_tree+0xfb/0x210 vfs_get_tree+0x8f/0x2f0 do_new_mount+0x30a/0x760 path_mount+0x4de/0x1880 __x64_sys_mount+0x2b3/0x340 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f33b62ff9ea Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffd0c471aa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f33b62ff9ea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd0c471be0 RBP: 00007ffd0c471c60 R08: 00007ffd0c471ae0 R09: 00007ffd0c471c24 R10: 0000000000000000 R11: 0000000000000202 R12: 000055bac5afc160 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Fix this by adding sanity check on extended system files' directory inode to ensure that it is directory, just like ntfs_extend_init() when mounting ntfs3. Link: https://lkml.kernel.org/r/20220809064730.2316892-1-chenxiaosong2@huawei.com Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

An argument list like "arg=val arg2 \"" can trigger a page fault if the page pointed by 'args[0xffffffff]' is not mapped and potential memory corruption otherwise (unlikely but possible if the bogus address is mapped and contents happen to match the ascii value of the quote character). The fix is to ensure that we load 'args[i-1]' only when (i > 0). Prior to this commit the following command would trigger an unhandled page fault in the kernel: root@(none):/linus/fs/fat# insmod ./fat.ko "foo=bar \"" [ 33.870507] BUG: unable to handle page fault for address: ffff888204252608 [ 33.872180] #PF: supervisor read access in kernel mode [ 33.873414] #PF: error_code(0x0000) - not-present page [ 33.874650] PGD 4401067 P4D 4401067 PUD 0 [ 33.875321] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [ 33.876113] CPU: 16 PID: 399 Comm: insmod Not tainted 5.19.0-dbg-DEV #4 [ 33.877193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 33.878739] RIP: 0010:next_arg+0xd1/0x110 [ 33.879399] Code: 22 75 1d 41 c6 04 01 00 41 80 f8 22 74 18 eb 35 4c 89 0e 45 31 d2 4c 89 cf 48 c7 02 00 00 00 00 41 80 f8 22 75 1f 41 8d 42 ff <41> 80 3c 01 22 75 14 41 c6 04 01 00 eb 0d 48 c7 02 00 00 00 00 41 [ 33.882338] RSP: 0018:ffffc90001253d08 EFLAGS: 00010246 [ 33.883174] RAX: 00000000ffffffff RBX: ffff888104252608 RCX: 0fc317bba1c1dd00 [ 33.884311] RDX: ffffc90001253d40 RSI: ffffc90001253d48 RDI: ffff888104252609 [ 33.885450] RBP: ffffc90001253d10 R08: 0000000000000022 R09: ffff888104252609 [ 33.886595] R10: 0000000000000000 R11: ffffffff82c7ff20 R12: 0000000000000282 [ 33.887748] R13: 00000000ffff8000 R14: 0000000000000000 R15: 0000000000007fff [ 33.888887] FS: 00007f04ec7432c0(0000) GS:ffff88813d300000(0000) knlGS:0000000000000000 [ 33.890183] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 33.891111] CR2: ffff888204252608 CR3: 0000000100f36005 CR4: 0000000000170ee0 [ 33.892241] Call Trace: [ 33.892641] <TASK> [ 33.892989] parse_args+0x8f/0x220 [ 33.893538] load_module+0x138b/0x15a0 [ 33.894149] ? prepare_coming_module+0x50/0x50 [ 33.894879] ? kernel_read_file_from_fd+0x5f/0x90 [ 33.895639] __se_sys_finit_module+0xce/0x130 [ 33.896342] __x64_sys_finit_module+0x1d/0x20 [ 33.897042] do_syscall_64+0x44/0xa0 [ 33.897622] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 33.898434] RIP: 0033:0x7f04ec85ef79 [ 33.899009] Code: 48 8d 3d da db 0d 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c7 9e 0d 00 f7 d8 64 89 01 48 [ 33.901912] RSP: 002b:00007fffae81bfe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 33.903081] RAX: ffffffffffffffda RBX: 0000559c5f1d2640 RCX: 00007f04ec85ef79 [ 33.904191] RDX: 0000000000000000 RSI: 0000559c5f1d12a0 RDI: 0000000000000003 [ 33.905304] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 33.906421] R10: 0000000000000003 R11: 0000000000000246 R12: 0000559c5f1d12a0 [ 33.907526] R13: 0000000000000000 R14: 0000559c5f1d25f0 R15: 0000559c5f1d12a0 [ 33.908631] </TASK> [ 33.908986] Modules linked in: fat(+) [last unloaded: fat] [ 33.909843] CR2: ffff888204252608 [ 33.910375] ---[ end trace 0000000000000000 ]--- [ 33.911172] RIP: 0010:next_arg+0xd1/0x110 [ 33.911796] Code: 22 75 1d 41 c6 04 01 00 41 80 f8 22 74 18 eb 35 4c 89 0e 45 31 d2 4c 89 cf 48 c7 02 00 00 00 00 41 80 f8 22 75 1f 41 8d 42 ff <41> 80 3c 01 22 75 14 41 c6 04 01 00 eb 0d 48 c7 02 00 00 00 00 41 [ 33.914643] RSP: 0018:ffffc90001253d08 EFLAGS: 00010246 [ 33.915446] RAX: 00000000ffffffff RBX: ffff888104252608 RCX: 0fc317bba1c1dd00 [ 33.916544] RDX: ffffc90001253d40 RSI: ffffc90001253d48 RDI: ffff888104252609 [ 33.917636] RBP: ffffc90001253d10 R08: 0000000000000022 R09: ffff888104252609 [ 33.918727] R10: 0000000000000000 R11: ffffffff82c7ff20 R12: 0000000000000282 [ 33.919821] R13: 00000000ffff8000 R14: 0000000000000000 R15: 0000000000007fff [ 33.920908] FS: 00007f04ec7432c0(0000) GS:ffff88813d300000(0000) knlGS:0000000000000000 [ 33.922125] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 33.923017] CR2: ffff888204252608 CR3: 0000000100f36005 CR4: 0000000000170ee0 [ 33.924098] Kernel panic - not syncing: Fatal exception [ 33.925776] Kernel Offset: disabled [ 33.926347] Rebooting in 10 seconds.. Link: https://lkml.kernel.org/r/20220728232434.1666488-1-neelnatu@google.com Signed-off-by: Neel Natu <neelnatu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Patch series "Dump command line of faulting process to syslog", v3. This patch series dumps the command line (including the program parameters) of a faulting process to the syslog. The motivation for this patch is that it's sometimes quite hard to find out and annoying to not know which program *exactly* faulted when looking at the syslog. For example, a dump on parisc shows: do_page_fault() command='cc1' type=15 address=0x00000000 in libc-2.33.so[f6abb000+184000] -> We see the "cc1" compiler crashed, but it would be useful to know which file was compiled. With this patch you will see that cc1 crashed while compiling some haskell code: cc1[13472] cmdline: /usr/lib/gcc/hppa-linux-gnu/12/cc1 -quiet @/tmp/ccRkFSfY -imultilib . -imultiarch hppa-linux-gnu -D USE_MINIINTERPRETER -D NO_REGS -D _HPUX_SOURCE -D NOSMP -D THREADED_RTS -include /build/ghc/ghc-9.0.2/includes/dist-install/build/ghcversion.h -iquote compiler/GHC/Iface -quiet -dumpdir /tmp/ghc13413_0/ -dumpbase ghc_5.hc -dumpbase-ext .hc -O -Wimplicit -fno-PIC -fwrapv -fno-builtin -fno-strict-aliasing -o /tmp/ghc13413_0/ghc_5.s Another example are the glibc testcases which always segfault in "ld.so.1" with no other info: do_page_fault() command='ld.so.1' type=15 address=0x565921d8 in libc.so[f7339000+1bb000] -> With the patch you can see it was the "tst-safe-linking-malloc-hugetlb1" testcase: ld.so.1[1151] cmdline: /home/gnu/glibc/objdir/elf/ld.so.1 --library-path /home/gnu/glibc/objdir:/home/gnu/glibc/objdir/math:/home/gnu/ /home/gnu/glibc/objdir/malloc/tst-safe-linking-malloc-hugetlb1 An example of a typical x86 fault shows up as: crash[2326]: segfault at 0 ip 0000561a7969c12e sp 00007ffe97a05630 error 6 in crash[561a7969c000+1000] Code: 68 ff ff ff c6 05 19 2f 00 00 01 5d c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 ... -> with this patch you now see the whole command line: crash[2326] cmdline: ./crash test_write_to_page_0 The patches are relatively small, and reuse functions which are used to create the output for the /proc/<pid>/cmdline files. The relevant changes are in patches #1 and #2. Patch #3 adds the cmdline dump on x86. Patch #4 drops code from arc which now becomes unnecessary as this is done by generic code. This patch (of 4): Add a new function get_task_cmdline_kernel() which reads the command line of a process into a kernel buffer. This command line can then be dumped by arch code to give additional debug info via the parameters with which a faulting process was started. The new function reuses the existing code which provides the cmdline for the procfs. For that the existing functions were modified so that the buffer page is allocated outside of get_mm_proctitle() and get_mm_cmdline() and instead provided as parameter. Link: https://lkml.kernel.org/r/20220808130917.30760-1-deller@gmx.de Link: https://lkml.kernel.org/r/20220808130917.30760-2-deller@gmx.de Signed-off-by: Helge Deller <deller@gmx.de> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Syzkaller reported a triggered kernel BUG as follows: ------------[ cut here ]------------ kernel BUG at kernel/bpf/cgroup.c:925! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0 Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84 28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 <0f> 0b 48 8b 0c5 RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578 RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000 R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00 FS: 00007f68213d2b80(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0 Call Trace: <TASK> cgroup_bpf_prog_detach+0xcc/0x100 __sys_bpf+0x2273/0x2a00 __x64_sys_bpf+0x17/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f68214dbcb9 Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff8 RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9 RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009 RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003 R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20 R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Repetition steps: For the following cgroup tree, root | cg1 | cg2 1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs attach type is NONE or OVERRIDE. 2. write 1 to /proc/thread-self/fail-nth for failslab. 3. detach prog1 for cg1, and then kernel BUG occur. Failslab injection will cause kmalloc fail and fall back to purge_effective_progs. The problem is that cg2 have attached another prog, so when go through cg2 layer, iteration will add pos to 1, and subsequent operations will be skipped by the following condition, and cg will meet NULL in the end. `if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))` The NULL cg means no link or prog match, this is as expected, and it's not a bug. So here just skip the no match situation. Fixes: 4c46091 ("bpf: Fix KASAN use-after-free Read in compute_effective_progs") Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com

If ntfs_fill_super() wasn't called then sbi->sb will be equal to NULL. Code should check this ptr before dereferencing. Syzbot hit this issue via passing wrong mount param as can be seen from log below Fail log: ntfs3: Unknown parameter 'iochvrset' general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 1 PID: 3589 Comm: syz-executor210 Not tainted 5.18.0-rc3-syzkaller-00016-gb253435746d9 #0 ... Call Trace: <TASK> put_ntfs+0x1ed/0x2a0 fs/ntfs3/super.c:463 ntfs_fs_free+0x6a/0xe0 fs/ntfs3/super.c:1363 put_fs_context+0x119/0x7a0 fs/fs_context.c:469 do_new_mount+0x2b4/0xad0 fs/namespace.c:3044 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] Fixes: 82cae26 ("fs/ntfs3: Add initialization of super block") Reported-and-tested-by: syzbot+c95173762127ad76a824@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

The recent change to get_phb_number() causes a DEBUG_ATOMIC_SLEEP warning on some systems: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 1 lock held by swapper/1: #0: c157efb0 (hose_spinlock){+.+.}-{2:2}, at: pcibios_alloc_controller+0x64/0x220 Preemption disabled at: [<00000000>] 0x0 CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0-yocto-standard+ #1 Call Trace: [d101dc90] [c073b264] dump_stack_lvl+0x50/0x8c (unreliable) [d101dcb0] [c0093b70] __might_resched+0x258/0x2a8 [d101dcd0] [c0d3e634] __mutex_lock+0x6c/0x6ec [d101dd50] [c0a84174] of_alias_get_id+0x50/0xf4 [d101dd80] [c002ec78] pcibios_alloc_controller+0x1b8/0x220 [d101ddd0] [c140c9dc] pmac_pci_init+0x198/0x784 [d101de50] [c140852c] discover_phbs+0x30/0x4c [d101de60] [c0007fd4] do_one_initcall+0x94/0x344 [d101ded0] [c1403b40] kernel_init_freeable+0x1a8/0x22c [d101df10] [c00086e0] kernel_init+0x34/0x160 [d101df30] [c001b334] ret_from_kernel_thread+0x5c/0x64 This is because pcibios_alloc_controller() holds hose_spinlock but of_alias_get_id() takes of_mutex which can sleep. The hose_spinlock protects the phb_bitmap, and also the hose_list, but it doesn't need to be held while get_phb_number() calls the OF routines, because those are only looking up information in the device tree. So fix it by having get_phb_number() take the hose_spinlock itself, only where required, and then dropping the lock before returning. pcibios_alloc_controller() then needs to take the lock again before the list_add() but that's safe, the order of the list is not important. Fixes: 0fe1e96 ("powerpc/pci: Prefer PCI domain assignment via DT 'linux,pci-domain' and alias") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220815065550.1303620-1-mpe@ellerman.id.au

Petr Machata says: ==================== mlxsw: Fixes for PTP support This set fixes several issues in mlxsw PTP code. - Patch #1 fixes compilation warnings. - Patch #2 adjusts the order of operation during cleanup, thereby closing the window after PTP state was already cleaned in the ASIC for the given port, but before the port is removed, when the user could still in theory make changes to the configuration. - Patch #3 protects the PTP configuration with a custom mutex, instead of relying on RTNL, which is not held in all access paths. - Patch #4 forbids enablement of PTP only in RX or only in TX. The driver implicitly assumed this would be the case, but neglected to sanitize the configuration. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>

The following BUG was reported: traps: Missing ENDBR: andw_ax_dx+0x0/0x10 [kvm] ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/traps.c:253! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI <TASK> asm_exc_control_protection+0x2b/0x30 RIP: 0010:andw_ax_dx+0x0/0x10 [kvm] Code: c3 cc cc cc cc 0f 1f 44 00 00 66 0f 1f 00 48 19 d0 c3 cc cc cc cc 0f 1f 40 00 f3 0f 1e fa 20 d0 c3 cc cc cc cc 0f 1f 44 00 00 <66> 0f 1f 00 66 21 d0 c3 cc cc cc cc 0f 1f 40 00 66 0f 1f 00 21 d0 ? andb_al_dl+0x10/0x10 [kvm] ? fastop+0x5d/0xa0 [kvm] x86_emulate_insn+0x822/0x1060 [kvm] x86_emulate_instruction+0x46f/0x750 [kvm] complete_emulated_mmio+0x216/0x2c0 [kvm] kvm_arch_vcpu_ioctl_run+0x604/0x650 [kvm] kvm_vcpu_ioctl+0x2f4/0x6b0 [kvm] ? wake_up_q+0xa0/0xa0 The BUG occurred because the ENDBR in the andw_ax_dx() fastop function had been incorrectly "sealed" (converted to a NOP) by apply_ibt_endbr(). Objtool marked it to be sealed because KVM has no compile-time references to the function. Instead KVM calculates its address at runtime. Prevent objtool from annotating fastop functions as sealable by creating throwaway dummy compile-time references to the functions. Fixes: 6649fa8 ("x86/ibt,kvm: Add ENDBR to fastops") Reported-by: Pengfei Xu <pengfei.xu@intel.com> Debugged-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Message-Id: <0d4116f90e9d0c1b754bb90c585e6f0415a1c508.1660837839.git.jpoimboe@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.0, take #1 - Fix unexpected sign extension of KVM_ARM_DEVICE_ID_MASK - Tidy-up handling of AArch32 on asymmetric systems

Check the bo->resource value before accessing the resource mem_type. v2: Fix commit description unwrapped warning <log snip> [ 40.191227][ T184] general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN PTI [ 40.192995][ T184] KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] [ 40.194411][ T184] CPU: 1 PID: 184 Comm: systemd-udevd Not tainted 5.19.0-rc4-00721-gb297c22b7070 #1 [ 40.196063][ T184] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 40.199605][ T184] RIP: 0010:ttm_bo_validate+0x1b3/0x240 [ttm] [ 40.200754][ T184] Code: e8 72 c5 ff ff 83 f8 b8 74 d4 85 c0 75 54 49 8b 9e 58 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 10 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 04 3c 03 7e 44 8b 53 10 31 c0 85 d2 0f 85 58 [ 40.203685][ T184] RSP: 0018:ffffc900006df0c8 EFLAGS: 00010202 [ 40.204630][ T184] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 1ffff1102f4bb71b [ 40.205864][ T184] RDX: 0000000000000002 RSI: ffffc900006df208 RDI: 0000000000000010 [ 40.207102][ T184] RBP: 1ffff920000dbe1a R08: ffffc900006df208 R09: 0000000000000000 [ 40.208394][ T184] R10: ffff88817a5f0000 R11: 0000000000000001 R12: ffffc900006df110 [ 40.209692][ T184] R13: ffffc900006df0f0 R14: ffff88817a5db800 R15: ffffc900006df208 [ 40.210862][ T184] FS: 00007f6b1d16e8c0(0000) GS:ffff88839d700000(0000) knlGS:0000000000000000 [ 40.212250][ T184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.213275][ T184] CR2: 000055a1001d4ff0 CR3: 00000001700f4000 CR4: 00000000000006e0 [ 40.214469][ T184] Call Trace: [ 40.214974][ T184] <TASK> [ 40.215438][ T184] ? ttm_bo_bounce_temp_buffer+0x140/0x140 [ttm] [ 40.216572][ T184] ? mutex_spin_on_owner+0x240/0x240 [ 40.217456][ T184] ? drm_vma_offset_add+0xaa/0x100 [drm] [ 40.218457][ T184] ttm_bo_init_reserved+0x3d6/0x540 [ttm] [ 40.219410][ T184] ? shmem_get_inode+0x744/0x980 [ 40.220231][ T184] ttm_bo_init_validate+0xb1/0x200 [ttm] [ 40.221172][ T184] ? bo_driver_evict_flags+0x340/0x340 [drm_vram_helper] [ 40.222530][ T184] ? ttm_bo_init_reserved+0x540/0x540 [ttm] [ 40.223643][ T184] ? __do_sys_finit_module+0x11a/0x1c0 [ 40.224654][ T184] ? __shmem_file_setup+0x102/0x280 [ 40.234764][ T184] drm_gem_vram_create+0x305/0x480 [drm_vram_helper] [ 40.235766][ T184] ? bo_driver_evict_flags+0x340/0x340 [drm_vram_helper] [ 40.236846][ T184] ? __kasan_slab_free+0x108/0x180 [ 40.237650][ T184] drm_gem_vram_fill_create_dumb+0x134/0x340 [drm_vram_helper] [ 40.238864][ T184] ? local_pci_probe+0xdf/0x180 [ 40.239674][ T184] ? drmm_vram_helper_init+0x400/0x400 [drm_vram_helper] [ 40.240826][ T184] drm_client_framebuffer_create+0x19c/0x400 [drm] [ 40.241955][ T184] ? drm_client_buffer_delete+0x200/0x200 [drm] [ 40.243001][ T184] ? drm_client_pick_crtcs+0x554/0xb80 [drm] [ 40.244030][ T184] drm_fb_helper_generic_probe+0x23f/0x940 [drm_kms_helper] [ 40.245226][ T184] ? __cond_resched+0x1c/0xc0 [ 40.245987][ T184] ? drm_fb_helper_memory_range_to_clip+0x180/0x180 [drm_kms_helper] [ 40.247316][ T184] ? mutex_unlock+0x80/0x100 [ 40.248005][ T184] ? __mutex_unlock_slowpath+0x2c0/0x2c0 [ 40.249083][ T184] drm_fb_helper_single_fb_probe+0x907/0xf00 [drm_kms_helper] [ 40.250314][ T184] ? drm_fb_helper_check_var+0x1180/0x1180 [drm_kms_helper] [ 40.251540][ T184] ? __cond_resched+0x1c/0xc0 [ 40.252321][ T184] ? mutex_lock+0x9f/0x100 [ 40.253062][ T184] __drm_fb_helper_initial_config_and_unlock+0xb9/0x2c0 [drm_kms_helper] [ 40.254394][ T184] drm_fbdev_client_hotplug+0x56f/0x840 [drm_kms_helper] [ 40.255477][ T184] drm_fbdev_generic_setup+0x165/0x3c0 [drm_kms_helper] [ 40.256607][ T184] bochs_pci_probe+0x6b7/0x900 [bochs] [ 40.257515][ T184] ? _raw_spin_lock_irqsave+0x87/0x100 [ 40.258312][ T184] ? bochs_hw_init+0x480/0x480 [bochs] [ 40.259244][ T184] ? bochs_hw_init+0x480/0x480 [bochs] [ 40.260186][ T184] local_pci_probe+0xdf/0x180 [ 40.260928][ T184] pci_call_probe+0x15f/0x500 [ 40.265798][ T184] ? _raw_spin_lock+0x81/0x100 [ 40.266508][ T184] ? pci_pm_suspend_noirq+0x980/0x980 [ 40.267322][ T184] ? pci_assign_irq+0x81/0x280 [ 40.268096][ T184] ? pci_match_device+0x351/0x6c0 [ 40.268883][ T184] ? kernfs_put+0x18/0x40 [ 40.269611][ T184] pci_device_probe+0xee/0x240 [ 40.270352][ T184] really_probe+0x435/0xa80 [ 40.271021][ T184] __driver_probe_device+0x2ab/0x480 [ 40.271828][ T184] driver_probe_device+0x49/0x140 [ 40.272627][ T184] __driver_attach+0x1bd/0x4c0 [ 40.273372][ T184] ? __device_attach_driver+0x240/0x240 [ 40.274273][ T184] bus_for_each_dev+0x11e/0x1c0 [ 40.275080][ T184] ? subsys_dev_iter_exit+0x40/0x40 [ 40.275951][ T184] ? klist_add_tail+0x132/0x280 [ 40.276767][ T184] bus_add_driver+0x39b/0x580 [ 40.277574][ T184] driver_register+0x20f/0x3c0 [ 40.278281][ T184] ? 0xffffffffc04a2000 [ 40.278894][ T184] do_one_initcall+0x8a/0x300 [ 40.279642][ T184] ? trace_event_raw_event_initcall_level+0x1c0/0x1c0 [ 40.280707][ T184] ? kasan_unpoison+0x23/0x80 [ 40.281479][ T184] ? kasan_unpoison+0x23/0x80 [ 40.282197][ T184] do_init_module+0x190/0x640 [ 40.282926][ T184] load_module+0x221b/0x2780 [ 40.283611][ T184] ? layout_and_allocate+0x5c0/0x5c0 [ 40.284401][ T184] ? kernel_read_file+0x286/0x6c0 [ 40.285216][ T184] ? __x64_sys_fspick+0x2c0/0x2c0 [ 40.286043][ T184] ? mmap_region+0x4e7/0x1300 [ 40.286832][ T184] ? __do_sys_finit_module+0x11a/0x1c0 [ 40.287743][ T184] __do_sys_finit_module+0x11a/0x1c0 [ 40.288636][ T184] ? __ia32_sys_init_module+0xc0/0xc0 [ 40.289557][ T184] ? __seccomp_filter+0x15e/0xc80 [ 40.290341][ T184] ? vm_mmap_pgoff+0x185/0x240 [ 40.291060][ T184] do_syscall_64+0x3b/0xc0 [ 40.291763][ T184] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 40.292678][ T184] RIP: 0033:0x7f6b1d6279b9 [ 40.293438][ T184] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48 [ 40.296302][ T184] RSP: 002b:00007ffe7f51b798 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 40.297633][ T184] RAX: ffffffffffffffda RBX: 00005642dcca2880 RCX: 00007f6b1d6279b9 [ 40.298890][ T184] RDX: 0000000000000000 RSI: 00007f6b1d7b2e2d RDI: 0000000000000016 [ 40.300199][ T184] RBP: 0000000000020000 R08: 0000000000000000 R09: 00005642dccd5530 [ 40.301547][ T184] R10: 0000000000000016 R11: 0000000000000246 R12: 00007f6b1d7b2e2d [ 40.302698][ T184] R13: 0000000000000000 R14: 00005642dcca4230 R15: 00005642dcca2880 Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20220726162205.2778-1-Arunpravin.PaneerSelvam@amd.com Link: https://patchwork.freedesktop.org/patch/msgid/20220809095623.3569-1-Arunpravin.PaneerSelvam@amd.com Signed-off-by: Christian König <christian.koenig@amd.com> CC: stable@vger.kernel.org

…ace is dead ftrace_startup does not remove ops from ftrace_ops_list when ftrace_startup_enable fails: register_ftrace_function ftrace_startup __register_ftrace_function ... add_ftrace_ops(&ftrace_ops_list, ops) ... ... ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1 ... return 0 // ops is in the ftrace_ops_list. When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything: unregister_ftrace_function ftrace_shutdown if (unlikely(ftrace_disabled)) return -ENODEV; // return here, __unregister_ftrace_function is not executed, // as a result, ops is still in the ftrace_ops_list __unregister_ftrace_function ... If ops is dynamically allocated, it will be free later, in this case, is_ftrace_trampoline accesses NULL pointer: is_ftrace_trampoline ftrace_ops_trampoline do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL! Syzkaller reports as follows: [ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b [ 1203.508039] #PF: supervisor read access in kernel mode [ 1203.508798] #PF: error_code(0x0000) - not-present page [ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0 [ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI [ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G B W 5.10.0 #8 [ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0 [ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00 [ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246 [ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866 [ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b [ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07 [ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399 [ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008 [ 1203.525634] FS: 00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000 [ 1203.526801] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0 [ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Therefore, when ftrace_startup_enable fails, we need to rollback registration process and remove ops from ftrace_ops_list. Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

While playing with event probes (eprobes), I tried to see what would happen if I attempted to retrieve the instruction pointer (%rip) knowing that event probes do not use pt_regs. The result was: BUG: kernel NULL pointer dereference, address: 0000000000000024 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:get_event_field.isra.0+0x0/0x50 Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8 50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24 8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74 RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086 RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000 RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8 R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff916c9ea40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0 Call Trace: <TASK> get_eprobe_size+0xb4/0x640 ? __mod_node_page_state+0x72/0xc0 __eprobe_trace_func+0x59/0x1a0 ? __mod_lruvec_page_state+0xaa/0x1b0 ? page_remove_file_rmap+0x14/0x230 ? page_remove_rmap+0xda/0x170 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x18f/0x240 trace_event_raw_event_sched_wakeup_template+0x7a/0xb0 try_to_wake_up+0x260/0x4c0 __wake_up_common+0x80/0x180 __wake_up_common_lock+0x7c/0xc0 do_notify_parent+0x1c9/0x2a0 exit_notify+0x1a9/0x220 do_exit+0x2ba/0x450 do_group_exit+0x2d/0x90 __x64_sys_exit_group+0x14/0x20 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Obviously this is not the desired result. Move the testing for TPARG_FL_TPOINT which is only used for event probes to the top of the "$" variable check, as all the other variables are not used for event probes. Also add a check in the register parsing "%" to fail if an event probe is used. Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org Cc: stable@vger.kernel.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Tom Zanussi <zanussi@kernel.org> Fixes: 7491e2c ("tracing: Add a probe that attaches to trace events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

…odel' Petr Machata says: ==================== selftests: mlxsw: Add ordering tests for unified bridge model Amit Cohen writes: Commit 798661c ("Merge branch 'mlxsw-unified-bridge-conversion-part-6'") converted mlxsw driver to use unified bridge model. In the legacy model, when a RIF was created / destroyed, it was firmware's responsibility to update it in the relevant FID classification records. In the unified bridge model, this responsibility moved to software. This set adds tests to check the order of configuration for the following classifications: 1. {Port, VID} -> FID 2. VID -> FID 3. VNI -> FID (after decapsulation) In addition, in the legacy model, software is responsible to update a table which is used to determine the packet's egress VID. Add a test to check that the order of configuration does not impact switch behavior. See more details in the commit messages. Note that the tests supposed to pass also using the legacy model, they are added now as with the new model they test the driver and not the firmware. Patch set overview: Patch #1 adds test for {Port, VID} -> FID Patch #2 adds test for VID -> FID Patch #3 adds test for VNI -> FID Patch #4 adds test for egress VID classification ==================== Link: https://lore.kernel.org/r/cover.1660747162.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Xfstests generic/335, generic/336 sometimes crash with the following message: F2FS-fs (dm-0): detect filesystem reference count leak during umount, type: 9, count: 1 ------------[ cut here ]------------ kernel BUG at fs/f2fs/super.c:1939! Oops: invalid opcode: 0000 [linuxppc#1] SMP NOPTI CPU: 1 UID: 0 PID: 609351 Comm: umount Tainted: G W 6.17.0-rc5-xfstests-g9dd1835ecda5 linuxppc#1 PREEMPT(none) Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:f2fs_put_super+0x3b3/0x3c0 Call Trace: <TASK> generic_shutdown_super+0x7e/0x190 kill_block_super+0x1a/0x40 kill_f2fs_super+0x9d/0x190 deactivate_locked_super+0x30/0xb0 cleanup_mnt+0xba/0x150 task_work_run+0x5c/0xa0 exit_to_user_mode_loop+0xb7/0xc0 do_syscall_64+0x1ae/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> ---[ end trace 0000000000000000 ]--- It appears that sometimes it is possible that f2fs_put_super() is called before all node page reads are completed. Adding a call to f2fs_wait_on_all_pages() for F2FS_RD_NODE fixes the problem. Cc: stable@kernel.org Fixes: 2087258 ("f2fs: fix to drop all dirty meta/node pages during umount()") Signed-off-by: Jan Prusakowski <jprusakowski@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

With below scripts, it will trigger panic in f2fs: mkfs.f2fs -f /dev/vdd mount /dev/vdd /mnt/f2fs touch /mnt/f2fs/foo sync echo 111 >> /mnt/f2fs/foo f2fs_io fsync /mnt/f2fs/foo f2fs_io shutdown 2 /mnt/f2fs umount /mnt/f2fs mount -o ro,norecovery /dev/vdd /mnt/f2fs or mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f F2FS-fs (vdd): Stopped filesystem due to reason: 0 F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1 Filesystem f2fs get_tree() didn't set fc->root, returned 1 ------------[ cut here ]------------ kernel BUG at fs/super.c:1761! Oops: invalid opcode: 0000 [linuxppc#1] SMP PTI CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:vfs_get_tree.cold+0x18/0x1a Call Trace: <TASK> fc_mount+0x13/0xa0 path_mount+0x34e/0xc50 __x64_sys_mount+0x121/0x150 do_syscall_64+0x84/0x800 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa6cc126cfe The root cause is we missed to handle error number returned from f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or ro,disable_roll_forward mount option, result in returning a positive error number to vfs_get_tree(), fix it. Cc: stable@kernel.org Fixes: 6781eab ("f2fs: give -EINVAL for norecovery and rw mount") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

kbd_led_set() can sleep, and so may not be used as the brightness_set() callback. Otherwise using this led with a trigger leads to system hangs accompanied by: BUG: scheduling while atomic: acpi_fakekeyd/2588/0x00000003 CPU: 4 UID: 0 PID: 2588 Comm: acpi_fakekeyd Not tainted 6.17.9+deb14-amd64 linuxppc#1 PREEMPT(lazy) Debian 6.17.9-1 Hardware name: ASUSTeK COMPUTER INC. ASUS EXPERTBOOK B9403CVAR/B9403CVAR, BIOS B9403CVAR.311 12/24/2024 Call Trace: <TASK> [...] schedule_timeout+0xbd/0x100 __down_common+0x175/0x290 down_timeout+0x67/0x70 acpi_os_wait_semaphore+0x57/0x90 [...] asus_wmi_evaluate_method3+0x87/0x190 [asus_wmi] led_trigger_event+0x3f/0x60 [...] Fixes: 9fe44fc ("platform/x86: asus-wmi: Simplify the keyboard brightness updating process") Signed-off-by: Anton Khirnov <anton@khirnov.net> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Denis Benato <benato.denis96@gmail.com> Link: https://patch.msgid.link/20251129101307.18085-3-anton@khirnov.net Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

…stats Cited commit added a dedicated mutex (instead of RTNL) to protect the multicast route list, so that it will not change while the driver periodically traverses it in order to update the kernel about multicast route stats that were queried from the device. One instance of list entry deletion (during route replace) was missed and it can result in a use-after-free [1]. Fix by acquiring the mutex before deleting the entry from the list and releasing it afterwards. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_mr_stats_update+0x4a5/0x540 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:1006 [mlxsw_spectrum] Read of size 8 at addr ffff8881523c2fa8 by task kworker/2:5/22043 CPU: 2 UID: 0 PID: 22043 Comm: kworker/2:5 Not tainted 6.18.0-rc1-custom-g1a3d6d7cd014 linuxppc#1 PREEMPT(full) Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017 Workqueue: mlxsw_core mlxsw_sp_mr_stats_update [mlxsw_spectrum] Call Trace: <TASK> dump_stack_lvl+0xba/0x110 print_report+0x174/0x4f5 kasan_report+0xdf/0x110 mlxsw_sp_mr_stats_update+0x4a5/0x540 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:1006 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 29933: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 mlxsw_sp_mr_route_add+0xd8/0x4770 [mlxsw_spectrum] mlxsw_sp_router_fibmr_event_work+0x371/0xad0 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:7965 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 Freed by task 29933: kasan_save_stack+0x30/0x50 kasan_save_track+0x14/0x30 __kasan_save_free_info+0x3b/0x70 __kasan_slab_free+0x43/0x70 kfree+0x14e/0x700 mlxsw_sp_mr_route_add+0x2dea/0x4770 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:444 [mlxsw_spectrum] mlxsw_sp_router_fibmr_event_work+0x371/0xad0 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:7965 [mlxsw_spectrum] process_one_work+0x9cc/0x18e0 worker_thread+0x5df/0xe40 kthread+0x3b8/0x730 ret_from_fork+0x3e9/0x560 ret_from_fork_asm+0x1a/0x30 Fixes: f38656d ("mlxsw: spectrum_mr: Protect multicast route list with a lock") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/f996feecfd59fde297964bfc85040b6d83ec6089.1764695650.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

At the moment - the memory allocation for fwsec-sb is created as-needed and is released after being used. Typically this is at some point well after driver load, which can cause runtime suspend/resume to initially work on driver load but then later fail on a machine that has been running for long enough with sufficiently high enough memory pressure: kworker/7:1: page allocation failure: order:5, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 7 UID: 0 PID: 875159 Comm: kworker/7:1 Not tainted 6.17.8-300.fc43.x86_64 linuxppc#1 PREEMPT(lazy) Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024 Workqueue: pm pm_runtime_work Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 warn_alloc+0x163/0x190 ? __alloc_pages_direct_compact+0x1b3/0x220 __alloc_pages_slowpath.constprop.0+0x57a/0xb10 __alloc_frozen_pages_noprof+0x334/0x350 __alloc_pages_noprof+0xe/0x20 __dma_direct_alloc_pages.isra.0+0x1eb/0x330 dma_direct_alloc_pages+0x3c/0x190 dma_alloc_pages+0x29/0x130 nvkm_firmware_ctor+0x1ae/0x280 [nouveau] nvkm_falcon_fw_ctor+0x3e/0x60 [nouveau] nvkm_gsp_fwsec+0x10e/0x2c0 [nouveau] ? sysvec_apic_timer_interrupt+0xe/0x90 nvkm_gsp_fwsec_sb+0x27/0x70 [nouveau] tu102_gsp_fini+0x65/0x110 [nouveau] ? ktime_get+0x3c/0xf0 nvkm_subdev_fini+0x67/0xc0 [nouveau] nvkm_device_fini+0x94/0x140 [nouveau] nvkm_udevice_fini+0x50/0x70 [nouveau] nvkm_object_fini+0xb1/0x140 [nouveau] nvkm_object_fini+0x70/0x140 [nouveau] ? __pfx_pci_pm_runtime_suspend+0x10/0x10 nouveau_do_suspend+0xe4/0x170 [nouveau] nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau] pci_pm_runtime_suspend+0x67/0x1a0 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 __rpm_callback+0x45/0x1f0 ? __pfx_pci_pm_runtime_suspend+0x10/0x10 rpm_callback+0x6d/0x80 rpm_suspend+0xe5/0x5e0 ? finish_task_switch.isra.0+0x99/0x2c0 pm_runtime_work+0x98/0xb0 process_one_work+0x18f/0x350 worker_thread+0x25a/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xf9/0x240 ? __pfx_kthread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0xf1/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> The reason this happens is because the fwsec-sb firmware image only supports being booted from a contiguous coherent sysmem allocation. If a system runs into enough memory fragmentation from memory pressure, such as what can happen on systems with low amounts of memory, this can lead to a situation where it later becomes impossible to find space for a large enough contiguous allocation to hold fwsec-sb. This causes us to fail to boot the firmware image, causing the GPU to fail booting and causing the driver to fail. Since this firmware can't use non-contiguous allocations, the best solution to avoid this issue is to simply allocate the memory for fwsec-sb during initial driver-load, and reuse the memory allocation when fwsec-sb needs to be used. We then release the memory allocations on driver unload. Signed-off-by: Lyude Paul <lyude@redhat.com> Fixes: 594766c ("drm/nouveau/gsp: move booter handling to GPU-specific code") Cc: <stable@vger.kernel.org> # v6.16+ Reviewed-by: Timur Tabi <ttabi@nvidia.com> Link: https://patch.msgid.link/20251202175918.63533-1-lyude@redhat.com

Since commit a735831 ("drm/nouveau: vendor in drm_encoder_slave API") nouveau appears to be broken for all dispnv04 GPUs (before NV50). Depending on the kernel version, either having no display output and hanging in kernel for a long time, or even oopsing in the cleanup path like: Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac ... nouveau 0000:0a:00.0: drm: 0x14C5: Parsing digital output script table BUG: Unable to handle kernel data access on read at 0x00041520 Faulting instruction address: 0xc0003d0001be0844 Oops: Kernel access of bad area, sig: 11 [linuxppc#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=8 NUMA PowerMac Modules linked in: windfarm_cpufreq_clamp windfarm_smu_sensors windfarm_smu_controls windfarm_pm112 snd_aoa_codec_onyx snd_aoa_fabric_layout snd_aoa windfarm_pid jo apple_mfi_fastcharge rndis_host cdc_ether usbnet mii snd_aoa_i2sbus snd_aoa_soundbus snd_pcm snd_timer snd soundcore rack_meter windfarm_smu_sat windfarm_max6690_s m75_sensor windfarm_core gpu_sched drm_gpuvm drm_exec drm_client_lib drm_ttm_helper ttm drm_display_helper drm_kms_helper drm drm_panel_orientation_quirks syscopyar _sys_fops i2c_algo_bit backlight uio_pdrv_genirq uio uninorth_agp agpgart zram dm_mod dax ipv6 nfsv4 dns_resolver nfs lockd grace sunrpc offb cfbfillrect cfbimgblt ont input_leds sr_mod cdrom sd_mod uas ata_generic hid_apple hid_generic usbhid hid usb_storage pata_macio sata_svw libata firewire_ohci scsi_mod firewire_core ohci ehci_pci ehci_hcd tg3 ohci_hcd libphy usbcore usb_common nls_base led_class CPU: 0 UID: 0 PID: 245 Comm: (udev-worker) Not tainted 6.14.0-09584-g7d06015d936c #7 PREEMPTLAZY Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac NIP: c0003d0001be0844 LR: c0003d0001be0830 CTR: 0000000000000000 REGS: c0000000053f70e0 TRAP: 0300 Not tainted (6.14.0-09584-g7d06015d936c) MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 24222220 XER: 00000000 DAR: 0000000000041520 DSISR: 40000000 IRQMASK: 0 \x0aGPR00: c0003d0001be0830 c0000000053f7380 c0003d0000911900 c000000007bc6800 \x0aGPR04: 0000000000000000 0000000000000000 c000000007bc6e70 0000000000000001 \x0aGPR08: 01f3040000000000 0000000000041520 0000000000000000 c0003d0000813958 \x0aGPR12: c000000000071a48 c000000000e28000 0000000000000020 0000000000000000 \x0aGPR16: 0000000000000000 0000000000f52630 0000000000000000 0000000000000000 \x0aGPR20: 0000000000000000 0000000000000000 0000000000000001 c0003d0000928528 \x0aGPR24: c0003d0000928598 0000000000000000 c000000007025480 c000000007025480 \x0aGPR28: c0000000010b4000 0000000000000000 c000000007bc1800 c000000007bc6800 NIP [c0003d0001be0844] nv_crtc_destroy+0x44/0xd4 [nouveau] LR [c0003d0001be0830] nv_crtc_destroy+0x30/0xd4 [nouveau] Call Trace: [c0000000053f7380] [c0003d0001be0830] nv_crtc_destroy+0x30/0xd4 [nouveau] (unreliable) [c0000000053f73c0] [c0003d00007f7bf4] drm_mode_config_cleanup+0x27c/0x30c [drm] [c0000000053f7490] [c0003d0001bdea50] nouveau_display_create+0x1cc/0x550 [nouveau] [c0000000053f7500] [c0003d0001bcc29c] nouveau_drm_device_init+0x1c8/0x844 [nouveau] [c0000000053f75e0] [c0003d0001bcc9ec] nouveau_drm_probe+0xd4/0x1e0 [nouveau] [c0000000053f7670] [c000000000557d24] local_pci_probe+0x50/0xa8 [c0000000053f76f0] [c000000000557fa8] pci_device_probe+0x22c/0x240 [c0000000053f7760] [c0000000005fff3c] really_probe+0x188/0x31c [c0000000053f77e0] [c000000000600204] __driver_probe_device+0x134/0x13c [c0000000053f7860] [c0000000006002c0] driver_probe_device+0x3c/0xb4 [c0000000053f78a0] [c000000000600534] __driver_attach+0x118/0x128 [c0000000053f78e0] [c0000000005fe038] bus_for_each_dev+0xa8/0xf4 [c0000000053f7950] [c0000000005ff460] driver_attach+0x2c/0x40 [c0000000053f7970] [c0000000005fea68] bus_add_driver+0x130/0x278 [c0000000053f7a00] [c00000000060117c] driver_register+0x9c/0x1a0 [c0000000053f7a80] [c00000000055623c] __pci_register_driver+0x5c/0x70 [c0000000053f7aa0] [c0003d0001c058a0] nouveau_drm_init+0x254/0x278 [nouveau] [c0000000053f7b10] [c00000000000e9bc] do_one_initcall+0x84/0x268 [c0000000053f7bf0] [c0000000001a0ba0] do_init_module+0x70/0x2d8 [c0000000053f7c70] [c0000000001a42bc] init_module_from_file+0xb4/0x108 [c0000000053f7d50] [c0000000001a4504] sys_finit_module+0x1ac/0x478 [c0000000053f7e10] [c000000000023230] system_call_exception+0x1a4/0x20c [c0000000053f7e50] [c00000000000c554] system_call_common+0xf4/0x258 --- interrupt: c00 at 0xfd5f988 NIP: 000000000fd5f988 LR: 000000000ff9b148 CTR: 0000000000000000 REGS: c0000000053f7e80 TRAP: 0c00 Not tainted (6.14.0-09584-g7d06015d936c) MSR: 100000000000d032 <HV,EE,PR,ME,IR,DR,RI> CR: 28222244 XER: 00000000 IRQMASK: 0 \x0aGPR00: 0000000000000161 00000000ffcdc2d0 00000000405db160 0000000000000020 \x0aGPR04: 000000000ffa2c9c 0000000000000000 000000000000001f 0000000000000045 \x0aGPR08: 0000000011a13770 0000000000000000 0000000000000000 0000000000000000 \x0aGPR12: 0000000000000000 0000000010249d8c 0000000000000020 0000000000000000 \x0aGPR16: 0000000000000000 0000000000f52630 0000000000000000 0000000000000000 \x0aGPR20: 0000000000000000 0000000000000000 0000000000000000 0000000011a11a70 \x0aGPR24: 0000000011a13580 0000000011a11950 0000000011a11a70 0000000000020000 \x0aGPR28: 000000000ffa2c9c 0000000000000000 000000000ffafc40 0000000011a11a70 NIP [000000000fd5f988] 0xfd5f988 LR [000000000ff9b148] 0xff9b148 --- interrupt: c00 Code: f821ffc1 418200ac e93f0000 e9290038 e9291468 eba90000 48026c0d e8410018 e93f06aa 3d290001 392982a4 79291f24 <7fdd482a> 2c3e0000 41820030 7fc3f378 ---[ end trace 0000000000000000 ]--- This is caused by the i2c encoder modules vendored into nouveau/ now depending on the equally vendored nouveau_i2c_encoder_destroy function. Trying to auto-load this modules hangs on nouveau initialization until timeout, and nouveau continues without i2c video encoders. Fix by avoiding nouveau dependency by __always_inlining that helper functions into those i2c video encoder modules. Fixes: a735831 ("drm/nouveau: vendor in drm_encoder_slave API") Signed-off-by: René Rebe <rene@exactco.de> Reviewed-by: Lyude Paul <lyude@redhat.com> [Lyude: fixed commit reference in description] Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://patch.msgid.link/20251202.164952.2216481867721531616.rene@exactco.de

Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)". One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch linuxppc#1 and linuxppc#4. Patch linuxppc#1 fixes hugetlb_pmd_shared() not detecting any sharing Patch linuxppc#2 + linuxppc#3 are simple comment fixes that patch linuxppc#4 interacts with. Patch linuxppc#4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated I added as much comments + description that I possibly could, and I am hoping for review from Jann. There are plenty of cleanups in the future to be had + one reasonable optimization on x86. But that's all out of scope for this series. This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251205213558.2980480-1-david@kernel.org Link: https://lkml.kernel.org/r/20251205213558.2980480-2-david@kernel.org Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [1] Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ linuxppc#1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Jakub reported an MPTCP deadlock at fallback time: WARNING: possible recursive locking detected 6.18.0-rc7-virtme linuxppc#1 Not tainted -------------------------------------------- mptcp_connect/20858 is trying to acquire lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_try_fallback+0xd8/0x280 but task is already holding lock: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&msk->fallback_lock); lock(&msk->fallback_lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by mptcp_connect/20858: #0: ff1100001da18290 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x114/0x1bc0 linuxppc#1: ff1100001db40fd0 (k-sk_lock-AF_INET#2){+.+.}-{0:0}, at: __mptcp_retrans+0x2cb/0xaa0 linuxppc#2: ff1100001da18b60 (&msk->fallback_lock){+.-.}-{3:3}, at: __mptcp_retrans+0x352/0xaa0 stack backtrace: CPU: 0 UID: 0 PID: 20858 Comm: mptcp_connect Not tainted 6.18.0-rc7-virtme linuxppc#1 PREEMPT(full) Hardware name: Bochs, BIOS Bochs 01/01/2011 Call Trace: <TASK> dump_stack_lvl+0x6f/0xa0 print_deadlock_bug.cold+0xc0/0xcd validate_chain+0x2ff/0x5f0 __lock_acquire+0x34c/0x740 lock_acquire.part.0+0xbc/0x260 _raw_spin_lock_bh+0x38/0x50 __mptcp_try_fallback+0xd8/0x280 mptcp_sendmsg_frag+0x16c2/0x3050 __mptcp_retrans+0x421/0xaa0 mptcp_release_cb+0x5aa/0xa70 release_sock+0xab/0x1d0 mptcp_sendmsg+0xd5b/0x1bc0 sock_write_iter+0x281/0x4d0 new_sync_write+0x3c5/0x6f0 vfs_write+0x65e/0xbb0 ksys_write+0x17e/0x200 do_syscall_64+0xbb/0xfd0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fa5627cbc5e Code: 4d 89 d8 e8 14 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff1fe14700 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa5627cbc5e RDX: 0000000000001f9c RSI: 00007fff1fe16984 RDI: 0000000000000005 RBP: 00007fff1fe14710 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff1fe16920 R13: 0000000000002000 R14: 0000000000001f9c R15: 0000000000001f9c The packet scheduler could attempt a reinjection after receiving an MP_FAIL and before the infinite map has been transmitted, causing a deadlock since MPTCP needs to do the reinjection atomically from WRT fallback. Address the issue explicitly avoiding the reinjection in the critical scenario. Note that this is the only fallback critical section that could potentially send packets and hit the double-lock. Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://netdev-ctrl.bots.linux.dev/logs/vmksft/mptcp-dbg/results/412720/1-mptcp-join-sh/stderr Fixes: f8a1d9b ("mptcp: make fallback action and fallback decision atomic") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251205-net-mptcp-misc-fixes-6-19-rc1-v1-4-9e4781a6c1b8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)". One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch linuxppc#1 and linuxppc#4. Patch linuxppc#1 fixes hugetlb_pmd_shared() not detecting any sharing Patch linuxppc#2 + linuxppc#3 are simple comment fixes that patch linuxppc#4 interacts with. Patch linuxppc#4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated I added as much comments + description that I possibly could, and I am hoping for review from Jann. There are plenty of cleanups in the future to be had + one reasonable optimization on x86. But that's all out of scope for this series. This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251205213558.2980480-1-david@kernel.org Link: https://lkml.kernel.org/r/20251205213558.2980480-2-david@kernel.org Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [1] Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

In the rare event if eeprom has only invalid address entries, allocation is skipped, this causes following NULL pointer issue [ 547.103445] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 547.118897] #PF: supervisor read access in kernel mode [ 547.130292] #PF: error_code(0x0000) - not-present page [ 547.141689] PGD 124757067 P4D 0 [ 547.148842] Oops: 0000 [linuxppc#1] PREEMPT SMP NOPTI [ 547.158504] CPU: 49 PID: 8167 Comm: cat Tainted: G OE 6.8.0-38-generic #38-Ubuntu [ 547.177998] Hardware name: Supermicro AS -8126GS-TNMR/H14DSG-OD, BIOS 1.7 09/12/2025 [ 547.195178] RIP: 0010:amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.210375] Code: e8 63 78 82 c0 45 31 d2 45 3b 75 08 48 8b 45 a0 73 44 44 89 f1 48 8b 7d 88 48 89 ca 48 c1 e2 05 48 29 ca 49 8b 4d 00 48 01 d1 <48> 83 79 10 00 74 17 49 63 f2 48 8b 49 08 41 83 c2 01 48 8d 34 76 [ 547.252045] RSP: 0018:ffa0000067287ac0 EFLAGS: 00010246 [ 547.263636] RAX: ff11000167c28130 RBX: ff11000127600000 RCX: 0000000000000000 [ 547.279467] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000125b1c800 [ 547.295298] RBP: ffa0000067287b50 R08: 0000000000000000 R09: 0000000000000000 [ 547.311129] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 547.326959] R13: ff11000217b1de00 R14: 0000000000000000 R15: 0000000000000092 [ 547.342790] FS: 0000746e59d14740(0000) GS:ff11017dfda80000(0000) knlGS:0000000000000000 [ 547.360744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 547.373489] CR2: 0000000000000010 CR3: 000000019585e001 CR4: 0000000000f71ef0 [ 547.389321] PKRU: 55555554 [ 547.395316] Call Trace: [ 547.400737] <TASK> [ 547.405386] ? show_regs+0x6d/0x80 [ 547.412929] ? __die+0x24/0x80 [ 547.419697] ? page_fault_oops+0x99/0x1b0 [ 547.428588] ? do_user_addr_fault+0x2ee/0x6b0 [ 547.438249] ? exc_page_fault+0x83/0x1b0 [ 547.446949] ? asm_exc_page_fault+0x27/0x30 [ 547.456225] ? amdgpu_ras_sysfs_badpages_read+0x2f2/0x5d0 [amdgpu] [ 547.470040] ? mas_wr_modify+0xcd/0x140 [ 547.478548] sysfs_kf_bin_read+0x63/0xb0 [ 547.487248] kernfs_file_read_iter+0xa1/0x190 [ 547.496909] kernfs_fop_read_iter+0x25/0x40 [ 547.506182] vfs_read+0x255/0x390 This also result in space left assigned to negative values. Moving data alloc call before bad page check resolves both the issue. Signed-off-by: Asad Kamal <asad.kamal@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Petr Machata says: ==================== selftests: forwarding: vxlan_bridge_1q_mc_ul: Fix flakiness The net/forwarding/vxlan_bridge_1q_mc_ul selftest runs an overlay traffic, forwarded over a multicast-routed VXLAN underlay. In order to determine whether packets reach their intended destination, it uses a TC match. For convenience, it uses a flower match, which however does not allow matching on the encapsulated packet. So various service traffic ends up being indistinguishable from the test packets, and ends up confusing the test. To alleviate the problem, the test uses sleep to allow the necessary service traffic to run and clear the channel, before running the test traffic. This worked for a while, but lately we have nevertheless seen flakiness of the test in the CI. In this patchset, first generalize tc_rule_stats_get() to support u32 in patch linuxppc#1, then in patch linuxppc#2 convert the test to use u32 to allow parsing deeper into the packet, and in linuxppc#3 drop the now-unnecessary sleep. ==================== Link: https://patch.msgid.link/cover.1765289566.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)". One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch linuxppc#1 and linuxppc#4. Patch linuxppc#1 fixes hugetlb_pmd_shared() not detecting any sharing Patch linuxppc#2 + linuxppc#3 are simple comment fixes that patch linuxppc#4 interacts with. Patch linuxppc#4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated I added as much comments + description that I possibly could, and I am hoping for review from Jann. There are plenty of cleanups in the future to be had + one reasonable optimization on x86. But that's all out of scope for this series. This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251205213558.2980480-1-david@kernel.org Link: https://lkml.kernel.org/r/20251205213558.2980480-2-david@kernel.org Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [1] Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

… transaction We can't log a conflicting inode if it's a directory and it was moved from one parent directory to another parent directory in the current transaction, as this can result an attempt to have a directory with two hard links during log replay, one for the old parent directory and another for the new parent directory. The following scenario triggers that issue: 1) We have directories "dir1" and "dir2" created in a past transaction. Directory "dir1" has inode A as its parent directory; 2) We move "dir1" to some other directory; 3) We create a file with the name "dir1" in directory inode A; 4) We fsync the new file. This results in logging the inode of the new file and the inode for the directory "dir1" that was previously moved in the current transaction. So the log tree has the INODE_REF item for the new location of "dir1"; 5) We move the new file to some other directory. This results in updating the log tree to included the new INODE_REF for the new location of the file and removes the INODE_REF for the old location. This happens during the rename when we call btrfs_log_new_name(); 6) We fsync the file, and that persists the log tree changes done in the previous step (btrfs_log_new_name() only updates the log tree in memory); 7) We have a power failure; 8) Next time the fs is mounted, log replay happens and when processing the inode for directory "dir1" we find a new INODE_REF and add that link, but we don't remove the old link of the inode since we have not logged the old parent directory of the directory inode "dir1". As a result after log replay finishes when we trigger writeback of the subvolume tree's extent buffers, the tree check will detect that we have a directory a hard link count of 2 and we get a mount failure. The errors and stack traces reported in dmesg/syslog are like this: [ 3845.729764] BTRFS info (device dm-0): start tree-log replay [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c [ 3845.731236] memcg:ffff9264c02f4e00 [ 3845.731751] aops:btree_aops [btrfs] ino:1 [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff) [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8 [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00 [ 3845.735305] page dumped because: eb page dump [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5 [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701 [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384 [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737797] rdev 0 sequence 2 flags 0x0 [ 3845.737798] atime 1764259517.0 [ 3845.737800] ctime 1764259517.572889464 [ 3845.737801] mtime 1764259517.572889464 [ 3845.737802] otime 1764259517.0 [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 [ 3845.737805] index 0 name_len 2 [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34 [ 3845.737808] location key (257 1 0) type 2 [ 3845.737810] transid 9 data_len 0 name_len 4 [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34 [ 3845.737813] location key (258 1 0) type 2 [ 3845.737814] transid 9 data_len 0 name_len 4 [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34 [ 3845.737816] location key (257 1 0) type 2 [ 3845.737818] transid 9 data_len 0 name_len 4 [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34 [ 3845.737820] location key (258 1 0) type 2 [ 3845.737821] transid 9 data_len 0 name_len 4 [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160 [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0 [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0 [ 3845.737826] rdev 0 sequence 1 flags 0x0 [ 3845.737827] atime 1764259517.572889464 [ 3845.737828] ctime 1764259517.572889464 [ 3845.737830] mtime 1764259517.572889464 [ 3845.737831] otime 1764259517.572889464 [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14 [ 3845.737833] index 2 name_len 4 [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14 [ 3845.737836] index 2 name_len 4 [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33 [ 3845.737838] location key (259 1 0) type 1 [ 3845.737839] transid 10 data_len 0 name_len 3 [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33 [ 3845.737842] location key (259 1 0) type 1 [ 3845.737843] transid 10 data_len 0 name_len 3 [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160 [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0 [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0 [ 3845.737848] rdev 0 sequence 1 flags 0x0 [ 3845.737849] atime 1764259517.572889464 [ 3845.737850] ctime 1764259517.572889464 [ 3845.737851] mtime 1764259517.572889464 [ 3845.737852] otime 1764259517.572889464 [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14 [ 3845.737855] index 3 name_len 4 [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34 [ 3845.737857] location key (257 1 0) type 2 [ 3845.737858] transid 10 data_len 0 name_len 4 [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34 [ 3845.737861] location key (257 1 0) type 2 [ 3845.737862] transid 10 data_len 0 name_len 4 [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160 [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0 [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0 [ 3845.737867] rdev 0 sequence 2 flags 0x0 [ 3845.737868] atime 1764259517.580874966 [ 3845.737869] ctime 1764259517.586121869 [ 3845.737870] mtime 1764259517.580874966 [ 3845.737872] otime 1764259517.580874966 [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13 [ 3845.737874] index 2 name_len 3 [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected [ 3845.739448] ------------[ cut here ]------------ [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...) [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G W 6.18.0-rc6-btrfs-next-218+ linuxppc#1 PREEMPT(full) [ 3845.752414] Tainted: [W]=WARN [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs] [ 3845.755460] Code: 31 f6 48 89 (...) [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246 [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000 [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000 [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468 [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680 [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8 [ 3845.765094] FS: 00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000 [ 3845.766226] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0 [ 3845.768009] Call Trace: [ 3845.768392] <TASK> [ 3845.768714] btrfs_submit_bbio+0x6ee/0x7f0 [btrfs] [ 3845.769640] ? write_one_eb+0x28e/0x340 [btrfs] [ 3845.770588] btree_write_cache_pages+0x2f0/0x550 [btrfs] [ 3845.771286] ? alloc_extent_state+0x19/0x100 [btrfs] [ 3845.771967] ? merge_next_state+0x1a/0x90 [btrfs] [ 3845.772586] ? set_extent_bit+0x233/0x8b0 [btrfs] [ 3845.773198] ? xas_load+0x9/0xc0 [ 3845.773589] ? xas_find+0x14d/0x1a0 [ 3845.773969] do_writepages+0xc6/0x160 [ 3845.774367] filemap_fdatawrite_wbc+0x48/0x60 [ 3845.775003] __filemap_fdatawrite_range+0x5b/0x80 [ 3845.775902] btrfs_write_marked_extents+0x61/0x170 [btrfs] [ 3845.776707] btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs] [ 3845.777379] ? _raw_spin_unlock_irqrestore+0x23/0x40 [ 3845.777923] btrfs_commit_transaction+0x5ea/0xd20 [btrfs] [ 3845.778551] ? _raw_spin_unlock+0x15/0x30 [ 3845.778986] ? release_extent_buffer+0x34/0x160 [btrfs] [ 3845.779659] btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs] [ 3845.780416] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs] [ 3845.781499] open_ctree+0x10bb/0x15f0 [btrfs] [ 3845.782194] btrfs_get_tree.cold+0xb/0x16c [btrfs] [ 3845.782764] ? fscontext_read+0x15c/0x180 [ 3845.783202] ? rw_verify_area+0x50/0x180 [ 3845.783667] vfs_get_tree+0x25/0xd0 [ 3845.784047] vfs_cmd_create+0x59/0xe0 [ 3845.784458] __do_sys_fsconfig+0x4f6/0x6b0 [ 3845.784914] do_syscall_64+0x50/0x1220 [ 3845.785340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3845.785980] RIP: 0033:0x7f751fc7f4aa [ 3845.786759] Code: 73 01 c3 48 (...) [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000 [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23 [ 3845.798633] </TASK> [ 3845.799067] ---[ end trace 0000000000000000 ]--- [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction) [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction. [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5) [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree) [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5 Fix this by never logging a conflicting inode that is a directory and was moved in the current transaction (its last_unlink_trans equals the current transaction) and instead fallback to a transaction commit. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/ CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The commit d5d399e ("printk/nbcon: Release nbcon consoles ownership in atomic flush after each emitted record") prevented stall of a CPU which lost nbcon console ownership because another CPU entered an emergency flush. But there is still the problem that the CPU doing the emergency flush might cause a stall on its own. Let's go even further and restore IRQ in the atomic flush after each emitted record. It is not a complete solution. The interrupts and/or scheduling might still be blocked when the emergency atomic flush was called with IRQs and/or scheduling disabled. But it should remove the following lockup: mlx5_core 0000:03:00.0: Shutdown was called kvm: exiting hardware virtualization arm-smmu-v3 arm-smmu-v3.10.auto: CMD_SYNC timeout at 0x00000103 [hwprod 0x00000104, hwcons 0x00000102] smp: csd: Detected non-responsive CSD lock (linuxppc#1) on CPU#4, waiting 5000000032 ns for CPU#00 do_nothing (kernel/smp.c:1057) smp: csd: CSD lock (linuxppc#1) unresponsive. [...] Call trace: pl011_console_write_atomic (./arch/arm64/include/asm/vdso/processor.h:12 drivers/tty/serial/amba-pl011.c:2540) (P) nbcon_emit_next_record (kernel/printk/nbcon.c:1049) __nbcon_atomic_flush_pending_con (kernel/printk/nbcon.c:1517) __nbcon_atomic_flush_pending.llvm.15488114865160659019 (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:192 kernel/printk/nbcon.c:1562 kernel/printk/nbcon.c:1612) nbcon_atomic_flush_pending (kernel/printk/nbcon.c:1629) printk_kthreads_shutdown (kernel/printk/printk.c:?) syscore_shutdown (drivers/base/syscore.c:120) kernel_kexec (kernel/kexec_core.c:1045) __arm64_sys_reboot (kernel/reboot.c:794 kernel/reboot.c:722 kernel/reboot.c:722) invoke_syscall (arch/arm64/kernel/syscall.c:50) el0_svc_common.llvm.14158405452757855239 (arch/arm64/kernel/syscall.c:?) do_el0_svc (arch/arm64/kernel/syscall.c:152) el0_svc (./arch/arm64/include/asm/alternative-macros.h:254 ./arch/arm64/include/asm/cpufeature.h:808 ./arch/arm64/include/asm/irqflags.h:73 arch/arm64/kernel/entry-common.c:169 arch/arm64/kernel/entry-common.c:182 arch/arm64/kernel/entry-common.c:749) el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:820) el0t_64_sync (arch/arm64/kernel/entry.S:600) In this case, nbcon_atomic_flush_pending() is called from printk_kthreads_shutdown() with IRQs and scheduling enabled. Note that __nbcon_atomic_flush_pending_con() is directly called also from nbcon_device_release() where the disabled IRQs might break PREEMPT_RT guarantees. But the atomic flush is called only in emergency or panic situations where the latencies are irrelevant anyway. An ultimate solution would be a touching of watchdogs. But it would hide all problems. Let's do it later when anyone reports a stall which does not have a better solution. Closes: https://lore.kernel.org/r/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251212124520.244483-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>

In ath12k_mac_op_link_sta_statistics(), the atomic context scope introduced by dp_lock also covers firmware stats request. Since that request could block, below issue is hit: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:575 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6866, name: iw preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 3 locks held by iw/6866: #0:[...] linuxppc#1:[...] linuxppc#2: ffff9748f43230c8 (&dp->dp_lock){+.-.}-{3:3}, at: ath12k_mac_op_link_sta_statistics+0xc6/0x380 [ath12k] Preemption disabled at: [<ffffffffc0349656>] ath12k_mac_op_link_sta_statistics+0xc6/0x380 [ath12k] Call Trace: <TASK> show_stack dump_stack_lvl dump_stack __might_resched.cold __might_sleep __mutex_lock mutex_lock_nested ath12k_mac_get_fw_stats ath12k_mac_op_link_sta_statistics </TASK> Since firmware stats request doesn't require protection from dp_lock, move it outside to fix this issue. While moving, also refine that code hunk to make function parameters get populated when really necessary. Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.1.c5-00302-QCAHMTSWPL_V1.0_V2.0_SILICONZ-1.115823.3 Signed-off-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com> Link: https://patch.msgid.link/20251119-ath12k-ng-sleep-in-atomic-v1-1-5d1a726597db@oss.qualcomm.com Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>

When releasing a timeslot there is a slight chance we may end up with the wrong payload mask due to overflow if the delayed_destroy_work ends up coming into play after a DP 2.1 monitor gets disconnected which causes vcpi to become 0 then we try to make the payload = ~BIT(vcpi - 1) which is a negative shift. VCPI id should never really be 0 hence skip changing the payload mask if VCPI is 0. Otherwise it leads to <7> [515.287237] xe 0000:03:00.0: [drm:drm_dp_mst_get_port_malloc [drm_display_helper]] port ffff888126ce9000 (3) <4> [515.287267] -----------[ cut here ]----------- <3> [515.287268] UBSAN: shift-out-of-bounds in ../drivers/gpu/drm/display/drm_dp_mst_topology.c:4575:36 <3> [515.287271] shift exponent -1 is negative <4> [515.287275] CPU: 7 UID: 0 PID: 3108 Comm: kworker/u64:33 Tainted: G S U 6.17.0-rc6-lgci-xe-xe-3795-3e79699fa1b216e92+ linuxppc#1 PREEMPT(voluntary) <4> [515.287279] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER <4> [515.287279] Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 1645 03/15/2024 <4> [515.287281] Workqueue: drm_dp_mst_wq drm_dp_delayed_destroy_work [drm_display_helper] <4> [515.287303] Call Trace: <4> [515.287304] <TASK> <4> [515.287306] dump_stack_lvl+0xc1/0xf0 <4> [515.287313] dump_stack+0x10/0x20 <4> [515.287316] __ubsan_handle_shift_out_of_bounds+0x133/0x2e0 <4> [515.287324] ? drm_atomic_get_private_obj_state+0x186/0x1d0 <4> [515.287333] drm_dp_atomic_release_time_slots.cold+0x17/0x3d [drm_display_helper] <4> [515.287355] mst_connector_atomic_check+0x159/0x180 [xe] <4> [515.287546] drm_atomic_helper_check_modeset+0x4d9/0xfa0 <4> [515.287550] ? __ww_mutex_lock.constprop.0+0x6f/0x1a60 <4> [515.287562] intel_atomic_check+0x119/0x2b80 [xe] <4> [515.287740] ? find_held_lock+0x31/0x90 <4> [515.287747] ? lock_release+0xce/0x2a0 <4> [515.287754] drm_atomic_check_only+0x6a2/0xb40 <4> [515.287758] ? drm_atomic_add_affected_connectors+0x12b/0x140 <4> [515.287765] drm_atomic_commit+0x6e/0xf0 <4> [515.287766] ? _pfx__drm_printfn_info+0x10/0x10 <4> [515.287774] drm_client_modeset_commit_atomic+0x25c/0x2b0 <4> [515.287794] drm_client_modeset_commit_locked+0x60/0x1b0 <4> [515.287795] ? mutex_lock_nested+0x1b/0x30 <4> [515.287801] drm_client_modeset_commit+0x26/0x50 <4> [515.287804] __drm_fb_helper_restore_fbdev_mode_unlocked+0xdc/0x110 <4> [515.287810] drm_fb_helper_hotplug_event+0x120/0x140 <4> [515.287814] drm_fbdev_client_hotplug+0x28/0xd0 <4> [515.287819] drm_client_hotplug+0x6c/0xf0 <4> [515.287824] drm_client_dev_hotplug+0x9e/0xd0 <4> [515.287829] drm_kms_helper_hotplug_event+0x1a/0x30 <4> [515.287834] drm_dp_delayed_destroy_work+0x3df/0x410 [drm_display_helper] <4> [515.287861] process_one_work+0x22b/0x6f0 <4> [515.287874] worker_thread+0x1e8/0x3d0 <4> [515.287879] ? __pfx_worker_thread+0x10/0x10 <4> [515.287882] kthread+0x11c/0x250 <4> [515.287886] ? __pfx_kthread+0x10/0x10 <4> [515.287890] ret_from_fork+0x2d7/0x310 <4> [515.287894] ? __pfx_kthread+0x10/0x10 <4> [515.287897] ret_from_fork_asm+0x1a/0x30 Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6303 Signed-off-by: Suraj Kandpal <suraj.kandpal@intel.com> Reviewed-by: Imre Deak <imre.deak@intel.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Link: https://patch.msgid.link/20251119094650.799135-1-suraj.kandpal@intel.com

When a page is freed it coalesces with a buddy into a higher order page while possible. When the buddy page migrate type differs, it is expected to be updated to match the one of the page being freed. However, only the first pageblock of the buddy page is updated, while the rest of the pageblocks are left unchanged. That causes warnings in later expand() and other code paths (like below), since an inconsistency between migration type of the list containing the page and the page-owned pageblocks migration types is introduced. [ 308.986589] ------------[ cut here ]------------ [ 308.987227] page type is 0, passed migratetype is 1 (nr=256) [ 308.987275] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:812 expand+0x23c/0x270 [ 308.987293] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E) [ 308.987439] Unloaded tainted modules: hmac_s390(E):2 [ 308.987650] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G E 6.18.0-gcc-bpf-debug #431 PREEMPT [ 308.987657] Tainted: [E]=UNSIGNED_MODULE [ 308.987661] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0) [ 308.987666] Krnl PSW : 0404f00180000000 00000349976fa600 (expand+0x240/0x270) [ 308.987676] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 308.987682] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88 [ 308.987688] 0000000000000005 0000034980000005 000002be803ac000 0000023efe6c8300 [ 308.987692] 0000000000000008 0000034998d57290 000002be00000100 0000023e00000008 [ 308.987696] 0000000000000000 0000000000000000 00000349976fa5fc 000002c99b1eb6f0 [ 308.987708] Krnl Code: 00000349976fa5f0: c020008a02f2 larl %r2,000003499883abd4 00000349976fa5f6: c0e5ffe3f4b5 brasl %r14,0000034997378f60 #00000349976fa5fc: af000000 mc 0,0 >00000349976fa600: a7f4ff4c brc 15,00000349976fa498 00000349976fa604: b9040026 lgr %r2,%r6 00000349976fa608: c0300088317f larl %r3,0000034998800906 00000349976fa60e: c0e5fffdb6e1 brasl %r14,00000349976b13d0 00000349976fa614: af000000 mc 0,0 [ 308.987734] Call Trace: [ 308.987738] [<00000349976fa600>] expand+0x240/0x270 [ 308.987744] ([<00000349976fa5fc>] expand+0x23c/0x270) [ 308.987749] [<00000349976ff95e>] rmqueue_bulk+0x71e/0x940 [ 308.987754] [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0 [ 308.987759] [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40 [ 308.987763] [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0 [ 308.987768] [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400 [ 308.987774] [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220 [ 308.987781] [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0 [ 308.987786] [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0 [ 308.987791] [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240 [ 308.987799] [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210 [ 308.987804] [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500 [ 308.987809] [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0 [ 308.987813] [<000003499734d70e>] do_exception+0x1de/0x540 [ 308.987822] [<0000034998387390>] __do_pgm_check+0x130/0x220 [ 308.987830] [<000003499839a934>] pgm_check_handler+0x114/0x160 [ 308.987838] 3 locks held by mempig_verify/5224: [ 308.987842] #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0 [ 308.987859] linuxppc#1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40 [ 308.987871] linuxppc#2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940 [ 308.987886] Last Breaking-Event-Address: [ 308.987890] [<0000034997379096>] __warn_printk+0x136/0x140 [ 308.987897] irq event stamp: 52330356 [ 308.987901] hardirqs last enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220 [ 308.987907] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0 [ 308.987913] softirqs last enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530 [ 308.987922] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140 [ 308.987929] ---[ end trace 0000000000000000 ]--- [ 308.987936] ------------[ cut here ]------------ [ 308.987940] page type is 0, passed migratetype is 1 (nr=256) [ 308.987951] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:860 __del_page_from_free_list+0x1be/0x1e0 [ 308.987960] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E) [ 308.988070] Unloaded tainted modules: hmac_s390(E):2 [ 308.988087] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G W E 6.18.0-gcc-bpf-debug #431 PREEMPT [ 308.988095] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE [ 308.988100] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0) [ 308.988105] Krnl PSW : 0404f00180000000 00000349976f9e32 (__del_page_from_free_list+0x1c2/0x1e0) [ 308.988118] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 308.988127] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88 [ 308.988133] 0000000000000005 0000034980000005 0000034998d57290 0000023efe6c8300 [ 308.988139] 0000000000000001 0000000000000008 000002be00000100 000002be803ac000 [ 308.988144] 0000000000000000 0000000000000001 00000349976f9e2e 000002c99b1eb728 [ 308.988153] Krnl Code: 00000349976f9e22: c020008a06d9 larl %r2,000003499883abd4 00000349976f9e28: c0e5ffe3f89c brasl %r14,0000034997378f60 #00000349976f9e2e: af000000 mc 0,0 >00000349976f9e32: a7f4ff4e brc 15,00000349976f9cce 00000349976f9e36: b904002b lgr %r2,%r11 00000349976f9e3a: c030008a06e7 larl %r3,000003499883ac08 00000349976f9e40: c0e5fffdbac8 brasl %r14,00000349976b13d0 00000349976f9e46: af000000 mc 0,0 [ 308.988184] Call Trace: [ 308.988188] [<00000349976f9e32>] __del_page_from_free_list+0x1c2/0x1e0 [ 308.988195] ([<00000349976f9e2e>] __del_page_from_free_list+0x1be/0x1e0) [ 308.988202] [<00000349976ff946>] rmqueue_bulk+0x706/0x940 [ 308.988208] [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0 [ 308.988214] [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40 [ 308.988221] [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0 [ 308.988227] [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400 [ 308.988233] [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220 [ 308.988240] [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0 [ 308.988247] [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0 [ 308.988253] [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240 [ 308.988260] [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210 [ 308.988267] [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500 [ 308.988273] [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0 [ 308.988279] [<000003499734d70e>] do_exception+0x1de/0x540 [ 308.988286] [<0000034998387390>] __do_pgm_check+0x130/0x220 [ 308.988293] [<000003499839a934>] pgm_check_handler+0x114/0x160 [ 308.988300] 3 locks held by mempig_verify/5224: [ 308.988305] #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0 [ 308.988322] linuxppc#1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40 [ 308.988334] linuxppc#2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940 [ 308.988346] Last Breaking-Event-Address: [ 308.988350] [<0000034997379096>] __warn_printk+0x136/0x140 [ 308.988356] irq event stamp: 52330356 [ 308.988360] hardirqs last enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220 [ 308.988366] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0 [ 308.988373] softirqs last enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530 [ 308.988380] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140 [ 308.988388] ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20251215081002.3353900A9c-agordeev@linux.ibm.com Link: https://lkml.kernel.org/r/20251212151457.3898073Add-agordeev@linux.ibm.com Fixes: e6cf9e1 ("mm: page_alloc: fix up block types when merging compatible blocks") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Closes: https://lore.kernel.org/linux-mm/87wmalyktd.fsf@linux.ibm.com/ Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

irdma_net_event() should not dereference anything from "neigh" (alias "ptr") until it has checked that the event is NETEVENT_NEIGH_UPDATE. Other events come with different structures pointed to by "ptr" and they may be smaller than struct neighbour. Move the read of neigh->dev under the NETEVENT_NEIGH_UPDATE case. The bug is mostly harmless, but it triggers KASAN on debug kernels: BUG: KASAN: stack-out-of-bounds in irdma_net_event+0x32e/0x3b0 [irdma] Read of size 8 at addr ffffc900075e07f0 by task kworker/27:2/542554 CPU: 27 PID: 542554 Comm: kworker/27:2 Kdump: loaded Not tainted 5.14.0-630.el9.x86_64+debug linuxppc#1 Hardware name: [...] Workqueue: events rt6_probe_deferred Call Trace: <IRQ> dump_stack_lvl+0x60/0xb0 print_address_description.constprop.0+0x2c/0x3f0 print_report+0xb4/0x270 kasan_report+0x92/0xc0 irdma_net_event+0x32e/0x3b0 [irdma] notifier_call_chain+0x9e/0x180 atomic_notifier_call_chain+0x5c/0x110 rt6_do_redirect+0xb91/0x1080 tcp_v6_err+0xe9b/0x13e0 icmpv6_notify+0x2b2/0x630 ndisc_redirect_rcv+0x328/0x530 icmpv6_rcv+0xc16/0x1360 ip6_protocol_deliver_rcu+0xb84/0x12e0 ip6_input_finish+0x117/0x240 ip6_input+0xc4/0x370 ipv6_rcv+0x420/0x7d0 __netif_receive_skb_one_core+0x118/0x1b0 process_backlog+0xd1/0x5d0 __napi_poll.constprop.0+0xa3/0x440 net_rx_action+0x78a/0xba0 handle_softirqs+0x2d4/0x9c0 do_softirq+0xad/0xe0 </IRQ> Fixes: 915cc7a ("RDMA/irdma: Add miscellaneous utility definitions") Link: https://patch.msgid.link/r/20251127143150.121099-1-mschmidt@redhat.com Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

A race condition was found in sg_proc_debug_helper(). It was observed on a system using an IBM LTO-9 SAS Tape Drive (ULTRIUM-TD9) and monitoring /proc/scsi/sg/debug every second. A very large elapsed time would sometimes appear. This is caused by two race conditions. We reproduced the issue with an IBM ULTRIUM-HH9 tape drive on an x86_64 architecture. A patched kernel was built, and the race condition could not be observed anymore after the application of this patch. A reproducer C program utilising the scsi_debug module was also built by Changhui Zhong and can be viewed here: https://github.com/MichaelRabek/linux-tests/blob/master/drivers/scsi/sg/sg_race_trigger.c The first race happens between the reading of hp->duration in sg_proc_debug_helper() and request completion in sg_rq_end_io(). The hp->duration member variable may hold either of two types of information: linuxppc#1 - The start time of the request. This value is present while the request is not yet finished. linuxppc#2 - The total execution time of the request (end_time - start_time). If sg_proc_debug_helper() executes *after* the value of hp->duration was changed from linuxppc#1 to linuxppc#2, but *before* srp->done is set to 1 in sg_rq_end_io(), a fresh timestamp is taken in the else branch, and the elapsed time (value type linuxppc#2) is subtracted from a timestamp, which cannot yield a valid elapsed time (which is a type linuxppc#2 value as well). To fix this issue, the value of hp->duration must change under the protection of the sfp->rq_list_lock in sg_rq_end_io(). Since sg_proc_debug_helper() takes this read lock, the change to srp->done and srp->header.duration will happen atomically from the perspective of sg_proc_debug_helper() and the race condition is thus eliminated. The second race condition happens between sg_proc_debug_helper() and sg_new_write(). Even though hp->duration is set to the current time stamp in sg_add_request() under the write lock's protection, it gets overwritten by a call to get_sg_io_hdr(), which calls copy_from_user() to copy struct sg_io_hdr from userspace into kernel space. hp->duration is set to the start time again in sg_common_write(). If sg_proc_debug_helper() is called between these two calls, an arbitrary value set by userspace (usually zero) is used to compute the elapsed time. To fix this issue, hp->duration must be set to the current timestamp again after get_sg_io_hdr() returns successfully. A small race window still exists between get_sg_io_hdr() and setting hp->duration, but this window is only a few instructions wide and does not result in observable issues in practice, as confirmed by testing. Additionally, we fix the format specifier from %d to %u for printing unsigned int values in sg_proc_debug_helper(). Signed-off-by: Michal Rábek <mrabek@redhat.com> Suggested-by: Tomas Henzl <thenzl@redhat.com> Tested-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Link: https://patch.msgid.link/20251212160900.64924-1-mrabek@redhat.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)", v2. One functional fix, one performance regression fix, and two related comment fixes. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch linuxppc#1 and linuxppc#4. Patch linuxppc#1 fixes hugetlb_pmd_shared() not detecting any sharing Patch linuxppc#2 + linuxppc#3 are simple comment fixes that patch linuxppc#4 interacts with. Patch linuxppc#4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251212071019.471146-1-david@kernel.org Link: https://lkml.kernel.org/r/20251212071019.471146-2-david@kernel.org Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Liu Shixin <liushixin2@huawei.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Patch series "kallsyms: Prevent invalid access when showing module buildid", v3. We have seen nested crashes in __sprint_symbol(), see below. They seem to be caused by an invalid pointer to "buildid". This patchset cleans up kallsyms code related to module buildid and fixes this invalid access when printing backtraces. I made an audit of __sprint_symbol() and found several situations when the buildid might be wrong: + bpf_address_lookup() does not set @modbuildid + ftrace_mod_address_lookup() does not set @modbuildid + __sprint_symbol() does not take rcu_read_lock and the related struct module might get removed before mod->build_id is printed. This patchset solves these problems: + 1st, 2nd patches are preparatory + 3rd, 4th, 6th patches fix the above problems + 5th patch cleans up a suspicious initialization code. This is the backtrace, we have seen. But it is not really important. The problems fixed by the patchset are obvious: crash64> bt [62/2029] PID: 136151 TASK: ffff9f6c981d4000 CPU: 367 COMMAND: "btrfs" #0 [ffffbdb687635c28] machine_kexec at ffffffffb4c845b3 linuxppc#1 [ffffbdb687635c80] __crash_kexec at ffffffffb4d86a6a linuxppc#2 [ffffbdb687635d08] hex_string at ffffffffb51b3b61 linuxppc#3 [ffffbdb687635d40] crash_kexec at ffffffffb4d87964 linuxppc#4 [ffffbdb687635d50] oops_end at ffffffffb4c41fc8 #5 [ffffbdb687635d70] do_trap at ffffffffb4c3e49a #6 [ffffbdb687635db8] do_error_trap at ffffffffb4c3e6a4 #7 [ffffbdb687635df8] exc_stack_segment at ffffffffb5666b33 #8 [ffffbdb687635e20] asm_exc_stack_segment at ffffffffb5800cf9 ... This patch (of 7) The function kallsyms_lookup_buildid() initializes the given @namebuf by clearing the first and the last byte. It is not clear why. The 1st byte makes sense because some callers ignore the return code and expect that the buffer contains a valid string, for example: - function_stat_show() - kallsyms_lookup() - kallsyms_lookup_buildid() The initialization of the last byte does not make much sense because it can later be overwritten. Fortunately, it seems that all called functions behave correctly: - kallsyms_expand_symbol() explicitly adds the trailing '\0' at the end of the function. - All *__address_lookup() functions either use the safe strscpy() or they do not touch the buffer at all. Document the reason for clearing the first byte. And remove the useless initialization of the last byte. Link: https://lkml.kernel.org/r/20251128135920.217303-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

With the RAPL PMU addition, there is a recursive locking when CPU online callback function calls rapl_package_add_pmu(). Here cpu_hotplug_lock is already acquired by cpuhp_thread_fun() and rapl_package_add_pmu() tries to acquire again. <4>[ 8.197433] ============================================ <4>[ 8.197437] WARNING: possible recursive locking detected <4>[ 8.197440] 6.19.0-rc1-lgci-xe-xe-4242-05b7c58b3367dca84+ linuxppc#1 Not tainted <4>[ 8.197444] -------------------------------------------- <4>[ 8.197447] cpuhp/0/20 is trying to acquire lock: <4>[ 8.197450] ffffffff83487870 (cpu_hotplug_lock){++++}-{0:0}, at: rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197463] but task is already holding lock: <4>[ 8.197466] ffffffff83487870 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x6d/0x290 <4>[ 8.197477] other info that might help us debug this: <4>[ 8.197480] Possible unsafe locking scenario: <4>[ 8.197483] CPU0 <4>[ 8.197485] ---- <4>[ 8.197487] lock(cpu_hotplug_lock); <4>[ 8.197490] lock(cpu_hotplug_lock); <4>[ 8.197493] *** DEADLOCK *** .. .. <4>[ 8.197542] __lock_acquire+0x146e/0x2790 <4>[ 8.197548] lock_acquire+0xc4/0x2c0 <4>[ 8.197550] ? rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197556] cpus_read_lock+0x41/0x110 <4>[ 8.197558] ? rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197561] rapl_package_add_pmu+0x37/0x370 [intel_rapl_common] <4>[ 8.197565] rapl_cpu_online+0x85/0x87 [intel_rapl_msr] <4>[ 8.197568] ? __pfx_rapl_cpu_online+0x10/0x10 [intel_rapl_msr] <4>[ 8.197570] cpuhp_invoke_callback+0x41f/0x6c0 <4>[ 8.197573] ? cpuhp_thread_fun+0x6d/0x290 <4>[ 8.197575] cpuhp_thread_fun+0x1e2/0x290 <4>[ 8.197578] ? smpboot_thread_fn+0x26/0x290 <4>[ 8.197581] smpboot_thread_fn+0x12f/0x290 <4>[ 8.197584] ? __pfx_smpboot_thread_fn+0x10/0x10 <4>[ 8.197586] kthread+0x11f/0x250 <4>[ 8.197589] ? __pfx_kthread+0x10/0x10 <4>[ 8.197592] ret_from_fork+0x344/0x3a0 <4>[ 8.197595] ? __pfx_kthread+0x10/0x10 <4>[ 8.197597] ret_from_fork_asm+0x1a/0x30 <4>[ 8.197604] </TASK> Fix this issue in the same way as rapl powercap package domain is added from the same CPU online callback by introducing another interface which doesn't call cpus_read_lock(). Add rapl_package_add_pmu_locked() and rapl_package_remove_pmu_locked() which don't call cpus_read_lock(). Fixes: 748d6ba ("powercap: intel_rapl: Enable MSR-based RAPL PMU support") Reported-by: Borah, Chaitanya Kumar <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/linux-pm/5427ede1-57a0-43d1-99f3-8ca4b0643e82@intel.com/T/#u Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Tested-by: RavitejaX Veesam <ravitejax.veesam@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://patch.msgid.link/20251217153455.3560176-1-srinivas.pandruvada@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Avoid a possible UAF in GPU recovery due to a race between the sched timeout callback and the tdr work queue. The gpu recovery function calls drm_sched_stop() and later drm_sched_start(). drm_sched_start() restarts the tdr queue which will eventually free the job. If the tdr queue frees the job before time out callback completes, the job will be freed and we'll get a UAF when accessing the pasid. Cache it early to avoid the UAF. Example KASAN trace: [ 493.058141] BUG: KASAN: slab-use-after-free in amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.067530] Read of size 4 at addr ffff88b0ce3f794c by task kworker/u128:1/323 [ 493.074892] [ 493.076485] CPU: 9 UID: 0 PID: 323 Comm: kworker/u128:1 Tainted: G E 6.16.0-1289896.2.zuul.bf4f11df81c1410bbe901c4373305a31 linuxppc#1 PREEMPT(voluntary) [ 493.076493] Tainted: [E]=UNSIGNED_MODULE [ 493.076495] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 493.076500] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 493.076512] Call Trace: [ 493.076515] <TASK> [ 493.076518] dump_stack_lvl+0x64/0x80 [ 493.076529] print_report+0xce/0x630 [ 493.076536] ? _raw_spin_lock_irqsave+0x86/0xd0 [ 493.076541] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 493.076545] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077253] kasan_report+0xb8/0xf0 [ 493.077258] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077965] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.078672] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu] [ 493.079378] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu] [ 493.080111] amdgpu_job_timedout+0x642/0x1400 [amdgpu] [ 493.080903] ? pick_task_fair+0x24e/0x330 [ 493.080910] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu] [ 493.081702] ? _raw_spin_lock+0x75/0xc0 [ 493.081708] ? __pfx__raw_spin_lock+0x10/0x10 [ 493.081712] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched] [ 493.081721] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081725] process_one_work+0x679/0xff0 [ 493.081732] worker_thread+0x6ce/0xfd0 [ 493.081736] ? __pfx_worker_thread+0x10/0x10 [ 493.081739] kthread+0x376/0x730 [ 493.081744] ? __pfx_kthread+0x10/0x10 [ 493.081748] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081751] ? __pfx_kthread+0x10/0x10 [ 493.081755] ret_from_fork+0x247/0x330 [ 493.081761] ? __pfx_kthread+0x10/0x10 [ 493.081764] ret_from_fork_asm+0x1a/0x30 [ 493.081771] </TASK> Fixes: a72002c ("drm/amdgpu: Make use of drm_wedge_task_info") Link: HansKristian-Work/vkd3d-proton#2670 Cc: SRINIVASAN.SHANMUGAM@amd.com Cc: vitaly.prosyak@amd.com Cc: christian.koenig@amd.com Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

@offset

System crash seen during load/unload test in a loop, [61110.449331] qla2xxx [0000:27:00.0]-0042:0: Disabled MSI-X. [61110.467494] ============================================================================= [61110.467498] BUG qla2xxx_srbs (Tainted: G OE -------- --- ): Objects remaining in qla2xxx_srbs on __kmem_cache_shutdown() [61110.467501] ----------------------------------------------------------------------------- [61110.467502] Slab 0x000000000ffc8162 objects=51 used=1 fp=0x00000000e25d3d85 flags=0x57ffffc0010200(slab|head|node=1|zone=2|lastcpupid=0x1fffff) [61110.467509] CPU: 53 PID: 455206 Comm: rmmod Kdump: loaded Tainted: G OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 linuxppc#1 [61110.467513] Hardware name: HPE ProLiant DL385 Gen10 Plus v2/ProLiant DL385 Gen10 Plus v2, BIOS A42 08/17/2023 [61110.467515] Call Trace: [61110.467516] <TASK> [61110.467519] dump_stack_lvl+0x34/0x48 [61110.467526] slab_err.cold+0x53/0x67 [61110.467534] __kmem_cache_shutdown+0x16e/0x320 [61110.467540] kmem_cache_destroy+0x51/0x160 [61110.467544] qla2x00_module_exit+0x93/0x99 [qla2xxx] [61110.467607] ? __do_sys_delete_module.constprop.0+0x178/0x280 [61110.467613] ? syscall_trace_enter.constprop.0+0x145/0x1d0 [61110.467616] ? do_syscall_64+0x5c/0x90 [61110.467619] ? exc_page_fault+0x62/0x150 [61110.467622] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd [61110.467626] </TASK> [61110.467627] Disabling lock debugging due to kernel taint [61110.467635] Object 0x0000000026f7e6e6 @offset=16000 [61110.467639] ------------[ cut here ]------------ [61110.467639] kmem_cache_destroy qla2xxx_srbs: Slab cache still has objects when called from qla2x00_module_exit+0x93/0x99 [qla2xxx] [61110.467659] WARNING: CPU: 53 PID: 455206 at mm/slab_common.c:520 kmem_cache_destroy+0x14d/0x160 [61110.467718] CPU: 53 PID: 455206 Comm: rmmod Kdump: loaded Tainted: G B OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 linuxppc#1 [61110.467720] Hardware name: HPE ProLiant DL385 Gen10 Plus v2/ProLiant DL385 Gen10 Plus v2, BIOS A42 08/17/2023 [61110.467721] RIP: 0010:kmem_cache_destroy+0x14d/0x160 [61110.467724] Code: 99 7d 07 00 48 89 ef e8 e1 6a 07 00 eb b3 48 8b 55 60 48 8b 4c 24 20 48 c7 c6 70 fc 66 90 48 c7 c7 f8 ef a1 90 e8 e1 ed 7c 00 <0f> 0b eb 93 c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 [61110.467725] RSP: 0018:ffffa304e489fe80 EFLAGS: 00010282 [61110.467727] RAX: 0000000000000000 RBX: ffffffffc0d9a860 RCX: 0000000000000027 [61110.467729] RDX: ffff8fd5ff9598a8 RSI: 0000000000000001 RDI: ffff8fd5ff9598a0 [61110.467730] RBP: ffff8fb6aaf78700 R08: 0000000000000000 R09: 0000000100d863b7 [61110.467731] R10: ffffa304e489fd20 R11: ffffffff913bef48 R12: 0000000040002000 [61110.467731] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [61110.467733] FS: 00007f64c89fb740(0000) GS:ffff8fd5ff940000(0000) knlGS:0000000000000000 [61110.467734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [61110.467735] CR2: 00007f0f02bfe000 CR3: 00000020ad6dc005 CR4: 0000000000770ee0 [61110.467736] PKRU: 55555554 [61110.467737] Call Trace: [61110.467738] <TASK> [61110.467739] qla2x00_module_exit+0x93/0x99 [qla2xxx] [61110.467755] ? __do_sys_delete_module.constprop.0+0x178/0x280 Free sp in the error path to fix the crash. Fixes: f352eeb ("scsi: qla2xxx: Add ability to use GPNFT/GNNFT for RSCN handling") Cc: stable@vger.kernel.org Signed-off-by: Anil Gurumurthy <agurumurthy@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com> Link: https://patch.msgid.link/20251210101604.431868-9-njavali@marvell.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

System crash with the following signature [154563.214890] nvme nvme2: NVME-FC{1}: controller connect complete [154564.169363] qla2xxx [0000:b0:00.1]-3002:2: nvme: Sched: Set ZIO exchange threshold to 3. [154564.169405] qla2xxx [0000:b0:00.1]-ffffff:2: SET ZIO Activity exchange threshold to 5. [154565.539974] qla2xxx [0000:b0:00.1]-5013:2: RSCN database changed – 0078 0080 0000. [154565.545744] qla2xxx [0000:b0:00.1]-5013:2: RSCN database changed – 0078 00a0 0000. [154565.545857] qla2xxx [0000:b0:00.1]-11a2:2: FEC=enabled (data rate). [154565.552760] qla2xxx [0000:b0:00.1]-11a2:2: FEC=enabled (data rate). [154565.553079] BUG: kernel NULL pointer dereference, address: 00000000000000f8 [154565.553080] #PF: supervisor read access in kernel mode [154565.553082] #PF: error_code(0x0000) - not-present page [154565.553084] PGD 80000010488ab067 P4D 80000010488ab067 PUD 104978a067 PMD 0 [154565.553089] Oops: 0000 1 PREEMPT SMP PTI [154565.553092] CPU: 10 PID: 858 Comm: qla2xxx_2_dpc Kdump: loaded Tainted: G OE ------- --- 5.14.0-503.11.1.el9_5.x86_64 linuxppc#1 [154565.553096] Hardware name: HPE Synergy 660 Gen10/Synergy 660 Gen10 Compute Module, BIOS I43 09/30/2024 [154565.553097] RIP: 0010:qla_fab_async_scan.part.0+0x40b/0x870 [qla2xxx] [154565.553141] Code: 00 00 e8 58 a3 ec d4 49 89 e9 ba 12 20 00 00 4c 89 e6 49 c7 c0 00 ee a8 c0 48 c7 c1 66 c0 a9 c0 bf 00 80 00 10 e8 15 69 00 00 <4c> 8b 8d f8 00 00 00 4d 85 c9 74 35 49 8b 84 24 00 19 00 00 48 8b [154565.553143] RSP: 0018:ffffb4dbc8aebdd0 EFLAGS: 00010286 [154565.553145] RAX: 0000000000000000 RBX: ffff8ec2cf0908d0 RCX: 0000000000000002 [154565.553147] RDX: 0000000000000000 RSI: ffffffffc0a9c896 RDI: ffffb4dbc8aebd47 [154565.553148] RBP: 0000000000000000 R08: ffffb4dbc8aebd45 R09: 0000000000ffff0a [154565.553150] R10: 0000000000000000 R11: 000000000000000f R12: ffff8ec2cf0908d0 [154565.553151] R13: ffff8ec2cf090900 R14: 0000000000000102 R15: ffff8ec2cf084000 [154565.553152] FS: 0000000000000000(0000) GS:ffff8ed27f800000(0000) knlGS:0000000000000000 [154565.553154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [154565.553155] CR2: 00000000000000f8 CR3: 000000113ae0a005 CR4: 00000000007706f0 [154565.553157] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [154565.553158] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [154565.553159] PKRU: 55555554 [154565.553160] Call Trace: [154565.553162] <TASK> [154565.553165] ? show_trace_log_lvl+0x1c4/0x2df [154565.553172] ? show_trace_log_lvl+0x1c4/0x2df [154565.553177] ? qla_fab_async_scan.part.0+0x40b/0x870 [qla2xxx] [154565.553215] ? __die_body.cold+0x8/0xd [154565.553218] ? page_fault_oops+0x134/0x170 [154565.553223] ? snprintf+0x49/0x70 [154565.553229] ? exc_page_fault+0x62/0x150 [154565.553238] ? asm_exc_page_fault+0x22/0x30 Check for sp being non NULL before freeing any associated memory Fixes: a423994 ("scsi: qla2xxx: Add switch command to simplify fabric discovery") Cc: stable@vger.kernel.org Signed-off-by: Anil Gurumurthy <agurumurthy@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com> Link: https://patch.msgid.link/20251210101604.431868-10-njavali@marvell.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Kernel panic observed on system, [5353358.825191] BUG: unable to handle page fault for address: ff5f5e897b024000 [5353358.825194] #PF: supervisor write access in kernel mode [5353358.825195] #PF: error_code(0x0002) - not-present page [5353358.825196] PGD 100006067 P4D 0 [5353358.825198] Oops: 0002 [linuxppc#1] PREEMPT SMP NOPTI [5353358.825200] CPU: 5 PID: 2132085 Comm: qlafwupdate.sub Kdump: loaded Tainted: G W L ------- --- 5.14.0-503.34.1.el9_5.x86_64 linuxppc#1 [5353358.825203] Hardware name: HPE ProLiant DL360 Gen11/ProLiant DL360 Gen11, BIOS 2.44 01/17/2025 [5353358.825204] RIP: 0010:memcpy_erms+0x6/0x10 [5353358.825211] RSP: 0018:ff591da8f4f6b710 EFLAGS: 00010246 [5353358.825212] RAX: ff5f5e897b024000 RBX: 0000000000007090 RCX: 0000000000001000 [5353358.825213] RDX: 0000000000001000 RSI: ff591da8f4fed090 RDI: ff5f5e897b024000 [5353358.825214] RBP: 0000000000010000 R08: ff5f5e897b024000 R09: 0000000000000000 [5353358.825215] R10: ff46cf8c40517000 R11: 0000000000000001 R12: 0000000000008090 [5353358.825216] R13: ff591da8f4f6b720 R14: 0000000000001000 R15: 0000000000000000 [5353358.825218] FS: 00007f1e88d47740(0000) GS:ff46cf935f940000(0000) knlGS:0000000000000000 [5353358.825219] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [5353358.825220] CR2: ff5f5e897b024000 CR3: 0000000231532004 CR4: 0000000000771ef0 [5353358.825221] PKRU: 55555554 [5353358.825222] Call Trace: [5353358.825223] <TASK> [5353358.825224] ? show_trace_log_lvl+0x1c4/0x2df [5353358.825229] ? show_trace_log_lvl+0x1c4/0x2df [5353358.825232] ? sg_copy_buffer+0xc8/0x110 [5353358.825236] ? __die_body.cold+0x8/0xd [5353358.825238] ? page_fault_oops+0x134/0x170 [5353358.825242] ? kernelmode_fixup_or_oops+0x84/0x110 [5353358.825244] ? exc_page_fault+0xa8/0x150 [5353358.825247] ? asm_exc_page_fault+0x22/0x30 [5353358.825252] ? memcpy_erms+0x6/0x10 [5353358.825253] sg_copy_buffer+0xc8/0x110 [5353358.825259] qla2x00_process_vendor_specific+0x652/0x1320 [qla2xxx] [5353358.825317] qla24xx_bsg_request+0x1b2/0x2d0 [qla2xxx] Most routines in qla_bsg.c call bsg_done() only for success cases. However a few invoke it for failure case as well leading to a double free. Validate before calling bsg_done(). Cc: stable@vger.kernel.org Signed-off-by: Anil Gurumurthy <agurumurthy@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Reviewed-by: Himanshu Madhani <hmadhani2024@gmail.com> Link: https://patch.msgid.link/20251210101604.431868-12-njavali@marvell.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Fix a loop scenario of ethx:egress->ethx:egress Example setup to reproduce: tc qdisc add dev ethx root handle 1: drr tc filter add dev ethx parent 1: protocol ip prio 1 matchall \ action mirred egress redirect dev ethx Now ping out of ethx and you get a deadlock: [ 116.892898][ T307] ============================================ [ 116.893182][ T307] WARNING: possible recursive locking detected [ 116.893418][ T307] 6.18.0-rc6-01205-ge05021a829b8-dirty #204 Not tainted [ 116.893682][ T307] -------------------------------------------- [ 116.893926][ T307] ping/307 is trying to acquire lock: [ 116.894133][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.894517][ T307] [ 116.894517][ T307] but task is already holding lock: [ 116.894836][ T307] ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.895252][ T307] [ 116.895252][ T307] other info that might help us debug this: [ 116.895608][ T307] Possible unsafe locking scenario: [ 116.895608][ T307] [ 116.895901][ T307] CPU0 [ 116.896057][ T307] ---- [ 116.896200][ T307] lock(&sch->root_lock_key); [ 116.896392][ T307] lock(&sch->root_lock_key); [ 116.896605][ T307] [ 116.896605][ T307] *** DEADLOCK *** [ 116.896605][ T307] [ 116.896864][ T307] May be due to missing lock nesting notation [ 116.896864][ T307] [ 116.897123][ T307] 6 locks held by ping/307: [ 116.897302][ T307] #0: ffff88800b4b0250 (sk_lock-AF_INET){+.+.}-{0:0}, at: raw_sendmsg+0xb20/0x2cf0 [ 116.897808][ T307] linuxppc#1: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_output+0xa9/0x600 [ 116.898138][ T307] linuxppc#2: ffffffff88c839c0 (rcu_read_lock){....}-{1:3}, at: ip_finish_output2+0x2c6/0x1ee0 [ 116.898459][ T307] linuxppc#3: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50 [ 116.898782][ T307] linuxppc#4: ffff88800c122908 (&sch->root_lock_key){+...}-{3:3}, at: __dev_queue_xmit+0x2210/0x3b50 [ 116.899132][ T307] #5: ffffffff88c83960 (rcu_read_lock_bh){....}-{1:3}, at: __dev_queue_xmit+0x200/0x3b50 [ 116.899442][ T307] [ 116.899442][ T307] stack backtrace: [ 116.899667][ T307] CPU: 2 UID: 0 PID: 307 Comm: ping Not tainted 6.18.0-rc6-01205-ge05021a829b8-dirty #204 PREEMPT(voluntary) [ 116.899672][ T307] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 116.899675][ T307] Call Trace: [ 116.899678][ T307] <TASK> [ 116.899680][ T307] dump_stack_lvl+0x6f/0xb0 [ 116.899688][ T307] print_deadlock_bug.cold+0xc0/0xdc [ 116.899695][ T307] __lock_acquire+0x11f7/0x1be0 [ 116.899704][ T307] lock_acquire+0x162/0x300 [ 116.899707][ T307] ? __dev_queue_xmit+0x2210/0x3b50 [ 116.899713][ T307] ? srso_alias_return_thunk+0x5/0xfbef5 [ 116.899717][ T307] ? stack_trace_save+0x93/0xd0 [ 116.899723][ T307] _raw_spin_lock+0x30/0x40 [ 116.899728][ T307] ? __dev_queue_xmit+0x2210/0x3b50 [ 116.899731][ T307] __dev_queue_xmit+0x2210/0x3b50 Fixes: 178ca30 ("Revert "net/sched: Fix mirred deadlock on device recursion"") Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251210162255.1057663-1-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

…ked_inode() In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [27386.164433] ====================================================== [27386.164574] WARNING: possible circular locking dependency detected [27386.164583] 6.18.0+ linuxppc#4 Tainted: G U [27386.164591] ------------------------------------------------------ [27386.164599] kswapd0/117 is trying to acquire lock: [27386.164606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164625] but task is already holding lock: [27386.164633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.164646] which lock already depends on the new lock. [27386.164657] the existing dependency chain (in reverse order) is: [27386.164667] -> linuxppc#2 (fs_reclaim){+.+.}-{0:0}: [27386.164677] fs_reclaim_acquire+0x9d/0xd0 [27386.164685] __kmalloc_cache_noprof+0x59/0x750 [27386.164694] btrfs_init_file_extent_tree+0x90/0x100 [27386.164702] btrfs_read_locked_inode+0xc3/0x6b0 [27386.164710] btrfs_iget+0xbb/0xf0 [27386.164716] btrfs_lookup_dentry+0x3c5/0x8e0 [27386.164724] btrfs_lookup+0x12/0x30 [27386.164731] lookup_open.isra.0+0x1aa/0x6a0 [27386.164739] path_openat+0x5f7/0xc60 [27386.164746] do_filp_open+0xd6/0x180 [27386.164753] do_sys_openat2+0x8b/0xe0 [27386.164760] __x64_sys_openat+0x54/0xa0 [27386.164768] do_syscall_64+0x97/0x3e0 [27386.164776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164784] -> linuxppc#1 (btrfs-tree-00){++++}-{3:3}: [27386.164794] lock_release+0x127/0x2a0 [27386.164801] up_read+0x1b/0x30 [27386.164808] btrfs_search_slot+0x8e0/0xff0 [27386.164817] btrfs_lookup_inode+0x52/0xd0 [27386.164825] __btrfs_update_delayed_inode+0x73/0x520 [27386.164833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [27386.164842] btrfs_log_inode+0x608/0x1aa0 [27386.164849] btrfs_log_inode_parent+0x249/0xf80 [27386.164857] btrfs_log_dentry_safe+0x3e/0x60 [27386.164865] btrfs_sync_file+0x431/0x690 [27386.164872] do_fsync+0x39/0x80 [27386.164879] __x64_sys_fsync+0x13/0x20 [27386.164887] do_syscall_64+0x97/0x3e0 [27386.164894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [27386.164903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [27386.164913] __lock_acquire+0x15e9/0x2820 [27386.164920] lock_acquire+0xc9/0x2d0 [27386.164927] __mutex_lock+0xcc/0x10a0 [27386.164934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.164944] btrfs_evict_inode+0x20b/0x4b0 [27386.164952] evict+0x15a/0x2f0 [27386.164958] prune_icache_sb+0x91/0xd0 [27386.164966] super_cache_scan+0x150/0x1d0 [27386.164974] do_shrink_slab+0x155/0x6f0 [27386.164981] shrink_slab+0x48e/0x890 [27386.164988] shrink_one+0x11a/0x1f0 [27386.164995] shrink_node+0xbfd/0x1320 [27386.165002] balance_pgdat+0x67f/0xc60 [27386.165321] kswapd+0x1dc/0x3e0 [27386.165643] kthread+0xff/0x240 [27386.165965] ret_from_fork+0x223/0x280 [27386.166287] ret_from_fork_asm+0x1a/0x30 [27386.166616] other info that might help us debug this: [27386.167561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [27386.168503] Possible unsafe locking scenario: [27386.169110] CPU0 CPU1 [27386.169411] ---- ---- [27386.169707] lock(fs_reclaim); [27386.169998] lock(btrfs-tree-00); [27386.170291] lock(fs_reclaim); [27386.170581] lock(&delayed_node->mutex); [27386.170874] *** DEADLOCK *** [27386.171716] 2 locks held by kswapd0/117: [27386.171999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [27386.172294] linuxppc#1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [27386.172596] stack backtrace: [27386.173183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ linuxppc#4 PREEMPT(lazy) [27386.173185] Tainted: [U]=USER [27386.173186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [27386.173187] Call Trace: [27386.173187] <TASK> [27386.173189] dump_stack_lvl+0x6e/0xa0 [27386.173192] print_circular_bug.cold+0x17a/0x1c0 [27386.173194] check_noncircular+0x175/0x190 [27386.173197] __lock_acquire+0x15e9/0x2820 [27386.173200] lock_acquire+0xc9/0x2d0 [27386.173201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173204] __mutex_lock+0xcc/0x10a0 [27386.173206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [27386.173215] btrfs_evict_inode+0x20b/0x4b0 [27386.173217] ? lock_acquire+0xc9/0x2d0 [27386.173220] evict+0x15a/0x2f0 [27386.173222] prune_icache_sb+0x91/0xd0 [27386.173224] super_cache_scan+0x150/0x1d0 [27386.173226] do_shrink_slab+0x155/0x6f0 [27386.173228] shrink_slab+0x48e/0x890 [27386.173229] ? shrink_slab+0x2d2/0x890 [27386.173231] shrink_one+0x11a/0x1f0 [27386.173234] shrink_node+0xbfd/0x1320 [27386.173236] ? shrink_node+0xa2d/0x1320 [27386.173236] ? shrink_node+0xbd3/0x1320 [27386.173239] ? balance_pgdat+0x67f/0xc60 [27386.173239] balance_pgdat+0x67f/0xc60 [27386.173241] ? finish_task_switch.isra.0+0xc4/0x2a0 [27386.173246] kswapd+0x1dc/0x3e0 [27386.173247] ? __pfx_autoremove_wake_function+0x10/0x10 [27386.173249] ? __pfx_kswapd+0x10/0x10 [27386.173250] kthread+0xff/0x240 [27386.173251] ? __pfx_kthread+0x10/0x10 [27386.173253] ret_from_fork+0x223/0x280 [27386.173255] ? __pfx_kthread+0x10/0x10 [27386.173257] ret_from_fork_asm+0x1a/0x30 [27386.173260] </TASK> This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/ Fixes: 8679d26 ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Avoid a possible UAF in GPU recovery due to a race between the sched timeout callback and the tdr work queue. The gpu recovery function calls drm_sched_stop() and later drm_sched_start(). drm_sched_start() restarts the tdr queue which will eventually free the job. If the tdr queue frees the job before time out callback completes, the job will be freed and we'll get a UAF when accessing the pasid. Cache it early to avoid the UAF. Example KASAN trace: [ 493.058141] BUG: KASAN: slab-use-after-free in amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.067530] Read of size 4 at addr ffff88b0ce3f794c by task kworker/u128:1/323 [ 493.074892] [ 493.076485] CPU: 9 UID: 0 PID: 323 Comm: kworker/u128:1 Tainted: G E 6.16.0-1289896.2.zuul.bf4f11df81c1410bbe901c4373305a31 #1 PREEMPT(voluntary) [ 493.076493] Tainted: [E]=UNSIGNED_MODULE [ 493.076495] Hardware name: TYAN B8021G88V2HR-2T/S8021GM2NR-2T, BIOS V1.03.B10 04/01/2019 [ 493.076500] Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] [ 493.076512] Call Trace: [ 493.076515] <TASK> [ 493.076518] dump_stack_lvl+0x64/0x80 [ 493.076529] print_report+0xce/0x630 [ 493.076536] ? _raw_spin_lock_irqsave+0x86/0xd0 [ 493.076541] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ 493.076545] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077253] kasan_report+0xb8/0xf0 [ 493.077258] ? amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.077965] amdgpu_device_gpu_recover+0x968/0x990 [amdgpu] [ 493.078672] ? __pfx_amdgpu_device_gpu_recover+0x10/0x10 [amdgpu] [ 493.079378] ? amdgpu_coredump+0x1fd/0x4c0 [amdgpu] [ 493.080111] amdgpu_job_timedout+0x642/0x1400 [amdgpu] [ 493.080903] ? pick_task_fair+0x24e/0x330 [ 493.080910] ? __pfx_amdgpu_job_timedout+0x10/0x10 [amdgpu] [ 493.081702] ? _raw_spin_lock+0x75/0xc0 [ 493.081708] ? __pfx__raw_spin_lock+0x10/0x10 [ 493.081712] drm_sched_job_timedout+0x1b0/0x4b0 [gpu_sched] [ 493.081721] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081725] process_one_work+0x679/0xff0 [ 493.081732] worker_thread+0x6ce/0xfd0 [ 493.081736] ? __pfx_worker_thread+0x10/0x10 [ 493.081739] kthread+0x376/0x730 [ 493.081744] ? __pfx_kthread+0x10/0x10 [ 493.081748] ? __pfx__raw_spin_lock_irq+0x10/0x10 [ 493.081751] ? __pfx_kthread+0x10/0x10 [ 493.081755] ret_from_fork+0x247/0x330 [ 493.081761] ? __pfx_kthread+0x10/0x10 [ 493.081764] ret_from_fork_asm+0x1a/0x30 [ 493.081771] </TASK> Fixes: a72002c ("drm/amdgpu: Make use of drm_wedge_task_info") Link: HansKristian-Work/vkd3d-proton#2670 Cc: SRINIVASAN.SHANMUGAM@amd.com Cc: vitaly.prosyak@amd.com Cc: christian.koenig@amd.com Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 20880a3)

riteshharjani added 4 commits August 17, 2022 04:40

riteshharjani closed this Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup submit bh patchv4 #1

Cleanup submit bh patchv4 #1

Uh oh!

riteshharjani commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cleanup submit bh patchv4 #1

Cleanup submit bh patchv4 #1

Uh oh!

Conversation

riteshharjani commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant