forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 2
net: include xdp generic metadata definition #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Definition is only a proposal. There should be free place for 8B of tx timestamp. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
alobakin
pushed a commit
that referenced
this pull request
Aug 25, 2021
…itch
FPU_STATUS register contains FP exception flags bits which are updated
by core as side-effect of FP instructions but can also be manually
wiggled such as by glibc C99 functions fe{raise,clear,test}except() etc.
To effect the update, the programming model requires OR'ing FWE
bit (31). This bit is write-only and RAZ, meaning it is effectively
auto-cleared after write and thus needs to be set everytime: which
is how glibc implements this.
However there's another usecase of FPU_STATUS update, at the time of
Linux task switch when incoming task value needs to be programmed into
the register. This was added as part of f45ba2b ("ARCv2:
fpu: preserve userspace fpu state") which missed OR'ing FWE bit,
meaning the new value is effectively not being written at all.
This patch remedies that.
Interestingly, this snafu was not caught in interm glibc testing as the
race window which relies on a specific exception bit to be set/clear is
really small specially when it nvolves context switch.
Fortunately this was caught by glibc's math/test-fenv-tls test which
repeatedly set/clear exception flags in a big loop, concurrently in main
program and also in a thread.
Fixes: foss-for-synopsys-dwc-arc-processors#54
Fixes: f45ba2b ("ARCv2: fpu: preserve userspace fpu state")
Cc: stable@vger.kernel.org #5.6+
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
alobakin
pushed a commit
that referenced
this pull request
Aug 25, 2021
Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags from the regression suite: ====================================================== WARNING: possible circular locking dependency detected 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Tainted: G OE ------------------------------------------------------ kworker/2:4/2684 is trying to acquire lock: ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0 but task is already holding lock: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}: __flush_work+0x31b/0x490 io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0 __do_sys_io_uring_register+0x45b/0x1060 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 __mutex_lock+0x86/0x740 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 kthread+0x135/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); lock((work_completion)(&(&ctx->rsrc_put_work)->work)); lock(&ctx->uring_lock); *** DEADLOCK *** 2 locks held by kworker/2:4/2684: #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530 stack backtrace: CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G OE 5.14.0-rc3-bluetea-test-00249-gc7d102232649 #5 Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015 Workqueue: events io_rsrc_put_work Call Trace: dump_stack_lvl+0x6a/0x9a check_noncircular+0xfe/0x110 __lock_acquire+0x119a/0x1e10 lock_acquire+0xc8/0x2f0 ? io_rsrc_put_work+0x13d/0x1a0 __mutex_lock+0x86/0x740 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? io_rsrc_put_work+0x13d/0x1a0 ? process_one_work+0x1ce/0x530 io_rsrc_put_work+0x13d/0x1a0 process_one_work+0x236/0x530 worker_thread+0x52/0x3b0 ? process_one_work+0x530/0x530 kthread+0x135/0x160 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 which is due to holding the ctx->uring_lock when flushing existing pending work, while the pending work flushing may need to grab the uring lock if we're using IOPOLL. Fix this by dropping the uring_lock a bit earlier as part of the flush. Cc: stable@vger.kernel.org Link: axboe/liburing#404 Tested-by: Ammar Faizi <ammarfaizi2@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
alobakin
pushed a commit
that referenced
this pull request
Sep 13, 2021
Following test case reproduces lockdep warning. Test case: $ mkfs.btrfs -f <dev1> $ btrfstune -S 1 <dev1> $ mount <dev1> <mnt> $ btrfs device add <dev2> <mnt> -f $ umount <mnt> $ mount <dev2> <mnt> $ umount <mnt> The warning claims a possible ABBA deadlock between the threads initiated by [#1] btrfs device add and [#0] the mount. [ 540.743122] WARNING: possible circular locking dependency detected [ 540.743129] 5.11.0-rc7+ #5 Not tainted [ 540.743135] ------------------------------------------------------ [ 540.743142] mount/2515 is trying to acquire lock: [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs] [ 540.743458] but task is already holding lock: [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743541] which lock already depends on the new lock. [ 540.743543] the existing dependency chain (in reverse order) is: [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}: [ 540.743566] down_read_nested+0x48/0x2b0 [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs] [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs] [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs] [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs] [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs] [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs] [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs] [ 540.744166] __x64_sys_ioctl+0xe4/0x140 [ 540.744170] do_syscall_64+0x4b/0x80 [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}: [ 540.744184] __lock_acquire+0x155f/0x2360 [ 540.744188] lock_acquire+0x10b/0x5c0 [ 540.744190] __mutex_lock+0xb1/0xf80 [ 540.744193] mutex_lock_nested+0x27/0x30 [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs] [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs] [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs] [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs] [ 540.744472] legacy_get_tree+0x38/0x90 [ 540.744475] vfs_get_tree+0x30/0x140 [ 540.744478] fc_mount+0x16/0x60 [ 540.744482] vfs_kern_mount+0x91/0x100 [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs] [ 540.744536] legacy_get_tree+0x38/0x90 [ 540.744537] vfs_get_tree+0x30/0x140 [ 540.744539] path_mount+0x8d8/0x1070 [ 540.744541] do_mount+0x8d/0xc0 [ 540.744543] __x64_sys_mount+0x125/0x160 [ 540.744545] do_syscall_64+0x4b/0x80 [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 540.744551] other info that might help us debug this: [ 540.744552] Possible unsafe locking scenario: [ 540.744553] CPU0 CPU1 [ 540.744554] ---- ---- [ 540.744555] lock(btrfs-chunk-00); [ 540.744557] lock(&fs_devs->device_list_mutex); [ 540.744560] lock(btrfs-chunk-00); [ 540.744562] lock(&fs_devs->device_list_mutex); [ 540.744564] *** DEADLOCK *** [ 540.744565] 3 locks held by mount/2515: [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450 [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs] [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs] [ 540.744708] stack backtrace: [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5 But the device_list_mutex in clone_fs_devices() is redundant, as explained below. Two threads [1] and [2] (below) could lead to clone_fs_device(). [1] open_ctree <== mount sprout fs btrfs_read_chunk_tree() mutex_lock(&uuid_mutex) <== global lock read_one_dev() open_seed_devices() clone_fs_devices() <== seed fs_devices mutex_lock(&orig->device_list_mutex) <== seed fs_devices [2] btrfs_init_new_device() <== sprouting mutex_lock(&uuid_mutex); <== global lock btrfs_prepare_sprout() lockdep_assert_held(&uuid_mutex) clone_fs_devices(seed_fs_device) <== seed fs_devices Both of these threads hold uuid_mutex which is sufficient to protect getting the seed device(s) freed while we are trying to clone it for sprouting [2] or mounting a sprout [1] (as above). A mounted seed device can not free/write/replace because it is read-only. An unmounted seed device can be freed by btrfs_free_stale_devices(), but it needs uuid_mutex. So this patch removes the unnecessary device_list_mutex in clone_fs_devices(). And adds a lockdep_assert_held(&uuid_mutex) in clone_fs_devices(). Reported-by: Su Yue <l@damenly.su> Tested-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
alobakin
pushed a commit
that referenced
this pull request
Sep 13, 2021
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other combinations), lockdep complains circular locking dependency at __loop_clr_fd(), for major_names_lock serves as a locking dependency aggregating hub across multiple block modules. ====================================================== WARNING: possible circular locking dependency detected 5.14.0+ torvalds#757 Tainted: G E ------------------------------------------------------ systemd-udevd/7568 is trying to acquire lock: ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560 but task is already holding lock: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (&lo->lo_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_killable_nested+0x17/0x20 lo_open+0x23/0x50 [loop] blkdev_get_by_dev+0x199/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #5 (&disk->open_mutex){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 bd_register_pending_holders+0x20/0x100 device_add_disk+0x1ae/0x390 loop_add+0x29c/0x2d0 [loop] blk_request_module+0x5a/0xb0 blkdev_get_no_open+0x27/0xa0 blkdev_get_by_dev+0x5f/0x540 blkdev_open+0x58/0x90 do_dentry_open+0x144/0x3a0 path_openat+0xa57/0xda0 do_filp_open+0x9f/0x140 do_sys_openat2+0x71/0x150 __x64_sys_openat+0x78/0xa0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (major_names_lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 blkdev_show+0x19/0x80 devinfo_show+0x52/0x60 seq_read_iter+0x2d5/0x3e0 proc_reg_read_iter+0x41/0x80 vfs_read+0x2ac/0x330 ksys_read+0x6b/0xd0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (&p->lock){+.+.}-{3:3}: lock_acquire+0xbe/0x1f0 __mutex_lock_common+0xb6/0xe10 mutex_lock_nested+0x17/0x20 seq_read_iter+0x37/0x3e0 generic_file_splice_read+0xf3/0x170 splice_direct_to_actor+0x14e/0x350 do_splice_direct+0x84/0xd0 do_sendfile+0x263/0x430 __se_sys_sendfile64+0x96/0xc0 do_syscall_64+0x3d/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#3){.+.+}-{0:0}: lock_acquire+0xbe/0x1f0 lo_write_bvec+0x96/0x280 [loop] loop_process_work+0xa68/0xc10 [loop] process_one_work+0x293/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: lock_acquire+0xbe/0x1f0 process_one_work+0x280/0x480 worker_thread+0x23d/0x4b0 kthread+0x163/0x180 ret_from_fork+0x1f/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: validate_chain+0x1f0d/0x33e0 __lock_acquire+0x92d/0x1030 lock_acquire+0xbe/0x1f0 flush_workqueue+0x8c/0x560 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 __loop_clr_fd+0xb4/0x400 [loop] blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 2 locks held by systemd-udevd/7568: #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0 #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop] stack backtrace: CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G E 5.14.0+ torvalds#757 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020 Call Trace: dump_stack_lvl+0x79/0xbf print_circular_bug+0x5d6/0x5e0 ? stack_trace_save+0x42/0x60 ? save_trace+0x3d/0x2d0 check_noncircular+0x10b/0x120 validate_chain+0x1f0d/0x33e0 ? __lock_acquire+0x953/0x1030 ? __lock_acquire+0x953/0x1030 __lock_acquire+0x92d/0x1030 ? flush_workqueue+0x70/0x560 lock_acquire+0xbe/0x1f0 ? flush_workqueue+0x70/0x560 flush_workqueue+0x8c/0x560 ? flush_workqueue+0x70/0x560 ? sched_clock_cpu+0xe/0x1a0 ? drain_workqueue+0x41/0x140 drain_workqueue+0x80/0x140 destroy_workqueue+0x47/0x4f0 ? blk_mq_freeze_queue_wait+0xac/0xd0 __loop_clr_fd+0xb4/0x400 [loop] ? __mutex_unlock_slowpath+0x35/0x230 blkdev_put+0x14a/0x1d0 blkdev_close+0x1c/0x20 __fput+0xfd/0x220 task_work_run+0x69/0xc0 exit_to_user_mode_prepare+0x1ce/0x1f0 syscall_exit_to_user_mode+0x26/0x60 do_syscall_64+0x4c/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0fd4c661f7 Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006 RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000 R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050 Commit 1c500ad ("loop: reduce the loop_ctl_mutex scope") is for breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling a different block module results in forming circular locking dependency due to shared major_names_lock mutex. The simplest fix is to call probe function without holding major_names_lock [1], but Christoph Hellwig does not like such idea. Therefore, instead of holding major_names_lock in blkdev_show(), introduce a different lock for blkdev_show() in order to break "sb_writers#$N => &p->lock => major_names_lock" dependency chain. Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>
alobakin
pushed a commit
that referenced
this pull request
Sep 23, 2021
As previously noted in commit 66e4f4a ("rtc: cmos: Use spin_lock_irqsave() in cmos_interrupt()"): <4>[ 254.192378] WARNING: inconsistent lock state <4>[ 254.192384] 5.12.0-rc1-CI-CI_DRM_9834+ #1 Not tainted <4>[ 254.192396] -------------------------------- <4>[ 254.192400] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. <4>[ 254.192409] rtcwake/5309 [HC0[0]:SC0[0]:HE1:SE1] takes: <4>[ 254.192429] ffffffff8263c5f8 (rtc_lock){?...}-{2:2}, at: cmos_interrupt+0x18/0x100 <4>[ 254.192481] {IN-HARDIRQ-W} state was registered at: <4>[ 254.192488] lock_acquire+0xd1/0x3d0 <4>[ 254.192504] _raw_spin_lock+0x2a/0x40 <4>[ 254.192519] cmos_interrupt+0x18/0x100 <4>[ 254.192536] rtc_handler+0x1f/0xc0 <4>[ 254.192553] acpi_ev_fixed_event_detect+0x109/0x13c <4>[ 254.192574] acpi_ev_sci_xrupt_handler+0xb/0x28 <4>[ 254.192596] acpi_irq+0x13/0x30 <4>[ 254.192620] __handle_irq_event_percpu+0x43/0x2c0 <4>[ 254.192641] handle_irq_event_percpu+0x2b/0x70 <4>[ 254.192661] handle_irq_event+0x2f/0x50 <4>[ 254.192680] handle_fasteoi_irq+0x9e/0x150 <4>[ 254.192693] __common_interrupt+0x76/0x140 <4>[ 254.192715] common_interrupt+0x96/0xc0 <4>[ 254.192732] asm_common_interrupt+0x1e/0x40 <4>[ 254.192750] _raw_spin_unlock_irqrestore+0x38/0x60 <4>[ 254.192767] resume_irqs+0xba/0xf0 <4>[ 254.192786] dpm_resume_noirq+0x245/0x3d0 <4>[ 254.192811] suspend_devices_and_enter+0x230/0xaa0 <4>[ 254.192835] pm_suspend.cold.8+0x301/0x34a <4>[ 254.192859] state_store+0x7b/0xe0 <4>[ 254.192879] kernfs_fop_write_iter+0x11d/0x1c0 <4>[ 254.192899] new_sync_write+0x11d/0x1b0 <4>[ 254.192916] vfs_write+0x265/0x390 <4>[ 254.192933] ksys_write+0x5a/0xd0 <4>[ 254.192949] do_syscall_64+0x33/0x80 <4>[ 254.192965] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 254.192986] irq event stamp: 43775 <4>[ 254.192994] hardirqs last enabled at (43775): [<ffffffff81c00c42>] asm_sysvec_apic_timer_interrupt+0x12/0x20 <4>[ 254.193023] hardirqs last disabled at (43774): [<ffffffff81aa691a>] sysvec_apic_timer_interrupt+0xa/0xb0 <4>[ 254.193049] softirqs last enabled at (42548): [<ffffffff81e00342>] __do_softirq+0x342/0x48e <4>[ 254.193074] softirqs last disabled at (42543): [<ffffffff810b45fd>] irq_exit_rcu+0xad/0xd0 <4>[ 254.193101] other info that might help us debug this: <4>[ 254.193107] Possible unsafe locking scenario: <4>[ 254.193112] CPU0 <4>[ 254.193117] ---- <4>[ 254.193121] lock(rtc_lock); <4>[ 254.193137] <Interrupt> <4>[ 254.193142] lock(rtc_lock); <4>[ 254.193156] *** DEADLOCK *** <4>[ 254.193161] 6 locks held by rtcwake/5309: <4>[ 254.193174] #0: ffff888104861430 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x5a/0xd0 <4>[ 254.193232] #1: ffff88810f823288 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xe7/0x1c0 <4>[ 254.193282] #2: ffff888100cef3c0 (kn->active#285 <7>[ 254.192706] i915 0000:00:02.0: [drm:intel_modeset_setup_hw_state [i915]] [CRTC:51:pipe A] hw state readout: disabled <4>[ 254.193307] ){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xf0/0x1c0 <4>[ 254.193333] #3: ffffffff82649fa8 (system_transition_mutex){+.+.}-{3:3}, at: pm_suspend.cold.8+0xce/0x34a <4>[ 254.193387] #4: ffffffff827a2108 (acpi_scan_lock){+.+.}-{3:3}, at: acpi_suspend_begin+0x47/0x70 <4>[ 254.193433] #5: ffff8881019ea178 (&dev->mutex){....}-{3:3}, at: device_resume+0x68/0x1e0 <4>[ 254.193485] stack backtrace: <4>[ 254.193492] CPU: 1 PID: 5309 Comm: rtcwake Not tainted 5.12.0-rc1-CI-CI_DRM_9834+ #1 <4>[ 254.193514] Hardware name: Google Soraka/Soraka, BIOS MrChromebox-4.10 08/25/2019 <4>[ 254.193524] Call Trace: <4>[ 254.193536] dump_stack+0x7f/0xad <4>[ 254.193567] mark_lock.part.47+0x8ca/0xce0 <4>[ 254.193604] __lock_acquire+0x39b/0x2590 <4>[ 254.193626] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 <4>[ 254.193660] lock_acquire+0xd1/0x3d0 <4>[ 254.193677] ? cmos_interrupt+0x18/0x100 <4>[ 254.193716] _raw_spin_lock+0x2a/0x40 <4>[ 254.193735] ? cmos_interrupt+0x18/0x100 <4>[ 254.193758] cmos_interrupt+0x18/0x100 <4>[ 254.193785] cmos_resume+0x2ac/0x2d0 <4>[ 254.193813] ? acpi_pm_set_device_wakeup+0x1f/0x110 <4>[ 254.193842] ? pnp_bus_suspend+0x10/0x10 <4>[ 254.193864] pnp_bus_resume+0x5e/0x90 <4>[ 254.193885] dpm_run_callback+0x5f/0x240 <4>[ 254.193914] device_resume+0xb2/0x1e0 <4>[ 254.193942] ? pm_dev_err+0x25/0x25 <4>[ 254.193974] dpm_resume+0xea/0x3f0 <4>[ 254.194005] dpm_resume_end+0x8/0x10 <4>[ 254.194030] suspend_devices_and_enter+0x29b/0xaa0 <4>[ 254.194066] pm_suspend.cold.8+0x301/0x34a <4>[ 254.194094] state_store+0x7b/0xe0 <4>[ 254.194124] kernfs_fop_write_iter+0x11d/0x1c0 <4>[ 254.194151] new_sync_write+0x11d/0x1b0 <4>[ 254.194183] vfs_write+0x265/0x390 <4>[ 254.194207] ksys_write+0x5a/0xd0 <4>[ 254.194232] do_syscall_64+0x33/0x80 <4>[ 254.194251] entry_SYSCALL_64_after_hwframe+0x44/0xae <4>[ 254.194274] RIP: 0033:0x7f07d79691e7 <4>[ 254.194293] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 <4>[ 254.194312] RSP: 002b:00007ffd9cc2c768 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 <4>[ 254.194337] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f07d79691e7 <4>[ 254.194352] RDX: 0000000000000004 RSI: 0000556ebfc63590 RDI: 000000000000000b <4>[ 254.194366] RBP: 0000556ebfc63590 R08: 0000000000000000 R09: 0000000000000004 <4>[ 254.194379] R10: 0000556ebf0ec2a6 R11: 0000000000000246 R12: 0000000000000004 which breaks S3-resume on fi-kbl-soraka presumably as that's slow enough to trigger the alarm during the suspend. Fixes: 6950d04 ("rtc: cmos: Replace spin_lock_irqsave with spin_lock in hard IRQ") References: 66e4f4a ("rtc: cmos: Use spin_lock_irqsave() in cmos_interrupt()"): Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Xiaofei Tan <tanxiaofei@huawei.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20210305122140.28774-1-chris@chris-wilson.co.uk
alobakin
pushed a commit
that referenced
this pull request
Sep 23, 2021
It's later supposed to be either a correct address or NULL. Without the initialization, it may contain an undefined value which results in the following segmentation fault: # perf top --sort comm -g --ignore-callees=do_idle terminates with: #0 0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x00007ffff55e3802 in strdup () from /lib64/libc.so.6 #2 0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489 #3 hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564 #4 0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420, sample_self=sample_self@entry=true) at util/hist.c:657 #5 0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0, sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288 #6 0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38) at util/hist.c:1056 #7 iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056 #8 0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0) at util/hist.c:1231 #9 0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0) at builtin-top.c:842 #10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202 #11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244 #12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323 #13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339 #14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341 #15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339 #16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114 #17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0 #18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6 If you look at the frame #2, the code is: 488 if (he->srcline) { 489 he->srcline = strdup(he->srcline); 490 if (he->srcline == NULL) 491 goto err_rawdata; 492 } If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish), it gets strdupped and strdupping a rubbish random string causes the problem. Also, if you look at the commit 1fb7d06, it adds the srcline property into the struct, but not initializing it everywhere needed. Committer notes: Now I see, when using --ignore-callees=do_idle we end up here at line 2189 in add_callchain_ip(): 2181 if (al.sym != NULL) { 2182 if (perf_hpp_list.parent && !*parent && 2183 symbol__match_regex(al.sym, &parent_regex)) 2184 *parent = al.sym; 2185 else if (have_ignore_callees && root_al && 2186 symbol__match_regex(al.sym, &ignore_callees_regex)) { 2187 /* Treat this symbol as the root, 2188 forgetting its callees. */ 2189 *root_al = al; 2190 callchain_cursor_reset(cursor); 2191 } 2192 } And the al that doesn't have the ->srcline field initialized will be copied to the root_al, so then, back to: 1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al, 1212 int max_stack_depth, void *arg) 1213 { 1214 int err, err2; 1215 struct map *alm = NULL; 1216 1217 if (al) 1218 alm = map__get(al->map); 1219 1220 err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent, 1221 iter->evsel, al, max_stack_depth); 1222 if (err) { 1223 map__put(alm); 1224 return err; 1225 } 1226 1227 err = iter->ops->prepare_entry(iter, al); 1228 if (err) 1229 goto out; 1230 1231 err = iter->ops->add_single_entry(iter, al); 1232 if (err) 1233 goto out; 1234 That al at line 1221 is what hist_entry_iter__add() (called from sample__resolve_callchain()) saw as 'root_al', and then: iter->ops->add_single_entry(iter, al); will go on with al->srcline with a bogus value, I'll add the above sequence to the cset and apply, thanks! Signed-off-by: Michael Petlan <mpetlan@redhat.com> CC: Milian Wolff <milian.wolff@kdab.com> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries") Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com Reported-by: Juri Lelli <jlelli@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Sep 23, 2021
FD uses xyarray__entry that may return NULL if an index is out of
bounds. If NULL is returned then a segv happens as FD unconditionally
dereferences the pointer. This was happening in a case of with perf
iostat as shown below. The fix is to make FD an "int*" rather than an
int and handle the NULL case as either invalid input or a closed fd.
$ sudo gdb --args perf stat --iostat list
...
Breakpoint 1, perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
50 {
(gdb) bt
#0 perf_evsel__alloc_fd (evsel=0x5555560951a0, ncpus=1, nthreads=1) at evsel.c:50
#1 0x000055555585c188 in evsel__open_cpu (evsel=0x5555560951a0, cpus=0x555556093410,
threads=0x555556086fb0, start_cpu=0, end_cpu=1) at util/evsel.c:1792
#2 0x000055555585cfb2 in evsel__open (evsel=0x5555560951a0, cpus=0x0, threads=0x555556086fb0)
at util/evsel.c:2045
#3 0x000055555585d0db in evsel__open_per_thread (evsel=0x5555560951a0, threads=0x555556086fb0)
at util/evsel.c:2065
#4 0x00005555558ece64 in create_perf_stat_counter (evsel=0x5555560951a0,
config=0x555555c34700 <stat_config>, target=0x555555c2f1c0 <target>, cpu=0) at util/stat.c:590
#5 0x000055555578e927 in __run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
at builtin-stat.c:833
#6 0x000055555578f3c6 in run_perf_stat (argc=1, argv=0x7fffffffe4a0, run_idx=0)
at builtin-stat.c:1048
#7 0x0000555555792ee5 in cmd_stat (argc=1, argv=0x7fffffffe4a0) at builtin-stat.c:2534
#8 0x0000555555835ed3 in run_builtin (p=0x555555c3f540 <commands+288>, argc=3,
argv=0x7fffffffe4a0) at perf.c:313
#9 0x0000555555836154 in handle_internal_command (argc=3, argv=0x7fffffffe4a0) at perf.c:365
#10 0x000055555583629f in run_argv (argcp=0x7fffffffe2ec, argv=0x7fffffffe2e0) at perf.c:409
#11 0x0000555555836692 in main (argc=3, argv=0x7fffffffe4a0) at perf.c:539
...
(gdb) c
Continuing.
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (uncore_iio_0/event=0x83,umask=0x04,ch_mask=0xF,fc_mask=0x07/).
/bin/dmesg | grep -i perf may provide additional information.
Program received signal SIGSEGV, Segmentation fault.
0x00005555559b03ea in perf_evsel__close_fd_cpu (evsel=0x5555560951a0, cpu=1) at evsel.c:166
166 if (FD(evsel, cpu, thread) >= 0)
v3. fixes a bug in perf_evsel__run_ioctl where the sense of a branch was
backward.
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20210918054440.2350466-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Sep 28, 2021
Peter's e-mail in MAINTAINERS is defunct:
This is the qmail-send program at a.mx.sunsite.dk.
<jacmet@sunsite.dk>:
Sorry, no mailbox here by that name. (#5.1.1)
Peter says:
** Ahh yes, it should be changed to peter@korsgaard.com.
However he also says:
** I haven't had access to c67x00 hw for quite some years though, so maybe
** it should be marked Orphan instead?
So change as he wishes.
Cc: Peter Korsgaard <peter@korsgaard.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-usb@vger.kernel.org
Acked-by: Peter Korsgaard <peter@korsgaard.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Link: https://lore.kernel.org/r/20210922063008.25758-1-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
alobakin
pushed a commit
that referenced
this pull request
Oct 1, 2021
Ido Schimmel says: ==================== mlxsw: Add support for IP-in-IP with IPv6 underlay Currently, mlxsw only supports IP-in-IP with IPv4 underlay. Traffic routed through 'gre' netdevs is encapsulated with IPv4 and GRE headers. Similarly, incoming IPv4 GRE packets are decapsulated and routed in the overlay VRF (which can be the same as the underlay VRF). This patchset adds support for IPv6 underlay using the 'ip6gre' netdev. Due to architectural differences between Spectrum-1 and later ASICs, this functionality is only supported on Spectrum-2 onwards (the software data path is used for Spectrum-1). Patchset overview: Patches #1-#5 are preparations. Patches #6-#9 add and extend required device registers. Patches #10-#14 gradually add IPv6 underlay support. A follow-up patchset will add net/forwarding/ selftests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Oct 7, 2021
…r-mode' Ido Schimmel says: ==================== ethtool: Add ability to control transceiver modules' power mode This patchset extends the ethtool netlink API to allow user space to control transceiver modules. Two specific APIs are added, but the plan is to extend the interface with more APIs in the future (see "Future plans"). This submission is a complete rework of a previous submission [1] that tried to achieve the same goal by allowing user space to write to the EEPROMs of these modules. It was rejected as it could have enabled user space binary blob drivers. However, the main issue is that by directly writing to some pages of these EEPROMs, we are interfering with the entity that is controlling the modules (kernel / device firmware). In addition, some functionality cannot be implemented solely by writing to the EEPROM, as it requires the assertion / de-assertion of hardware signals (e.g., "ResetL" pin in SFF-8636). Motivation ========== The kernel can currently dump the contents of module EEPROMs to user space via the ethtool legacy ioctl API or the new netlink API. These dumps can then be parsed by ethtool(8) according to the specification that defines the memory map of the EEPROM. For example, SFF-8636 [2] for QSFP and CMIS [3] for QSFP-DD. In addition to read-only elements, these specifications also define writeable elements that can be used to control the behavior of the module. For example, controlling whether the module is put in low or high power mode to limit its power consumption. The CMIS specification even defines a message exchange mechanism (CDB, Command Data Block) on top of the module's memory map. This allows the host to send various commands to the module. For example, to update its firmware. Implementation ============== The ethtool netlink API is extended with two new messages, 'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that allow user space to set and get transceiver module parameters. Specifically, the 'ETHTOOL_A_MODULE_POWER_MODE_POLICY' attribute allows user space to control the power mode policy of the module in order to limit its power consumption. See detailed description in patch #1. The user API is designed to be generic enough so that it could be used for modules with different memory maps (e.g., SFF-8636, CMIS). The only implementation of the device driver API in this series is for a MAC driver (mlxsw) where the module is controlled by the device's firmware, but it is designed to be generic enough so that it could also be used by implementations where the module is controlled by the kernel. Testing and introspection ========================= See detailed description in patches #1 and #5. Patchset overview ================= Patch #1 adds the initial infrastructure in ethtool along with the ability to control transceiver modules' power mode. Patches #2-#3 add required device registers in mlxsw. Patch #4 implements in mlxsw the ethtool operations added in patch #1. Patch #5 adds extended link states in order to allow user space to troubleshoot link down issues related to transceiver modules. Patch #6 adds support for these extended states in mlxsw. Future plans ============ * Extend 'ETHTOOL_MSG_MODULE_SET' to control Tx output among other attributes. * Add new ethtool message(s) to update firmware on transceiver modules. * Extend ethtool(8) to parse more diagnostic information from CMIS modules. No kernel changes required. [1] https://lore.kernel.org/netdev/20210623075925.2610908-1-idosch@idosch.org/ [2] https://members.snia.org/document/dl/26418 [3] http://www.qsfp-dd.com/wp-content/uploads/2021/05/CMIS5p0.pdf Previous versions: [4] https://lore.kernel.org/netdev/20211003073219.1631064-1-idosch@idosch.org/ [5] https://lore.kernel.org/netdev/20210824130344.1828076-1-idosch@idosch.org/ [6] https://lore.kernel.org/netdev/20210818155202.1278177-1-idosch@idosch.org/ [7] https://lore.kernel.org/netdev/20210809102152.719961-1-idosch@idosch.org/ ==================== Link: https://lore.kernel.org/r/20211006104647.2357115-1-idosch@idosch.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Oct 11, 2021
Ido Schimmel says: ==================== selftests: forwarding: Add ip6gre tests This patchset adds forwarding selftests for ip6gre. The tests can be run with veth pairs or with physical loopbacks. Patch #1 adds a new config option to determine if 'skip_sw' / 'skip_hw' flags are used when installing tc filters. By default, it is not set which means the flags are not used. 'skip_sw' is useful to ensure traffic is forwarded by the hardware data path. Patch #2 adds a new helper function. Patches #3-#4 add the forwarding selftests. Patch #5 adds a mlxsw-specific selftest to validate correct behavior of the 'decap_error' trap with IPv6 underlay. Patches #6-#8 align the corresponding IPv4 underlay test to the IPv6 one. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Oct 11, 2021
On SiFive Unmatched, I recently fell onto the following BUG when booting: [ 0.000000] ftrace: allocating 36610 entries in 144 pages [ 0.000000] Oops - illegal instruction [#1] [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.1+ #5 [ 0.000000] Hardware name: SiFive HiFive Unmatched A00 (DT) [ 0.000000] epc : riscv_cpuid_to_hartid_mask+0x6/0xae [ 0.000000] ra : __sbi_rfence_v02+0xc8/0x10a [ 0.000000] epc : ffffffff80007240 ra : ffffffff80009964 sp : ffffffff81803e10 [ 0.000000] gp : ffffffff81a1ea70 tp : ffffffff8180f500 t0 : ffffffe07fe30000 [ 0.000000] t1 : 0000000000000004 t2 : 0000000000000000 s0 : ffffffff81803e60 [ 0.000000] s1 : 0000000000000000 a0 : ffffffff81a22238 a1 : ffffffff81803e10 [ 0.000000] a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 [ 0.000000] a5 : 0000000000000000 a6 : ffffffff8000989c a7 : 0000000052464e43 [ 0.000000] s2 : ffffffff81a220c8 s3 : 0000000000000000 s4 : 0000000000000000 [ 0.000000] s5 : 0000000000000000 s6 : 0000000200000100 s7 : 0000000000000001 [ 0.000000] s8 : ffffffe07fe04040 s9 : ffffffff81a22c80 s10: 0000000000001000 [ 0.000000] s11: 0000000000000004 t3 : 0000000000000001 t4 : 0000000000000008 [ 0.000000] t5 : ffffffcf04000808 t6 : ffffffe3ffddf188 [ 0.000000] status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000002 [ 0.000000] [<ffffffff80007240>] riscv_cpuid_to_hartid_mask+0x6/0xae [ 0.000000] [<ffffffff80009474>] sbi_remote_fence_i+0x1e/0x26 [ 0.000000] [<ffffffff8000b8f4>] flush_icache_all+0x12/0x1a [ 0.000000] [<ffffffff8000666c>] patch_text_nosync+0x26/0x32 [ 0.000000] [<ffffffff8000884e>] ftrace_init_nop+0x52/0x8c [ 0.000000] [<ffffffff800f051e>] ftrace_process_locs.isra.0+0x29c/0x360 [ 0.000000] [<ffffffff80a0e3c6>] ftrace_init+0x80/0x130 [ 0.000000] [<ffffffff80a00f8c>] start_kernel+0x5c4/0x8f6 [ 0.000000] ---[ end trace f67eb9af4d8d492b ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! [ 0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- While ftrace is looping over a list of addresses to patch, it always failed when patching the same function: riscv_cpuid_to_hartid_mask. Looking at the backtrace, the illegal instruction is encountered in this same function. However, patch_text_nosync, after patching the instructions, calls flush_icache_range. But looking at what happens in this function: flush_icache_range -> flush_icache_all -> sbi_remote_fence_i -> __sbi_rfence_v02 -> riscv_cpuid_to_hartid_mask The icache and dcache of the current cpu are never synchronized between the patching of riscv_cpuid_to_hartid_mask and calling this same function. So fix this by flushing the current cpu's icache before asking for the other cpus to do the same. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Fixes: fab957c ("RISC-V: Atomic and Locking Code") Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
alobakin
pushed a commit
that referenced
this pull request
Oct 14, 2021
Ido Schimmel says: ==================== mlxsw: Add support for ECN mirroring Petr says: Patches in this set have been floating around for some time now together with trap_fwd support. That will however need more work, time for which is nowhere to be found, apparently. Instead, this patchset enables offload of only packet mirroring on RED mark qevent, enabling mirroring of ECN-marked packets. Formally it enables offload of filters added to blocks bound to the RED qevent mark if: - The switch ASIC is Spectrum-2 or above. - Only a single filter is attached at the block, at chain 0 (the default), and its classifier is matchall. - The filter has hw_stats set to disabled. - The filter has a single action, which is mirror. This differs from early_drop qevent offload, which supports mirroring and trapping. However trapping in context of ECN-marked packets is not suitable, because the HW does not drop the packet, as the trap action implies. And there is as of now no way to express only the part of trapping that transfers the packet to the SW datapath, sans the HW-datapath drop. The patchset progresses as follows: Patch #1 is an extack propagation. Mirroring of ECN-marked packets is configured in the ASIC through an ECN trigger, which is considered "egress", unlike the EARLY_DROP trigger. In patch #2, add a helper to classify triggers as ingress. As clarified above, traps cannot be offloaded on mark qevent. Similarly, given a trap_fwd action, it would not be offloadable on early_drop qevent. In patch #3, introduce support for tracking actions permissible on a given block. Patch #4 actually adds the mark qevent offload. In patch #5, fix a small style issue in one of the selftests, and in patch #6 add mark offload selftests. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Oct 18, 2021
Sebastian Andrzej Siewior says:
====================
Try to simplify the gnet_stats and remove qdisc->running sequence counter.
The first few patches is a follow up to
https://lore.kernel.org/all/20211007175000.2334713-1-bigeasy@linutronix.de/
The remaining patches (#5+) remove the seqcount_t (Qdisc::running) from
the Qdisc. The statistics (Qdisc::bstats and Qdisc::cpu_bstats) use
u64_stats_t and the "running state" is now represented by a bit in
Qdisc::state.
By removing the seqcount_t from Qdisc and decoupling the bstats
statistics from the seqcount_t it is possible to query the statistics
even if the Qdisc is running instead of waiting until it is idle again.
The try-lock like usage of the seqcount_t in qdisc_run_begin() is
problematic on PREEMPT_RT. Inside the qdisc_run_begin/end() qdisc->running
sequence counter write sections, at sch_direct_xmit(), the seqcount write
serialization lock is released then re-acquired. This is fine for !RT, because
the writer is in a BH disabled region and there is a no in-IRQ reader. For RT
though, BH sections are preemptible. The earlier introduced seqcount_LOCKNAME_t
mechanism, which for RT the reader acquires then relesaes the write
serailization lock to avoid infinite spinning if it preempts a seqcount write
section, cannot work: the qdisc->running write serialization lock is already
intermittingly released inside the seqcount write section.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Oct 20, 2021
Ido Schimmel says: ==================== mlxsw: Multi-level qdisc offload Petr says: Currently, mlxsw admits for offload a suitable root qdisc, and its children. Thus up to two levels of hierarchy are offloaded. Often, this is enough: one can configure TCs with RED and TCs with a shaper on, and can even see counters for each TC by looking at a qdisc at a sufficiently shallow position. While simple, the system has obvious shortcomings. It is not possible to configure both RED and shaping on one TC. It is not possible to place a PRIO below root TBF, which would then be offloaded as port shaper. FIFOs are only offloaded at root or directly below, which is confusing to users, because RED and TBF of course have their own FIFO. This patch set lifts assumptions that prevent offloading multi-level qdisc trees. In patch #1, offload of a graft operation is added to TBF. Grafts are issued as another qdisc is linked to the qdisc in question, and give drivers a chance to react to the linking. The absence of this event was not a major issue so far, because TBF was not considered classful, which changes with this patchset. The codebase currently assumes that ETS and PRIO are the only classful qdiscs. The following patches gradually lift this assumption. In patch #2, calculation of traffic class and priomap of a qdisc is fixed. Patch #3 fixes handling of future FIFOs. Child FIFO qdiscs may be created and notified before their parent qdisc exists and therefore need special handling. Patches #4, #5 and #6 unify, respectively, child destruction, child grafting, and cleanup of statistics. Patch #7 adds a function that validates whether a given qdisc topology is offloadable. Finally in patch #8, TBF and RED become classful. At this point, FIFO qdiscs grafted to an offloaded qdisc should always be offloaded. Patch #9 adds a selftest to verify some offloadable and unoffloadable qdisc trees. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Oct 29, 2021
Ido Schimmel says: ==================== mlxsw: Support multiple RIF MAC prefixes Currently, mlxsw enforces that all the netdevs used as router interfaces (RIFs) have the same MAC prefix (e.g., same 38 MSBs in Spectrum-1). Otherwise, an error is returned to user space with extack. This patchset relaxes the limitation through the use of RIF MAC profiles. A RIF MAC profile is a hardware entity that represents a particular MAC prefix which multiple RIFs can reference. Therefore, the number of possible MAC prefixes is no longer one, but the number of profiles supported by the device. The ability to change the MAC of a particular netdev is useful, for example, for users who use the netdev to connect to an upstream provider that performs MAC filtering. Currently, such users are either forced to negotiate with the provider or change the MAC address of all other netdevs so that they share the same prefix. Patchset overview: Patches #1-#3 are preparations. Patch #4 adds actual support for RIF MAC profiles. Patch #5 exposes RIF MAC profiles as a devlink resource, so that user space has visibility into the maximum number of profiles and current occupancy. Useful for debugging and testing (next 3 patches). Patches #6-#8 add both scale and functional tests. Patch #9 removes tests that validated the previous limitation. It is now covered by patch #6 for devices that support a single profile. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
alobakin
pushed a commit
that referenced
this pull request
Nov 3, 2021
We got the following lockdep splat while running fstests (specifically btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered by 87579e9 ("loop: use worker per cgroup instead of kworker") which converted loop to using workqueues, which comes with lockdep annotations that don't exist with kworkers. The lockdep splat is as follows: WARNING: possible circular locking dependency detected 5.14.0-rc2-custom+ torvalds#34 Not tainted ------------------------------------------------------ losetup/156417 is trying to acquire lock: ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600 but task is already holding lock: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&lo->lo_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 lo_open+0x28/0x60 [loop] blkdev_get_whole+0x28/0xf0 blkdev_get_by_dev.part.0+0x168/0x3c0 blkdev_open+0xd2/0xe0 do_dentry_open+0x163/0x3a0 path_openat+0x74d/0xa40 do_filp_open+0x9c/0x140 do_sys_openat2+0xb1/0x170 __x64_sys_openat+0x54/0x90 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #4 (&disk->open_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 blkdev_get_by_dev.part.0+0xd1/0x3c0 blkdev_get_by_path+0xc0/0xd0 btrfs_scan_one_device+0x52/0x1f0 [btrfs] btrfs_control_ioctl+0xac/0x170 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #3 (uuid_mutex){+.+.}-{3:3}: __mutex_lock+0xba/0x7c0 btrfs_rm_device+0x48/0x6a0 [btrfs] btrfs_ioctl+0x2d1c/0x3110 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #2 (sb_writers#11){.+.+}-{0:0}: lo_write_bvec+0x112/0x290 [loop] loop_process_work+0x25f/0xcb0 [loop] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}: process_one_work+0x266/0x5d0 worker_thread+0x55/0x3c0 kthread+0x140/0x170 ret_from_fork+0x22/0x30 -> #0 ((wq_completion)loop0){+.+.}-{0:0}: __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 flush_workqueue+0xae/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Chain exists of: (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&lo->lo_mutex); lock(&disk->open_mutex); lock(&lo->lo_mutex); lock((wq_completion)loop0); *** DEADLOCK *** 1 lock held by losetup/156417: #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop] stack backtrace: CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ torvalds#34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x57/0x72 check_noncircular+0x10a/0x120 __lock_acquire+0x1130/0x1dc0 lock_acquire+0xf5/0x320 ? flush_workqueue+0x84/0x600 flush_workqueue+0xae/0x600 ? flush_workqueue+0x84/0x600 drain_workqueue+0xa0/0x110 destroy_workqueue+0x36/0x250 __loop_clr_fd+0x9a/0x650 [loop] lo_ioctl+0x29d/0x780 [loop] ? __lock_acquire+0x3a0/0x1dc0 ? update_dl_rq_load_avg+0x152/0x360 ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 block_ioctl+0x3f/0x50 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f645884de6b Usually the uuid_mutex exists to protect the fs_devices that map together all of the devices that match a specific uuid. In rm_device we're messing with the uuid of a device, so it makes sense to protect that here. However in doing that it pulls in a whole host of lockdep dependencies, as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus we end up with the dependency chain under the uuid_mutex being added under the normal sb write dependency chain, which causes problems with loop devices. We don't need the uuid mutex here however. If we call btrfs_scan_one_device() before we scratch the super block we will find the fs_devices and not find the device itself and return EBUSY because the fs_devices is open. If we call it after the scratch happens it will not appear to be a valid btrfs file system. We do not need to worry about other fs_devices modifying operations here because we're protected by the exclusive operations locking. So drop the uuid_mutex here in order to fix the lockdep splat. A more detailed explanation from the discussion: We are worried about rm and scan racing with each other, before this change we'll zero the device out under the UUID mutex so when scan does run it'll make sure that it can go through the whole device scan thing without rm messing with us. We aren't worried if the scratch happens first, because the result is we don't think this is a btrfs device and we bail out. The only case we are concerned with is we scratch _after_ scan is able to read the superblock and gets a seemingly valid super block, so lets consider this case. Scan will call device_list_add() with the device we're removing. We'll call find_fsid_with_metadata_uuid() and get our fs_devices for this UUID. At this point we lock the fs_devices->device_list_mutex. This is what protects us in this case, but we have two cases here. 1. We aren't to the device removal part of the RM. We found our device, and device name matches our path, we go down and we set total_devices to our super number of devices, which doesn't affect anything because we haven't done the remove yet. 2. We are past the device removal part, which is protected by the device_list_mutex. Scan doesn't find the device, it goes down and does the if (fs_devices->opened) return -EBUSY; check and we bail out. Nothing about this situation is ideal, but the lockdep splat is real, and the fix is safe, tho admittedly a bit scary looking. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more from the discussion ] Signed-off-by: David Sterba <dsterba@suse.com>
alobakin
pushed a commit
that referenced
this pull request
Nov 3, 2021
Attempting to defragment a Btrfs file containing a transparent huge page immediately deadlocks with the following stack trace: #0 context_switch (kernel/sched/core.c:4940:2) #1 __schedule (kernel/sched/core.c:6287:8) #2 schedule (kernel/sched/core.c:6366:3) #3 io_schedule (kernel/sched/core.c:8389:2) #4 wait_on_page_bit_common (mm/filemap.c:1356:4) #5 __lock_page (mm/filemap.c:1648:2) #6 lock_page (./include/linux/pagemap.h:625:3) #7 pagecache_get_page (mm/filemap.c:1910:4) #8 find_or_create_page (./include/linux/pagemap.h:420:9) #9 defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9) #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14) #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9) #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9) #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9) #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10) #15 vfs_ioctl (fs/ioctl.c:51:10) #16 __do_sys_ioctl (fs/ioctl.c:874:11) #17 __se_sys_ioctl (fs/ioctl.c:860:1) #18 __x64_sys_ioctl (fs/ioctl.c:860:1) #19 do_syscall_x64 (arch/x86/entry/common.c:50:14) #20 do_syscall_64 (arch/x86/entry/common.c:80:7) #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113) A huge page is represented by a compound page, which consists of a struct page for each PAGE_SIZE page within the huge page. The first struct page is the "head page", and the remaining are "tail pages". Defragmentation attempts to lock each page in the range. However, lock_page() on a tail page actually locks the corresponding head page. So, if defragmentation tries to lock more than one struct page in a compound page, it tries to lock the same head page twice and deadlocks with itself. Ideally, we should be able to defragment transparent huge pages. However, THP for filesystems is currently read-only, so a lot of code is not ready to use huge pages for I/O. For now, let's just return ETXTBUSY. This can be reproduced with the following on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y: $ cat create_thp_file.c #include <fcntl.h> #include <stdbool.h> #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> static const char zeroes[1024 * 1024]; static const size_t FILE_SIZE = 2 * 1024 * 1024; int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s PATH\n", argv[0]); return EXIT_FAILURE; } int fd = creat(argv[1], 0777); if (fd == -1) { perror("creat"); return EXIT_FAILURE; } size_t written = 0; while (written < FILE_SIZE) { ssize_t ret = write(fd, zeroes, sizeof(zeroes) < FILE_SIZE - written ? sizeof(zeroes) : FILE_SIZE - written); if (ret < 0) { perror("write"); return EXIT_FAILURE; } written += ret; } close(fd); fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("open"); return EXIT_FAILURE; } /* * Reserve some address space so that we can align the file mapping to * the huge page size. */ void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (placeholder_map == MAP_FAILED) { perror("mmap (placeholder)"); return EXIT_FAILURE; } void *aligned_address = (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1)); void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC, MAP_SHARED | MAP_FIXED, fd, 0); if (map == MAP_FAILED) { perror("mmap"); return EXIT_FAILURE; } if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) { perror("madvise"); return EXIT_FAILURE; } char *line = NULL; size_t line_capacity = 0; FILE *smaps_file = fopen("/proc/self/smaps", "r"); if (!smaps_file) { perror("fopen"); return EXIT_FAILURE; } for (;;) { for (size_t off = 0; off < FILE_SIZE; off += 4096) ((volatile char *)map)[off]; ssize_t ret; bool this_mapping = false; while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) { unsigned long start, end, huge; if (sscanf(line, "%lx-%lx", &start, &end) == 2) { this_mapping = (start <= (uintptr_t)map && (uintptr_t)map < end); } else if (this_mapping && sscanf(line, "FilePmdMapped: %ld", &huge) == 1 && huge > 0) { return EXIT_SUCCESS; } } sleep(6); rewind(smaps_file); fflush(smaps_file); } } $ ./create_thp_file huge $ btrfs fi defrag -czstd ./huge Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
alobakin
pushed a commit
that referenced
this pull request
Nov 3, 2021
The Xen interrupt injection for event channels relies on accessing the
guest's vcpu_info structure in __kvm_xen_has_interrupt(), through a
gfn_to_hva_cache.
This requires the srcu lock to be held, which is mostly the case except
for this code path:
[ 11.822877] WARNING: suspicious RCU usage
[ 11.822965] -----------------------------
[ 11.823013] include/linux/kvm_host.h:664 suspicious rcu_dereference_check() usage!
[ 11.823131]
[ 11.823131] other info that might help us debug this:
[ 11.823131]
[ 11.823196]
[ 11.823196] rcu_scheduler_active = 2, debug_locks = 1
[ 11.823253] 1 lock held by dom:0/90:
[ 11.823292] #0: ffff998956ec8118 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x680
[ 11.823379]
[ 11.823379] stack backtrace:
[ 11.823428] CPU: 2 PID: 90 Comm: dom:0 Kdump: loaded Not tainted 5.4.34+ #5
[ 11.823496] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[ 11.823612] Call Trace:
[ 11.823645] dump_stack+0x7a/0xa5
[ 11.823681] lockdep_rcu_suspicious+0xc5/0x100
[ 11.823726] __kvm_xen_has_interrupt+0x179/0x190
[ 11.823773] kvm_cpu_has_extint+0x6d/0x90
[ 11.823813] kvm_cpu_accept_dm_intr+0xd/0x40
[ 11.823853] kvm_vcpu_ready_for_interrupt_injection+0x20/0x30
< post_kvm_run_save() inlined here >
[ 11.823906] kvm_arch_vcpu_ioctl_run+0x135/0x6a0
[ 11.823947] kvm_vcpu_ioctl+0x263/0x680
Fixes: 40da8cc ("KVM: x86/xen: Add event channel interrupt vector upcall")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Message-Id: <606aaaf29fca3850a63aa4499826104e77a72346.camel@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Nov 23, 2021
Convert the netfs helper library to use folios throughout, convert the 9p and afs filesystems to use folios in their file I/O paths and convert the ceph filesystem to use just enough folios to compile. With these changes, afs passes -g quick xfstests. Changes ======= ver #5: - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does instead (Willy decided he didn't want this after all). ver #4: - Fixed a bug in afs_redirty_page() whereby it didn't set the next page index in the loop and returned too early. - Simplified a check in v9fs_vfs_write_folio_locked()[1]. - Undid a change to afs_symlink_readpage()[1]. - Used offset_in_folio() in afs_write_end()[1]. - Changed from using page_endio() to folio_end{io,_read,_write}()[1]. ver #2: - Add 9p foliation. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dominique Martinet <asmadeus@codewreck.org> Tested-by: kafs-testing@auristor.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Dominique Martinet <asmadeus@codewreck.org> cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cachefs@redhat.com Link: https://lore.kernel.org/r/YYKa3bfQZxK5/wDN@casper.infradead.org/ [1] Link: https://lore.kernel.org/r/2408234.1628687271@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
alobakin
pushed a commit
that referenced
this pull request
Nov 23, 2021
The exit function fixes a memory leak with the src field as detected by
leak sanitizer. An example of which is:
Indirect leak of 25133184 byte(s) in 207 object(s) allocated from:
#0 0x7f199ecfe987 in __interceptor_calloc libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55defe638224 in annotated_source__alloc_histograms util/annotate.c:803
#2 0x55defe6397e4 in symbol__hists util/annotate.c:952
#3 0x55defe639908 in symbol__inc_addr_samples util/annotate.c:968
#4 0x55defe63aa29 in hist_entry__inc_addr_samples util/annotate.c:1119
#5 0x55defe499a79 in hist_iter__report_callback tools/perf/builtin-report.c:182
#6 0x55defe7a859d in hist_entry_iter__add util/hist.c:1236
#7 0x55defe49aa63 in process_sample_event tools/perf/builtin-report.c:315
#8 0x55defe731bc8 in evlist__deliver_sample util/session.c:1473
#9 0x55defe731e38 in machines__deliver_event util/session.c:1510
#10 0x55defe732a23 in perf_session__deliver_event util/session.c:1590
#11 0x55defe72951e in ordered_events__deliver_event util/session.c:183
#12 0x55defe740082 in do_flush util/ordered-events.c:244
#13 0x55defe7407cb in __ordered_events__flush util/ordered-events.c:323
#14 0x55defe740a61 in ordered_events__flush util/ordered-events.c:341
#15 0x55defe73837f in __perf_session__process_events util/session.c:2390
#16 0x55defe7385ff in perf_session__process_events util/session.c:2420
...
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Liška <mliska@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: https://lore.kernel.org/r/20211112035124.94327-3-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
When reading the voltage: $ cat /sys/bus/iio/devices/iio\:device0/in_voltage0_raw Lockdep complains: [ 153.910616] ====================================================== [ 153.916918] WARNING: possible circular locking dependency detected [ 153.923221] 5.14.0+ #5 Not tainted [ 153.926692] ------------------------------------------------------ [ 153.932992] cat/717 is trying to acquire lock: [ 153.937525] c2585358 (&indio_dev->mlock){+.+.}-{3:3}, at: iio_device_claim_direct_mode+0x28/0x44 [ 153.946541] but task is already holding lock: [ 153.952487] c2585860 (&dln2->mutex){+.+.}-{3:3}, at: dln2_adc_read_raw+0x94/0x2bc [dln2_adc] [ 153.961152] which lock already depends on the new lock. Fix this by not calling into the iio core underneath the dln2->mutex lock. Fixes: 7c0299e ("iio: adc: Add support for DLN2 ADC") Cc: Jack Andersen <jackoalan@gmail.com> Signed-off-by: Noralf Trønnes <noralf@tronnes.org> Link: https://lore.kernel.org/r/20211018113731.25723-1-noralf@tronnes.org Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
With preemption enabled (CONFIG_DEBUG_PREEMPT=y), the following appeared when rnbd client tries to map remote block device. BUG: using smp_processor_id() in preemptible [00000000] code: bash/1733 caller is debug_smp_processor_id+0x17/0x20 CPU: 0 PID: 1733 Comm: bash Not tainted 5.16.0-rc1 #5 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x78 dump_stack+0x10/0x12 check_preemption_disabled+0xe4/0xf0 debug_smp_processor_id+0x17/0x20 rtrs_clt_update_all_stats+0x3b/0x70 [rtrs_client] rtrs_clt_read_req+0xc3/0x380 [rtrs_client] ? rtrs_clt_init_req+0xe3/0x120 [rtrs_client] rtrs_clt_request+0x1a7/0x320 [rtrs_client] ? 0xffffffffc0ab1000 send_usr_msg+0xbf/0x160 [rnbd_client] ? rnbd_clt_put_sess+0x60/0x60 [rnbd_client] ? send_usr_msg+0x160/0x160 [rnbd_client] ? sg_alloc_table+0x27/0xb0 ? sg_zero_buffer+0xd0/0xd0 send_msg_sess_info+0xe9/0x180 [rnbd_client] ? rnbd_clt_put_sess+0x60/0x60 [rnbd_client] ? blk_mq_alloc_tag_set+0x2ef/0x370 rnbd_clt_map_device+0xba8/0xcd0 [rnbd_client] ? send_msg_open+0x200/0x200 [rnbd_client] rnbd_clt_map_device_store+0x3e5/0x620 [rnbd_client To supress the calltrace, let's call get_cpu_ptr/put_cpu_ptr pair in rtrs_clt_update_rdma_stats to disable preemption when accessing per-cpu variable. While at it, let's make the similar change in rtrs_clt_update_wc_stats. And for rtrs_clt_inc_failover_cnt, though it was only called inside rcu section, but it still can be preempted in case CONFIG_PREEMPT_RCU is enabled, so change it to {get,put}_cpu_ptr pair either. Link: https://lore.kernel.org/r/20211128133501.38710-1-guoqing.jiang@linux.dev Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
When kmalloc in nfc_genl_dump_devices() fails then nfc_genl_dump_devices_done() segfaults as below KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 25 Comm: kworker/0:1 Not tainted 5.16.0-rc4-01180-g2a987e65025e-dirty #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-6.fc35 04/01/2014 Workqueue: events netlink_sock_destruct_work RIP: 0010:klist_iter_exit+0x26/0x80 Call Trace: <TASK> class_dev_iter_exit+0x15/0x20 nfc_genl_dump_devices_done+0x3b/0x50 genl_lock_done+0x84/0xd0 netlink_sock_destruct+0x8f/0x270 __sk_destruct+0x64/0x3b0 sk_destruct+0xa8/0xd0 __sk_free+0x2e8/0x3d0 sk_free+0x51/0x90 netlink_sock_destruct_work+0x1c/0x20 process_one_work+0x411/0x710 worker_thread+0x6fd/0xa80 Link: https://syzkaller.appspot.com/bug?id=fc0fa5a53db9edd261d56e74325419faf18bd0df Reported-by: syzbot+f9f76f4a0766420b4a02@syzkaller.appspotmail.com Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Link: https://lore.kernel.org/r/20211208182742.340542-1-tadeusz.struk@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
For whatever reason, some devices like QCA6390, WCN6855 using ath11k are not in M3 state during PM resume, but still functional. The mhi_pm_resume should then not fail in those cases, and let the higher level device specific stack continue resuming process. Add an API mhi_pm_resume_force(), to force resuming irrespective of the current MHI state. This fixes a regression with non functional ath11k WiFi after suspend/resume cycle on some machines. Bug report: https://bugzilla.kernel.org/show_bug.cgi?id=214179 Link: https://lore.kernel.org/regressions/871r5p0x2u.fsf@codeaurora.org/ Fixes: 020d3b2 ("bus: mhi: Early MHI resume failure in non M3 state") Cc: stable@vger.kernel.org #5.13 Reported-by: Kalle Valo <kvalo@codeaurora.org> Reported-by: Pengyu Ma <mapengyu@gmail.com> Tested-by: Kalle Valo <kvalo@kernel.org> Acked-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Loic Poulain <loic.poulain@linaro.org> [mani: Switched to API, added bug report, reported-by tags and CCed stable] Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Link: https://lore.kernel.org/r/20211209131633.4168-1-manivannan.sadhasivam@linaro.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
The fixed commit attempts to close inject.output even if it was never opened e.g. $ perf record uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (7 samples) ] $ perf inject -i perf.data --vm-time-correlation=dry-run Segmentation fault (core dumped) $ gdb --quiet perf Reading symbols from perf... (gdb) r inject -i perf.data --vm-time-correlation=dry-run Starting program: /home/ahunter/bin/perf inject -i perf.data --vm-time-correlation=dry-run [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. 0x00007eff8afeef5b in _IO_new_fclose (fp=0x0) at iofclose.c:48 48 iofclose.c: No such file or directory. (gdb) bt #0 0x00007eff8afeef5b in _IO_new_fclose (fp=0x0) at iofclose.c:48 #1 0x0000557fc7b74f92 in perf_data__close (data=data@entry=0x7ffcdafa6578) at util/data.c:376 #2 0x0000557fc7a6b807 in cmd_inject (argc=<optimized out>, argv=<optimized out>) at builtin-inject.c:1085 #3 0x0000557fc7ac4783 in run_builtin (p=0x557fc8074878 <commands+600>, argc=4, argv=0x7ffcdafb6a60) at perf.c:313 #4 0x0000557fc7a25d5c in handle_internal_command (argv=<optimized out>, argc=<optimized out>) at perf.c:365 #5 run_argv (argcp=<optimized out>, argv=<optimized out>) at perf.c:409 #6 main (argc=4, argv=0x7ffcdafb6a60) at perf.c:539 (gdb) Fixes: 02e6246 ("perf inject: Close inject.output on exit") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20211213084829.114772-2-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Dec 22, 2021
The fixed commit attempts to get the output file descriptor even if the file was never opened e.g. $ perf record uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (7 samples) ] $ perf inject -i perf.data --vm-time-correlation=dry-run Segmentation fault (core dumped) $ gdb --quiet perf Reading symbols from perf... (gdb) r inject -i perf.data --vm-time-correlation=dry-run Starting program: /home/ahunter/bin/perf inject -i perf.data --vm-time-correlation=dry-run [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. __GI___fileno (fp=0x0) at fileno.c:35 35 fileno.c: No such file or directory. (gdb) bt #0 __GI___fileno (fp=0x0) at fileno.c:35 #1 0x00005621e48dd987 in perf_data__fd (data=0x7fff4c68bd08) at util/data.h:72 #2 perf_data__fd (data=0x7fff4c68bd08) at util/data.h:69 #3 cmd_inject (argc=<optimized out>, argv=0x7fff4c69c1f0) at builtin-inject.c:1017 #4 0x00005621e4936783 in run_builtin (p=0x5621e4ee6878 <commands+600>, argc=4, argv=0x7fff4c69c1f0) at perf.c:313 #5 0x00005621e4897d5c in handle_internal_command (argv=<optimized out>, argc=<optimized out>) at perf.c:365 #6 run_argv (argcp=<optimized out>, argv=<optimized out>) at perf.c:409 #7 main (argc=4, argv=0x7fff4c69c1f0) at perf.c:539 (gdb) Fixes: 0ae0389 ("perf tools: Pass a fd to perf_file_header__read_pipe()") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Riccardo Mancini <rickyman7@gmail.com> Cc: stable@vger.kernel.org Link: http://lore.kernel.org/lkml/20211213084829.114772-3-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Mar 28, 2025
…uctions
Add several ./test_progs tests:
- arena_atomics/load_acquire
- arena_atomics/store_release
- verifier_load_acquire/*
- verifier_store_release/*
- verifier_precision/bpf_load_acquire
- verifier_precision/bpf_store_release
The last two tests are added to check if backtrack_insn() handles the
new instructions correctly.
Additionally, the last test also makes sure that the verifier
"remembers" the value (in src_reg) we store-release into e.g. a stack
slot. For example, if we take a look at the test program:
#0: r1 = 8;
/* store_release((u64 *)(r10 - 8), r1); */
#1: .8byte %[store_release];
#2: r1 = *(u64 *)(r10 - 8);
#3: r2 = r10;
#4: r2 += r1;
#5: r0 = 0;
#6: exit;
At #1, if the verifier doesn't remember that we wrote 8 to the stack,
then later at #4 we would be adding an unbounded scalar value to the
stack pointer, which would cause the program to be rejected:
VERIFIER LOG:
=============
...
math between fp pointer and register with unbounded min value is not allowed
For easier CI integration, instead of using built-ins like
__atomic_{load,store}_n() which depend on the new
__BPF_FEATURE_LOAD_ACQ_STORE_REL pre-defined macro, manually craft
load-acquire/store-release instructions using __imm_insn(), as suggested
by Eduard.
All new tests depend on:
(1) Clang major version >= 18, and
(2) ENABLE_ATOMICS_TESTS is defined (currently implies -mcpu=v3 or
v4), and
(3) JIT supports load-acquire/store-release (currently arm64 and
x86-64)
In .../progs/arena_atomics.c:
/* 8-byte-aligned */
__u8 __arena_global load_acquire8_value = 0x12;
/* 1-byte hole */
__u16 __arena_global load_acquire16_value = 0x1234;
That 1-byte hole in the .addr_space.1 ELF section caused clang-17 to
crash:
fatal error: error in backend: unable to write nop sequence of 1 bytes
To work around such llvm-17 CI job failures, conditionally define
__arena_global variables as 64-bit if __clang_major__ < 18, to make sure
.addr_space.1 has no holes. Ideally we should avoid compiling this file
using clang-17 at all (arena tests depend on
__BPF_FEATURE_ADDR_SPACE_CAST, and are skipped for llvm-17 anyway), but
that is a separate topic.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/1b46c6feaf0f1b6984d9ec80e500cc7383e9da1a.1741049567.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Mar 28, 2025
In recent kernels, there are lockdep splats around the struct request_queue::io_lockdep_map, similar to [1], but they typically don't show up until reclaim with writeback happens. Having multiple kernel versions released with a known risc of kernel deadlock during reclaim writeback should IMHO be addressed and backported to -stable with the highest priority. In order to have these lockdep splats show up earlier, preferrably during system initialization, prime the struct request_queue::io_lockdep_map as GFP_KERNEL reclaim- tainted. This will instead lead to lockdep splats looking similar to [2], but without the need for reclaim + writeback happening. [1]: [ 189.762244] ====================================================== [ 189.762432] WARNING: possible circular locking dependency detected [ 189.762441] 6.14.0-rc6-xe+ #6 Tainted: G U [ 189.762450] ------------------------------------------------------ [ 189.762459] kswapd0/119 is trying to acquire lock: [ 189.762467] ffff888110ceb710 (&q->q_usage_counter(io)#26){++++}-{0:0}, at: __submit_bio+0x76/0x230 [ 189.762485] but task is already holding lock: [ 189.762494] ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.762507] which lock already depends on the new lock. [ 189.762519] the existing dependency chain (in reverse order) is: [ 189.762529] -> #2 (fs_reclaim){+.+.}-{0:0}: [ 189.762540] fs_reclaim_acquire+0xc5/0x100 [ 189.762548] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 189.762558] alloc_inode+0xaa/0xe0 [ 189.762566] iget_locked+0x157/0x330 [ 189.762573] kernfs_get_inode+0x1b/0x110 [ 189.762582] kernfs_get_tree+0x1b0/0x2e0 [ 189.762590] sysfs_get_tree+0x1f/0x60 [ 189.762597] vfs_get_tree+0x2a/0xf0 [ 189.762605] path_mount+0x4cd/0xc00 [ 189.762613] __x64_sys_mount+0x119/0x150 [ 189.762621] x64_sys_call+0x14f2/0x2310 [ 189.762630] do_syscall_64+0x91/0x180 [ 189.762637] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762647] -> #1 (&root->kernfs_rwsem){++++}-{3:3}: [ 189.762659] down_write+0x3e/0xf0 [ 189.762667] kernfs_remove+0x32/0x60 [ 189.762676] sysfs_remove_dir+0x4f/0x60 [ 189.762685] __kobject_del+0x33/0xa0 [ 189.762709] kobject_del+0x13/0x30 [ 189.762716] elv_unregister_queue+0x52/0x80 [ 189.762725] elevator_switch+0x68/0x360 [ 189.762733] elv_iosched_store+0x14b/0x1b0 [ 189.762756] queue_attr_store+0x181/0x1e0 [ 189.762765] sysfs_kf_write+0x49/0x80 [ 189.762773] kernfs_fop_write_iter+0x17d/0x250 [ 189.762781] vfs_write+0x281/0x540 [ 189.762790] ksys_write+0x72/0xf0 [ 189.762798] __x64_sys_write+0x19/0x30 [ 189.762807] x64_sys_call+0x2a3/0x2310 [ 189.762815] do_syscall_64+0x91/0x180 [ 189.762823] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 189.762833] -> #0 (&q->q_usage_counter(io)#26){++++}-{0:0}: [ 189.762845] __lock_acquire+0x1525/0x2760 [ 189.762854] lock_acquire+0xca/0x310 [ 189.762861] blk_mq_submit_bio+0x8a2/0xba0 [ 189.762870] __submit_bio+0x76/0x230 [ 189.762878] submit_bio_noacct_nocheck+0x323/0x430 [ 189.762888] submit_bio_noacct+0x2cc/0x620 [ 189.762896] submit_bio+0x38/0x110 [ 189.762904] __swap_writepage+0xf5/0x380 [ 189.762912] swap_writepage+0x3c7/0x600 [ 189.762920] shmem_writepage+0x3da/0x4f0 [ 189.762929] pageout+0x13f/0x310 [ 189.762937] shrink_folio_list+0x61c/0xf60 [ 189.763261] evict_folios+0x378/0xcd0 [ 189.763584] try_to_shrink_lruvec+0x1b0/0x360 [ 189.763946] shrink_one+0x10e/0x200 [ 189.764266] shrink_node+0xc02/0x1490 [ 189.764586] balance_pgdat+0x563/0xb00 [ 189.764934] kswapd+0x1e8/0x430 [ 189.765249] kthread+0x10b/0x260 [ 189.765559] ret_from_fork+0x44/0x70 [ 189.765889] ret_from_fork_asm+0x1a/0x30 [ 189.766198] other info that might help us debug this: [ 189.767089] Chain exists of: &q->q_usage_counter(io)#26 --> &root->kernfs_rwsem --> fs_reclaim [ 189.767971] Possible unsafe locking scenario: [ 189.768555] CPU0 CPU1 [ 189.768849] ---- ---- [ 189.769136] lock(fs_reclaim); [ 189.769421] lock(&root->kernfs_rwsem); [ 189.769714] lock(fs_reclaim); [ 189.770016] rlock(&q->q_usage_counter(io)#26); [ 189.770305] *** DEADLOCK *** [ 189.771167] 1 lock held by kswapd0/119: [ 189.771453] #0: ffffffff834c97c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xbe/0xb00 [ 189.771770] stack backtrace: [ 189.772351] CPU: 4 UID: 0 PID: 119 Comm: kswapd0 Tainted: G U 6.14.0-rc6-xe+ #6 [ 189.772353] Tainted: [U]=USER [ 189.772354] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 189.772354] Call Trace: [ 189.772355] <TASK> [ 189.772356] dump_stack_lvl+0x6e/0xa0 [ 189.772359] dump_stack+0x10/0x18 [ 189.772360] print_circular_bug.cold+0x17a/0x1b7 [ 189.772363] check_noncircular+0x13a/0x150 [ 189.772365] ? __pfx_stack_trace_consume_entry+0x10/0x10 [ 189.772368] __lock_acquire+0x1525/0x2760 [ 189.772368] ? ret_from_fork_asm+0x1a/0x30 [ 189.772371] lock_acquire+0xca/0x310 [ 189.772372] ? __submit_bio+0x76/0x230 [ 189.772375] ? lock_release+0xd5/0x2c0 [ 189.772376] blk_mq_submit_bio+0x8a2/0xba0 [ 189.772378] ? __submit_bio+0x76/0x230 [ 189.772380] __submit_bio+0x76/0x230 [ 189.772382] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772384] submit_bio_noacct_nocheck+0x323/0x430 [ 189.772386] ? submit_bio_noacct_nocheck+0x323/0x430 [ 189.772387] ? __might_sleep+0x58/0xa0 [ 189.772390] submit_bio_noacct+0x2cc/0x620 [ 189.772391] ? count_memcg_events+0x68/0x90 [ 189.772393] submit_bio+0x38/0x110 [ 189.772395] __swap_writepage+0xf5/0x380 [ 189.772396] swap_writepage+0x3c7/0x600 [ 189.772397] shmem_writepage+0x3da/0x4f0 [ 189.772401] pageout+0x13f/0x310 [ 189.772406] shrink_folio_list+0x61c/0xf60 [ 189.772409] ? isolate_folios+0xe80/0x16b0 [ 189.772410] ? mark_held_locks+0x46/0x90 [ 189.772412] evict_folios+0x378/0xcd0 [ 189.772414] ? evict_folios+0x34a/0xcd0 [ 189.772415] ? lock_is_held_type+0xa3/0x130 [ 189.772417] try_to_shrink_lruvec+0x1b0/0x360 [ 189.772420] shrink_one+0x10e/0x200 [ 189.772421] shrink_node+0xc02/0x1490 [ 189.772423] ? shrink_node+0xa08/0x1490 [ 189.772424] ? shrink_node+0xbd8/0x1490 [ 189.772425] ? mem_cgroup_iter+0x366/0x480 [ 189.772427] balance_pgdat+0x563/0xb00 [ 189.772428] ? balance_pgdat+0x563/0xb00 [ 189.772430] ? trace_hardirqs_on+0x1e/0xe0 [ 189.772431] ? finish_task_switch.isra.0+0xcb/0x330 [ 189.772433] ? __switch_to_asm+0x33/0x70 [ 189.772437] kswapd+0x1e8/0x430 [ 189.772438] ? __pfx_autoremove_wake_function+0x10/0x10 [ 189.772440] ? __pfx_kswapd+0x10/0x10 [ 189.772441] kthread+0x10b/0x260 [ 189.772443] ? __pfx_kthread+0x10/0x10 [ 189.772444] ret_from_fork+0x44/0x70 [ 189.772446] ? __pfx_kthread+0x10/0x10 [ 189.772447] ret_from_fork_asm+0x1a/0x30 [ 189.772450] </TASK> [2]: [ 8.760253] ====================================================== [ 8.760254] WARNING: possible circular locking dependency detected [ 8.760255] 6.14.0-rc6-xe+ #7 Tainted: G U [ 8.760256] ------------------------------------------------------ [ 8.760257] (udev-worker)/674 is trying to acquire lock: [ 8.760259] ffff888100e39148 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_remove+0x32/0x60 [ 8.760265] but task is already holding lock: [ 8.760266] ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760272] which lock already depends on the new lock. [ 8.760272] the existing dependency chain (in reverse order) is: [ 8.760273] -> #2 (&q->q_usage_counter(io)#27){++++}-{0:0}: [ 8.760276] blk_alloc_queue+0x30a/0x350 [ 8.760279] blk_mq_alloc_queue+0x6b/0xe0 [ 8.760281] scsi_alloc_sdev+0x276/0x3c0 [ 8.760284] scsi_probe_and_add_lun+0x22a/0x440 [ 8.760286] __scsi_scan_target+0x109/0x230 [ 8.760288] scsi_scan_channel+0x65/0xc0 [ 8.760290] scsi_scan_host_selected+0xff/0x140 [ 8.760292] do_scsi_scan_host+0xa7/0xc0 [ 8.760293] do_scan_async+0x1c/0x160 [ 8.760295] async_run_entry_fn+0x32/0x150 [ 8.760299] process_one_work+0x224/0x5f0 [ 8.760302] worker_thread+0x1d4/0x3e0 [ 8.760304] kthread+0x10b/0x260 [ 8.760306] ret_from_fork+0x44/0x70 [ 8.760309] ret_from_fork_asm+0x1a/0x30 [ 8.760312] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 8.760315] fs_reclaim_acquire+0xc5/0x100 [ 8.760317] kmem_cache_alloc_lru_noprof+0x4a/0x480 [ 8.760319] alloc_inode+0xaa/0xe0 [ 8.760322] iget_locked+0x157/0x330 [ 8.760323] kernfs_get_inode+0x1b/0x110 [ 8.760325] kernfs_get_tree+0x1b0/0x2e0 [ 8.760327] sysfs_get_tree+0x1f/0x60 [ 8.760329] vfs_get_tree+0x2a/0xf0 [ 8.760332] path_mount+0x4cd/0xc00 [ 8.760334] __x64_sys_mount+0x119/0x150 [ 8.760336] x64_sys_call+0x14f2/0x2310 [ 8.760338] do_syscall_64+0x91/0x180 [ 8.760340] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760342] -> #0 (&root->kernfs_rwsem){++++}-{3:3}: [ 8.760345] __lock_acquire+0x1525/0x2760 [ 8.760347] lock_acquire+0xca/0x310 [ 8.760348] down_write+0x3e/0xf0 [ 8.760350] kernfs_remove+0x32/0x60 [ 8.760351] sysfs_remove_dir+0x4f/0x60 [ 8.760353] __kobject_del+0x33/0xa0 [ 8.760355] kobject_del+0x13/0x30 [ 8.760356] elv_unregister_queue+0x52/0x80 [ 8.760358] elevator_switch+0x68/0x360 [ 8.760360] elv_iosched_store+0x14b/0x1b0 [ 8.760362] queue_attr_store+0x181/0x1e0 [ 8.760364] sysfs_kf_write+0x49/0x80 [ 8.760366] kernfs_fop_write_iter+0x17d/0x250 [ 8.760367] vfs_write+0x281/0x540 [ 8.760370] ksys_write+0x72/0xf0 [ 8.760372] __x64_sys_write+0x19/0x30 [ 8.760374] x64_sys_call+0x2a3/0x2310 [ 8.760376] do_syscall_64+0x91/0x180 [ 8.760377] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760380] other info that might help us debug this: [ 8.760380] Chain exists of: &root->kernfs_rwsem --> fs_reclaim --> &q->q_usage_counter(io)#27 [ 8.760384] Possible unsafe locking scenario: [ 8.760384] CPU0 CPU1 [ 8.760385] ---- ---- [ 8.760385] lock(&q->q_usage_counter(io)#27); [ 8.760387] lock(fs_reclaim); [ 8.760388] lock(&q->q_usage_counter(io)#27); [ 8.760390] lock(&root->kernfs_rwsem); [ 8.760391] *** DEADLOCK *** [ 8.760391] 6 locks held by (udev-worker)/674: [ 8.760392] #0: ffff8881209ac420 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x72/0xf0 [ 8.760398] #1: ffff88810c80f488 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x136/0x250 [ 8.760402] #2: ffff888125d1d330 (kn->active#101){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x13f/0x250 [ 8.760406] #3: ffff888110dc7bb0 (&q->sysfs_lock){+.+.}-{3:3}, at: queue_attr_store+0x148/0x1e0 [ 8.760411] #4: ffff888110dc7680 (&q->q_usage_counter(io)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760416] #5: ffff888110dc76b8 (&q->q_usage_counter(queue)#27){++++}-{0:0}, at: blk_mq_freeze_queue_nomemsave+0x12/0x30 [ 8.760421] stack backtrace: [ 8.760422] CPU: 7 UID: 0 PID: 674 Comm: (udev-worker) Tainted: G U 6.14.0-rc6-xe+ #7 [ 8.760424] Tainted: [U]=USER [ 8.760425] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [ 8.760426] Call Trace: [ 8.760427] <TASK> [ 8.760428] dump_stack_lvl+0x6e/0xa0 [ 8.760431] dump_stack+0x10/0x18 [ 8.760433] print_circular_bug.cold+0x17a/0x1b7 [ 8.760437] check_noncircular+0x13a/0x150 [ 8.760441] ? save_trace+0x54/0x360 [ 8.760445] __lock_acquire+0x1525/0x2760 [ 8.760446] ? irqentry_exit+0x3a/0xb0 [ 8.760448] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ 8.760452] lock_acquire+0xca/0x310 [ 8.760453] ? kernfs_remove+0x32/0x60 [ 8.760457] down_write+0x3e/0xf0 [ 8.760459] ? kernfs_remove+0x32/0x60 [ 8.760460] kernfs_remove+0x32/0x60 [ 8.760462] sysfs_remove_dir+0x4f/0x60 [ 8.760464] __kobject_del+0x33/0xa0 [ 8.760466] kobject_del+0x13/0x30 [ 8.760467] elv_unregister_queue+0x52/0x80 [ 8.760470] elevator_switch+0x68/0x360 [ 8.760472] elv_iosched_store+0x14b/0x1b0 [ 8.760475] queue_attr_store+0x181/0x1e0 [ 8.760479] ? lock_acquire+0xca/0x310 [ 8.760480] ? kernfs_fop_write_iter+0x13f/0x250 [ 8.760482] ? lock_is_held_type+0xa3/0x130 [ 8.760485] sysfs_kf_write+0x49/0x80 [ 8.760487] kernfs_fop_write_iter+0x17d/0x250 [ 8.760489] vfs_write+0x281/0x540 [ 8.760494] ksys_write+0x72/0xf0 [ 8.760497] __x64_sys_write+0x19/0x30 [ 8.760499] x64_sys_call+0x2a3/0x2310 [ 8.760502] do_syscall_64+0x91/0x180 [ 8.760504] ? trace_hardirqs_off+0x5d/0xe0 [ 8.760506] ? handle_softirqs+0x479/0x4d0 [ 8.760508] ? hrtimer_interrupt+0x13f/0x280 [ 8.760511] ? irqentry_exit_to_user_mode+0x8b/0x260 [ 8.760513] ? clear_bhb_loop+0x15/0x70 [ 8.760515] ? clear_bhb_loop+0x15/0x70 [ 8.760516] ? clear_bhb_loop+0x15/0x70 [ 8.760518] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 8.760520] RIP: 0033:0x7aa3bf2f5504 [ 8.760522] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89 [ 8.760523] RSP: 002b:00007ffc1e3697d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 8.760526] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007aa3bf2f5504 [ 8.760527] RDX: 0000000000000003 RSI: 00007ffc1e369ae0 RDI: 000000000000001c [ 8.760528] RBP: 00007ffc1e369800 R08: 00007aa3bf3f51c8 R09: 00007ffc1e3698b0 [ 8.760528] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [ 8.760529] R13: 00007ffc1e369ae0 R14: 0000613ccf21f2f0 R15: 00007aa3bf3f4e80 [ 8.760533] </TASK> v2: - Update a code comment to increase readability (Ming Lei). Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318095548.5187-1-thomas.hellstrom@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
alobakin
pushed a commit
that referenced
this pull request
Apr 1, 2025
perf test 11 hwmon fails on s390 with this error
# ./perf test -Fv 11
--- start ---
---- end ----
11.1: Basic parsing test : Ok
--- start ---
Testing 'temp_test_hwmon_event1'
Using CPUID IBM,3931,704,A01,3.7,002f
temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/
FAILED tests/hwmon_pmu.c:189 Unexpected config for
'temp_test_hwmon_event1', 292470092988416 != 655361
---- end ----
11.2: Parsing without PMU name : FAILED!
--- start ---
Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/'
FAILED tests/hwmon_pmu.c:189 Unexpected config for
'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/',
292470092988416 != 655361
---- end ----
11.3: Parsing with PMU name : FAILED!
#
The root cause is in member test_event::config which is initialized
to 0xA0001 or 655361. During event parsing a long list event parsing
functions are called and end up with this gdb call stack:
#0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8,
term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623
#1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8,
terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662
#2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0,
attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false,
apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519
#3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8,
head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8)
at util/pmu.c:1545
#4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8,
list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090,
auto_merge_stats=true, alternate_hw_config=10)
at util/parse-events.c:1508
#5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8,
event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10,
const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0)
at util/parse-events.c:1592
#6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8,
scanner=0x16878c0) at util/parse-events.y:293
#7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8
"temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8)
at util/parse-events.c:1867
#8 0x000000000126a1e8 in __parse_events (evlist=0x168b580,
str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0,
err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true,
fake_tp=false) at util/parse-events.c:2136
#9 0x00000000011e36aa in parse_events (evlist=0x168b580,
str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8)
at /root/linux/tools/perf/util/parse-events.h:41
#10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false)
at tests/hwmon_pmu.c:164
#11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false)
at tests/hwmon_pmu.c:219
#12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368
<suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23
where the attr::config is set to value 292470092988416 or 0x10a0000000000
in line 625 of file ./util/hwmon_pmu.c:
attr->config = key.type_and_num;
However member key::type_and_num is defined as union and bit field:
union hwmon_pmu_event_key {
long type_and_num;
struct {
int num :16;
enum hwmon_type type :8;
};
};
s390 is big endian and Intel is little endian architecture.
The events for the hwmon dummy pmu have num = 1 or num = 2 and
type is set to HWMON_TYPE_TEMP (which is 10).
On s390 this assignes member key::type_and_num the value of
0x10a0000000000 (which is 292470092988416) as shown in above
trace output.
Fix this and export the structure/union hwmon_pmu_event_key
so the test shares the same implementation as the event parsing
functions for union and bit fields. This should avoid
endianess issues on all platforms.
Output after:
# ./perf test -F 11
11.1: Basic parsing test : Ok
11.2: Parsing without PMU name : Ok
11.3: Parsing with PMU name : Ok
#
Fixes: 531ee0f ("perf test: Add hwmon "PMU" test")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250131112400.568975-1-tmricht@linux.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Apr 1, 2025
Ian told me that there are many memory leaks in the hierarchy mode. I
can easily reproduce it with the follwing command.
$ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak
$ perf record --latency -g -- ./perf test -w thloop
$ perf report -H --stdio
...
Indirect leak of 168 byte(s) in 21 object(s) allocated from:
#0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75
#1 0x55ed3602346e in map__get util/map.h:189
#2 0x55ed36024cc4 in hist_entry__init util/hist.c:476
#3 0x55ed36025208 in hist_entry__new util/hist.c:588
#4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587
#5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638
#6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685
#7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776
#8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735
#9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119
#10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867
#11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351
#12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404
#13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448
#14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556
#15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
...
$ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
93
I found that hist_entry__delete() missed to release child entries in the
hierarchy tree (hroot_{in,out}). It needs to iterate the child entries
and call hist_entry__delete() recursively.
After this change:
$ perf report -H --stdio 2>&1 | grep -c '^Indirect leak'
0
Reported-by: Ian Rogers <irogers@google.com>
Tested-by Thomas Falcon <thomas.falcon@intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250307061250.320849-2-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Apr 1, 2025
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD.
For a pipe data, it reads the header data including pmu_mapping from
PERF_RECORD_HEADER_FEATURE runtime. But it's already set in:
perf_session__new()
__perf_session__new()
evlist__init_trace_event_sample_raw()
evlist__has_amd_ibs()
perf_env__nr_pmu_mappings()
Then it'll overwrite that when it processes the HEADER_FEATURE record.
Here's a report from address sanitizer.
Direct leak of 2689 byte(s) in 1 object(s) allocated from:
#0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
#1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64
#2 0x5595a7d414ef in strbuf_init util/strbuf.c:25
#3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362
#4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517
#5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315
#6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23
#7 0x5595a7d7f893 in __perf_session__new util/session.c:179
#8 0x5595a7b79572 in perf_session__new util/session.h:115
#9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603
#10 0x5595a7c019eb in run_builtin perf.c:351
#11 0x5595a7c01c92 in handle_internal_command perf.c:404
#12 0x5595a7c01deb in run_argv perf.c:448
#13 0x5595a7c02134 in main perf.c:556
#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Let's free the existing pmu_mapping data if any.
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250311000416.817631-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Apr 3, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().
An example from v5.4, similar problem also exists in upstream:
crash> bt 2091206
PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0"
#0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
#1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
#2 [ffff800084a2f880] schedule at ffff800040bfa4b4
#3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
#4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
#5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
#6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
#8 [ffff800084a2fa60] generic_make_request at ffff800040570138
#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
#18 [ffff800084a2fe70] kthread at ffff800040118de4
After commit 2def284 ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.
Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Tianxiang Peng <txpeng@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
May 12, 2025
…e probe The spin lock tx_handling_spinlock in struct m_can_classdev is not being initialized. This leads the following spinlock bad magic complaint from the kernel, eg. when trying to send CAN frames with cansend from can-utils: | BUG: spinlock bad magic on CPU#0, cansend/95 | lock: 0xff60000002ec1010, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 | CPU: 0 UID: 0 PID: 95 Comm: cansend Not tainted 6.15.0-rc3-00032-ga79be02bba5c #5 NONE | Hardware name: MachineWare SIM-V (DT) | Call Trace: | [<ffffffff800133e0>] dump_backtrace+0x1c/0x24 | [<ffffffff800022f2>] show_stack+0x28/0x34 | [<ffffffff8000de3e>] dump_stack_lvl+0x4a/0x68 | [<ffffffff8000de70>] dump_stack+0x14/0x1c | [<ffffffff80003134>] spin_dump+0x62/0x6e | [<ffffffff800883ba>] do_raw_spin_lock+0xd0/0x142 | [<ffffffff807a6fcc>] _raw_spin_lock_irqsave+0x20/0x2c | [<ffffffff80536dba>] m_can_start_xmit+0x90/0x34a | [<ffffffff806148b0>] dev_hard_start_xmit+0xa6/0xee | [<ffffffff8065b730>] sch_direct_xmit+0x114/0x292 | [<ffffffff80614e2a>] __dev_queue_xmit+0x3b0/0xaa8 | [<ffffffff8073b8fa>] can_send+0xc6/0x242 | [<ffffffff8073d1c0>] raw_sendmsg+0x1a8/0x36c | [<ffffffff805ebf06>] sock_write_iter+0x9a/0xee | [<ffffffff801d06ea>] vfs_write+0x184/0x3a6 | [<ffffffff801d0a88>] ksys_write+0xa0/0xc0 | [<ffffffff801d0abc>] __riscv_sys_write+0x14/0x1c | [<ffffffff8079ebf8>] do_trap_ecall_u+0x168/0x212 | [<ffffffff807a830a>] handle_exception+0x146/0x152 Initializing the spin lock in m_can_class_allocate_dev solves that problem. Fixes: 1fa80e2 ("can: m_can: Introduce a tx_fifo_in_flight counter") Signed-off-by: Antonios Salios <antonios@mwa.re> Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Link: https://patch.msgid.link/20250425111744.37604-2-antonios@mwa.re Reviewed-by: Markus Schneider-Pargmann <msp@baylibre.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
alobakin
pushed a commit
that referenced
this pull request
May 12, 2025
As shown in [1], it is possible to corrupt a BPF ELF file such that
arbitrary BPF instructions are loaded by libbpf. This can be done by
setting a symbol (BPF program) section offset to a large (unsigned)
number such that <section start + symbol offset> overflows and points
before the section data in the memory.
Consider the situation below where:
- prog_start = sec_start + symbol_offset <-- size_t overflow here
- prog_end = prog_start + prog_size
prog_start sec_start prog_end sec_end
| | | |
v v v v
.....................|################################|............
The report in [1] also provides a corrupted BPF ELF which can be used as
a reproducer:
$ readelf -S crash
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
...
[ 2] uretprobe.mu[...] PROGBITS 0000000000000000 00000040
0000000000000068 0000000000000000 AX 0 0 8
$ readelf -s crash
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
...
6: ffffffffffffffb8 104 FUNC GLOBAL DEFAULT 2 handle_tp
Here, the handle_tp prog has section offset ffffffffffffffb8, i.e. will
point before the actual memory where section 2 is allocated.
This is also reported by AddressSanitizer:
=================================================================
==1232==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7c7302fe0000 at pc 0x7fc3046e4b77 bp 0x7ffe64677cd0 sp 0x7ffe64677490
READ of size 104 at 0x7c7302fe0000 thread T0
#0 0x7fc3046e4b76 in memcpy (/lib64/libasan.so.8+0xe4b76)
#1 0x00000040df3e in bpf_object__init_prog /src/libbpf/src/libbpf.c:856
#2 0x00000040df3e in bpf_object__add_programs /src/libbpf/src/libbpf.c:928
#3 0x00000040df3e in bpf_object__elf_collect /src/libbpf/src/libbpf.c:3930
#4 0x00000040df3e in bpf_object_open /src/libbpf/src/libbpf.c:8067
#5 0x00000040f176 in bpf_object__open_file /src/libbpf/src/libbpf.c:8090
#6 0x000000400c16 in main /poc/poc.c:8
#7 0x7fc3043d25b4 in __libc_start_call_main (/lib64/libc.so.6+0x35b4)
#8 0x7fc3043d2667 in __libc_start_main@@GLIBC_2.34 (/lib64/libc.so.6+0x3667)
#9 0x000000400b34 in _start (/poc/poc+0x400b34)
0x7c7302fe0000 is located 64 bytes before 104-byte region [0x7c7302fe0040,0x7c7302fe00a8)
allocated by thread T0 here:
#0 0x7fc3046e716b in malloc (/lib64/libasan.so.8+0xe716b)
#1 0x7fc3045ee600 in __libelf_set_rawdata_wrlock (/lib64/libelf.so.1+0xb600)
#2 0x7fc3045ef018 in __elf_getdata_rdlock (/lib64/libelf.so.1+0xc018)
#3 0x00000040642f in elf_sec_data /src/libbpf/src/libbpf.c:3740
The problem here is that currently, libbpf only checks that the program
end is within the section bounds. There used to be a check
`while (sec_off < sec_sz)` in bpf_object__add_programs, however, it was
removed by commit 6245947 ("libbpf: Allow gaps in BPF program
sections to support overriden weak functions").
Add a check for detecting the overflow of `sec_off + prog_sz` to
bpf_object__init_prog to fix this issue.
[1] https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Fixes: 6245947 ("libbpf: Allow gaps in BPF program sections to support overriden weak functions")
Reported-by: lmarch2 <2524158037@qq.com>
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://github.com/lmarch2/poc/blob/main/libbpf/libbpf.md
Link: https://lore.kernel.org/bpf/20250415155014.397603-1-vmalik@redhat.com
alobakin
pushed a commit
that referenced
this pull request
May 12, 2025
Ido Schimmel says: ==================== vxlan: Convert FDB table to rhashtable The VXLAN driver currently stores FDB entries in a hash table with a fixed number of buckets (256), resulting in reduced performance as the number of entries grows. This patchset solves the issue by converting the driver to use rhashtable which maintains a more or less constant performance regardless of the number of entries. Measured transmitted packets per second using a single pktgen thread with varying number of entries when the transmitted packet always hits the default entry (worst case): Number of entries | Improvement ------------------|------------ 1k | +1.12% 4k | +9.22% 16k | +55% 64k | +585% 256k | +2460% The first patches are preparations for the conversion in the last patch. Specifically, the series is structured as follows: Patch #1 adds RCU read-side critical sections in the Tx path when accessing FDB entries. Targeting at net-next as I am not aware of any issues due to this omission despite the code being structured that way for a long time. Without it, traces will be generated when converting FDB lookup to rhashtable_lookup(). Patch #2-#5 simplify the creation of the default FDB entry (all-zeroes). Current code assumes that insertion into the hash table cannot fail, which will no longer be true with rhashtable. Patches #6-#10 add FDB entries to a linked list for entry traversal instead of traversing over them using the fixed size hash table which is removed in the last patch. Patches #11-#12 add wrappers for FDB lookup that make it clear when each should be used along with lockdep annotations. Needed as a preparation for rhashtable_lookup() that must be called from an RCU read-side critical section. Patch #13 treats dst cache initialization errors as non-fatal. See more info in the commit message. The current code happens to work because insertion into the fixed size hash table is slow enough for the per-CPU allocator to be able to create new chunks of per-CPU memory. Patch #14 adds an FDB key structure that includes the MAC address and source VNI. To be used as rhashtable key. Patch #15 does the conversion to rhashtable. ==================== Link: https://patch.msgid.link/20250415121143.345227-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Jun 3, 2025
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.
This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.
For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.
The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.
See __lock_acquire for more information.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Jun 3, 2025
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.
This fixes the following false lockdep warning:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Jun 3, 2025
Running a modified trace-cmd record --nosplice where it does a mmap of the ring buffer when '--nosplice' is set, caused the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.15.0-rc7-test-00002-gfb7d03d8a82f torvalds#551 Not tainted ------------------------------------------------------ trace-cmd/1113 is trying to acquire lock: ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70 but task is already holding lock: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 ring_buffer_map+0xcf/0xe70 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 do_mmap+0x9d7/0x1010 vm_mmap_pgoff+0x20b/0x390 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #4 (&mm->mmap_lock){++++}-{4:4}: __might_fault+0xa5/0x110 _copy_to_user+0x22/0x80 _perf_ioctl+0x61b/0x1b70 perf_ioctl+0x62/0x90 __x64_sys_ioctl+0x134/0x190 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #3 (&cpuctx_mutex){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 perf_event_init_cpu+0x325/0x7c0 perf_event_init+0x52a/0x5b0 start_kernel+0x263/0x3e0 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0x95/0xa0 common_startup_64+0x13e/0x141 -> #2 (pmus_lock){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 perf_event_init_cpu+0xb7/0x7c0 cpuhp_invoke_callback+0x2c0/0x1030 __cpuhp_invoke_callback_range+0xbf/0x1f0 _cpu_up+0x2e7/0x690 cpu_up+0x117/0x170 cpuhp_bringup_mask+0xd5/0x120 bringup_nonboot_cpus+0x13d/0x170 smp_init+0x2b/0xf0 kernel_init_freeable+0x441/0x6d0 kernel_init+0x1e/0x160 ret_from_fork+0x34/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2a/0xd0 ring_buffer_resize+0x610/0x14e0 __tracing_resize_ring_buffer.part.0+0x42/0x120 tracing_set_tracer+0x7bd/0xa80 tracing_set_trace_write+0x132/0x1e0 vfs_write+0x21c/0xe80 ksys_write+0xf9/0x1c0 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&buffer->mutex){+.+.}-{4:4}: __lock_acquire+0x1405/0x2210 lock_acquire+0x174/0x310 __mutex_lock+0x192/0x18c0 ring_buffer_map+0x11c/0xe70 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 do_mmap+0x9d7/0x1010 vm_mmap_pgoff+0x20b/0x390 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&cpu_buffer->mapping_lock); lock(&mm->mmap_lock); lock(&cpu_buffer->mapping_lock); lock(&buffer->mutex); *** DEADLOCK *** 2 locks held by trace-cmd/1113: #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390 #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70 stack backtrace: CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f torvalds#551 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0xa0 print_circular_bug.cold+0x178/0x1be check_noncircular+0x146/0x160 __lock_acquire+0x1405/0x2210 lock_acquire+0x174/0x310 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? __mutex_lock+0x169/0x18c0 __mutex_lock+0x192/0x18c0 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? function_trace_call+0x296/0x370 ? __pfx___mutex_lock+0x10/0x10 ? __pfx_function_trace_call+0x10/0x10 ? __pfx___mutex_lock+0x10/0x10 ? _raw_spin_unlock+0x2d/0x50 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? __mutex_lock+0x5/0x18c0 ring_buffer_map+0x11c/0xe70 ? do_raw_spin_lock+0x12d/0x270 ? find_held_lock+0x2b/0x80 ? _raw_spin_unlock+0x2d/0x50 ? rcu_is_watching+0x15/0xb0 ? _raw_spin_unlock+0x2d/0x50 ? trace_preempt_on+0xd0/0x110 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 ? ring_buffer_lock_reserve+0x99/0xff0 ? __pfx___mmap_region+0x10/0x10 ? ring_buffer_lock_reserve+0x99/0xff0 ? __pfx_ring_buffer_lock_reserve+0x10/0x10 ? __pfx_ring_buffer_lock_reserve+0x10/0x10 ? bpf_lsm_mmap_addr+0x4/0x10 ? security_mmap_addr+0x46/0xd0 ? lock_is_held_type+0xd9/0x130 do_mmap+0x9d7/0x1010 ? 0xffffffffc0370095 ? __pfx_do_mmap+0x10/0x10 vm_mmap_pgoff+0x20b/0x390 ? __pfx_vm_mmap_pgoff+0x10/0x10 ? 0xffffffffc0370095 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fb0963a7de2 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90 </TASK> The issue is that cpus_read_lock() is taken within buffer->mutex. The memory mapped pages are taken with the mmap_lock held. The buffer->mutex is taken within the cpu_buffer->mapping_lock. There's quite a chain with all these locks, where the deadlock can be fixed by moving the cpus_read_lock() outside the taking of the buffer->mutex. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home Fixes: 117c392 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
alobakin
pushed a commit
that referenced
this pull request
Jun 18, 2025
Subbaraya Sundeep says: ==================== CN20K silicon with mbox support CN20K is the next generation silicon in the Octeon series with various improvements and new features. Along with other changes the mailbox communication mechanism between RVU (Resource virtualization Unit) SRIOV PFs/VFs with Admin function (AF) has also gone through some changes. Some of those changes are - Separate IRQs for mbox request and response/ack. - Configurable mbox size, default being 64KB. - Ability for VFs to communicate with RVU AF instead of going through parent SRIOV PF. Due to more memory requirement due to configurable mbox size, mbox memory will now have to be allocated by - AF (PF0) for communicating with other PFs and all VFs in the system. - PF for communicating with it's child VFs. On previous silicons mbox memory was reserved and configured by firmware. This patch series add basic mbox support for AF (PF0) <=> PFs and PF <=> VFs. AF <=> VFs communication and variable mbox size support will come in later. Patch #1 Supported co-existance of bit encoding PFs and VFs in 16-bit hardware pcifunc format between CN20K silicon and older octeon series. Also exported PF,VF masks and shifts present in mailbox module to all other modules. Patch #2 Added basic mbox operation APIs and structures to support both CN20K and previous version of silicons. Patch #3 This patch adds support for basic mbox infrastructure implementation for CN20K silicon in AF perspective. There are few updates w.r.t MBOX ACK interrupt and offsets in CN20k. Patch #4 Added mbox implementation between NIC PF and AF for CN20K. Patch #5 Added mbox communication support between AF and AF's VFs. Patch #6 This patch adds support for MBOX communication between NIC PF and its VFs. ==================== Link: https://patch.msgid.link/1749639716-13868-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Jun 18, 2025
Petr Machata says: ==================== ipmr, ip6mr: Allow MC-routing locally-generated MC packets Multicast routing is today handled in the input path. Locally generated MC packets don't hit the IPMR code. Thus if a VXLAN remote address is multicast, the driver needs to set an OIF during route lookup. In practice that means that MC routing configuration needs to be kept in sync with the VXLAN FDB and MDB. Ideally, the VXLAN packets would be routed by the MC routing code instead. To that end, this patchset adds support to route locally generated multicast packets. However, an installation that uses a VXLAN underlay netdevice for which it also has matching MC routes, would get a different routing with this patch. Previously, the MC packets would be delivered directly to the underlay port, whereas now they would be MC-routed. In order to avoid this change in behavior, introduce an IPCB/IP6CB flag. Unless the flag is set, the new MC-routing code is skipped. All this is keyed to a new VXLAN attribute, IFLA_VXLAN_MC_ROUTE. Only when it is set does any of the above engage. In addition to that, and as is the case today with MC forwarding, IPV4_DEVCONF_MC_FORWARDING must be enabled for the netdevice that acts as a source of MC traffic (i.e. the VXLAN PHYS_DEV), so an MC daemon must be attached to the netdevice. When a VXLAN netdevice with a MC remote is brought up, the physical netdevice joins the indicated MC group. This is important for local delivery of MC packets, so it is still necessary to configure a physical netdevice -- the parameter cannot go away. The netdevice would however typically not be a front panel port, but a dummy. An MC daemon would then sit on top of that netdevice as well as any front panel ports that it needs to service, and have routes set up between the two. A way to configure the VXLAN netdevice to take advantage of the new MC routing would be: # ip link add name d up type dummy # ip link add name vx10 up type vxlan id 1000 dstport 4789 \ local 192.0.2.1 group 225.0.0.1 ttl 16 dev d mrcoute # ip link set dev vx10 master br # plus vlans etc. With the following MC routes: (192.0.2.1, 225.0.0.1) iif=d oil=swp1,swp2 # TX route (*, 225.0.0.1) iif=swp1 oil=d,swp2 # RX route (*, 225.0.0.1) iif=swp2 oil=d,swp1 # RX route The RX path has not changed, with the exception of an extra MC hop. Packets are delivered to the front panel port and MC-forwarded to the VXLAN physical port, here "d". Since the port has joined the multicast group, the packets are locally delivered, and end up being processed by the VXLAN netdevice. This patchset is based on earlier patches from Nikolay Aleksandrov and Roopa Prabhu, though it underwent significant changes. Roopa broadly presented the topic on LPC 2019 [0]. Patchset progression: - Patches #1 to #4 add ip_mr_output() - Patches #5 to #10 add ip6_mr_output() - Patch #11 adds the VXLAN bits to enable MR engagement - Patches #12 to #14 prepare selftest libraries - Patch #15 includes a new test suite [0] https://www.youtube.com/watch?v=xlReECfi-uo ==================== Link: https://patch.msgid.link/cover.1750113335.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Jul 11, 2025
When reconnecting a channel in smb2_reconnect_server(), a dummy tcon is passed down to smb2_reconnect() with ->query_interface uninitialized, so we can't call queue_delayed_work() on it. Fix the following warning by ensuring that we're queueing the delayed worker from correct tcon. WARNING: CPU: 4 PID: 1126 at kernel/workqueue.c:2498 __queue_delayed_work+0x1d2/0x200 Modules linked in: cifs cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 4 UID: 0 PID: 1126 Comm: kworker/4:0 Not tainted 6.16.0-rc3 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-4.fc42 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__queue_delayed_work+0x1d2/0x200 Code: 41 5e 41 5f e9 7f ee ff ff 90 0f 0b 90 e9 5d ff ff ff bf 02 00 00 00 e8 6c f3 07 00 89 c3 eb bd 90 0f 0b 90 e9 57 f> 0b 90 e9 65 fe ff ff 90 0f 0b 90 e9 72 fe ff ff 90 0f 0b 90 e9 RSP: 0018:ffffc900014afad8 EFLAGS: 00010003 RAX: 0000000000000000 RBX: ffff888124d99988 RCX: ffffffff81399cc1 RDX: dffffc0000000000 RSI: ffff888114326e00 RDI: ffff888124d999f0 RBP: 000000000000ea60 R08: 0000000000000001 R09: ffffed10249b3331 R10: ffff888124d9998f R11: 0000000000000004 R12: 0000000000000040 R13: ffff888114326e00 R14: ffff888124d999d8 R15: ffff888114939020 FS: 0000000000000000(0000) GS:ffff88829f7fe000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe7a2b4038 CR3: 0000000120a6f000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> queue_delayed_work_on+0xb4/0xc0 smb2_reconnect+0xb22/0xf50 [cifs] smb2_reconnect_server+0x413/0xd40 [cifs] ? __pfx_smb2_reconnect_server+0x10/0x10 [cifs] ? local_clock_noinstr+0xd/0xd0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 process_one_work+0x4c5/0xa10 ? __pfx_process_one_work+0x10/0x10 ? __list_add_valid_or_report+0x37/0x120 worker_thread+0x2f1/0x5a0 ? __kthread_parkme+0xde/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x1f0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x15b/0x1f0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1116206 hardirqs last enabled at (1116205): [<ffffffff8143af42>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1116206): [<ffffffff81399f0e>] queue_delayed_work_on+0x6e/0xc0 softirqs last enabled at (1116138): [<ffffffffc04562fd>] __smb_send_rqst+0x42d/0x950 [cifs] softirqs last disabled at (1116136): [<ffffffff823d35e1>] release_sock+0x21/0xf0 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: 42ca547 ("cifs: do not disable interface polling on failure") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Steve French <stfrench@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
alobakin
pushed a commit
that referenced
this pull request
Jul 25, 2025
…terface collect_md property on xfrm interfaces can only be set on device creation, thus xfrmi_changelink() should fail when called on such interfaces. The check to enforce this was done only in the case where the xi was returned from xfrmi_locate() which doesn't look for the collect_md interface, and thus the validation was never reached. Calling changelink would thus errornously place the special interface xi in the xfrmi_net->xfrmi hash, but since it also exists in the xfrmi_net->collect_md_xfrmi pointer it would lead to a double free when the net namespace was taken down [1]. Change the check to use the xi from netdev_priv which is available earlier in the function to prevent changes in xfrm collect_md interfaces. [1] resulting oops: [ 8.516540] kernel BUG at net/core/dev.c:12029! [ 8.516552] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 8.516559] CPU: 0 UID: 0 PID: 12 Comm: kworker/u80:0 Not tainted 6.15.0-virtme #5 PREEMPT(voluntary) [ 8.516565] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 8.516569] Workqueue: netns cleanup_net [ 8.516579] RIP: 0010:unregister_netdevice_many_notify+0x101/0xab0 [ 8.516590] Code: 90 0f 0b 90 48 8b b0 78 01 00 00 48 8b 90 80 01 00 00 48 89 56 08 48 89 32 4c 89 80 78 01 00 00 48 89 b8 80 01 00 00 eb ac 90 <0f> 0b 48 8b 45 00 4c 8d a0 88 fe ff ff 48 39 c5 74 5c 41 80 bc 24 [ 8.516593] RSP: 0018:ffffa93b8006bd30 EFLAGS: 00010206 [ 8.516598] RAX: ffff98fe4226e000 RBX: ffffa93b8006bd58 RCX: ffffa93b8006bc60 [ 8.516601] RDX: 0000000000000004 RSI: 0000000000000000 RDI: dead000000000122 [ 8.516603] RBP: ffffa93b8006bdd8 R08: dead000000000100 R09: ffff98fe4133c100 [ 8.516605] R10: 0000000000000000 R11: 00000000000003d2 R12: ffffa93b8006be00 [ 8.516608] R13: ffffffff96c1a510 R14: ffffffff96c1a510 R15: ffffa93b8006be00 [ 8.516615] FS: 0000000000000000(0000) GS:ffff98fee73b7000(0000) knlGS:0000000000000000 [ 8.516619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.516622] CR2: 00007fcd2abd0700 CR3: 000000003aa40000 CR4: 0000000000752ef0 [ 8.516625] PKRU: 55555554 [ 8.516627] Call Trace: [ 8.516632] <TASK> [ 8.516635] ? rtnl_is_locked+0x15/0x20 [ 8.516641] ? unregister_netdevice_queue+0x29/0xf0 [ 8.516650] ops_undo_list+0x1f2/0x220 [ 8.516659] cleanup_net+0x1ad/0x2e0 [ 8.516664] process_one_work+0x160/0x380 [ 8.516673] worker_thread+0x2aa/0x3c0 [ 8.516679] ? __pfx_worker_thread+0x10/0x10 [ 8.516686] kthread+0xfb/0x200 [ 8.516690] ? __pfx_kthread+0x10/0x10 [ 8.516693] ? __pfx_kthread+0x10/0x10 [ 8.516697] ret_from_fork+0x82/0xf0 [ 8.516705] ? __pfx_kthread+0x10/0x10 [ 8.516709] ret_from_fork_asm+0x1a/0x30 [ 8.516718] </TASK> Fixes: abc340b ("xfrm: interface: support collect metadata mode") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
alobakin
pushed a commit
that referenced
this pull request
Aug 4, 2025
pert script tests fails with segmentation fault as below:
92: perf script tests:
--- start ---
test child forked, pid 103769
DB test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
/usr/libexec/perf-core/tests/shell/script.sh: line 35:
103780 Segmentation fault (core dumped)
perf script -i "${perfdatafile}" -s "${db_test}"
--- Cleaning up ---
---- end(-1) ----
92: perf script tests : FAILED!
Backtrace pointed to :
#0 0x0000000010247dd0 in maps.machine ()
#1 0x00000000101d178c in db_export.sample ()
#2 0x00000000103412c8 in python_process_event ()
#3 0x000000001004eb28 in process_sample_event ()
#4 0x000000001024fcd0 in machines.deliver_event ()
#5 0x000000001025005c in perf_session.deliver_event ()
#6 0x00000000102568b0 in __ordered_events__flush.part.0 ()
#7 0x0000000010251618 in perf_session.process_events ()
#8 0x0000000010053620 in cmd_script ()
#9 0x00000000100b5a28 in run_builtin ()
#10 0x00000000100b5f94 in handle_internal_command ()
#11 0x0000000010011114 in main ()
Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.
With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.
As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.
Reported-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Aditya Bodkhe <aditya.b1@linux.ibm.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Link: https://lore.kernel.org/r/20250429065132.36839-1-adityab1@linux.ibm.com
Suggested-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Aug 4, 2025
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:
$ perf record -a -g --call-graph=dwarf
$ perf report
# hung
`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`
$ strace -y -f -p 2780484
strace: Process 2780484 attached
pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached
It's call trace descends into `elfutils`:
$ gdb -p 2780484
(gdb) bt
#0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
at ../sysdeps/unix/sysv/linux/pread64.c:25
#1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
#2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
at util/dso.h:537
#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
best_effort=best_effort@entry=false) at util/thread.h:152
#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
max_stack=127, symbols=true) at util/machine.c:2920
#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
at util/machine.c:2970
#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
file_path=0x105ffbf0 "perf.data") at util/session.c:1419
#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
--Type <RET> for more, q to quit, c to continue without paging--
quit
prog=prog@entry=0x7fff9df81220) at util/session.c:2132
#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
at util/session.c:2181
#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
torvalds#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
torvalds#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
at perf.c:351
torvalds#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
torvalds#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
torvalds#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556
The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.
The change conservatively skips all non-regular files.
Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250505174419.2814857-1-slyich@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Aug 4, 2025
Symbolize stack traces by creating a live machine. Add this
functionality to dump_stack and switch dump_stack users to use
it. Switch TUI to use it. Add stack traces to the child test function
which can be useful to diagnose blocked code.
Example output:
```
$ perf test -vv PERF_RECORD_
...
7: PERF_RECORD_* events & perf_sample fields:
7: PERF_RECORD_* events & perf_sample fields : Running (1 active)
^C
Signal (2) while running tests.
Terminating tests with the same signal
Internal test harness failure. Completing any started tests:
: 7: PERF_RECORD_* events & perf_sample fields:
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#4 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#5 0x7fc12ff02d68 in __sleep sleep.c:55
#6 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#7 0x55788c620fb0 in run_test_child builtin-test.c:0
#8 0x55788c5bd18d in start_command run-command.c:127
#9 0x55788c621ef3 in __cmd_test builtin-test.c:0
#10 0x55788c6225bf in cmd_test ??:0
#11 0x55788c5afbd0 in run_builtin perf.c:0
#12 0x55788c5afeeb in handle_internal_command perf.c:0
#13 0x55788c52b383 in main ??:0
#14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#16 0x55788c52b9d1 in _start ??:0
---- unexpected signal (2) ----
#0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0
#1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45
#3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26
#4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36
#5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0
#6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0
#7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64
#8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72
#9 0x7fc12fef1393 in __nanosleep nanosleep.c:26
#10 0x7fc12ff02d68 in __sleep sleep.c:55
#11 0x55788c63196b in test__PERF_RECORD perf-record.c:0
#12 0x55788c620fb0 in run_test_child builtin-test.c:0
#13 0x55788c5bd18d in start_command run-command.c:127
#14 0x55788c621ef3 in __cmd_test builtin-test.c:0
#15 0x55788c6225bf in cmd_test ??:0
#16 0x55788c5afbd0 in run_builtin perf.c:0
#17 0x55788c5afeeb in handle_internal_command perf.c:0
#18 0x55788c52b383 in main ??:0
#19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74
#20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#21 0x55788c52b9d1 in _start ??:0
7: PERF_RECORD_* events & perf_sample fields : Skip (permissions)
```
Signed-off-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250624210500.2121303-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Aug 4, 2025
Calling perf top with branch filters enabled on Intel CPU's
with branch counters logging (A.K.A LBR event logging [1]) support
results in a segfault.
$ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter
...
Thread 27 "perf" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffafff76c0 (LWP 949003)]
perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
653 *width = env->cpu_pmu_caps ? env->br_cntr_width :
(gdb) bt
#0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653
#1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345
#2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389
#3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422
#4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850
#5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737
#6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359
#7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845
#8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211
#9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245
#10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324
#11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342
#12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120
#13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448
#14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
The cause is that perf_env__find_br_cntr_info tries to access a
null pointer pmu_caps in the perf_env struct. A similar issue exists
for homogeneous core systems which use the cpu_pmu_caps structure.
Fix this by populating cpu_pmu_caps and pmu_caps structures with
values from sysfs when calling perf top with branch stack sampling
enabled.
[1], LBR event logging introduced here:
https://lore.kernel.org/all/20231025201626.3000228-5-kan.liang@linux.intel.com/
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Link: https://lore.kernel.org/r/20250612163659.1357950-2-thomas.falcon@intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Aug 5, 2025
profile allocation is wrongly setting the number of entries on the rules vector before any ruleset is assigned. If profile allocation fails between ruleset allocation and assigning the first ruleset, free_ruleset() will be called with a null pointer resulting in an oops. [ 107.350226] kernel BUG at mm/slub.c:545! [ 107.350912] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 107.351447] CPU: 1 UID: 0 PID: 27 Comm: ksoftirqd/1 Not tainted 6.14.6-hwe-rlee287-dev+ #5 [ 107.353279] Hardware name:[ 107.350218] -QE-----------[ cutMU here ]--------- Ub--- [ 107.3502untu26] kernel BUG a 24t mm/slub.c:545.!04 P [ 107.350912]C ( Oops: invalid oi4pcode: 0000 [#1]40 PREEMPT SMP NOPFXTI + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 107.356054] RIP: 0010:__slab_free+0x152/0x340 [ 107.356444] Code: 00 4c 89 ff e8 0f ac df 00 48 8b 14 24 48 8b 4c 24 20 48 89 44 24 08 48 8b 03 48 c1 e8 09 83 e0 01 88 44 24 13 e9 71 ff ff ff <0f> 0b 41 f7 44 24 08 87 04 00 00 75 b2 eb a8 41 f7 44 24 08 87 04 [ 107.357856] RSP: 0018:ffffad4a800fbbb0 EFLAGS: 00010246 [ 107.358937] RAX: ffff97ebc2a88e70 RBX: ffffd759400aa200 RCX: 0000000000800074 [ 107.359976] RDX: ffff97ebc2a88e60 RSI: ffffd759400aa200 RDI: ffffad4a800fbc20 [ 107.360600] RBP: ffffad4a800fbc50 R08: 0000000000000001 R09: ffffffff86f02cf2 [ 107.361254] R10: 0000000000000000 R11: 0000000000000000 R12: ffff97ecc0049400 [ 107.361934] R13: ffff97ebc2a88e60 R14: ffff97ecc0049400 R15: 0000000000000000 [ 107.362597] FS: 0000000000000000(0000) GS:ffff97ecfb200000(0000) knlGS:0000000000000000 [ 107.363332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 107.363784] CR2: 000061c9545ac000 CR3: 0000000047aa6000 CR4: 0000000000750ef0 [ 107.364331] PKRU: 55555554 [ 107.364545] Call Trace: [ 107.364761] <TASK> [ 107.364931] ? local_clock+0x15/0x30 [ 107.365219] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.365593] ? kfree_sensitive+0x32/0x70 [ 107.365900] kfree+0x29d/0x3a0 [ 107.366144] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.366510] ? local_clock_noinstr+0xe/0xd0 [ 107.366841] ? srso_alias_return_thunk+0x5/0xfbef5 [ 107.367209] kfree_sensitive+0x32/0x70 [ 107.367502] aa_free_profile.part.0+0xa2/0x400 [ 107.367850] ? rcu_do_batch+0x1e6/0x5e0 [ 107.368148] aa_free_profile+0x23/0x60 [ 107.368438] label_free_switch+0x4c/0x80 [ 107.368751] label_free_rcu+0x1c/0x50 [ 107.369038] rcu_do_batch+0x1e8/0x5e0 [ 107.369324] ? rcu_do_batch+0x157/0x5e0 [ 107.369626] rcu_core+0x1b0/0x2f0 [ 107.369888] rcu_core_si+0xe/0x20 [ 107.370156] handle_softirqs+0x9b/0x3d0 [ 107.370460] ? smpboot_thread_fn+0x26/0x210 [ 107.370790] run_ksoftirqd+0x3a/0x70 [ 107.371070] smpboot_thread_fn+0xf9/0x210 [ 107.371383] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 107.371746] kthread+0x10d/0x280 [ 107.372010] ? __pfx_kthread+0x10/0x10 [ 107.372310] ret_from_fork+0x44/0x70 [ 107.372655] ? __pfx_kthread+0x10/0x10 [ 107.372974] ret_from_fork_asm+0x1a/0x30 [ 107.373316] </TASK> [ 107.373505] Modules linked in: af_packet_diag mptcp_diag tcp_diag udp_diag raw_diag inet_diag snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore qrtr binfmt_misc intel_rapl_msr intel_rapl_common kvm_amd ccp kvm irqbypass polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd i2c_piix4 i2c_smbus input_leds joydev sch_fq_codel msr parport_pc ppdev lp parport efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci dmi_sysfs qemu_fw_cfg ip_tables x_tables autofs4 hid_generic usbhid hid psmouse serio_raw floppy bochs pata_acpi [ 107.379086] ---[ end trace 0000000000000000 ]--- Don't set the count until a ruleset is actually allocated and guard against free_ruleset() being called with a null pointer. Reported-by: Ryan Lee <ryan.lee@canonical.com> Fixes: 217af7e ("apparmor: refactor profile rules and attachments") Signed-off-by: John Johansen <john.johansen@canonical.com>
alobakin
pushed a commit
that referenced
this pull request
Sep 12, 2025
Petr Machata says: ==================== bridge: Allow keeping local FDB entries only on VLAN 0 The bridge FDB contains one local entry per port per VLAN, for the MAC of the port in question, and likewise for the bridge itself. This allows bridge to locally receive and punt "up" any packets whose destination MAC address matches that of one of the bridge interfaces or of the bridge itself. The number of these local "service" FDB entries grows linearly with number of bridge-global VLAN memberships, but that in turn will tend to grow quadratically with number of ports and per-port VLAN memberships. While that does not cause issues during forwarding lookups, it does make dumps impractically slow. As an example, with 100 interfaces, each on 4K VLANs, a full dump of FDB that just contains these 400K local entries, takes 6.5s. That's _without_ considering iproute2 formatting overhead, this is just how long it takes to walk the FDB (repeatedly), serialize it into netlink messages, and parse the messages back in userspace. This is to illustrate that with growing number of ports and VLANs, the time required to dump this repetitive information blows up. Arguably 4K VLANs per interface is not a very realistic configuration, but then modern switches can instead have several hundred interfaces, and we have fielded requests for >1K VLAN memberships per port among customers. FDB entries are currently all kept on a single linked list, and then dumping uses this linked list to walk all entries and dump them in order. When the message buffer is full, the iteration is cut short, and later restarted. Of course, to restart the iteration, it's first necessary to walk the already-dumped front part of the list before starting dumping again. So one possibility is to organize the FDB entries in different structure more amenable to walk restarts. One option is to walk directly the hash table. The advantage is that no auxiliary structure needs to be introduced. With a rough sketch of this approach, the above scenario gets dumped in not quite 3 s, saving over 50 % of time. However hash table iteration requires maintaining an active cursor that must be collected when the dump is aborted. It looks like that would require changes in the NDO protocol to allow to run this cleanup. Moreover, on hash table resize the iteration is simply restarted. FDB dumps are currently not guaranteed to correspond to any one particular state: entries can be missed, or be duplicated. But with hash table iteration we would get that plus the much less graceful resize behavior, where swaths of FDB are duplicated. Another option is to maintain the FDB entries in a red-black tree. We have a PoC of this approach on hand, and the above scenario is dumped in about 2.5 s. Still not as snappy as we'd like it, but better than the hash table. However the savings come at the expense of a more expensive insertion, and require locking during dumps, which blocks insertion. The upside of these approaches is that they provide benefits whatever the FDB contents. But it does not seem like either of these is workable. However we intend to clean up the RB tree PoC and present it for consideration later on in case the trade-offs are considered acceptable. Yet another option might be to use in-kernel FDB filtering, and to filter the local entries when dumping. Unfortunately, this does not help all that much either, because the linked-list walk still needs to happen. Also, with the obvious filtering interface built around ndm_flags / ndm_state filtering, one can't just exclude pure local entries in one query. One needs to dump all non-local entries first, and then to get permanent entries in another run filter local & added_by_user. I.e. one needs to pay the iteration overhead twice, and then integrate the result in userspace. To get significant savings, one would need a very specific knob like "dump, but skip/only include local entries". But if we are adding a local-specific knobs, maybe let's have an option to just not duplicate them in the first place. All this FDB duplication is there merely to make things snappy during forwarding. But high-radix switches with thousands of VLANs typically do not process much traffic in the SW datapath at all, but rather offload vast majority of it. So we could exchange some of the runtime performance for a neater FDB. To that end, in this patchset, introduce a new bridge option, BR_BOOLOPT_FDB_LOCAL_VLAN_0, which when enabled, has local FDB entries installed only on VLAN 0, instead of duplicating them across all VLANs. Then to maintain the local termination behavior, on FDB miss, the bridge does a second lookup on VLAN 0. Enabling this option changes the bridge behavior in expected ways. Since the entries are only kept on VLAN 0, FDB get, flush and dump will not perceive them on non-0 VLANs. And deleting the VLAN 0 entry affects forwarding on all VLANs. This patchset is loosely based on a privately circulated patch by Nikolay Aleksandrov. The patchset progresses as follows: - Patch #1 introduces a bridge option to enable the above feature. Then patches #2 to #5 gradually patch the bridge to do the right thing when the option is enabled. Finally patch #6 adds the UAPI knob and the code for when the feature is enabled or disabled. - Patches #7, #8 and #9 contain fixes and improvements to selftest libraries - Patch #10 contains a new selftest ==================== Link: https://patch.msgid.link/cover.1757004393.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Sep 22, 2025
…rnal() A crash was observed with the following output: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none) RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911 Call Trace: <TASK> trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785 vfs_write+0x2b6/0xd00 fs/read_write.c:684 ksys_write+0x129/0x240 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94 </TASK> Function kmemdup() may return NULL in trace_kprobe_create_internal(), add check for it's return value. Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/ Fixes: 33b4e38 ("tracing: kprobe-event: Allocate string buffers from heap") Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Oct 9, 2025
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 #7 [3800313fd60] crw_collect_info at c91905822 #8 [3800313fe10] kthread at c90feb390 #9 [3800313fe68] __ret_from_fork at c90f6aa64 #10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
alobakin
pushed a commit
that referenced
this pull request
Oct 9, 2025
The test starts a workload and then opens events. If the events fail
to open, for example because of perf_event_paranoid, the gopipe of the
workload is leaked and the file descriptor leak check fails when the
test exits. To avoid this cancel the workload when opening the events
fails.
Before:
```
$ perf test -vv 7
7: PERF_RECORD_* events & perf_sample fields:
--- start ---
test child forked, pid 1189568
Using CPUID GenuineIntel-6-B7-1
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
sys_perf_event_open failed, error -13
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
disabled 1
exclude_kernel 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
sys_perf_event_open failed, error -13
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
disabled 1
exclude_kernel 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3
Attempt to add: software/cpu-clock/
..after resolving event: software/config=0/
cpu-clock -> software/cpu-clock/
------------------------------------------------------------
perf_event_attr:
type 1 (PERF_TYPE_SOFTWARE)
size 136
config 0x9 (PERF_COUNT_SW_DUMMY)
sample_type IP|TID|TIME|CPU
read_format ID|LOST
disabled 1
inherit 1
mmap 1
comm 1
enable_on_exec 1
task 1
sample_id_all 1
mmap2 1
comm_exec 1
ksymbol 1
bpf_event 1
{ wakeup_events, wakeup_watermark } 1
------------------------------------------------------------
sys_perf_event_open: pid 1189569 cpu 0 group_fd -1 flags 0x8
sys_perf_event_open failed, error -13
perf_evlist__open: Permission denied
---- end(-2) ----
Leak of file descriptor 6 that opened: 'pipe:[14200347]'
---- unexpected signal (6) ----
iFailed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
#0 0x565358f6666e in child_test_sig_handler builtin-test.c:311
#1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0
#2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44
#3 0x7f29ce849cc2 in raise raise.c:27
#4 0x7f29ce8324ac in abort abort.c:81
#5 0x565358f662d4 in check_leaks builtin-test.c:226
#6 0x565358f6682e in run_test_child builtin-test.c:344
#7 0x565358ef7121 in start_command run-command.c:128
#8 0x565358f67273 in start_test builtin-test.c:545
#9 0x565358f6771d in __cmd_test builtin-test.c:647
#10 0x565358f682bd in cmd_test builtin-test.c:849
#11 0x565358ee5ded in run_builtin perf.c:349
#12 0x565358ee6085 in handle_internal_command perf.c:401
#13 0x565358ee61de in run_argv perf.c:448
#14 0x565358ee6527 in main perf.c:555
#15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74
#16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
#17 0x565358e391c1 in _start perf[851c1]
7: PERF_RECORD_* events & perf_sample fields : FAILED!
```
After:
```
$ perf test 7
7: PERF_RECORD_* events & perf_sample fields : Skip (permissions)
```
Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object")
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Chun-Tse Shao <ctshao@google.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
alobakin
pushed a commit
that referenced
this pull request
Oct 16, 2025
Phil reported a boot failure once sheaves become used in commits 59faa4d ("maple_tree: use percpu sheaves for maple_node_cache") and 3accabd ("mm, vma: use percpu sheaves for vm_area_struct cache"): BUG: kernel NULL pointer dereference, address: 0000000000000040 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary) Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025 RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0 Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89 RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300 RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388 R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0 R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0 FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0 Call Trace: <TASK> ? srso_return_thunk+0x5/0x5f ? vm_area_alloc+0x1e/0x60 kmem_cache_alloc_noprof+0x4ec/0x5b0 vm_area_alloc+0x1e/0x60 create_init_stack_vma+0x26/0x210 alloc_bprm+0x139/0x200 kernel_execve+0x4a/0x140 call_usermodehelper_exec_async+0xd0/0x190 ? __pfx_call_usermodehelper_exec_async+0x10/0x10 ret_from_fork+0xf0/0x110 ? __pfx_call_usermodehelper_exec_async+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 0000000000000040 ---[ end trace 0000000000000000 ]--- RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0 Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89 RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300 RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388 R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0 R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0 FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such that memory is only on 2 of them." # numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90 node 1 size: 31584 MB node 1 free: 30397 MB node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94 node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89 node 4 size: 0 MB node 4 free: 0 MB node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91 node 5 size: 32214 MB node 5 free: 31625 MB node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93 node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95 node 7 size: 0 MB node 7 free: 0 MB Linus decoded the stacktrace to get_barn() and get_node() and determined that kmem_cache->node[numa_mem_id()] is NULL. The problem is due to a wrong assumption that memoryless nodes only exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id() points to the nearest node that has memory. SLUB has been allocating its kmem_cache_node structures only on nodes with memory and so it does with struct node_barn. For kmem_cache_node, get_partial_node() checks if get_node() result is not NULL, which I assumed was for protection from a bogus node id passed to kmalloc_node() but apparently it's also for systems where numa_mem_id() (used when no specific node is given) might return a memoryless node. Fix the sheaves code the same way by checking the result of get_node() and bailing out if it's NULL. Note that cpus on such memoryless nodes will have degraded sheaves performance, which can be improved later, preferably by making numa_mem_id() work properly on such systems. Fixes: 2d517aa ("slab: add opt-in caching layer of percpu sheaves") Reported-and-tested-by: Phil Auld <pauld@redhat.com> Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/ Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
alobakin
pushed a commit
that referenced
this pull request
Nov 7, 2025
With CONFIG_PROVE_RCU_LIST=y and by executing $ netcat -l --sctp & $ netcat --sctp localhost & $ ss --sctp one can trigger the following Lockdep-RCU splat(s): WARNING: suspicious RCU usage 6.18.0-rc1-00093-g7f864458e9a6 #5 Not tainted ----------------------------- net/sctp/diag.c:76 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by ss/215: #0: ffff9c740828bec0 (nlk_cb_mutex-SOCK_DIAG){+.+.}-{4:4}, at: __netlink_dump_start+0x84/0x2b0 #1: ffff9c7401d72cd0 (sk_lock-AF_INET6){+.+.}-{0:0}, at: sctp_sock_dump+0x38/0x200 stack backtrace: CPU: 0 UID: 0 PID: 215 Comm: ss Not tainted 6.18.0-rc1-00093-g7f864458e9a6 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x90 lockdep_rcu_suspicious.cold+0x4e/0xa3 inet_sctp_diag_fill.isra.0+0x4b1/0x5d0 sctp_sock_dump+0x131/0x200 sctp_transport_traverse_process+0x170/0x1b0 ? __pfx_sctp_sock_filter+0x10/0x10 ? __pfx_sctp_sock_dump+0x10/0x10 sctp_diag_dump+0x103/0x140 __inet_diag_dump+0x70/0xb0 netlink_dump+0x148/0x490 __netlink_dump_start+0x1f3/0x2b0 inet_diag_handler_cmd+0xcd/0x100 ? __pfx_inet_diag_dump_start+0x10/0x10 ? __pfx_inet_diag_dump+0x10/0x10 ? __pfx_inet_diag_dump_done+0x10/0x10 sock_diag_rcv_msg+0x18e/0x320 ? __pfx_sock_diag_rcv_msg+0x10/0x10 netlink_rcv_skb+0x4d/0x100 netlink_unicast+0x1d7/0x2b0 netlink_sendmsg+0x203/0x450 ____sys_sendmsg+0x30c/0x340 ___sys_sendmsg+0x94/0xf0 __sys_sendmsg+0x83/0xf0 do_syscall_64+0xbb/0x390 entry_SYSCALL_64_after_hwframe+0x77/0x7f ... </TASK> Fixes: 8f840e4 ("sctp: add the sctp_diag.c file") Signed-off-by: Stefan Wiehler <stefan.wiehler@nokia.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251028161506.3294376-2-stefan.wiehler@nokia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Nov 7, 2025
Michael Chan says: ==================== bnxt_en: Bug fixes Patches 1, 3, and 4 are bug fixes related to the FW log tracing driver coredump feature recently added in 6.13. Patch #1 adds the necessary call to shutdown the FW logging DMA during PCI shutdown. Patch #3 fixes a possible null pointer derefernce when using early versions of the FW with this feature. Patch #4 adds the coredump header information unconditionally to make it more robust. Patch #2 fixes a possible memory leak during PTP shutdown. Patch #5 eliminates a dmesg warning when doing devlink reload. ==================== Link: https://patch.msgid.link/20251104005700.542174-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
alobakin
pushed a commit
that referenced
this pull request
Nov 20, 2025
On completion of i915_vma_pin_ww(), a synchronous variant of dma_fence_work_commit() is called. When pinning a VMA to GGTT address space on a Cherry View family processor, or on a Broxton generation SoC with VTD enabled, i.e., when stop_machine() is then called from intel_ggtt_bind_vma(), that can potentially lead to lock inversion among reservation_ww and cpu_hotplug locks. [86.861179] ====================================================== [86.861193] WARNING: possible circular locking dependency detected [86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 Tainted: G U [86.861226] ------------------------------------------------------ [86.861238] i915_module_loa/1432 is trying to acquire lock: [86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50 [86.861290] but task is already holding lock: [86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.862233] which lock already depends on the new lock. [86.862251] the existing dependency chain (in reverse order) is: [86.862265] -> #5 (reservation_ww_class_mutex){+.+.}-{3:3}: [86.862292] dma_resv_lockdep+0x19a/0x390 [86.862315] do_one_initcall+0x60/0x3f0 [86.862334] kernel_init_freeable+0x3cd/0x680 [86.862353] kernel_init+0x1b/0x200 [86.862369] ret_from_fork+0x47/0x70 [86.862383] ret_from_fork_asm+0x1a/0x30 [86.862399] -> #4 (reservation_ww_class_acquire){+.+.}-{0:0}: [86.862425] dma_resv_lockdep+0x178/0x390 [86.862440] do_one_initcall+0x60/0x3f0 [86.862454] kernel_init_freeable+0x3cd/0x680 [86.862470] kernel_init+0x1b/0x200 [86.862482] ret_from_fork+0x47/0x70 [86.862495] ret_from_fork_asm+0x1a/0x30 [86.862509] -> #3 (&mm->mmap_lock){++++}-{3:3}: [86.862531] down_read_killable+0x46/0x1e0 [86.862546] lock_mm_and_find_vma+0xa2/0x280 [86.862561] do_user_addr_fault+0x266/0x8e0 [86.862578] exc_page_fault+0x8a/0x2f0 [86.862593] asm_exc_page_fault+0x27/0x30 [86.862607] filldir64+0xeb/0x180 [86.862620] kernfs_fop_readdir+0x118/0x480 [86.862635] iterate_dir+0xcf/0x2b0 [86.862648] __x64_sys_getdents64+0x84/0x140 [86.862661] x64_sys_call+0x1058/0x2660 [86.862675] do_syscall_64+0x91/0xe90 [86.862689] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.862703] -> #2 (&root->kernfs_rwsem){++++}-{3:3}: [86.862725] down_write+0x3e/0xf0 [86.862738] kernfs_add_one+0x30/0x3c0 [86.862751] kernfs_create_dir_ns+0x53/0xb0 [86.862765] internal_create_group+0x134/0x4c0 [86.862779] sysfs_create_group+0x13/0x20 [86.862792] topology_add_dev+0x1d/0x30 [86.862806] cpuhp_invoke_callback+0x4b5/0x850 [86.862822] cpuhp_issue_call+0xbf/0x1f0 [86.862836] __cpuhp_setup_state_cpuslocked+0x111/0x320 [86.862852] __cpuhp_setup_state+0xb0/0x220 [86.862866] topology_sysfs_init+0x30/0x50 [86.862879] do_one_initcall+0x60/0x3f0 [86.862893] kernel_init_freeable+0x3cd/0x680 [86.862908] kernel_init+0x1b/0x200 [86.862921] ret_from_fork+0x47/0x70 [86.862934] ret_from_fork_asm+0x1a/0x30 [86.862947] -> #1 (cpuhp_state_mutex){+.+.}-{3:3}: [86.862969] __mutex_lock+0xaa/0xed0 [86.862982] mutex_lock_nested+0x1b/0x30 [86.862995] __cpuhp_setup_state_cpuslocked+0x67/0x320 [86.863012] __cpuhp_setup_state+0xb0/0x220 [86.863026] page_alloc_init_cpuhp+0x2d/0x60 [86.863041] mm_core_init+0x22/0x2d0 [86.863054] start_kernel+0x576/0xbd0 [86.863068] x86_64_start_reservations+0x18/0x30 [86.863084] x86_64_start_kernel+0xbf/0x110 [86.863098] common_startup_64+0x13e/0x141 [86.863114] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [86.863135] __lock_acquire+0x1635/0x2810 [86.863152] lock_acquire+0xc4/0x2f0 [86.863166] cpus_read_lock+0x41/0x100 [86.863180] stop_machine+0x1c/0x50 [86.863194] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.863987] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.864735] __vma_bind+0x55/0x70 [i915] [86.865510] fence_work+0x26/0xa0 [i915] [86.866248] fence_notify+0xa1/0x140 [i915] [86.866983] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.867719] i915_sw_fence_commit+0x39/0x60 [i915] [86.868453] i915_vma_pin_ww+0x462/0x1360 [i915] [86.869228] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.870001] initial_plane_vma+0x307/0x840 [i915] [86.870774] intel_initial_plane_config+0x33f/0x670 [i915] [86.871546] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.872330] i915_driver_probe+0x7fa/0xe80 [i915] [86.873057] i915_pci_probe+0xe6/0x220 [i915] [86.873782] local_pci_probe+0x47/0xb0 [86.873802] pci_device_probe+0xf3/0x260 [86.873817] really_probe+0xf1/0x3c0 [86.873833] __driver_probe_device+0x8c/0x180 [86.873848] driver_probe_device+0x24/0xd0 [86.873862] __driver_attach+0x10f/0x220 [86.873876] bus_for_each_dev+0x7f/0xe0 [86.873892] driver_attach+0x1e/0x30 [86.873904] bus_add_driver+0x151/0x290 [86.873917] driver_register+0x5e/0x130 [86.873931] __pci_register_driver+0x7d/0x90 [86.873945] i915_pci_register_driver+0x23/0x30 [i915] [86.874678] i915_init+0x37/0x120 [i915] [86.875347] do_one_initcall+0x60/0x3f0 [86.875369] do_init_module+0x97/0x2a0 [86.875385] load_module+0x2c54/0x2d80 [86.875398] init_module_from_file+0x96/0xe0 [86.875413] idempotent_init_module+0x117/0x330 [86.875426] __x64_sys_finit_module+0x77/0x100 [86.875440] x64_sys_call+0x24de/0x2660 [86.875454] do_syscall_64+0x91/0xe90 [86.875470] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.875486] other info that might help us debug this: [86.875502] Chain exists of: cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex [86.875539] Possible unsafe locking scenario: [86.875552] CPU0 CPU1 [86.875563] ---- ---- [86.875573] lock(reservation_ww_class_mutex); [86.875588] lock(reservation_ww_class_acquire); [86.875606] lock(reservation_ww_class_mutex); [86.875624] rlock(cpu_hotplug_lock); [86.875637] *** DEADLOCK *** [86.875650] 3 locks held by i915_module_loa/1432: [86.875663] #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220 [86.875699] #1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.876512] #2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915] [86.877305] stack backtrace: [86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G U 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 PREEMPT(voluntary) [86.877334] Tainted: [U]=USER [86.877336] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020 [86.877339] Call Trace: [86.877344] <TASK> [86.877353] dump_stack_lvl+0x91/0xf0 [86.877364] dump_stack+0x10/0x20 [86.877369] print_circular_bug+0x285/0x360 [86.877379] check_noncircular+0x135/0x150 [86.877390] __lock_acquire+0x1635/0x2810 [86.877403] lock_acquire+0xc4/0x2f0 [86.877408] ? stop_machine+0x1c/0x50 [86.877422] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878173] cpus_read_lock+0x41/0x100 [86.878182] ? stop_machine+0x1c/0x50 [86.878191] ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915] [86.878916] stop_machine+0x1c/0x50 [86.878927] bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915] [86.879652] intel_ggtt_bind_vma+0x43/0x70 [i915] [86.880375] __vma_bind+0x55/0x70 [i915] [86.881133] fence_work+0x26/0xa0 [i915] [86.881851] fence_notify+0xa1/0x140 [i915] [86.882566] __i915_sw_fence_complete+0x8f/0x270 [i915] [86.883286] i915_sw_fence_commit+0x39/0x60 [i915] [86.884003] i915_vma_pin_ww+0x462/0x1360 [i915] [86.884756] ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915] [86.885513] i915_vma_pin.constprop.0+0x133/0x1d0 [i915] [86.886281] initial_plane_vma+0x307/0x840 [i915] [86.887049] intel_initial_plane_config+0x33f/0x670 [i915] [86.887819] intel_display_driver_probe_nogem+0x1c6/0x260 [i915] [86.888587] i915_driver_probe+0x7fa/0xe80 [i915] [86.889293] ? mutex_unlock+0x12/0x20 [86.889301] ? drm_privacy_screen_get+0x171/0x190 [86.889308] ? acpi_dev_found+0x66/0x80 [86.889321] i915_pci_probe+0xe6/0x220 [i915] [86.890038] local_pci_probe+0x47/0xb0 [86.890049] pci_device_probe+0xf3/0x260 [86.890058] really_probe+0xf1/0x3c0 [86.890067] __driver_probe_device+0x8c/0x180 [86.890072] driver_probe_device+0x24/0xd0 [86.890078] __driver_attach+0x10f/0x220 [86.890083] ? __pfx___driver_attach+0x10/0x10 [86.890088] bus_for_each_dev+0x7f/0xe0 [86.890097] driver_attach+0x1e/0x30 [86.890101] bus_add_driver+0x151/0x290 [86.890107] driver_register+0x5e/0x130 [86.890113] __pci_register_driver+0x7d/0x90 [86.890119] i915_pci_register_driver+0x23/0x30 [i915] [86.890833] i915_init+0x37/0x120 [i915] [86.891482] ? __pfx_i915_init+0x10/0x10 [i915] [86.892135] do_one_initcall+0x60/0x3f0 [86.892145] ? __kmalloc_cache_noprof+0x33f/0x470 [86.892157] do_init_module+0x97/0x2a0 [86.892164] load_module+0x2c54/0x2d80 [86.892168] ? __kernel_read+0x15c/0x300 [86.892185] ? kernel_read_file+0x2b1/0x320 [86.892195] init_module_from_file+0x96/0xe0 [86.892199] ? init_module_from_file+0x96/0xe0 [86.892211] idempotent_init_module+0x117/0x330 [86.892224] __x64_sys_finit_module+0x77/0x100 [86.892230] x64_sys_call+0x24de/0x2660 [86.892236] do_syscall_64+0x91/0xe90 [86.892243] ? irqentry_exit+0x77/0xb0 [86.892249] ? sysvec_apic_timer_interrupt+0x57/0xc0 [86.892256] entry_SYSCALL_64_after_hwframe+0x76/0x7e [86.892261] RIP: 0033:0x7303e1b2725d [86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48 [86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d [86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c [86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80 [86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0 [86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710 [86.892304] </TASK> Call asynchronous variant of dma_fence_work_commit() in that case. v3: Provide more verbose in-line comment (Andi), - mention target environments in commit message. Fixes: 7d1c261 ("drm/i915: Take reservation lock around i915_vma_pin.") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985 Cc: Andi Shyti <andi.shyti@kernel.org> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com> Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Acked-by: Andi Shyti <andi.shyti@linux.intel.com> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com> Link: https://lore.kernel.org/r/20251023082925.351307-6-janusz.krzysztofik@linux.intel.com (cherry picked from commit 648ef13) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
alobakin
pushed a commit
that referenced
this pull request
Dec 4, 2025
Marc Kleine-Budde <mkl@pengutronix.de> says: Similarly to how CAN FD reuses the bittiming logic of Classical CAN, CAN XL also reuses the entirety of CAN FD features, and, on top of that, adds new features which are specific to CAN XL. A so-called 'mixed-mode' is intended to have (XL-tolerant) CAN FD nodes and CAN XL nodes on one CAN segment, where the FD-controllers can talk CC/FD and the XL-controllers can talk CC/FD/XL. This mixed-mode utilizes the known error-signalling (ES) for sending CC/FD/XL frames. For CAN FD and CAN XL the tranceiver delay compensation (TDC) is supported to use common CAN and CAN-SIG transceivers. The CANXL-only mode disables the error-signalling in the CAN XL controller. This mode does not allow CC/FD frames to be sent but additionally offers a CAN XL transceiver mode switching (TMS) to send CAN XL frames with up to 20Mbit/s data rate. The TMS utilizes a PWM configuration which is added to the netlink interface. Configured with CAN_CTRLMODE_FD and CAN_CTRLMODE_XL this leads to: FD=0 XL=0 CC-only mode (ES=1) FD=1 XL=0 FD/CC mixed-mode (ES=1) FD=1 XL=1 XL/FD/CC mixed-mode (ES=1) FD=0 XL=1 XL-only mode (ES=0, TMS optional) Patch #1 print defined ctrlmode strings capitalized to increase the readability and to be in line with the 'ip' tool (iproute2). Patch #2 is a small clean-up which makes can_calc_bittiming() use NL_SET_ERR_MSG() instead of netdev_err(). Patch #3 adds a check in can_dev_dropped_skb() to drop CAN FD frames when CAN FD is turned off. Patch #4 adds CAN_CTRLMODE_RESTRICTED. Note that contrary to the other CAN_CTRL_MODE_XL_* that are introduced in the later patches, this control mode is not specific to CAN XL. The nuance is that because this restricted mode was only added in ISO 11898-1:2024, it is made mandatory for CAN XL devices but optional for other protocols. This is why this patch is added as a preparation before introducing the core CAN XL logic. Patch #5 adds all the CAN XL features which are inherited from CAN FD: the nominal bittiming, the data bittiming and the TDC. Patch #6 add a new CAN_CTRLMODE_XL_TMS control mode which is specific to CAN XL to enable the transceiver mode switching (TMS) in XL-only mode. Patch #7 adds a check in can_dev_dropped_skb() to drop CAN CC/FD frames when the CAN XL controller is in CAN XL-only mode. The introduced can_dev_in_xl_only_mode() function also determines the error-signalling configuration for the CAN XL controllers. Patch #8 to #11 add the PWM logic for the CAN XL TMS mode. Patch #12 to #14 add different default sample-points for standard CAN and CAN SIG transceivers (with TDC) and CAN XL transceivers using PWM in the CAN XL TMS mode. Patch #15 add a dummy_can driver for netlink testing and debugging. Patch #16 check CAN frame type (CC/FD/XL) when writing those frames to the CAN_RAW socket and reject them if it's not supported by the CAN interface. Patch #17 increase the resolution when printing the bitrate error and round-up the value to 0.01% in the case the resolution would still provide values which would lead to 0.00%. Link: https://patch.msgid.link/20251126-canxl-v8-0-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Definition is only a proposal. There should be free place for 8B of tx
timestamp.
Signed-off-by: Michal Swiatkowski michal.swiatkowski@intel.com