value swap macro change with xor #32

halit · 2013-05-12T12:33:33Z

No description provided.

VdeVatman · 2013-05-12T17:59:12Z

may I ask what is the swap value for?

jesus-ramos · 2013-05-14T18:11:00Z

The compiler will get rid of the temporary variable anyway this is just a different way or writing it.

tawseef · 2013-05-14T19:12:47Z

12 GB
On 12 May 2013 23:30, "Rubén Marrero Ruiz" notifications@github.com wrote:

may I ask what is the value swap for?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/32#issuecomment-17782140
.

VdeVatman · 2013-05-14T19:36:40Z

Thanks

There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel: > My fuzz tester keeps hitting this. Every instance shows the non-irq stack > came in from mlockall. I'm only seeing this on one box, but that has more > ram (8gb) than my other machines, which might explain it. > > Dave > > INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) > sending NMI to all CPUs: > NMI backtrace for cpu 3 > CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ torvalds#32 > task: ffff88023e743fc0 ti: ffff88022f6f2000 task.ti: ffff88022f6f2000 > RIP: 0010:[<ffffffff810bf7d1>] [<ffffffff810bf7d1>] trace_hardirqs_off_caller+0x21/0xb0 > RSP: 0018:ffff880244e03c30 EFLAGS: 00000046 > RAX: ffff88023e743fc0 RBX: 0000000000000001 RCX: 000000000000003c > RDX: 000000000000000f RSI: 0000000000000004 RDI: ffffffff81033cab > RBP: ffff880244e03c38 R08: ffff880243288a80 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff880243288a80 > R13: ffff8802437eda40 R14: 0000000000080000 R15: 000000000000d010 > FS: 00007f50ae33b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000097f000 CR3: 0000000240fa0000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffffffff810bf86d ffff880244e03c98 ffffffff81033cab 0000000000000096 > 000000000000d008 0000000300000002 0000000000000004 0000000000000003 > 0000000000002710 ffffffff81c50d00 ffffffff81c50d00 ffff880244fcde00 > Call Trace: > <IRQ> > [<ffffffff810bf86d>] ? trace_hardirqs_off+0xd/0x10 > [<ffffffff81033cab>] __x2apic_send_IPI_mask+0x1ab/0x1c0 > [<ffffffff81033cdc>] x2apic_send_IPI_all+0x1c/0x20 > [<ffffffff81030115>] arch_trigger_all_cpu_backtrace+0x65/0xa0 > [<ffffffff811144b1>] rcu_check_callbacks+0x331/0x8e0 > [<ffffffff8108bfa0>] ? hrtimer_run_queues+0x20/0x180 > [<ffffffff8109e905>] ? sched_clock_cpu+0xb5/0x100 > [<ffffffff81069557>] update_process_times+0x47/0x80 > [<ffffffff810bd115>] tick_sched_handle.isra.16+0x25/0x60 > [<ffffffff810bd231>] tick_sched_timer+0x41/0x60 > [<ffffffff8108ace1>] __run_hrtimer+0x81/0x4e0 > [<ffffffff810bd1f0>] ? tick_sched_do_timer+0x60/0x60 > [<ffffffff8108b93f>] hrtimer_interrupt+0xff/0x240 > [<ffffffff8102de84>] local_apic_timer_interrupt+0x34/0x60 > [<ffffffff81718c5f>] smp_apic_timer_interrupt+0x3f/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8105f101>] ? __do_softirq+0xb1/0x440 > [<ffffffff8105f64d>] irq_exit+0xcd/0xe0 > [<ffffffff81718c65>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > <EOI> > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8170b830>] ? wait_for_completion_killable+0x170/0x170 > [<ffffffff8170c853>] ? preempt_schedule_irq+0x53/0x90 > [<ffffffff8170e9f6>] retint_kernel+0x26/0x30 > [<ffffffff8107a523>] ? queue_work_on+0x43/0x90 > [<ffffffff8107c369>] schedule_on_each_cpu+0xc9/0x1a0 > [<ffffffff81167770>] ? lru_add_drain+0x50/0x50 > [<ffffffff811677c5>] lru_add_drain_all+0x15/0x20 > [<ffffffff81186965>] SyS_mlockall+0xa5/0x1a0 > [<ffffffff81716e94>] tracesys+0xdd/0xe2 This commit addresses this problem by inserting the required preemption point. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>

There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel. Dave Jones reports: "My fuzz tester keeps hitting this. Every instance shows the non-irq stack came in from mlockall. I'm only seeing this on one box, but that has more ram (8gb) than my other machines, which might explain it. INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) sending NMI to all CPUs: NMI backtrace for cpu 3 CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32 Call Trace: lru_add_drain_all+0x15/0x20 SyS_mlockall+0xa5/0x1a0 tracesys+0xdd/0xe2" This commit addresses this problem by inserting the required preemption point. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

torvalds#125 IRQ torvalds#125's status is not constant on different boards, IRQ torvalds#32 is IOMUXC's interrupt which can be triggered manually at anytime, use this irq instead of torvalds#125 to generate interrupt for avoiding CCM enter low power mode by mistake. Signed-off-by: Anson Huang <b20788@freescale.com>

There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel: > My fuzz tester keeps hitting this. Every instance shows the non-irq stack > came in from mlockall. I'm only seeing this on one box, but that has more > ram (8gb) than my other machines, which might explain it. > > Dave > > INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) > sending NMI to all CPUs: > NMI backtrace for cpu 3 > CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ torvalds#32 > task: ffff88023e743fc0 ti: ffff88022f6f2000 task.ti: ffff88022f6f2000 > RIP: 0010:[<ffffffff810bf7d1>] [<ffffffff810bf7d1>] trace_hardirqs_off_caller+0x21/0xb0 > RSP: 0018:ffff880244e03c30 EFLAGS: 00000046 > RAX: ffff88023e743fc0 RBX: 0000000000000001 RCX: 000000000000003c > RDX: 000000000000000f RSI: 0000000000000004 RDI: ffffffff81033cab > RBP: ffff880244e03c38 R08: ffff880243288a80 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000001 R12: ffff880243288a80 > R13: ffff8802437eda40 R14: 0000000000080000 R15: 000000000000d010 > FS: 00007f50ae33b740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000097f000 CR3: 0000000240fa0000 CR4: 00000000001407e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 > Stack: > ffffffff810bf86d ffff880244e03c98 ffffffff81033cab 0000000000000096 > 000000000000d008 0000000300000002 0000000000000004 0000000000000003 > 0000000000002710 ffffffff81c50d00 ffffffff81c50d00 ffff880244fcde00 > Call Trace: > <IRQ> > [<ffffffff810bf86d>] ? trace_hardirqs_off+0xd/0x10 > [<ffffffff81033cab>] __x2apic_send_IPI_mask+0x1ab/0x1c0 > [<ffffffff81033cdc>] x2apic_send_IPI_all+0x1c/0x20 > [<ffffffff81030115>] arch_trigger_all_cpu_backtrace+0x65/0xa0 > [<ffffffff811144b1>] rcu_check_callbacks+0x331/0x8e0 > [<ffffffff8108bfa0>] ? hrtimer_run_queues+0x20/0x180 > [<ffffffff8109e905>] ? sched_clock_cpu+0xb5/0x100 > [<ffffffff81069557>] update_process_times+0x47/0x80 > [<ffffffff810bd115>] tick_sched_handle.isra.16+0x25/0x60 > [<ffffffff810bd231>] tick_sched_timer+0x41/0x60 > [<ffffffff8108ace1>] __run_hrtimer+0x81/0x4e0 > [<ffffffff810bd1f0>] ? tick_sched_do_timer+0x60/0x60 > [<ffffffff8108b93f>] hrtimer_interrupt+0xff/0x240 > [<ffffffff8102de84>] local_apic_timer_interrupt+0x34/0x60 > [<ffffffff81718c5f>] smp_apic_timer_interrupt+0x3f/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8105f101>] ? __do_softirq+0xb1/0x440 > [<ffffffff8105f64d>] irq_exit+0xcd/0xe0 > [<ffffffff81718c65>] smp_apic_timer_interrupt+0x45/0x60 > [<ffffffff817178ef>] apic_timer_interrupt+0x6f/0x80 > <EOI> > [<ffffffff8170e8e0>] ? retint_restore_args+0xe/0xe > [<ffffffff8170b830>] ? wait_for_completion_killable+0x170/0x170 > [<ffffffff8170c853>] ? preempt_schedule_irq+0x53/0x90 > [<ffffffff8170e9f6>] retint_kernel+0x26/0x30 > [<ffffffff8107a523>] ? queue_work_on+0x43/0x90 > [<ffffffff8107c369>] schedule_on_each_cpu+0xc9/0x1a0 > [<ffffffff81167770>] ? lru_add_drain+0x50/0x50 > [<ffffffff811677c5>] lru_add_drain_all+0x15/0x20 > [<ffffffff81186965>] SyS_mlockall+0xa5/0x1a0 > [<ffffffff81716e94>] tracesys+0xdd/0xe2 This commit addresses this problem by inserting the required preemption point. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>

As the new x86 CPU bootup printout format code maintainer, I am taking immediate action to improve and clean (and thus indulge my OCD) the reporting of the cores when coming up online. Fix padding to a right-hand alignment, cleanup code and bind reporting width to the max number of supported CPUs on the system, like this: [ 0.074509] smpboot: Booting Node 0, Processors: #1 #2 #3 #4 #5 torvalds#6 torvalds#7 OK [ 0.644008] smpboot: Booting Node 1, Processors: torvalds#8 torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 OK [ 1.245006] smpboot: Booting Node 2, Processors: torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 OK [ 1.864005] smpboot: Booting Node 3, Processors: torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31 OK [ 2.489005] smpboot: Booting Node 4, Processors: torvalds#32 torvalds#33 torvalds#34 torvalds#35 torvalds#36 torvalds#37 torvalds#38 torvalds#39 OK [ 3.093005] smpboot: Booting Node 5, Processors: torvalds#40 torvalds#41 torvalds#42 torvalds#43 torvalds#44 torvalds#45 torvalds#46 torvalds#47 OK [ 3.698005] smpboot: Booting Node 6, Processors: torvalds#48 torvalds#49 torvalds#50 torvalds#51 #52 #53 torvalds#54 torvalds#55 OK [ 4.304005] smpboot: Booting Node 7, Processors: torvalds#56 torvalds#57 #58 torvalds#59 torvalds#60 torvalds#61 torvalds#62 torvalds#63 OK [ 4.961413] Brought up 64 CPUs and this: [ 0.072367] smpboot: Booting Node 0, Processors: #1 #2 #3 #4 #5 torvalds#6 torvalds#7 OK [ 0.686329] Brought up 8 CPUs Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Libin <huawei.libin@huawei.com> Cc: wangyijing@huawei.com Cc: fenghua.yu@intel.com Cc: guohanjun@huawei.com Cc: paul.gortmaker@windriver.com Link: http://lkml.kernel.org/r/20130927143554.GF4422@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>

Turn it into (for example): [ 0.073380] x86: Booting SMP configuration: [ 0.074005] .... node #0, CPUs: #1 #2 #3 #4 #5 torvalds#6 torvalds#7 [ 0.603005] .... node #1, CPUs: torvalds#8 torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 [ 1.200005] .... node #2, CPUs: torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 [ 1.796005] .... node #3, CPUs: torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31 [ 2.393005] .... node #4, CPUs: torvalds#32 torvalds#33 torvalds#34 torvalds#35 torvalds#36 torvalds#37 torvalds#38 torvalds#39 [ 2.996005] .... node #5, CPUs: torvalds#40 torvalds#41 torvalds#42 torvalds#43 torvalds#44 torvalds#45 torvalds#46 torvalds#47 [ 3.600005] .... node torvalds#6, CPUs: torvalds#48 torvalds#49 torvalds#50 torvalds#51 #52 #53 torvalds#54 torvalds#55 [ 4.202005] .... node torvalds#7, CPUs: torvalds#56 torvalds#57 #58 torvalds#59 torvalds#60 torvalds#61 torvalds#62 torvalds#63 [ 4.811005] .... node torvalds#8, CPUs: torvalds#64 torvalds#65 torvalds#66 torvalds#67 torvalds#68 torvalds#69 #70 torvalds#71 [ 5.421006] .... node torvalds#9, CPUs: torvalds#72 torvalds#73 torvalds#74 torvalds#75 torvalds#76 torvalds#77 torvalds#78 torvalds#79 [ 6.032005] .... node torvalds#10, CPUs: torvalds#80 torvalds#81 torvalds#82 torvalds#83 torvalds#84 torvalds#85 torvalds#86 torvalds#87 [ 6.648006] .... node torvalds#11, CPUs: torvalds#88 torvalds#89 torvalds#90 torvalds#91 torvalds#92 torvalds#93 torvalds#94 torvalds#95 [ 7.262005] .... node torvalds#12, CPUs: torvalds#96 torvalds#97 torvalds#98 torvalds#99 torvalds#100 torvalds#101 torvalds#102 torvalds#103 [ 7.865005] .... node torvalds#13, CPUs: torvalds#104 torvalds#105 torvalds#106 torvalds#107 torvalds#108 torvalds#109 torvalds#110 torvalds#111 [ 8.466005] .... node torvalds#14, CPUs: torvalds#112 torvalds#113 torvalds#114 torvalds#115 torvalds#116 torvalds#117 torvalds#118 torvalds#119 [ 9.073006] .... node torvalds#15, CPUs: torvalds#120 torvalds#121 torvalds#122 torvalds#123 torvalds#124 torvalds#125 torvalds#126 torvalds#127 [ 9.679901] x86: Booted up 16 nodes, 128 CPUs and drop useless elements. Change num_digits() to hpa's division-avoiding, cell-phone-typed version which he went at great lengths and pains to submit on a Saturday evening. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: huawei.libin@huawei.com Cc: wangyijing@huawei.com Cc: fenghua.yu@intel.com Cc: guohanjun@huawei.com Cc: paul.gortmaker@windriver.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20130930095624.GB16383@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>

The 'driver' field of the i2c_client struct is redundant. The same data can be accessed through to_i2c_driver(client->dev.driver). The generated code for both approaches in more or less the same. E.g. on ARM the expression client->driver->command(...) generates ... ldr r3, [r0, torvalds#28] ldr r3, [r3, torvalds#32] blx r3 ... and the expression to_i2c_driver(client->dev.driver)->command(...) generates ... ldr r3, [r0, torvalds#160] ldr r3, [r3, #-4] blx r3 ... Other architectures will generate similar code. All users of the 'driver' field outside of the I2C core have already been converted. So this only leaves the core itself. This patch converts the remaining few users in the I2C core and then removes the 'driver' field from the i2c_client struct. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>

commit 7ed47b7 upstream. The ghash_update function passes a pointer to gf128mul_4k_lle which will be NULL if ghash_setkey is not called or if the most recent call to ghash_setkey failed to allocate memory. This causes an oops. Fix this up by returning an error code in the null case. This is trivially triggered from unprivileged userspace through the AF_ALG interface by simply writing to the socket without setting a key. The ghash_final function has a similar issue, but triggering it requires a memory allocation failure in ghash_setkey _after_ at least one successful call to ghash_update. BUG: unable to handle kernel NULL pointer dereference at 00000670 IP: [<d88c92d4>] gf128mul_4k_lle+0x23/0x60 [gf128mul] *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: ghash_generic gf128mul algif_hash af_alg nfs lockd nfs_acl sunrpc bridge ipv6 stp llc Pid: 1502, comm: hashatron Tainted: G W 3.1.0-rc9-00085-ge9308cf torvalds#32 Bochs Bochs EIP: 0060:[<d88c92d4>] EFLAGS: 00000202 CPU: 0 EIP is at gf128mul_4k_lle+0x23/0x60 [gf128mul] EAX: d69db1f0 EBX: d6b8ddac ECX: 00000004 EDX: 00000000 ESI: 00000670 EDI: d6b8ddac EBP: d6b8ddc8 ESP: d6b8dda4 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process hashatron (pid: 1502, ti=d6b8c000 task=d6810000 task.ti=d6b8c000) Stack: 00000000 d69db1f0 00000163 00000000 d6b8ddc8 c101a520 d69db1f0 d52aa000 00000ff0 d6b8dde8 d88d310f d6b8a3f8 d52aa000 00001000 d88d502c d6b8ddfc 00001000 d6b8ddf4 c11676e d69db1e8 d6b8de24 c11679ad d52aa000 00000000 Call Trace: [<c101a520>] ? kmap_atomic_prot+0x37/0xa6 [<d88d310f>] ghash_update+0x85/0xbe [ghash_generic] [<c11676ed>] crypto_shash_update+0x18/0x1b [<c11679ad>] shash_ahash_update+0x22/0x36 [<c11679cc>] shash_async_update+0xb/0xd [<d88ce0ba>] hash_sendpage+0xba/0xf2 [algif_hash] [<c121b24c>] kernel_sendpage+0x39/0x4e [<d88ce000>] ? 0xd88cdfff [<c121b298>] sock_sendpage+0x37/0x3e [<c121b261>] ? kernel_sendpage+0x4e/0x4e [<c10b4dbc>] pipe_to_sendpage+0x56/0x61 [<c10b4e1f>] splice_from_pipe_feed+0x58/0xcd [<c10b4d66>] ? splice_from_pipe_begin+0x10/0x10 [<c10b51f5>] __splice_from_pipe+0x36/0x55 [<c10b4d66>] ? splice_from_pipe_begin+0x10/0x10 [<c10b6383>] splice_from_pipe+0x51/0x64 [<c10b63c2>] ? default_file_splice_write+0x2c/0x2c [<c10b63d5>] generic_splice_sendpage+0x13/0x15 [<c10b4d66>] ? splice_from_pipe_begin+0x10/0x10 [<c10b527f>] do_splice_from+0x5d/0x67 [<c10b6865>] sys_splice+0x2bf/0x363 [<c129373b>] ? sysenter_exit+0xf/0x16 [<c104dc1e>] ? trace_hardirqs_on_caller+0x10e/0x13f [<c129370c>] sysenter_do_call+0x12/0x32 Code: 83 c4 0c 5b 5e 5f c9 c3 55 b9 04 00 00 00 89 e5 57 8d 7d e4 56 53 8d 5d e4 83 ec 18 89 45 e0 89 55 dc 0f b6 70 0f c1 e6 04 01 d6 <f3> a5 be 0f 00 00 00 4e 89 d8 e8 48 ff ff ff 8b 45 e0 89 da 0f EIP: [<d88c92d4>] gf128mul_4k_lle+0x23/0x60 [gf128mul] SS:ESP 0068:d6b8dda4 CR2: 0000000000000670 ---[ end trace 4eaa2a86a8e2da24 ]--- note: hashatron[1502] exited with preempt_count 1 BUG: scheduling while atomic: hashatron/1502/0x10000002 INFO: lockdep is turned off. [...] Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

There is a defect in imx6 LPM design. When SW tries to enter low power mode with following sequence, the chip will enter low power mode before A9 CPU execute WFI instruction: 1. Set CCM_CLPCR[1:0] to 2'b00; 2. ARM CPU enters WFI; 3. ARM CPU wakeup from an interrupt event, which is masked by GPC or not visible to GPC, such as interrupt from local timer; 4. Set CCM_CLPCR[1:0] to 2'b01 or 2'b10; 5. ARM CPU execute WFI. Before the last step, the chip will enter WAIT mode if CCM_CLPCR[1:0] is set to 2'b01, or enter STOP mode if CCM_CLPCR[1:0] is set to 2'b10. The patch implements a recommended workaround for this issue. 1. SW triggers irq torvalds#32(IOMUX) to be always pending manually by setting IOMUX_GPR1_GINT bit; 2. SW should then unmask it in GPC before setting CCM LPM; 3. SW should mask it right after CCM LPM is set (bit0-1 of CCM_CLPCR). Signed-off-by: Shawn Guo <shawn.guo@linaro.org>

Lockdep complains about btrfs's async commit: [ 2372.462171] [ BUG: bad unlock balance detected! ] [ 2372.462191] 3.12.0+ #32 Tainted: G W [ 2372.462209] ------------------------------------- [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at: [ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462305] but there are no more locks to release! [ 2372.462324] [ 2372.462324] other info that might help us debug this: [ 2372.462349] no locks held by ceph-osd/14048. [ 2372.462367] [ 2372.462367] stack backtrace: [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32 [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320 [ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650 [ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000 [ 2372.462562] Call Trace: [ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56 [ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100 [ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210 [ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs] [ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs] [ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs] [ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0 [ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0 [ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100 [ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0 [ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0 [ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0 [ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2 ==================================================== It's because that we don't do the right thing when checking if it's ok to tell lockdep that we're trying to release the rwsem. If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze rwsem, which makes lockdep complains a lot. Reported-by: Ma Jianpeng <majianpeng@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>

commit b1a06a4 upstream. Lockdep complains about btrfs's async commit: [ 2372.462171] [ BUG: bad unlock balance detected! ] [ 2372.462191] 3.12.0+ #32 Tainted: G W [ 2372.462209] ------------------------------------- [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at: [ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462305] but there are no more locks to release! [ 2372.462324] [ 2372.462324] other info that might help us debug this: [ 2372.462349] no locks held by ceph-osd/14048. [ 2372.462367] [ 2372.462367] stack backtrace: [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32 [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320 [ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650 [ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000 [ 2372.462562] Call Trace: [ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56 [ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100 [ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210 [ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs] [ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs] [ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs] [ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0 [ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0 [ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100 [ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0 [ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0 [ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0 [ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2 ==================================================== It's because that we don't do the right thing when checking if it's ok to tell lockdep that we're trying to release the rwsem. If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze rwsem, which makes lockdep complains a lot. Reported-by: Ma Jianpeng <majianpeng@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Improve the comment of SW workaround for CCM lpm issue using hardware errata description to avoid confusion. ERR007265: CCM: When improper low-power sequence is used, the SoC enters low power mode before the ARM core executes WFI. Software workaround: 1) Software should trigger IRQ torvalds#32 (IOMUX) to be always pending by setting IOMUX_GPR1_GINT. 2) Software should then unmask IRQ torvalds#32 in GPC before setting CCM Low-Power mode. 3) Software should mask IRQ torvalds#32 right after CCM Low-Power mode is set (set bits 0-1 of CCM_CLPCR). Signed-off-by: Anson Huang <b20788@freescale.com> Signed-off-by: Shawn Guo <shawn.guo@linaro.org>

torvalds#125 IRQ torvalds#125's status is not constant on different boards, IRQ torvalds#32 is IOMUXC's interrupt which can be triggered manually at anytime, use this irq instead of torvalds#125 to generate interrupt for avoiding CCM enter low power mode by mistake. Signed-off-by: Anson Huang <b20788@freescale.com>

When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page->_last_cpupid. It should be initialized as "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. PPC needs the LAST_CPUPID_NOT_IN_PAGE_FLAGS case because ppc needs to support a large physical address region, up to 2^46 but small section size (2^24). So when NR_CPUS grows up, it is easily to cause not-in-page-flags. Call trace: [ 55.978091] SMP NR_CPUS=64 NUMA PowerNV [ 55.978118] Modules linked in: [ 55.978145] CPU: 24 PID: 804 Comm: systemd-udevd Not tainted 3.13.0-rc1+ torvalds#32 [ 55.978183] task: c000001e2746aa80 ti: c000001e32c50000 task.ti: c000001e32c50000 [ 55.978219] NIP: c0000000000f5ad0 LR: c0000000000f5ac8 CTR: c000000000913cf0 [ 55.978256] REGS: c000001e32c53510 TRAP: 0300 Not tainted (3.13.0-rc1+) [ 55.978286] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 28024424 XER: 20000000 [ 55.978380] CFAR: c000000000009324 DAR: 7265717569726857 DSISR: 40000000 SOFTE: 1 GPR00: c0000000000f5ac8 c000001e32c53790 c000000001f34338 0000000000000021 GPR04: 0000000000000000 0000000000000031 c000000001f74338 0000ffffffffffff GPR08: 0000000000000001 7265717569726573 0000000000000000 0000000000000000 GPR12: 0000000028024422 c00000000ffdd800 00000000296b2e64 0000000000000020 GPR16: 0000000000000002 0000000000000003 c000001e2f8e4658 c000001e25c1c1d8 GPR20: c000001e2f8e4000 c000000001f7a858 0000000000000658 0000000040000392 GPR24: 00000000000000a8 c000001e33c1a400 00000000000001d8 c000001e25c1c000 GPR28: c000001e33c37ff0 0007837840000392 000000000000003f c000001e32c53790 [ 55.978903] NIP [c0000000000f5ad0] .task_numa_fault+0x1470/0x2370 [ 55.978934] LR [c0000000000f5ac8] .task_numa_fault+0x1468/0x2370 [ 55.978964] Call Trace: [ 55.978978] [c000001e32c53790] [c0000000000f5ac8] .task_numa_fault+0x1468/0x2370 (unreliable) [ 55.979036] [c000001e32c539e0] [c00000000020a820] .do_numa_page+0x480/0x4a0 [ 55.979072] [c000001e32c53b10] [c00000000020bfec] .handle_mm_fault+0x4ec/0xc90 [ 55.979123] [c000001e32c53c00] [c000000000e88c98] .do_page_fault+0x3a8/0x890 [ 55.979161] [c000001e32c53e30] [c000000000009568] handle_page_fault+0x10/0x30 [ 55.979197] Instruction dump: [ 55.979216] 3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fef 3863af78 3c82fefb [ 55.979277] 3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b4 5529063e 7d2a07b4 [ 55.979354] ---[ end trace 15f2510da5ae07cf ]--- Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

I can trigger a lockdep warning: # mount -t cgroup -o cpuset xxx /cgroup # mkdir /cgroup/cpuset # mkdir /cgroup/tmp # echo 0 > /cgroup/tmp/cpuset.cpus # echo 0 > /cgroup/tmp/cpuset.mems # echo 1 > /cgroup/tmp/cpuset.memory_migrate # echo $$ > /cgroup/tmp/tasks # echo 1 > /cgruop/tmp/cpuset.mems =============================== [ INFO: suspicious RCU usage. ] 3.14.0-rc1-0.1-default+ torvalds#32 Not tainted ------------------------------- include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage! ... [<ffffffff81582174>] dump_stack+0x72/0x86 [<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140 [<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0 ... We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now we hold cpuset_mutex, which causes task_css() to complain. This is not a false-positive but a real issue. Holding cpuset_mutex won't prevent a task from migrating to another cpuset, and it won't prevent the original task->cgroup from destroying during this change. Fixes: 5d21cc2 (cpuset: replace cgroup_mutex locking with cpuset internal locking) Cc: <stable@vger.kernel.org> # 3.9+ Signed-off-by: Li Zefan <lizefan@huawei.com> Sigend-off-by: Tejun Heo <tj@kernel.org>

When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page->_last_cpupid. It should be initialized as "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled Call trace: SMP NR_CPUS=64 NUMA PowerNV Modules linked in: CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32 task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000 REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+) MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR:28024424 XER: 20000000 CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1 NIP .task_numa_fault+0x1470/0x2370 LR .task_numa_fault+0x1468/0x2370 Call Trace: .task_numa_fault+0x1468/0x2370 (unreliable) .do_numa_page+0x480/0x4a0 .handle_mm_fault+0x4ec/0xc90 .do_page_fault+0x3a8/0x890 handle_page_fault+0x10/0x30 Instruction dump: 3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb 3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4 ---[ end trace 15f2510da5ae07cf ]--- Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page->_last_cpupid. It should be initialized as "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled Call trace: [ 55.978091] SMP NR_CPUS=64 NUMA PowerNV [ 55.978118] Modules linked in: [ 55.978145] CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ torvalds#32 [ 55.978183] task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000 [ 55.978219] NIP: c0000000000f5ad0 LR: c0000000000f5ac8 CTR:c000000000913cf0 [ 55.978256] REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+) [ 55.978286] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR:28024424 XER: 20000000 [ 55.978380] CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1 GPR00: c0000000000f5ac8 c000001e32c53790 c000000001f343380000000000000021 GPR04: 0000000000000000 0000000000000031 c000000001f743380000ffffffffffff GPR08: 0000000000000001 7265717569726573 00000000000000000000000000000000 GPR12: 0000000028024422 c00000000ffdd800 00000000296b2e640000000000000020 GPR16: 0000000000000002 0000000000000003 c000001e2f8e4658c000001e25c1c1d8 GPR20: c000001e2f8e4000 c000000001f7a858 00000000000006580000000040000392 GPR24: 00000000000000a8 c000001e33c1a400 00000000000001d8c000001e25c1c000 GPR28: c000001e33c37ff0 0007837840000392 000000000000003fc000001e32c53790 [ 55.978903] NIP [c0000000000f5ad0] .task_numa_fault+0x1470/0x2370 [ 55.978934] LR [c0000000000f5ac8] .task_numa_fault+0x1468/0x2370 [ 55.978964] Call Trace: [ 55.978978] [c000001e32c53790] [c0000000000f5ac8].task_numa_fault+0x1468/0x2370 (unreliable) [ 55.979036] [c000001e32c539e0] [c00000000020a820].do_numa_page+0x480/0x4a0 [ 55.979072] [c000001e32c53b10] [c00000000020bfec].handle_mm_fault+0x4ec/0xc90 [ 55.979123] [c000001e32c53c00] [c000000000e88c98].do_page_fault+0x3a8/0x890 [ 55.979161] [c000001e32c53e30] [c000000000009568]handle_page_fault+0x10/0x30 [ 55.979197] Instruction dump: [ 55.979216] 3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb [ 55.979277] 3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4 [ 55.979354] ---[ end trace 15f2510da5ae07cf ]--- Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

commit 4729583 upstream. I can trigger a lockdep warning: # mount -t cgroup -o cpuset xxx /cgroup # mkdir /cgroup/cpuset # mkdir /cgroup/tmp # echo 0 > /cgroup/tmp/cpuset.cpus # echo 0 > /cgroup/tmp/cpuset.mems # echo 1 > /cgroup/tmp/cpuset.memory_migrate # echo $$ > /cgroup/tmp/tasks # echo 1 > /cgruop/tmp/cpuset.mems =============================== [ INFO: suspicious RCU usage. ] 3.14.0-rc1-0.1-default+ torvalds#32 Not tainted ------------------------------- include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage! ... [<ffffffff81582174>] dump_stack+0x72/0x86 [<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140 [<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0 ... We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now we hold cpuset_mutex, which causes task_css() to complain. This is not a false-positive but a real issue. Holding cpuset_mutex won't prevent a task from migrating to another cpuset, and it won't prevent the original task->cgroup from destroying during this change. Fixes: 5d21cc2 (cpuset: replace cgroup_mutex locking with cpuset internal locking) Signed-off-by: Li Zefan <lizefan@huawei.com> Sigend-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>

…ixes WARNING: please, no spaces at the start of a line torvalds#29: FILE: mm/memcontrol.c:689: + int nid = zone_to_nid(zone);$ WARNING: please, no spaces at the start of a line torvalds#30: FILE: mm/memcontrol.c:690: + int zid = zone_idx(zone);$ WARNING: please, no spaces at the start of a line torvalds#32: FILE: mm/memcontrol.c:692: + return mem_cgroup_zoneinfo(memcg, nid, zid);$ total: 0 errors, 3 warnings, 35 lines checked ./patches/mm-memcontrolc-introduce-helper-mem_cgroup_zoneinfo_zone.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The asm-generic, big-endian version of zero_bytemask creates a mask of bytes preceding the first zero-byte by left shifting ~0ul based on the position of the first zero byte. Unfortunately, if the first (top) byte is zero, the output of prep_zero_mask has only the top bit set, resulting in undefined C behaviour as we shift left by an amount equal to the width of the type. As it happens, GCC doesn't manage to spot this through the call to fls(), but the issue remains if architectures choose to implement their shift instructions differently. An example would be arch/arm/ (AArch32), where LSL Rd, Rn, #32 results in Rd == 0x0, whilst on arch/arm64 (AArch64) LSL Xd, Xn, #64 results in Xd == Xn. Rather than check explicitly for the problematic shift, this patch adds an extra shift by 1, replacing fls with __fls. Since zero_bytemask is never called with a zero argument (has_zero() is used to check the data first), we don't need to worry about calling __fls(0), which is undefined. Cc: <stable@vger.kernel.org> Cc: Victor Kamensky <victor.kamensky@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

…ixes WARNING: please, no spaces at the start of a line torvalds#29: FILE: mm/memcontrol.c:689: + int nid = zone_to_nid(zone);$ WARNING: please, no spaces at the start of a line torvalds#30: FILE: mm/memcontrol.c:690: + int zid = zone_idx(zone);$ WARNING: please, no spaces at the start of a line torvalds#32: FILE: mm/memcontrol.c:692: + return mem_cgroup_zoneinfo(memcg, nid, zid);$ total: 0 errors, 3 warnings, 35 lines checked ./patches/mm-memcontrolc-introduce-helper-mem_cgroup_zoneinfo_zone.patch has style problems, please review. If any of these errors are false positives, please report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, torvalds#16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, torvalds#16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, torvalds#16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, torvalds#16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, torvalds#16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, torvalds#16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, torvalds#16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, torvalds#16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 0796ddf git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next) Bug: 450671466 Change-Id: Icb6faa509da6dc282270d763f763bc943d461119 Signed-off-by: Reka Norman <rekanorman@google.com>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. [1]: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>

Do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Update syscall_set_arguments() too to keep syscall_get_arguments() and syscall_set_arguments() in sync. With Generic Entry patch[1] and turn on audit, the performance benchmarks from perf bench basic syscall on kunpeng920 gives roughly a 1% performance uplift. | Metric | W/O this patch | With this patch | Change | | ---------- | -------------- | --------------- | --------- | | Total time | 2.241 [sec] | 2.211 [sec] | ↓1.36% | | usecs/op | 0.224157 | 0.221146 | ↓1.36% | | ops/sec | 4,461,157 | 4,501,409 | ↑0.9% | Disassembly shows that using direct assignment causes syscall_set_arguments() to be inlined and cuts the instruction count by five or six compared to memcpy(). Because __audit_syscall_entry() only uses four syscall arguments, the compiler has also elided the copy of regs->regs[4] and regs->regs[5]. Before: <syscall_get_arguments.constprop.0>: aa0103e2 mov x2, x1 91002003 add x3, x0, #0x8 f9408804 ldr x4, [x0, torvalds#272] f8008444 str x4, [x2], torvalds#8 a9409404 ldp x4, x5, [x0, torvalds#8] a9009424 stp x4, x5, [x1, torvalds#8] a9418400 ldp x0, x1, [x0, torvalds#24] a9010440 stp x0, x1, [x2, torvalds#16] f9401060 ldr x0, [x3, torvalds#32] f9001040 str x0, [x2, torvalds#32] d65f03c0 ret d503201f nop After: a9408e82 ldp x2, x3, [x20, torvalds#8] 2a1603e0 mov w0, w22 f9400e84 ldr x4, [x20, torvalds#24] f9408a81 ldr x1, [x20, torvalds#272] 9401c4ba bl ffff800080215ca8 <__audit_syscall_entry> This also aligns the implementation with x86 and RISC-V. Link: https://lkml.kernel.org/r/20251201120633.1193122-3-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20251126071446.3234218-1-ruanjinjie@huawei.com/ [1] Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Christian Zankel <chris@zankel.net> Cc: "Dmitry V. Levin" <ldv@strace.io> Cc: Helge Deller <deller@gmx.de> Cc: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

The cpuidle governor callbacks for update, select and reflect are always running on the actual idle entering/exiting CPU, so use the more optimized this_cpu_ptr() to access the internal teo data. This brings down the latency-critical teo_reflect() from static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffcff0: hint #0x19 ffffffc080ffcff4: stp x29, x30, [sp, #-48]! struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffcff8: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> { ffffffc080ffcffc: add x29, sp, #0x0 ffffffc080ffd000: stp x19, x20, [sp, torvalds#16] ffffffc080ffd004: orr x20, xzr, x0 struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd008: add x0, x2, #0xc20 { ffffffc080ffd00c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); ffffffc080ffd010: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> ffffffc080ffd014: add x19, x19, #0xbb0 ffffffc080ffd018: ldr w3, [x20, #4] dev->last_state_idx = state; to static void teo_reflect(struct cpuidle_device *dev, int state) { ffffffc080ffd034: hint #0x19 ffffffc080ffd038: stp x29, x30, [sp, #-48]! ffffffc080ffd03c: add x29, sp, #0x0 ffffffc080ffd040: stp x19, x20, [sp, torvalds#16] ffffffc080ffd044: orr x20, xzr, x0 struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd048: adrp x19, ffffffc083eb5000 <cpu_devices+0x78> { ffffffc080ffd04c: stp x21, x22, [sp, torvalds#32] struct teo_cpu *cpu_data = this_cpu_ptr(&teo_cpus); ffffffc080ffd050: add x19, x19, #0xbb0 dev->last_state_idx = state; This saves us: adrp x2, ffffffc0848c0000 <gicv5_global_data+0x28> add x0, x2, #0xc20 ldr w3, [x20, #4] Signed-off-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Subject tweak ] Link: https://patch.msgid.link/20251110120819.714560-1-christian.loehle@arm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

swap value macro change with xor

385b8c7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

value swap macro change with xor #32

value swap macro change with xor #32

Uh oh!

halit commented May 12, 2013

Uh oh!

VdeVatman commented May 12, 2013

Uh oh!

jesus-ramos commented May 14, 2013

Uh oh!

tawseef commented May 14, 2013

Uh oh!

VdeVatman commented May 14, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

value swap macro change with xor #32

value swap macro change with xor #32

Uh oh!

Conversation

halit commented May 12, 2013

Uh oh!

VdeVatman commented May 12, 2013

Uh oh!

jesus-ramos commented May 14, 2013

Uh oh!

tawseef commented May 14, 2013

Uh oh!

VdeVatman commented May 14, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants