Accelerate access to interrupt status #13486

hujun260 · 2024-09-16T06:09:02Z

Summary

using percpu storage for g_current_regs or leveraging interrupt status
registers to determine if code is running within an interrupt context can enhance performance.

The benefits are very significant.
Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index
2 Access the global variable
4 Disable/Enable interrupts
This process involved at least 7 CPU instructions.

However, now it only requires a single CPU instruction.

Impact

none

Testing

ostest

arch/arm/include/tlsr82/irq.h

arch/arm/include/armv8-m/irq.h

arch/arm64/src/common/arm64_arch.h

yf13 · 2024-09-16T23:33:58Z

@hujun260, Can you enrich the testing section of PR description with checked targets (with configs) and observed performance changes?

hujun260 · 2024-09-18T01:21:59Z

@hujun260, Can you enrich the testing section of PR description with checked targets (with configs) and observed performance changes?

The benefits are very significant.
Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index
2 Access the global variable
4 Disable/Enable interrupts
This process involved at least 7 CPU instructions.

However, now it only requires a single CPU instruction.

anchao · 2024-09-18T02:04:51Z

The benefits are very significant. Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index 2 Access the global variable 4 Disable/Enable interrupts This process involved at least 6 CPU instructions.

However, now it only requires a single CPU instruction.

The switch irq enable/disable in up_interrupt_context() could be removed actually, as 32-bit is atomic type on arm32 CPU core
The instructions cycle timings of MCR may bring more overhead, requiring 6 cycles in the worst case

https://developer.arm.com/documentation/100026/0104/smr1465219161191

Do we have relevant performance test? For example, how many cycles does it take to call up_set_current_regs()/up_current_regs() 10,000 times with/out this PR?

hujun260 · 2024-09-18T02:35:58Z

The benefits are very significant. Before the modification, if we needed to obtain the interrupt status, it required three steps:
1 Obtain the CPU index 2 Access the global variable 4 Disable/Enable interrupts This process involved at least 6 CPU instructions.
However, now it only requires a single CPU instruction.

The switch irq enable/disable in up_interrupt_context() could be removed actually, as 32-bit is atomic type on arm32 CPU core

The instructions cycle timings of MCR may bring more overhead, requiring 6 cycles in the worst case

https://developer.arm.com/documentation/100026/0104/smr1465219161191

Do we have relevant performance test? For example, how many cycles does it take to call up_set_current_regs()/up_current_regs() 10,000 times with/out this PR?

Firstly, irq masking cannot be removed here due to the crucial reason that we must ensure no scheduling occurs for the current task
after the cpuindex is acquired. Otherwise, the cpuindex will not correspond to the CPU where the current task resides, leading to logical errors.
The implementation of this_task follows a similar principle.

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

anchao · 2024-09-18T03:16:45Z

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Have you checked the assembly code? The previous implementation only reads the Affinity ID through MRC (1 cycle) and does not call MCR. Now your implementation uses MCR instruction to save the current regs with higher overhead than before.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

Could you provide performance diagram before and after adding this commit? Or API level test? Why you conclude that the performance is better than before without any test?

My concern is that MCR may perform worse than it currently does, but I'm not sure, could you help confirm this?

anchao · 2024-09-18T03:23:29Z

Firstly, irq masking cannot be removed here due to the crucial reason that we must ensure no scheduling occurs for the current task after the cpuindex is acquired. Otherwise, the cpuindex will not correspond to the CPU where the current task resides, leading to logical errors. The implementation of this_task follows a similar principle.

Yes, you are right, I forgot SMP mode, which does require disabling interrupts. I am currently using AMP/BMP mode, the performance is much higher than SMP

arch/arm/include/armv6-m/irq.h

arch/arm/include/armv7-a/irq.h

hujun260 · 2024-09-18T09:13:50Z

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Have you checked the assembly code? The previous implementation only reads the Affinity ID through MRC (1 cycle) and does not call MCR. Now your implementation uses MCR instruction to save the current regs with higher overhead than before.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

Could you provide performance diagram before and after adding this commit? Or API level test? Why you conclude that the performance is better than before without any test?

My concern is that MCR may perform worse than it currently does, but I'm not sure, could you help confirm this?

I did a test, in armv7-a arch, 200 million cycles

before:
up_current_regs + irq enable/disable
2884 ms
up_set_current_regs
1358 ms

after:
up_current_regs
1017 ms
up_set_current_regs
339 ms

Signed-off-by: hujun5 <hujun5@xiaomi.com>

resson: using percpu storage for g_current_regs or leveraging interrupt status registers to determine if code is running within an interrupt context can enhance performance. Signed-off-by: hujun5 <hujun5@xiaomi.com>

This is continue work of apache#13486 Discussion here: apache#13486 (comment) 1. move cp15.h to arch public 2. replace cp15 instruct to macros align operation 3. add memory barrier to avoid compiler optimization Signed-off-by: chao an <anchao@lixiang.com>

This is continue work of apache#13486 Discussion here: apache#13486 (comment) 1. move cp15.h to arch public 2. replace cp15 instruct to macros to align operation 3. add memory barrier to avoid compiler optimization Signed-off-by: chao an <anchao@lixiang.com>

This is continue work of #13486 Discussion here: #13486 (comment) 1. move cp15.h to arch public 2. replace cp15 instruct to macros to align operation 3. add memory barrier to avoid compiler optimization Signed-off-by: chao an <anchao@lixiang.com>

acassis requested review from anchao and yf13 September 16, 2024 14:25

xiaoxiang781216 mentioned this pull request Sep 16, 2024

Revert "irq: add [enter|leave]_critical_section_nonirq" #13485

Merged

xiaoxiang781216 reviewed Sep 16, 2024

View reviewed changes

arch/arm/include/tlsr82/irq.h Show resolved Hide resolved

arch/arm/include/armv8-m/irq.h Outdated Show resolved Hide resolved

pussuw reviewed Sep 16, 2024

View reviewed changes

arch/arm64/src/common/arm64_arch.h Outdated Show resolved Hide resolved

hujun260 force-pushed the apache_5 branch from 4cffe8a to 2592a03 Compare September 18, 2024 01:16

hujun260 force-pushed the apache_5 branch from 2592a03 to 0129a86 Compare September 18, 2024 01:25

anchao requested changes Sep 18, 2024

View reviewed changes

arch/arm/include/armv6-m/irq.h Show resolved Hide resolved

arch/arm/include/armv6-m/irq.h Show resolved Hide resolved

arch/arm/include/armv7-a/irq.h Show resolved Hide resolved

anchao requested changes Sep 18, 2024

View reviewed changes

arch/arm/include/armv7-a/irq.h Show resolved Hide resolved

hujun260 force-pushed the apache_5 branch from 0129a86 to 370538f Compare September 18, 2024 09:14

hujun260 added 3 commits September 18, 2024 19:08

irq: use per-cpu reg to replace g_current_regs

341b58f

Signed-off-by: hujun5 <hujun5@xiaomi.com>

arch: move up_interrupt_context to arch specific irq.h

5b1700b

Signed-off-by: hujun5 <hujun5@xiaomi.com>

arm: optimize up_interrupt_context used in armv[6/7/8]-m

9f45b97

resson: using percpu storage for g_current_regs or leveraging interrupt status registers to determine if code is running within an interrupt context can enhance performance. Signed-off-by: hujun5 <hujun5@xiaomi.com>

hujun260 force-pushed the apache_5 branch from 370538f to 9f45b97 Compare September 18, 2024 11:14

anchao approved these changes Sep 19, 2024

View reviewed changes

anchao merged commit 0561b55 into apache:master Sep 19, 2024

anchao mentioned this pull request Sep 19, 2024

arm/cortex-a,r: replace cp15 instruct to macros to align operation #13529

Merged

hujun260 deleted the apache_5 branch September 29, 2024 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate access to interrupt status #13486

Accelerate access to interrupt status #13486

Uh oh!

hujun260 commented Sep 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yf13 commented Sep 16, 2024

Uh oh!

hujun260 commented Sep 18, 2024 •

edited

Loading

Uh oh!

anchao commented Sep 18, 2024 •

edited

Loading

Uh oh!

hujun260 commented Sep 18, 2024 •

edited

Loading

Uh oh!

anchao commented Sep 18, 2024

Uh oh!

anchao commented Sep 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hujun260 commented Sep 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Accelerate access to interrupt status #13486

Accelerate access to interrupt status #13486

Uh oh!

Conversation

hujun260 commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yf13 commented Sep 16, 2024

Uh oh!

hujun260 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anchao commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hujun260 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anchao commented Sep 18, 2024

Uh oh!

anchao commented Sep 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hujun260 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hujun260 commented Sep 16, 2024 •

edited

Loading

hujun260 commented Sep 18, 2024 •

edited

Loading

anchao commented Sep 18, 2024 •

edited

Loading

hujun260 commented Sep 18, 2024 •

edited

Loading

hujun260 commented Sep 18, 2024 •

edited

Loading