Skip to content

Conversation

@hujun260
Copy link
Contributor

@hujun260 hujun260 commented Sep 16, 2024

Summary

using percpu storage for g_current_regs or leveraging interrupt status
registers to determine if code is running within an interrupt context can enhance performance.

The benefits are very significant.
Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index
2 Access the global variable
4 Disable/Enable interrupts
This process involved at least 7 CPU instructions.

However, now it only requires a single CPU instruction.

Impact

none

Testing

ostest

@yf13
Copy link
Contributor

yf13 commented Sep 16, 2024

@hujun260, Can you enrich the testing section of PR description with checked targets (with configs) and observed performance changes?

@hujun260
Copy link
Contributor Author

hujun260 commented Sep 18, 2024

@hujun260, Can you enrich the testing section of PR description with checked targets (with configs) and observed performance changes?

The benefits are very significant.
Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index
2 Access the global variable
4 Disable/Enable interrupts
This process involved at least 7 CPU instructions.

However, now it only requires a single CPU instruction.

@anchao
Copy link
Contributor

anchao commented Sep 18, 2024

The benefits are very significant. Before the modification, if we needed to obtain the interrupt status, it required three steps:

1 Obtain the CPU index 2 Access the global variable 4 Disable/Enable interrupts This process involved at least 6 CPU instructions.

However, now it only requires a single CPU instruction.

  1. The switch irq enable/disable in up_interrupt_context() could be removed actually, as 32-bit is atomic type on arm32 CPU core
  2. The instructions cycle timings of MCR may bring more overhead, requiring 6 cycles in the worst case

image

https://developer.arm.com/documentation/100026/0104/smr1465219161191

Do we have relevant performance test? For example, how many cycles does it take to call up_set_current_regs()/up_current_regs() 10,000 times with/out this PR?

@hujun260
Copy link
Contributor Author

hujun260 commented Sep 18, 2024

The benefits are very significant. Before the modification, if we needed to obtain the interrupt status, it required three steps:
1 Obtain the CPU index 2 Access the global variable 4 Disable/Enable interrupts This process involved at least 6 CPU instructions.
However, now it only requires a single CPU instruction.

  1. The switch irq enable/disable in up_interrupt_context() could be removed actually, as 32-bit is atomic type on arm32 CPU core
  2. The instructions cycle timings of MCR may bring more overhead, requiring 6 cycles in the worst case

image

https://developer.arm.com/documentation/100026/0104/smr1465219161191

Do we have relevant performance test? For example, how many cycles does it take to call up_set_current_regs()/up_current_regs() 10,000 times with/out this PR?

Firstly, irq masking cannot be removed here due to the crucial reason that we must ensure no scheduling occurs for the current task
after the cpuindex is acquired. Otherwise, the cpuindex will not correspond to the CPU where the current task resides, leading to logical errors.
The implementation of this_task follows a similar principle.

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

@anchao
Copy link
Contributor

anchao commented Sep 18, 2024

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Have you checked the assembly code? The previous implementation only reads the Affinity ID through MRC (1 cycle) and does not call MCR. Now your implementation uses MCR instruction to save the current regs with higher overhead than before.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

Could you provide performance diagram before and after adding this commit? Or API level test? Why you conclude that the performance is better than before without any test?

My concern is that MCR may perform worse than it currently does, but I'm not sure, could you help confirm this?

@anchao
Copy link
Contributor

anchao commented Sep 18, 2024

Firstly, irq masking cannot be removed here due to the crucial reason that we must ensure no scheduling occurs for the current task after the cpuindex is acquired. Otherwise, the cpuindex will not correspond to the CPU where the current task resides, leading to logical errors. The implementation of this_task follows a similar principle.

Yes, you are right, I forgot SMP mode, which does require disabling interrupts. I am currently using AMP/BMP mode, the performance is much higher than SMP

@hujun260
Copy link
Contributor Author

hujun260 commented Sep 18, 2024

The current implementation need at least 3 executions of msr/mrs instructions plus 4 normal instructions, making this optimization evident. After optimization, only a single mrs instruction is needed, with no additional overhead.

Have you checked the assembly code? The previous implementation only reads the Affinity ID through MRC (1 cycle) and does not call MCR. Now your implementation uses MCR instruction to save the current regs with higher overhead than before.

Unfortunately, we haven't conducted tests specifically for this single optimization point alone.
Instead, we've tested the entire message sending/receiving process, and each test incorporates multiple optimization points.

Could you provide performance diagram before and after adding this commit? Or API level test? Why you conclude that the performance is better than before without any test?

My concern is that MCR may perform worse than it currently does, but I'm not sure, could you help confirm this?

I did a test, in armv7-a arch, 200 million cycles

before:
up_current_regs + irq enable/disable
2884 ms
up_set_current_regs
1358 ms


after:
up_current_regs
1017 ms
up_set_current_regs
339 ms

Signed-off-by: hujun5 <hujun5@xiaomi.com>
Signed-off-by: hujun5 <hujun5@xiaomi.com>
resson:
using percpu storage for g_current_regs or leveraging interrupt status
registers to determine if code is running within an interrupt context can enhance performance.

Signed-off-by: hujun5 <hujun5@xiaomi.com>
@anchao anchao merged commit 0561b55 into apache:master Sep 19, 2024
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
anchao added a commit to anchao/nuttx that referenced this pull request Sep 19, 2024
This is continue work of apache#13486

Discussion here:
apache#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
xiaoxiang781216 pushed a commit that referenced this pull request Sep 20, 2024
This is continue work of #13486

Discussion here:
#13486 (comment)

1. move cp15.h to arch public
2. replace cp15 instruct to macros to align operation
3. add memory barrier to avoid compiler optimization

Signed-off-by: chao an <anchao@lixiang.com>
@hujun260 hujun260 deleted the apache_5 branch September 29, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants