Skip to content

Conversation

@fjpanag
Copy link
Contributor

@fjpanag fjpanag commented Jul 16, 2022

Summary

The PR #5368 removed the memory checks from the idle thread, due to issues with priority inheritance.

This is an attempt to restore these checks, this time running in a worker.

If this feature is enabled, a worker will be started during system boot, that performs the following checks periodically:

  • Checks the stacks of all threads for any possible overflow.
  • Checks the heap for any corruption.

Impact

None on existing systems.

Testing

I did a quick test on a custom board running on an STM32F427 MCU.
The check is started and executed as expected.


Note, although this code is tested to be working, it is marked as a draft.

I am not very sure about the correct structure of the code, and if it adheres to NuttX standards.
Please review it, and provide me with some feedback.

Mostly, is it correct to create this new dir mm_check, or does the code belong anywhere else?
Is the worker started at the correct place?
Any naming issues?
Etc...

@xiaoxiang781216
Copy link
Contributor

* Public Functions
****************************************************************************/

void mm_check_init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void mm_check_init()
void mm_check_init(void)

kmm_checkcorruption();

work_queue(LPWORK, &work_q, mm_check_worker, NULL,
(CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC));
CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC);

****************************************************************************/

#ifndef CONFIG_SCHED_LPWORK
#error "Low priority work queue is required for the memory health checks."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#error "Low priority work queue is required for the memory health checks."
# error "Low priority work queue is required for the memory health checks."

#include <nuttx/wqueue.h>
#include <nuttx/arch.h>
#include <nuttx/compiler.h>
#include <nuttx/config.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a first include

@acassis
Copy link
Contributor

acassis commented Jul 16, 2022

Hi @fjpanag, you forgot to comment this feature shouldn't be used on production. Suggestion: keep it dependent on CONFIG_DEBUG_MM

@fjpanag
Copy link
Contributor Author

fjpanag commented Jul 16, 2022

@fjpanag use the check here should be enough: https://github.com/apache/incubator-nuttx/blob/master/sched/irq/irq_dispatch.c#L177-L183

I wasn't aware that this check is also present there.

Shall I drop this PR?

@ALTracer
Copy link
Contributor

I wasn't aware that this check is also present there.

Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase).
It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

@xiaoxiang781216
Copy link
Contributor

I wasn't aware that this check is also present there.
Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase). It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

@ALTracer
Copy link
Contributor

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM?
LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps)
And there's always KAsan, which is designed to catch heap access errors with its guard pages.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Jul 18, 2022

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM?

It isn't fast as expect, that's why we add the per TCB filter capability. One approach is do the check when the next thread is idle which give the same check as before.

LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps) And there's always KAsan, which is designed to catch heap access errors with its guard pages.

If you like, the same policy can be inserted in irq_dispatch.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

You can try STACK_CANARIES for arm6/7-m which is fast and lightweight:
https://lwn.net/Articles/584225/
https://www.redhat.com/en/blog/security-technologies-stack-smashing-protection-stackguard

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Jul 31, 2022

@fjpanag @ALTracer
#6724 provide the method to enable the heap check from nsh.
So, you can do:

  1. Check the heap corruption issue by:
    a. KSan https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L181-L187
    b. Per-thread heap check(add heapcheck(memcheck) flag #6724) by:
    echo 1 > /proc/xxx/heapcheck
    c. Track the allocator(pid and backtrace) by MM_BACKTRACE:
    https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L189-L200
    and memdump command from:
    https://github.com/apache/incubator-nuttx-apps/blob/master/nshlib/nsh_mmcmds.c#L53-L64
  2. Check the stack corruption issue by:
    a. The stack coloration:
    https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1869-L1878
    b. The stack canaries:
    https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1880-L1892
    c. The stack hw/sw check:
    https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/armv8-m/Kconfig#L108-L146

The infrastructure is enough to cover the most debug case.

@xiaoxiang781216
Copy link
Contributor

@fjpanag do you want to improve this patch?

@fjpanag
Copy link
Contributor Author

fjpanag commented Dec 3, 2022

@fjpanag do you want to improve this patch?

Yes yes, I will.

I am struggling to find some free time, first for the TCP connections, and then for this.

I will try my best to finish this soon.

@linguini1
Copy link
Contributor

Hi @fjpanag, it's been almost 3 years since this PR has seen any activity. Are you still working on this or can it be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants