Added memory health checks. #6631

fjpanag · 2022-07-16T14:38:25Z

Summary

The PR #5368 removed the memory checks from the idle thread, due to issues with priority inheritance.

This is an attempt to restore these checks, this time running in a worker.

If this feature is enabled, a worker will be started during system boot, that performs the following checks periodically:

Checks the stacks of all threads for any possible overflow.
Checks the heap for any corruption.

Impact

None on existing systems.

Testing

I did a quick test on a custom board running on an STM32F427 MCU.
The check is started and executed as expected.

Note, although this code is tested to be working, it is marked as a draft.

I am not very sure about the correct structure of the code, and if it adheres to NuttX standards.
Please review it, and provide me with some feedback.

Mostly, is it correct to create this new dir mm_check, or does the code belong anywhere else?
Is the worker started at the correct place?
Any naming issues?
Etc...

xiaoxiang781216 · 2022-07-16T16:22:36Z

@fjpanag use the check here should be enough:
https://github.com/apache/incubator-nuttx/blob/master/sched/irq/irq_dispatch.c#L177-L183

pkarashchenko · 2022-07-16T16:31:35Z

mm/mm_check/mm_check.c

+ * Public Functions
+ ****************************************************************************/
+
+void mm_check_init()


Suggested change

void mm_check_init()

void mm_check_init(void)

pkarashchenko · 2022-07-16T16:31:49Z

mm/mm_check/mm_check.c

+  kmm_checkcorruption();
+
+  work_queue(LPWORK, &work_q, mm_check_worker, NULL,
+             (CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC));


Suggested change

(CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC));

CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC);

pkarashchenko · 2022-07-16T16:32:04Z

mm/mm_check/mm_check.c

+ ****************************************************************************/
+
+#ifndef CONFIG_SCHED_LPWORK
+#error "Low priority work queue is required for the memory health checks."


Suggested change

#error "Low priority work queue is required for the memory health checks."

# error "Low priority work queue is required for the memory health checks."

pkarashchenko · 2022-07-16T16:32:24Z

mm/mm_check/mm_check.c

+#include <nuttx/wqueue.h>
+#include <nuttx/arch.h>
+#include <nuttx/compiler.h>
+#include <nuttx/config.h>


Please make this a first include

acassis · 2022-07-16T20:04:34Z

Hi @fjpanag, you forgot to comment this feature shouldn't be used on production. Suggestion: keep it dependent on CONFIG_DEBUG_MM

fjpanag · 2022-07-16T21:04:59Z

@fjpanag use the check here should be enough: https://github.com/apache/incubator-nuttx/blob/master/sched/irq/irq_dispatch.c#L177-L183

I wasn't aware that this check is also present there.

Shall I drop this PR?

ALTracer · 2022-07-17T19:44:30Z

I wasn't aware that this check is also present there.

Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase).
It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

xiaoxiang781216 · 2022-07-18T02:39:44Z

I wasn't aware that this check is also present there.
Shall I drop this PR?

The check in irq_dispatch() only calls kmm_checkforcorruption() on IRQ exits to check heaps for TCBs marked with TCB_FLAG_MEM_CHECK (and there are no mentions of the flag whatsoever in the master codebase). It also build-depends on CONFIG_DEBUG_MM, as @acassis suggests for this checker as well.

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

I see the PR as still useful because it also checks stacks, and it doesn't require setting a tcb flag. Yes, nsh> ps can show stack usage in interactive shell, though it requires STACK_COLORATION and provides virtually a very slow sample rate, as does stackmonitor. However, it might conflict / do double work with irq_dispatch. The good thing is we're decoupling memory checks from IDLE thread which used to cause problems in #5266, and that issue/PR mentioned something about moving checks to LPWORK thread (half a year ago).

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

@fjpanag How many processes/threads (and pthreads) were running on your STM32F427 board? Did you enable DEBUG_MM as well? Can you somehow verify that there won't be deadlock problems with semaphores of mm? Was there any external (FMC SDRAM) memory attached to system heaps?

ALTracer · 2022-07-18T09:32:18Z

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM?
LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps)
And there's always KAsan, which is designed to catch heap access errors with its guard pages.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

xiaoxiang781216 · 2022-07-18T09:50:40Z

We can refine how to trigger the check. But irq_dispatch is the best place to get the reliable result since if kmm_checkforcorruption report the error, we can ascertain that the interrupted thread corrupt the memory. On the another hand, even LP work detect the memory corruption, it's still hard to identify which thread corrupt the memory.

Okay, can someone launch SEGGER SystemView and verify the board spends a reasonable amount of time checking all the heaps on every context switch? What if a heap is located in slow external SDRAM?

It isn't fast as expect, that's why we add the per TCB filter capability. One approach is do the check when the next thread is idle which give the same check as before.

LPWORK thread, on the other hand, runs once a second or slower. I'd change the Kconfig setting from seconds to milliseconds, though. (Or microseconds, to be consistent with other apps) And there's always KAsan, which is designed to catch heap access errors with its guard pages.

If you like, the same policy can be inserted in irq_dispatch.

It's better to enable STACK_CANARIES, ARMV8M_STACKCHECK_HARDWARE or ARMV[7|8]M_STACKCHECK, since they can report the stack overflow immediately.

Yes, on armv8-m stackcheck_hardware is essentially free. However, on armv6/7-m it reserves a register (r10), requires userspace be compiled with additional CFLAGS, and checks stacks on function exits (not on context switch, which might be more often) IIRC. Again, that doesn't let us set a safety margin of e.g. 90%, like in this PR.

You can try STACK_CANARIES for arm6/7-m which is fast and lightweight:
https://lwn.net/Articles/584225/
https://www.redhat.com/en/blog/security-technologies-stack-smashing-protection-stackguard

xiaoxiang781216 · 2022-07-31T14:16:12Z

@fjpanag @ALTracer
#6724 provide the method to enable the heap check from nsh.
So, you can do:

Check the heap corruption issue by:
a. KSan https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L181-L187
b. Per-thread heap check(add heapcheck(memcheck) flag #6724) by:
echo 1 > /proc/xxx/heapcheck
c. Track the allocator(pid and backtrace) by MM_BACKTRACE:
https://github.com/apache/incubator-nuttx/blob/master/mm/Kconfig#L189-L200
and memdump command from:
https://github.com/apache/incubator-nuttx-apps/blob/master/nshlib/nsh_mmcmds.c#L53-L64
Check the stack corruption issue by:
a. The stack coloration:
https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1869-L1878
b. The stack canaries:
https://github.com/apache/incubator-nuttx/blob/master/Kconfig#L1880-L1892
c. The stack hw/sw check:
https://github.com/apache/incubator-nuttx/blob/master/arch/arm/src/armv8-m/Kconfig#L108-L146

The infrastructure is enough to cover the most debug case.

xiaoxiang781216 · 2022-12-02T15:11:09Z

@fjpanag do you want to improve this patch?

fjpanag · 2022-12-03T13:34:04Z

@fjpanag do you want to improve this patch?

Yes yes, I will.

I am struggling to find some free time, first for the TCP connections, and then for this.

I will try my best to finish this soon.

linguini1 · 2025-09-25T20:25:23Z

Hi @fjpanag, it's been almost 3 years since this PR has seen any activity. Are you still working on this or can it be closed?

Added memory health checks.

a6d7080

pkarashchenko reviewed Jul 16, 2022

View reviewed changes

	(CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC));
	CONFIG_MM_CHECK_PERIOD * CLOCKS_PER_SEC);

	#error "Low priority work queue is required for the memory health checks."
	# error "Low priority work queue is required for the memory health checks."

Added memory health checks. #6631

Are you sure you want to change the base?

Added memory health checks. #6631

Conversation

fjpanag commented Jul 16, 2022

Summary

Impact

Testing

Uh oh!

xiaoxiang781216 commented Jul 16, 2022

Uh oh!

pkarashchenko Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

pkarashchenko Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

pkarashchenko Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

pkarashchenko Jul 16, 2022

Choose a reason for hiding this comment

Uh oh!

acassis commented Jul 16, 2022

Uh oh!

fjpanag commented Jul 16, 2022

Uh oh!

ALTracer commented Jul 17, 2022

Uh oh!

xiaoxiang781216 commented Jul 18, 2022

Uh oh!

ALTracer commented Jul 18, 2022

Uh oh!

xiaoxiang781216 commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaoxiang781216 commented Jul 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaoxiang781216 commented Dec 2, 2022

Uh oh!

fjpanag commented Dec 3, 2022

Uh oh!

linguini1 commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xiaoxiang781216 commented Jul 18, 2022 •

edited

Loading

xiaoxiang781216 commented Jul 31, 2022 •

edited

Loading