Skip to content

Conversation

@GUIDINGLI
Copy link
Contributor

@GUIDINGLI GUIDINGLI commented Feb 15, 2022

Summary

mm: handle take mm sem in IRQ

In #5368 we remove the heap check in idle thread for performance, and don't allow user call sem_trywait in IRQ context.

But there is also have a debug feature in sched/irq/irq_dispatch.c, when tcb's flag TCB_FLAG_MEM_CHECK setted.
It also need call mm_takesemaphore -> sem_trywait in IRQ context.

So I used another way to handle this, that's is the patch.

Impact

mm take sem in IRQ

Testing

VELA

Signed-off-by: ligd <liguiding1@xiaomi.com>
Copy link
Contributor

@pkarashchenko pkarashchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@acassis acassis merged commit 0169a51 into apache:master Feb 16, 2022
@masayuki2009
Copy link
Contributor

@GUIDINGLI
I noticed that stress tests such as the nxplayer plus telnetd with sprense:wifi_smp failed with this PR.

NuttShell (NSH) NuttX-10.2.0
nsh> uname -a
NuttX  10.2.0 0169a51220 Feb 18 2022 09:45:37 arm spresense
nsh> ps
  PID GROUP CPU PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0     0   0   0 FIFO     Kthread N-- Assigned           00000000 001000 000488  48.8%  CPU0 IDLE
    1     1   1   0 FIFO     Kthread N-- Running            00000000 001000 000228  22.8%  CPU1 IDLE
    2     2 --- 224 RR       Kthread --- Waiting  Signal    00000000 002016 000468  23.2%  hpwork 0x2d05d388
    3     3 ---  60 RR       Kthread --- Waiting  Semaphore 00000000 002016 000292  14.4%  lpwork 0x2d05d394
    5     5 --- 200 RR       Task    --- Waiting  MQ empty  00000000 001000 000496  49.6%  cxd56_pm_task
    6     6   0 100 RR       Task    --- Running            00000000 003048 001188  38.9%  spresense_main
nsh> free
                   total       used       free    largest  nused  nfree
        Umem:    1176144      42240    1133904    1131840    111      2
nsh> mount
  /mnt/spif type smartfs
  /proc type procfs
nsh> ifconfig
wlan0	Link encap:Ethernet HWaddr 00:00:00:00:00:00 at UP
	inet addr:0.0.0.0 DRaddr:0.0.0.0 Mask:0.0.0.0

nsh> gs2200m raspi3-g wifi-test-24g & 
gs2200m [7:50]
nsh> renew wlan0
nsh> ntpcstart
Started the NTP daemon as PID=24
nsh> ifconfig
wlan0	Link encap:Ethernet HWaddr 3c:95:09:00:89:96 at UP
	inet addr:192.168.10.22 DRaddr:192.168.10.1 Mask:255.255.255.0

nsh> mount
  /mnt/sd0 type vfat
  /mnt/spif type smartfs
  /proc type procfs
nsh> telnetd
nsh> webserver &
webserver [26:100]
nsh> Starting webserver
date
Fri, Feb 18 00:46:43 2022
nsh> nxplayer
NxPlayer version 1.05
h for commands, q to exit

nxplayer> play http://192.168.10.11/~ishikawa/audio/01-Technopolis-48k.wav
nxplayer> [   18.868085] [CPU1] up_assert: Assertion failed CPU1 at file:mm_heap/mm_foreach.c line: 88 task: Telnet session
[   18.875226] [CPU0] arm_registerdump: R0: 00000001 R1: 2d070570 R2: 041ac000  R3: 00000000
[   18.883374] [CPU0] arm_registerdump: R4: 2d0704f0 R5: 2d05fea8 R6: 2d072498  FP: 2d070570
[   18.891522] [CPU0] arm_registerdump: R8: 2d072518 SB: 2d060d98 SL: 00000048 R11: 0d049b10
[   18.899671] [CPU0] arm_registerdump: IP: 00000000 SP: 2d072498 LR: 0d00a275  PC: 0d00a900
[   18.907849] [CPU0] arm_registerdump: xPSR: 61000000 BASEPRI: 000000e0 CONTROL: 00000004
[   18.915814] [CPU0] arm_registerdump: EXC_RETURN: ffffffe9
[   18.921216] [CPU0] arm_dump_stack: IRQ Stack:
[   18.925549] [CPU0] arm_dump_stack: sp:     2d072498
[   18.930401] [CPU0] arm_dump_stack:   base: 2d05bf08
[   18.935284] [CPU0] arm_dump_stack:   size: 00000800
[   18.940136] [CPU0] arm_dump_stack: ERROR: IRQ Stack pointer is not within the stack
[   18.947765] [CPU0] arm_dump_stack: User Stack:
[   18.952190] [CPU0] arm_dump_stack: sp:     2d072498
[   18.957042] [CPU0] arm_dump_stack:   base: 2d071f20
[   18.961925] [CPU0] arm_dump_stack:   size: 000007e8
[   18.966777] [CPU0] arm_stackdump: 2d072480: 2d05fea8 2d0704f0 2d05fea8 2d072498 2d070570 0d00a1d1 00000001 00000004
[   18.977184] [CPU0] arm_stackdump: 2d0724a0: 0d00a900 00000000 00000000 2d05bf08 00000000 2d096fc8 2d060c50 0d00935d
[   18.987620] [CPU0] arm_stackdump: 2d0724c0: 2d0967b8 0d0077a1 00000810 0d009519 2d072518 2d060c50 2d06fdc8 2d072510
[   18.998027] [CPU0] arm_stackdump: 2d0724e0: 2d095188 0d0093fd 0d0093d9 000003b8 2d071e68 0d00e1fb 2d072510 0d049afd
[   19.008433] [CPU0] arm_stackdump: 2d072500: 0d049af7 0d049b02 0d049afc 0d049af6 00000000 00000000 00000000 0000000a
[   19.018870] [CPU0] arm_stackdump: 2d072520: 00000104 000e9030 00034cd8 000ea570 00000018 2d095140 2d09619c 00000000
[   19.029276] [CPU0] arm_stackdump: 2d072540: 2d095140 00000003 00000000 0d058b0e 00000000 0d03438b 00000003 00000400
[   19.039683] [CPU0] arm_stackdump: 2d072560: 00000400 2d071e68 2d095d30 0d034395 2d095d30 0d02f239 2d095d30 2d09619c
[   19.050089] [CPU0] arm_stackdump: 2d072580: 0d059749 2d095d30 2d07260c 00000000 ffffffff 00000001 00000000 00000004
[   19.060526] [CPU0] arm_stackdump: 2d0725a0: 00000000 0d02b91d 00000000 2d070570 2d05d76c 0d004cb5 00000001 0d004e11
[   19.070932] [CPU0] arm_stackdump: 2d0725c0: 00000001 0d004e11 2d05ff0d 2d05ff0c 2d05ff0d 2d095d30 2d09619c 00000001
[   19.081338] [CPU0] arm_stackdump: 2d0725e0: 00000000 00000000 00000000 00000004 00000000 0d02c3d7 00000000 0d004e11
[   19.091775] [CPU0] arm_stackdump: 2d072600: 2d05ff0d 2d0961a1 00000000 2d09619c 00000000 00000000 00000000 00000000
[   19.102181] [CPU0] arm_stackdump: 2d072620: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   19.112588] [CPU0] arm_stackdump: 2d072640: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   19.123024] [CPU0] arm_stackdump: 2d072660: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   19.133431] [CPU0] arm_stackdump: 2d072680: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   19.143837] [CPU0] arm_stackdump: 2d0726a0: 00000000 00000000 2d09619c 2d095d30 2d09619c 0d058b86 00000000 00000000
[   19.154243] [CPU0] arm_stackdump: 2d0726c0: 00000000 00000000 00000000 0d02feb5 00000001 2d071f08 0d02fe39 00000101
[   19.164680] [CPU0] arm_stackdump: 2d0726e0: 00000000 0d0079d3 00000001 2d071f08 2d071f08 0d005647 00000000 00000000
[   19.175117] [CPU0] arm_showtasks:    PID    PRI      USED     STACK   FILLED    COMMAND
[   19.183082] [CPU0] arm_showtasks:   ----   ----       264      2048    12.8%    irq
[   19.190864] [CPU1] arm_dump_task:      0      0       488      1000    48.8%    CPU0 IDLE
[   19.198920] [CPU1] arm_dump_task:      1      0       228      1000    22.8%    CPU1 IDLE
[   19.207099] [CPU1] arm_dump_task:      2    224       644      2016    31.9%    hpwork
[   19.215003] [CPU1] arm_dump_task:      3     60       756      2016    37.5%    lpwork
[   19.222846] [CPU1] arm_dump_task:     36    100      1468      2024    72.5%    Telnet session
[   19.231451] [CPU1] arm_dump_task:      5    200       496      1000    49.6%    cxd56_pm_task
[   19.239996] [CPU1] arm_dump_task:      6    100      1580      3048    51.8%    spresense_main
[   19.248541] [CPU1] arm_dump_task:      7     50      1556      2000    77.8%    gs2200m
[   19.256506] [CPU1] arm_dump_task:     24    100      1764      1976    89.2%!   NTP daemon
[   19.264856] [CPU1] arm_dump_task:     25    100       624      2008    31.0%    Telnet daemon
[   19.273370] [CPU1] arm_dump_task:     26    100       628      2024    31.0%    webserver
[   19.281549] [CPU1] arm_dump_task:     27    100      1092      3048    35.8%    nxplayer
[   19.289636] [CPU1] arm_dump_task:     28    246       868      3072    28.2%    playthread
[   19.297784] [CPU1] arm_dump_task:     29    252       500      1024    48.8%    cxd56
[   19.305596] [CPU1] arm_dump_task:     30    100       556      1000    55.6%    telnet_io

The tests normally work for more than 10hrs.
Did you test with your SMP board?

@masayuki2009
Copy link
Contributor

@GUIDINGLI
The stress tests with spresense:wifi_smp have been working for 3hrs without this PR.
Shall we revert this PR until the SMP issue is fixed?

@GUIDINGLI
Copy link
Contributor Author

GUIDINGLI commented Feb 18, 2022

No, I don't think this crash caused by this PR.
For merge reason (my branch code base is no with b316611).
This PR is not reach my purpose.

There still have if (up_interrupt_context) branch at top of mm_takesemaphore().
In another word, this PR, only do the mm_givesemaphore() return when meet up_interrupt_context.
This has no harmful for the logic.

So, could you please run more test without this PR ? Guess will also have the problem.

76 bool mm_takesemaphore(FAR struct mm_heap_s *heap)
77 {
78 #if defined(CONFIG_BUILD_FLAT) || defined(__KERNEL__)
79   /* Check current environment */
80 
81   if (up_interrupt_context())                                // same with 105, will fix it later
82     {
83       /* Can't take semaphore in the interrupt handler */
84 
85       return false;
86     }
87   else
88 #endif
89 
90   /* getpid() returns the task ID of the task at the head of the ready-to-
91    * run task list.  mm_takesemaphore() may be called during context
92    * switches.  There are certain situations during context switching when
93    * the OS data structures are in flux and then can't be freed immediately
94    * (e.g. the running thread stack).
95    *
96    * This is handled by getpid() to return the special value -ESRCH to
97    * indicate this special situation.
98    */
99 
100   if (getpid() < 0)
101     {
102       return false;
103     }
104 #if defined(CONFIG_BUILD_FLAT) || defined(__KERNEL__)
105   else if (sched_idletask())
106     {
107       return false;
108     }
109   else if (up_interrupt_context())
110     {
111 #ifdef CONFIG_SMP
112       return false;
113 #else
114       int val;
115 
116       /* Check the semaphore value, if held by someone, then return false.
117        * Else, we can take it, return true.
118        */
119 
120       _SEM_GETVALUE(&heap->mm_semaphore, &val);
121 
122       return val > 0;
123 #endif

@masayuki2009
Copy link
Contributor

So, could you please run more test without this PR ? Guess will also have the problem.

@GUIDINGLI
OK, so you mean that there should be another issue ** without ** this PR.
My tests are still runing for 3.7hrs ** without ** this PR.

By the way, did you test your SMP board ** with ** this PR?
My tests are running nxplayer for http audio streaming and run several commands such as ps/free via telnet from Linux host.

@GUIDINGLI
Copy link
Contributor Author

By the way, did you test your SMP board ** with ** this PR? My tests are running nxplayer for http audio streaming and run several commands such as ps/free via telnet from Linux host.

Currently just test with NON-SMP board,
OK, I will find a SMP board and test it.

@GUIDINGLI
Copy link
Contributor Author

[ 18.868085] [CPU1] up_assert: Assertion failed CPU1 at file:mm_heap/mm_foreach.c line: 88

-- This seems like a memory corruption, someone overwrite ?
And, mm_foreach only called from mm_checkcorruption and mm_mallinfo, which function do you called ?
And, also, THIS PR only have effect with mm_takesemaphore() called from IRQ, do you have this situation ?

@masayuki2009
Copy link
Contributor

And, mm_foreach only called from mm_checkcorruption and mm_mallinfo, which function do you called ?

@GUIDINGLI
As I said above, my test executes free command via telnet.

Connection closed by foreign host.
Trying 192.168.10.22...
Connected to 192.168.10.22.
Escape character is '^]'.

NuttShell (NSH) NuttX-10.2.0
nsh> uname -a
NuttX  10.2.0 48f54853b2 Feb 18 2022 09:50:19 arm spresense
nsh> cat /proc/uptime
    750.77
nsh> ps
  PID GROUP CPU PRI POLICY   TYPE    NPX STATE    EVENT     SIGMASK   STACK   USED  FILLED COMMAND
    0     0   0   0 FIFO     Kthread N-- Running            00000000 001000 000488  48.8%  CPU0 IDLE
    1     1   1   0 FIFO     Kthread N-- Assigned           00000000 001000 000228  22.8%  CPU1 IDLE
    2     2 --- 224 RR       Kthread --- Waiting  Semaphore 00000000 002016 000644  31.9%  hpwork 0x2d05d398
    3     3 ---  60 RR       Kthread --- Waiting  Semaphore 00000000 002016 000844  41.8%  lpwork 0x2d05d3a4
    5     5 --- 200 RR       Task    --- Waiting  MQ empty  00000000 001000 000496  49.6%  cxd56_pm_task
    6     6 --- 100 RR       Task    --- Waiting  Semaphore 00000000 003048 001604  52.6%  spresense_main
    7     7 ---  50 RR       Task    --- Waiting  Semaphore 00000000 002000 001556  77.8%  gs2200m raspi3-g wifi-test-24g
   24    24 --- 100 RR       Task    --- Waiting  Signal    00000000 001976 001764  89.2%! NTP daemon 0.pool.ntp.org;1.pool.ntp.org;
   25    25 --- 100 RR       Task    --- Waiting  Semaphore 00000000 002008 000636  31.6%  Telnet daemon 0x2d06bca0
   26    26 --- 100 RR       Task    --- Waiting  Semaphore 00000000 002024 000628  31.0%  webserver
   27    27 --- 100 RR       Task    --- Waiting  Semaphore 00000000 003048 001092  35.8%  nxplayer
 2236    27 --- 246 RR       pthread --- Waiting  Semaphore 00000000 003072 000860  27.9%  playthread 0x2d06fbf0
 2237    27 --- 252 RR       pthread --- Waiting  MQ empty  00000000 001024 000500  48.8%  cxd56 0x2d069ae0
   30    30 --- 100 RR       Kthread --- Waiting  Semaphore 00000000 001000 000556  55.6%  telnet_io
 3327  3327   1 100 RR       Task    --- Running            00000000 002024 001468  72.5%  Telnet session
nsh> free
                   total       used       free    largest  nused  nfree
        Umem:    1176144     209632     966512     956864    256     13
nsh> ifconfig
wlan0	Link encap:Ethernet HWaddr 3c:95:09:00:89:96 at UP
	inet addr:192.168.10.22 DRaddr:192.168.10.1 Mask:255.255.255.0

nsh> exit

@xiaoxiang781216
Copy link
Contributor

@masayuki2009 could you try this patch: #5539?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants