Skip to content

Conversation

@Donny9
Copy link
Contributor

@Donny9 Donny9 commented Oct 20, 2024

Summary

Revert "fs/inode: add pre-allocated task files to avoid allocator access

This reverts commit 086834aae1722af2797fd5d4d0f79226ac11c2e9.

The refs in the filelist cannot protect the handles in the prefiles The prefiles and its associated group are allocated together, and when the group leaves, they are released together.
During a reboot process, the sync operation may encounter a "used after free" issue.

issue stack:

(filep=filep@entry=0x3c482770) at ../../../fs/vfs/fs_close.c:80
../../../fs/vfs/fs_close.c:118
../../../fs/inode/fs_files.c:476
/home/cibuild/Public/jenkinsversion/2148/nuttx/include/nuttx/fs/fs.h:870
at ../../../fs/inode/fs_files.c:223
<task_fssync(tcb_s*, void*)>, arg=0x0) at

../../../sched/sched/sched_foreach.c:73
data=) at ../../../fs/fs_initialize.c:48
../../../sched/misc/reboot_notifier.c:87
../../../boards/boardctl.c:415
argv=0x3d3e07e8) at ../../../../apps/nshlib/nsh_syscmds.c:465
argc=argc@entry=2, argv=0x3d3e07e8, argv@entry=0x3d3e0820)
at ../../../../apps/nshlib/nsh_command.c:1247
argv=0x3d3e0820, argc=2, vtbl=0x3c5d2aa8) at
../../../../apps/nshlib/nsh_parse.c:847
cmdline=cmdline@entry=0x3d3df93a "reboot") at
../../../../apps/nshlib/nsh_parse.c:2757
"reboot") at ../../../../apps/nshlib/nsh_parse.c:2844
argc=argc@entry=3, argv=0x3d3df920) at
../../../../apps/nshlib/nsh_session.c:146
char**)>, isctty=isctty@entry=0) at
../../../../apps/nshlib/nsh_system.c:47
out>) at ../../../../apps/nshlib/nsh_system.c:80
entrypt=) at
../../../libs/libc/sched/task_startup.c:70

And the original prefile improvement was only 3 microseconds, which is very minimal, so reverting it is the best approach.

Impact

Bug fix

Testing

local test

…ess"

This reverts commit 086834aae1722af2797fd5d4d0f79226ac11c2e9.

The refs in the filelist cannot protect the handles in the prefiles
The prefiles and its associated group are allocated together, and when
the group leaves, they are released together.
During a reboot process, the sync operation may encounter a "used after free"
issue.

issue stack:

    (filep=filep@entry=0x3c482770) at ../../../fs/vfs/fs_close.c:80
    ../../../fs/vfs/fs_close.c:118
    ../../../fs/inode/fs_files.c:476
    /home/cibuild/Public/jenkinsversion/2148/nuttx/include/nuttx/fs/fs.h:870
    at ../../../fs/inode/fs_files.c:223
    <task_fssync(tcb_s*, void*)>, arg=0x0) at
../../../sched/sched/sched_foreach.c:73
    data=<optimized out>) at ../../../fs/fs_initialize.c:48
    ../../../sched/misc/reboot_notifier.c:87
    ../../../boards/boardctl.c:415
    argv=0x3d3e07e8) at ../../../../apps/nshlib/nsh_syscmds.c:465
    argc=argc@entry=2, argv=0x3d3e07e8, argv@entry=0x3d3e0820)
    at ../../../../apps/nshlib/nsh_command.c:1247
    argv=0x3d3e0820, argc=2, vtbl=0x3c5d2aa8) at
../../../../apps/nshlib/nsh_parse.c:847
    cmdline=cmdline@entry=0x3d3df93a "reboot") at
../../../../apps/nshlib/nsh_parse.c:2757
    "reboot") at ../../../../apps/nshlib/nsh_parse.c:2844
    argc=argc@entry=3, argv=0x3d3df920) at
../../../../apps/nshlib/nsh_session.c:146
    char**)>, isctty=isctty@entry=0) at
../../../../apps/nshlib/nsh_system.c:47
    out>) at ../../../../apps/nshlib/nsh_system.c:80
    entrypt=<optimized out>) at
../../../libs/libc/sched/task_startup.c:70

And the original prefile improvement was only 3 microseconds, which is
very minimal, so reverting it is the best approach.

Signed-off-by: dongjiuzhu1 <dongjiuzhu1@xiaomi.com>
@github-actions github-actions bot added Area: File System File System issues Size: S The size of the change in this PR is small labels Oct 20, 2024
@nuttxpr
Copy link

nuttxpr commented Oct 20, 2024

[Experimental Bot, please feedback here]

The PR summary does not fully meet NuttX requirements. Here's a breakdown and suggestions for improvement:

Strengths:

  • Clear problem description: The summary clearly explains the bug caused by the original change (used after free issue during reboot).
  • Rationale for revert: The justification for reverting is well-explained (minimal performance gain vs. introduced bug).
  • Impact is concise: Clearly states the PR is a bug fix.
  • Testing mentioned: Mentions local testing, which is a good start.

Areas for Improvement:

  • Missing Technical Details:
    • Functional Change: The summary mentions reverting file pre-allocation but lacks specifics on what code is being reverted and how the file handling mechanism will work after the revert.
    • How Revert Works: A brief explanation of how reverting solves the "used after free" issue would be beneficial (e.g., "Reverting ensures that file handles are released correctly during the shutdown process...").
  • Incomplete Testing Information:
    • No Build Host Details: Specify the operating system, CPU architecture, and compiler used for local testing.
    • No Target Details: Provide information about the target architecture and board configuration used for testing.
    • Missing Logs: The "Testing logs before change" and "Testing logs after change" sections are empty. Include relevant log snippets demonstrating the bug before the revert and the fix afterward.

Revised Summary (Example):

This PR reverts commit 086834aae1722af2797fd5d4d0f79226ac11c2e9, which introduced pre-allocated task files to improve performance. However, this change caused a "used after free" issue during reboot due to the way pre-allocated files and their associated groups are released.

This revert removes the pre-allocation mechanism, reverting to the original file handling logic. This ensures that file handles are released correctly during shutdown, preventing the "used after free" error.

Revised Testing (Example):

Build Host(s): Linux (Ubuntu 20.04), x86_64, GCC 9.4.0
Target(s): SIM, qemu-system-arm

Testing logs before change:

[Relevant log output showing the "used after free" error during reboot]

Testing logs after change:

[Relevant log output showing the reboot process completing successfully without errors]

@raiden00pl raiden00pl requested a review from anchao October 20, 2024 15:42
@raiden00pl
Copy link
Member

And the original prefile improvement was only 3 microseconds, which is very minimal, so reverting it is the best approach.

I think, the main argument for the reverted commit (d297482) is functional safety requirements, not time consumption.

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

@xiaoxiang781216

  1. The optimized time here is 30% (10.65(us) -> 7.35(us)), and the performance of CPUs with different frequencies will be different
  2. This PR fs/inode: add pre-allocated task files to avoid allocator access #11850 is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)
  3. Please tackle the issue head on, this is probably a regression introduced by CONFIG_FS_REFCOUNT, filep Reference count #13296

#11850 was merged earlier

Copy link
Contributor

@anchao anchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge this PR, thanks

@anchao anchao marked this pull request as draft October 21, 2024 00:54
@Donny9
Copy link
Contributor Author

Donny9 commented Oct 21, 2024

@xiaoxiang781216

  1. The optimized time here is 30% (10.65(us) -> 7.35(us)), and the performance of CPUs with different frequencies will be different
  2. This PR fs/inode: add pre-allocated task files to avoid allocator access #11850 is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)
  3. Please tackle the issue head on, this is probably a regression introduced by CONFIG_FS_REFCOUNT, filep Reference count #13296

#11850 was merged earlier

@anchao
If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

This is an issue related to filelist references, not file references.

Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

We will ensure that the application code does not exceed CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, so file_extend() is useless.

This is an issue related to filelist references, not file references.
Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

No, static reservation is a better solution. As long as the nuttx task is created, it needs 3 handles for standard input 、output、 error(0,1,2),So pre-reserve for the file list is a certain event and cannot be skipped, so why not make it static reservation?

@Donny9
Copy link
Contributor Author

Donny9 commented Oct 21, 2024

If the used file descriptor exceeds CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, it will still invoke file_extend and then call the allocator. How can this situation be resolved? By adjusting CONFIG_NFILE_DESCRIPTORS_PER_BLOCK?

We will ensure that the application code does not exceed CONFIG_NFILE_DESCRIPTORS_PER_BLOCK, so file_extend() is useless.

This is an issue related to filelist references, not file references.
Can this functionality be made configurable with a default setting to turn it off? Or are there any other good solutions to this problem?

No, static reservation is a better solution. As long as the nuttx task is created, it needs 3 handles for standard input 、output、 error(0,1,2),So pre-reserve for the file list is a certain event and cannot be skipped, so why not make it static reservation?

If that's the case, tasks using a static file list shouldn't exit, now using such a static file list for tasks that need to exit doesn't make sense. However, given the current requirement to enforce the use of such static lists, it would be best if there were options available.

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

If that's the case, tasks using a static file list shouldn't exit, now using such a static file list for tasks that need to exit doesn't make sense. However, given the current requirement to enforce the use of such static lists, it would be best if there were options available.

The static file list could be exited and closed, but the reserved static buffer should not be freed. is it be possible to calculate whether the current filep is in the static file list to decide whether to release the file list?

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

2. This PR [fs/inode: add pre-allocated task files to avoid allocator access #11850](https://github.com/apache/nuttx/pull/11850)  is mainly to prevent the kernel code to access the memory allocator during task_create/task_terminate, which is part of functional safety requirements. (Our internal branch could disable CONFIG_SIGNAL and CONFIG_MM in nuttx kernel space)

is the "functional safety requirements" documented somewhere?
or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

  1. We haven't fully passed ISO26262 yet, so most of the changes are not suitable for upstream.
  2. All task creation will be bound to fd 0/1/2, why do we need to allocate for additional memory segment? most kernel tasks will not needs additional file descriptors.

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

is the "functional safety requirements" documented somewhere? or, is it just your internal requirement, which doesn't matter much for the nuttx project?

1. We haven't fully passed ISO26262 yet, so most of the changes are not suitable for upstream.

i don't understand how ISO26262 is related here.
but i guess you meant "not ready for upstream yet, but maybe in future", right?
if it isn't ready, maybe it's an option to keep this in your internal branch along with the rest of the necessary changes for now?

2. All task creation will be bound to fd 0/1/2, why do we need to allocate for additional memory segment?

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

also, the associated code/structural complexity (which might caused the crash @Donny9 was observing, i dunno) can be considered as another downside.

most kernel tasks will not needs additional file descriptors.

while i agree about the kernel threads, this stuff is not limited to kernel threads, is it?

(well, actually i personally tend to think kernel threads should not use file descriptors at all.
but i guess it's a separate topic.)

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

I think you should read this part of the code further. PR #11850 just replace dynamic allocation to static definition. It does not waste any memory.

also, the associated code/structural complexity (which might caused the crash @Donny9 was observing, i dunno) can be considered as another downside.

This flaw should be solved in a more elegant fix instead of revert.

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

it seems to have an obvious downside.

  • waste memory when the file lists are extended
  • waste memory if CONFIG_NFILE_DESCRIPTORS_PER_BLOCK is too big

that is, when the workload on the system is not static as yours.

I think you should read this part of the code further. PR #11850 just replace dynamic allocation to static definition. It does not waste any memory.

it makes "struct filelist" larger, which consumes memory.

@anchao
Copy link
Contributor

anchao commented Oct 21, 2024

it makes "struct filelist" larger, which consumes memory.

But task_create will also request the same memory, I just moved it to tcb->group by static.

@raiden00pl
Copy link
Member

is the "functional safety requirements" documented somewhere?
or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project.
NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

BTW. dynamic files allocation is a relatively new feature (Apache era) which was also not fully discussed during its introduction and at that time introduced a difficult-to-detect bug (#6012)

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

is the "functional safety requirements" documented somewhere?
or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is.
i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

it makes "struct filelist" larger, which consumes memory.

But task_create will also request the same memory, I just moved it to tcb->group by static.

if all tasks have a few open files, maybe.
however, isn't it allowed to have tasks w/o open files?

@xiaoxiang781216
Copy link
Contributor

As the convention, we could hold the revert for several days if someone could provide a fix, otherwise the revert has to been merged.

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

is the "functional safety requirements" documented somewhere?
or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

is this what people talking about here?
https://www.iso.org/obp/ui/en/#iso:std:iso:26262:-1:ed-2:v1:en

@raiden00pl
Copy link
Member

raiden00pl commented Oct 21, 2024

i just asked what "functional safety requirements" is because i had no idea what it is.
i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

OK, that make sense :)
Functional safety is a requirement in some applications (most obviously automotive) to avoid critical faults that can cause fatal results (like a system error causing the user to die). From a dev's point of view, this often comes down to eliminating dynamic allocation during system runtime (so we never use free), or eliminating it completely (in the case of NuttX, only the first option is probably possible).

Functional safety opens up the possibility for NuttX to be used in many other industries

@raiden00pl
Copy link
Member

raiden00pl commented Oct 21, 2024

is the "functional safety requirements" documented somewhere?
or, is it just your internal requirement, which doesn't matter much for the nuttx project?

@yamt why do you think that functional safety doesn't matter much for NuttX? I think this is a big deal for this project. NuttX is not an "IoT-only RTOS", many applications can benefit from meeting functional safety requirements.

i just asked what "functional safety requirements" is because i had no idea what it is. i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

is this what people talking about here? https://www.iso.org/obp/ui/en/#iso:std:iso:26262:-1:ed-2:v1:en

yes, ISO 26262 is an automotive-specific standard, more general standard is IEC 61508

@yamt
Copy link
Contributor

yamt commented Oct 21, 2024

i just asked what "functional safety requirements" is because i had no idea what it is.
i still have no idea what it is. so i have no idea if it matters for nuttx or not either.

OK, that make sense :) Functional safety is a requirement in some applications (most obviously automotive) to avoid critical faults that can cause fatal results (like a system error causing the user to die). From a dev's point of view, this often comes down to eliminating dynamic allocation during system runtime (so we never use free), or eliminating it completely (in the case of NuttX, only the first option is probably possible).

Functional safety opens up the possibility for NuttX to be used in many other industries

ok. thank you for explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: File System File System issues Size: S The size of the change in this PR is small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants