Skip to content

Conversation

@bsbernd
Copy link
Collaborator

@bsbernd bsbernd commented Nov 10, 2025

No description provided.

bsbernd and others added 30 commits November 7, 2025 19:54
When modpost processes .prelink.o files during module linking, it looks
for corresponding .mod files that list the object files comprising each
module. However, the build system only creates .mod files for the original
module names, not for the .prelink.o filenames.

This causes modpost to fail with "No such file or directory" errors when
trying to open the .prelink.mod files, breaking module compilation for
any module using the prelink extension.

Fix this by adding a Makefile rule that automatically creates .prelink.mod
files by copying the original .mod files, and add these files as
dependencies to the modpost rule.

Fixes: modpost failures during module compilation with .prelink.o files
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
commit 670d21c
Author: NeilBrown <neilb@suse.de>
Date:   Tue Mar 22 14:38:58 2022 -0700

    fuse: remove reliance on bdi congestion

    The bdi congestion tracking in not widely used and will be removed.

    Fuse is one of a small number of filesystems that uses it, setting both
    the sync (read) and async (write) congestion flags at what it determines
    are appropriate times.

    The only remaining effect of the sync flag is to cause read-ahead to be
    skipped.  The only remaining effect of the async flag is to cause (some)
    WB_SYNC_NONE writes to be skipped.

    So instead of setting the flags, change:

     - .readahead to stop when it has submitted all non-async pages for
       read.

     - .writepages to do nothing if WB_SYNC_NONE and the flag would be set

     - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the
       flag would be set.

    The writepages change causes a behavioural change in that pageout() can
    now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be
    called on the page which (I think) will further delay the next attempt at
    writeout.  This might be a good thing.

    Link: https://lkml.kernel.org/r/164549983737.9187.2627117501000365074.stgit@noble.brown
    Signed-off-by: NeilBrown <neilb@suse.de>
    Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Darrick J. Wong <djwong@kernel.org>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Paolo Valente <paolo.valente@linaro.org>
    Cc: Philipp Reisner <philipp.reisner@linbit.com>
    Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
(cherry picked from commit cab0fc3)
…shold

commit efc4105
Author: Kemeng Shi <shikemeng@huaweicloud.com>
Date:   Sat Oct 7 23:39:56 2023 +0800

    fuse: remove unneeded lock which protecting update of congestion_threshold

    Commit 670d21c ("fuse: remove reliance on bdi congestion") change how
    congestion_threshold is used and lock in
    fuse_conn_congestion_threshold_write is not needed anymore.
    1. Access to supe_block is removed along with removing of bdi congestion.
    Then down_read(&fc->killsb) which protecting access to super_block is no
    needed.
    2. Compare num_background and congestion_threshold without holding
    bg_lock. Then there is no need to hold bg_lock to update
    congestion_threshold.

    Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
    Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>

JIRA: https://issues.redhat.com/browse/RHEL-5619
Signed-off-by: Nico Pache <npache@redhat.com>
(cherry picked from commit 168fb63)
jira LE-2157
cve CVE-2024-44947
Rebuild_History Non-Buildable kernel-5.14.0-503.14.1.el9_5
commit-author Jann Horn <jannh@google.com>
commit 3c0da3d

fuse_notify_store(), unlike fuse_do_readpage(), does not enable page
zeroing (because it can be used to change partial page contents).

So fuse_notify_store() must be more careful to fully initialize page
contents (including parts of the page that are beyond end-of-file)
before marking the page uptodate.

The current code can leave beyond-EOF page contents uninitialized, which
makes these uninitialized page contents visible to userspace via mmap().

This is an information leak, but only affects systems which do not
enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the
corresponding kernel command line parameter).

Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574
	Cc: stable@kernel.org
Fixes: a1d75f2 ("fuse: add store request")
	Signed-off-by: Jann Horn <jannh@google.com>
	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3c0da3d)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
(cherry picked from commit 04ad364)
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 70e986c
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jun 1 16:59:02 2023 +0200

    fuse: update ki_pos in fuse_perform_write

    Both callers of fuse_perform_write need to updated ki_pos, move it into
    common code.

    Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andreas Gruenbacher <agruenba@redhat.com>
    Cc: Anna Schumaker <anna@kernel.org>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Hannes Reinecke <hare@suse.de>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Miklos Szeredi <mszeredi@redhat.com>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Xiubo Li <xiubli@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
(cherry picked from commit 4669357)
JIRA: https://issues.redhat.com/browse/RHEL-29564

commit 596df33
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jun 1 16:59:03 2023 +0200

    fuse: drop redundant arguments to fuse_perform_write

    pos is always equal to iocb->ki_pos, and mapping is always equal to
    iocb->ki_filp->f_mapping.

    Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Acked-by: Miklos Szeredi <mszeredi@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Andreas Gruenbacher <agruenba@redhat.com>
    Cc: Anna Schumaker <anna@kernel.org>
    Cc: Chao Yu <chao@kernel.org>
    Cc: Christian Brauner <brauner@kernel.org>
    Cc: "Darrick J. Wong" <djwong@kernel.org>
    Cc: Ilya Dryomov <idryomov@gmail.com>
    Cc: Jaegeuk Kim <jaegeuk@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Miklos Szeredi <miklos@szeredi.hu>
    Cc: Theodore Ts'o <tytso@mit.edu>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Cc: Xiubo Li <xiubli@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
(cherry picked from commit 69700d9)
Use invalidate_lock instead of fuse's private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.

CC: Miklos Szeredi <miklos@szeredi.hu>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
(cherry picked from commit 8bcbbe9)
(cherry picked from commit 23d6d83)
There is a potential race between fuse_read_interrupt() and
fuse_request_end().

TASK1
  in fuse_read_interrupt(): delete req->intr_entry (while holding
  fiq->lock)

TASK2
  in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock
  wake up TASK3

TASK3
  request is freed

TASK1
  in fuse_read_interrupt(): dereference req->in.h.unique ***BAM***

Fix by always grabbing fiq->lock if the request was ever interrupted
(FR_INTERRUPTED set) thereby serializing with concurrent
fuse_read_interrupt() calls.

FR_INTERRUPTED is set before the request is queued on fiq->interrupts.
Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not
cleared in this case.

Reported-by: lijiazi <lijiazi@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit e1e71c1)
(cherry picked from commit fefb72b)
Callers of fuse_writeback_range() assume that the file is ready for
modification by the server in the supplied byte range after the call
returns.

If there's a write that extends the file beyond the end of the supplied
range, then the file needs to be extended to at least the end of the range,
but currently that's not done.

There are at least two cases where this can cause problems:

 - copy_file_range() will return short count if the file is not extended
   up to end of the source range.

 - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file,
   hence the region may not be fully allocated.

Fix by flushing writes from the start of the range up to the end of the
file.  This could be optimized if the writes are non-extending, etc, but
it's probably not worth the trouble.

Fixes: a2bc923 ("fuse: fix copy_file_range() in the writeback case")
Fixes: 6b1bdb5 ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)")
Cc: <stable@vger.kernel.org>  # v5.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 59bda8e)
(cherry picked from commit 00445ea)
The struct fuse_conn argument is not used and can be removed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit a9667ac)
(cherry picked from commit 6e78684)
In writeback cache mode mtime/ctime updates are cached, and flushed to the
server using the ->write_inode() callback.

Closing the file will result in a dirty inode being immediately written,
but in other cases the inode can remain dirty after all references are
dropped.  This result in the inode being written back from reclaim, which
can deadlock on a regular allocation while the request is being served.

The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because
serving a request involves unrelated userspace process(es).

Instead do the same as for dirty pages: make sure the inode is written
before the last reference is gone.

 - fallocate(2)/copy_file_range(2): these call file_update_time() or
   file_modified(), so flush the inode before returning from the call

 - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so
   flush the ctime directly from this helper

Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 5c791fe)
(cherry picked from commit 46a668d)
Fuse ->release() is otherwise asynchronous for the reason that it can
happen in contexts unrelated to close/munmap.

Inode is already written back from fuse_flush().  Add it to
fuse_vma_close() as well to make sure inode dirtying from mmaps also get
written out before the file is released.

Also add error handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 36ea233)
(cherry picked from commit f451e1b)
Add missing inode lock annotatation; found by syzbot.

Reported-and-tested-by: syzbot+9f747458f5990eaa8d43@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit bda9a71)
(cherry picked from commit 66cf13d)
Due to the introduction of kmap_local_*, the storage of slots used for
short-term mapping has changed from per-CPU to per-thread.  kmap_atomic()
disable preemption, while kmap_local_*() only disable migration.

There is no need to disable preemption in several kamp_atomic places used
in fuse.

Link: https://lwn.net/Articles/836144/
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 5fe0fc9)
(cherry picked from commit 90f0b8a)
'ia->io=io' has been set in fuse_io_alloc.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit b5d9758)
(cherry picked from commit 0510052)
Logically it belongs there since attributes are invalidated due to the
updated ctime.  This is a cleanup and should not change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 371e8fd)
(cherry picked from commit dba6ae4)
Use list_first_entry_or_null() instead of list_empty() + list_entry().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 84840ef)
(cherry picked from commit 5d48884)
Rename didn't decrement/clear nlink on overwritten target inode.

Create a common helper fuse_entry_unlinked() that handles this for unlink,
rmdir and rename.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit cefd1b8)
(cherry picked from commit 0234295)
The fuse_iget() call in create_new_entry() already updated the inode with
all the new attributes and incremented the attribute version.

Incrementing the nlink will result in the wrong count.  This wasn't noticed
because the attributes were invalidated right after this.

Updating ctime is still needed for the writeback case when the ctime is not
refreshed.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 97f044f)
(cherry picked from commit 0ac01b5)
Only invalidate attributes that the operation might have changed.

Introduce two constants for common combinations of changed attributes:

  FUSE_STATX_MODIFY: file contents are modified but not size

  FUSE_STATX_MODSIZE: size and/or file contents modified

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit fa5eee5)
(cherry picked from commit 5f061d1)
The attribute version in fuse_inode should be updated whenever the
attributes might have changed on the server.  In case of cached writes this
is not the case, so updating the attr_version is unnecessary and could
possibly affect performance.

Open code the remaining part of fuse_write_update_size().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 8c56e03)
(cherry picked from commit 5b3e1e3)
This function already updates the attr_version in fuse_inode, regardless of
whether the size was changed or not.

Rename the helper to fuse_write_update_attr() to reflect the more generic
nature.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 27ae449)
(cherry picked from commit 3480f87)
Extend the fuse_write_update_attr() helper to invalidate cached attributes
after a write.

This has already been done in all cases except in fuse_notify_store(), so
this is mostly a cleanup.

fuse_direct_write_iter() calls fuse_direct_IO() which already calls
fuse_write_update_attr(), so don't repeat that again in the former.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit d347739)
(cherry picked from commit 4905303)
A READ request returning a short count is taken as indication of EOF, and
the cached file size is modified accordingly.

Fix the attribute version checking to allow for changes to fc->attr_version
on other inodes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 484ce65)
(cherry picked from commit 9a48447)
It's safe to call file_update_time() if writeback cache is not enabled,
since S_NOCMTIME is set in this case.  This part is purely a cleanup.

__fuse_copy_file_range() also calls fuse_write_update_attr() only in the
writeback cache case.  This is inconsistent with other callers, where it's
called unconditionally.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 20235b4)
(cherry picked from commit f2abe85)
There are two instances of "bool is_wb = fc->writeback_cache" where the
actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)".

Clean up these cases by storing the second condition in the local variable.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit c15016b)
(cherry picked from commit 8712ae1)
In case of writeback_cache fuse_fillattr() would revert the queried
attributes to the cached version.

Move this to fuse_change_attributes() in order to manage the writeback
logic in a central helper.  This will be necessary for patches that follow.

Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling
fuse_change_attributes(), so this should not change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 04d82db)
(cherry picked from commit 4856207)
If writeback_cache is enabled, then the size, mtime and ctime attributes of
regular files are always valid in the kernel's cache.  They are retrieved
from userspace only when the inode is freshly looked up.

Add a more generic "cache_mask", that indicates which attributes are
currently valid in cache.

This patch doesn't change behavior.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 4b52f05)
(cherry picked from commit 10c2fe3)
When deciding to send a GETATTR request take into account the cache mask
(which attributes are always valid).  The cache mask takes precedence over
the invalid mask.

This results in the GETATTR request not being sent unnecessarily.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit ec85537)
(cherry picked from commit 9da4e8f)
fuse_update_attributes() refreshes metadata for internal use.

Each use needs a particular set of attributes to be refreshed, but
currently that cannot be expressed and all but atime are refreshed.

Add a mask argument, which lets fuse_update_get_attr() to decide based on
the cache_mask and the inval_mask whether a GETATTR call is needed or not.

Reported-by: Yongji Xie <xieyongji@bytedance.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit c6c745b)
(cherry picked from commit 4e2c49d)
bsbernd and others added 24 commits November 8, 2025 21:03
readhead is currently limited to bdi->ra_pages. One can change
that after the mount with something like

minor=$(stat -c "%d" /path/to/fuse)
echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb

Issue is that fuse-server cannot do that from its ->init method,
as it has to know about device minor, which blocks before
init is complete.

Fuse already sets the bdi value, but upper limit is the current
bdi value. For CAP_SYS_ADMIN we can allow higher values.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 763c96d)
(cherry picked from commit 99bc3f2)
When mounting a user-space filesystem on multiple clients, after
concurrent ->setattr() calls from different node, stale inode
attributes may be cached in some node.

This is caused by fuse_setattr() racing with
fuse_reverse_inval_inode().

When filesystem server receives setattr request, the client node
with valid iattr cached will be required to update the fuse_inode's
attr_version and invalidate the cache by fuse_reverse_inval_inode(),
and at the next call to ->getattr() they will be fetched from user
space.

The race scenario is:
1. client-1 sends setattr (iattr-1) request to server
2. client-1 receives the reply from server
3. before client-1 updates iattr-1 to the cached attributes by
   fuse_change_attributes_common(), server receives another setattr
   (iattr-2) request from client-2
4. server requests client-1 to update the inode attr_version and
   invalidate the cached iattr, and iattr-1 becomes staled
5. client-2 receives the reply from server, and caches iattr-2
6. continue with step 2, client-1 invokes
   fuse_change_attributes_common(), and caches iattr-1

The issue has been observed from concurrent of chmod, chown, or
truncate, which all invoke ->setattr() call.

The solution is to use fuse_inode's attr_version to check whether
the attributes have been modified during the setattr request's
lifetime.  If so, mark the attributes as invalid in the function
fuse_change_attributes_common().

Signed-off-by: Guang Yuan Wu <gwu@ddn.com>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 78dec11)
(cherry picked from commit 1e71f3f)
When having writeback cache enabled it is beneficial for data consistency
to communicate to the FUSE server when the kernel prepares a page for caching.
This lets the FUSE server react and lock the page.

Additionally the kernel lets the FUSE server decide how much data it locks by the
same call and keeps the given information in the dlm lock management.

If the feature is not supported it will be disabled after first unsuccessful use.

- Add DLM_LOCK fuse opcode
- Add cache page lock caching for writeback cache functionality.
This means sending out a FUSE call whenever the kernel prepares a page
for writeback cache. The kernel will manage the cache so that it will keep
track of already acquired locks.
(except for the case that is documented in the code)
- Use rb-trees for the management of the already 'locked' page ranges
- Use rw_semaphore for synchronization in fuse_dlm_cache

(cherry picked from commit 287c884)
(cherry picked from commit 3b0ae3b)
Add support to invalidate inode aliases when doing inode invalidation.
This is useful for distributed file systems, which use DLM for cache
coherency. So, when a client losts its inode lock, it should invalidate
its inode cache and dentry cache since the other client may delete
this file after getting inode lock.

Signed-off-by: Yong Ze Chen <yochen@ddn.com>
(cherry picked from commit 49720b5)
(cherry picked from commit d126533)
Renumber the operation code to a high value to avoid conflicts with upstream.

(cherry picked from commit 27a0e9e)
(cherry picked from commit 9cd8b25)
Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE
filesystems to acquire a distributed lock manager (DLM) lock for
protecting upcoming dirty pages when a previously read-only mapped
page is about to be written.

Signed-off-by: Cheng Ding <cding@ddn.com>
(cherry picked from commit ec36c45)
(cherry picked from commit 360f652)
Allow read_folio to return EAGAIN error and translate it to
AOP_TRUNCATE_PAGE to retry page fault and read operations.
This is used to prevent deadlock of folio lock/DLM lock order reversal:
 - Fault or read operations acquire folio lock first, then DLM lock.
 - FUSE daemon blocks new DLM lock acquisition while it invalidating
   page cache. invalidate_inode_pages2_range() acquires folio lock
To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition
with EAGAIN if it detects an in-flight page cache invalidating
operation.

Signed-off-by: Cheng Ding <cding@ddn.com>
(cherry picked from commit 8ecf118)
(cherry picked from commit 02eecf9)
generic/488 fails with fuse2fs in the following fashion:

generic/488       _check_generic_filesystem: filesystem on /dev/sdf is inconsistent
(see /var/tmp/fstests/generic/488.full for details)

This test opens a large number of files, unlinks them (which really just
renames them to fuse hidden files), closes the program, unmounts the
filesystem, and runs fsck to check that there aren't any inconsistencies
in the filesystem.

Unfortunately, the 488.full file shows that there are a lot of hidden
files left over in the filesystem, with incorrect link counts.  Tracing
fuse_request_* shows that there are a large number of FUSE_RELEASE
commands that are queued up on behalf of the unlinked files at the time
that fuse_conn_destroy calls fuse_abort_conn.  Had the connection not
aborted, the fuse server would have responded to the RELEASE commands by
removing the hidden files; instead they stick around.

Create a function to push all the background requests to the queue and
then wait for the number of pending events to hit zero, and call this
before fuse_abort_conn.  That way, all the pending events are processed
by the fuse server and we don't end up with a corrupt filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
(cherry picked from commit d4262f9)
(cherry picked from commit 14e0495)
This is a preparation to allow fuse-io-uring bg queue
flush from flush_bg_queue()

This does two function renames:
fuse_uring_flush_bg -> fuse_uring_flush_queue_bg
fuse_uring_abort_end_requests -> fuse_uring_flush_bg

And fuse_uring_abort_end_queue_requests() is moved to
fuse_uring_stop_queues().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit e70ef24)
(cherry picked from commit cd90b4d)
This is useful to have a unique API to flush background requests.
For example when the bg queue gets flushed before
the remaining of fuse_conn_destroy().

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit fc4120c)
(cherry picked from commit 487773b)
When calling the fuse server with a dlm request and the fuse server
responds with some other error than ENOSYS most likely the lock size
will be set to zero. In that case the kernel will abort the fuse
connection. This is completely unnecessary.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit 0bc2f9c)
(cherry picked from commit bccb169)
Check whether dlm is still enabled when interpreting the returned
error from fuse server.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit f6fbf7c)
(cherry picked from commit 3898d27)
Fix reference count leak of payload pages during fuse argument copies.

Signed-off-by: Cheng Ding <cding@ddn.com>
(cherry picked from commit 2436409)
- Increase the possible lock size to 64 bit.
- change semantics of DLM locks to request start and end
- change semantics of DLM request return to mark start
and end of the locked area
- better prepare dlm lock range cache rb-tree
for unaligned byte range locks which could return
any value as long as it is larger than the range
requested
- add the case where start and end are zero
to destroy the cache

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
(cherry picked from commit 3782c41)
Take actions on the PR merged event of this repo. Run
copy-from-linux-branch.sh and create a PR for redfs.

(cherry picked from commit f54872e)
generic_file_direct_write() also does this and has a large
comment about.

Reproducer here is xfstest's generic/209, which is exactly to
have competing DIO write and cached IO read.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 04869c8)
(cherry picked from commit 45d035e)
This was done as condition on direct_io_allow_mmap, but I believe
this is not right, as a file might be open two times - once with
write-back enabled another time with FOPEN_DIRECT_IO.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 45908a9)
(cherry picked from commit c18d105)
This is another preparation and will be used for decision
which queue to add a request to.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
(cherry picked from commit e4698fa)
(cherry picked from commit 4c3941b)
This is preparation for follow up commits that allow to run with a
reduced number of queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit 2e27c33)
(cherry picked from commit 32d402c)
Add per-CPU and per-NUMA node bitmasks to track which
io-uring queues are registered.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit be6edce)
(cherry picked from commit 83db987)
Queues selection (fuse_uring_get_queue) can handle reduced number
queues - using io-uring is possible now even with a single
queue and entry.

The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that
reduced queues are possible, i.e. if the flag is set, fuse server
is free to reduce number queues.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit f620f3d)
(cherry picked from commit db5c364)
Running background IO on a different core makes quite a difference.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
 --direct=1

unpatched
   READ: bw=272MiB/s (285MB/s) ...
patched
   READ: bw=650MiB/s (682MB/s)

Reason is easily visible, the fio process is migrating between CPUs
when requests are submitted on the queue for the same core.

With --iodepth=8

unpatched
   READ: bw=466MiB/s (489MB/s)
patched
   READ: bw=641MiB/s (672MB/s)

Without io-uring (--iodepth=8)
   READ: bw=729MiB/s (764MB/s)

Without fuse (--iodepth=8)
   READ: bw=2199MiB/s (2306MB/s)

(Test were done with
<libfuse>/example/passthrough_hp -o allow_other --nopassthrough  \
[-o io_uring] /tmp/source /tmp/dest
)

Additional notes:

With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8)
   READ: bw=903MiB/s (946MB/s)

With just a random qid (--iodepth=8)
   READ: bw=429MiB/s (450MB/s)

With --iodepth=1
unpatched
   READ: bw=195MiB/s (204MB/s)
patched
   READ: bw=232MiB/s (243MB/s)

With --iodepth=1 --numjobs=2
unpatched
   READ: bw=366MiB/s (384MB/s)
patched
   READ: bw=472MiB/s (495MB/s)

With --iodepth=1 --numjobs=8
unpatched
   READ: bw=1437MiB/s (1507MB/s)
patched
   READ: bw=1529MiB/s (1603MB/s)
fuse without io-uring
   READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ...
no-fuse
   READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ...

In summary, for async requests the core doing application IO is busy
sending requests and processing IOs should be done on a different core.
Spreading the load on random cores is also not desirable, as the core
might be frequency scaled down and/or in C1 sleep states. Not shown here,
but differnces are much smaller when the system uses performance govenor
instead of schedutil (ubuntu default). Obviously at the cost of higher
system power consumption for performance govenor - not desirable either.

Results without io-uring (which uses fixed libfuse threads per queue)
heavily depend on the current number of active threads. Libfuse uses
default of max 10 threads, but actual nr max threads is a parameter.
Also, no-fuse-io-uring results heavily depend on, if there was already
running another workload before, as libfuse starts these threads
dynamically - i.e. the more threads are active, the worse the
performance.

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
(cherry picked from commit c6399ea)
(cherry picked from commit 44eaf77)
This is to further improve performance.

fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \
--bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\
--runtime=30s --group_reporting --ioengine=io_uring\
--direct=1

unpatched
   READ: bw=650MiB/s (682MB/s)
patched:
   READ: bw=995MiB/s (1043MB/s)

with --iodepth=8

unpatched
   READ: bw=641MiB/s (672MB/s)
patched
   READ: bw=966MiB/s (1012MB/s)

Reason is that with --iodepth=x (x > 1) fio submits multiple async
requests and a single queue might become CPU limited. I.e. spreading
the load helps.

(cherry picked from commit 2e73b0b)
(cherry picked from commit 7081fec)
With the reduced queue feature io-uring is marked as ready after
receiving the 1st ring entry. At this time other queues just
might be in the process of registration and then a race happens

fuse_uring_queue_fuse_req -> no queue entry registered yet
    list_add_tail -> fuse request gets queued

So far fetching requests from the list only happened from
FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests
on the same queue, it would actually never send requests
from that queue - the request was stuck.

(cherry picked from commit 4cbee2e)
@bsbernd bsbernd merged commit 2407a55 into redfs-rhel9_4-427.42.1 Nov 10, 2025
@bsbernd bsbernd deleted the ddn-update-redfs-rhel9_4-427.42.1 branch November 12, 2025 16:45
openunix pushed a commit that referenced this pull request Jan 4, 2026
… to macb_open()

In the non-RT kernel, local_bh_disable() merely disables preemption,
whereas it maps to an actual spin lock in the RT kernel. Consequently,
when attempting to refill RX buffers via netdev_alloc_skb() in
macb_mac_link_up(), a deadlock scenario arises as follows:

   WARNING: possible circular locking dependency detected
   6.18.0-08691-g2061f18ad76e #39 Not tainted
   ------------------------------------------------------
   kworker/0:0/8 is trying to acquire lock:
   ffff00080369bbe0 (&bp->lock){+.+.}-{3:3}, at: macb_start_xmit+0x808/0xb7c

   but task is already holding lock:
   ffff000803698e58 (&queue->tx_ptr_lock){+...}-{3:3}, at: macb_start_xmit
   +0x148/0xb7c

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #3 (&queue->tx_ptr_lock){+...}-{3:3}:
          rt_spin_lock+0x50/0x1f0
          macb_start_xmit+0x148/0xb7c
          dev_hard_start_xmit+0x94/0x284
          sch_direct_xmit+0x8c/0x37c
          __dev_queue_xmit+0x708/0x1120
          neigh_resolve_output+0x148/0x28c
          ip6_finish_output2+0x2c0/0xb2c
          __ip6_finish_output+0x114/0x308
          ip6_output+0xc4/0x4a4
          mld_sendpack+0x220/0x68c
          mld_ifc_work+0x2a8/0x4f4
          process_one_work+0x20c/0x5f8
          worker_thread+0x1b0/0x35c
          kthread+0x144/0x200
          ret_from_fork+0x10/0x20

   -> #2 (_xmit_ETHER#2){+...}-{3:3}:
          rt_spin_lock+0x50/0x1f0
          sch_direct_xmit+0x11c/0x37c
          __dev_queue_xmit+0x708/0x1120
          neigh_resolve_output+0x148/0x28c
          ip6_finish_output2+0x2c0/0xb2c
          __ip6_finish_output+0x114/0x308
          ip6_output+0xc4/0x4a4
          mld_sendpack+0x220/0x68c
          mld_ifc_work+0x2a8/0x4f4
          process_one_work+0x20c/0x5f8
          worker_thread+0x1b0/0x35c
          kthread+0x144/0x200
          ret_from_fork+0x10/0x20

   -> #1 ((softirq_ctrl.lock)){+.+.}-{3:3}:
          lock_release+0x250/0x348
          __local_bh_enable_ip+0x7c/0x240
          __netdev_alloc_skb+0x1b4/0x1d8
          gem_rx_refill+0xdc/0x240
          gem_init_rings+0xb4/0x108
          macb_mac_link_up+0x9c/0x2b4
          phylink_resolve+0x170/0x614
          process_one_work+0x20c/0x5f8
          worker_thread+0x1b0/0x35c
          kthread+0x144/0x200
          ret_from_fork+0x10/0x20

   -> #0 (&bp->lock){+.+.}-{3:3}:
          __lock_acquire+0x15a8/0x2084
          lock_acquire+0x1cc/0x350
          rt_spin_lock+0x50/0x1f0
          macb_start_xmit+0x808/0xb7c
          dev_hard_start_xmit+0x94/0x284
          sch_direct_xmit+0x8c/0x37c
          __dev_queue_xmit+0x708/0x1120
          neigh_resolve_output+0x148/0x28c
          ip6_finish_output2+0x2c0/0xb2c
          __ip6_finish_output+0x114/0x308
          ip6_output+0xc4/0x4a4
          mld_sendpack+0x220/0x68c
          mld_ifc_work+0x2a8/0x4f4
          process_one_work+0x20c/0x5f8
          worker_thread+0x1b0/0x35c
          kthread+0x144/0x200
          ret_from_fork+0x10/0x20

   other info that might help us debug this:

   Chain exists of:
     &bp->lock --> _xmit_ETHER#2 --> &queue->tx_ptr_lock

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&queue->tx_ptr_lock);
                                  lock(_xmit_ETHER#2);
                                  lock(&queue->tx_ptr_lock);
     lock(&bp->lock);

    *** DEADLOCK ***

   Call trace:
    show_stack+0x18/0x24 (C)
    dump_stack_lvl+0xa0/0xf0
    dump_stack+0x18/0x24
    print_circular_bug+0x28c/0x370
    check_noncircular+0x198/0x1ac
    __lock_acquire+0x15a8/0x2084
    lock_acquire+0x1cc/0x350
    rt_spin_lock+0x50/0x1f0
    macb_start_xmit+0x808/0xb7c
    dev_hard_start_xmit+0x94/0x284
    sch_direct_xmit+0x8c/0x37c
    __dev_queue_xmit+0x708/0x1120
    neigh_resolve_output+0x148/0x28c
    ip6_finish_output2+0x2c0/0xb2c
    __ip6_finish_output+0x114/0x308
    ip6_output+0xc4/0x4a4
    mld_sendpack+0x220/0x68c
    mld_ifc_work+0x2a8/0x4f4
    process_one_work+0x20c/0x5f8
    worker_thread+0x1b0/0x35c
    kthread+0x144/0x200
    ret_from_fork+0x10/0x20

Notably, invoking the mog_init_rings() callback upon link establishment
is unnecessary. Instead, we can exclusively call mog_init_rings() within
the ndo_open() callback. This adjustment resolves the deadlock issue.
Furthermore, since MACB_CAPS_MACB_IS_EMAC cases do not use mog_init_rings()
when opening the network interface via at91ether_open(), moving
mog_init_rings() to macb_open() also eliminates the MACB_CAPS_MACB_IS_EMAC
check.

Fixes: 633e98a ("net: macb: use resolved link config in mac_link_up()")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Hao <kexin.hao@windriver.com>
Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com>
Link: https://patch.msgid.link/20251222015624.1994551-1-xiaolei.wang@windriver.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.