Ddn update redfs rhel9 4 427.42.1 #39

bsbernd · 2025-11-10T15:52:23Z

No description provided.

When modpost processes .prelink.o files during module linking, it looks for corresponding .mod files that list the object files comprising each module. However, the build system only creates .mod files for the original module names, not for the .prelink.o filenames. This causes modpost to fail with "No such file or directory" errors when trying to open the .prelink.mod files, breaking module compilation for any module using the prelink extension. Fix this by adding a Makefile rule that automatically creates .prelink.mod files by copying the original .mod files, and add these files as dependencies to the modpost rule. Fixes: modpost failures during module compilation with .prelink.o files Signed-off-by: Bernd Schubert <bschubert@ddn.com>

commit 670d21c Author: NeilBrown <neilb@suse.de> Date: Tue Mar 22 14:38:58 2022 -0700 fuse: remove reliance on bdi congestion The bdi congestion tracking in not widely used and will be removed. Fuse is one of a small number of filesystems that uses it, setting both the sync (read) and async (write) congestion flags at what it determines are appropriate times. The only remaining effect of the sync flag is to cause read-ahead to be skipped. The only remaining effect of the async flag is to cause (some) WB_SYNC_NONE writes to be skipped. So instead of setting the flags, change: - .readahead to stop when it has submitted all non-async pages for read. - .writepages to do nothing if WB_SYNC_NONE and the flag would be set - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the flag would be set. The writepages change causes a behavioural change in that pageout() can now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be called on the page which (I think) will further delay the next attempt at writeout. This might be a good thing. Link: https://lkml.kernel.org/r/164549983737.9187.2627117501000365074.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com> (cherry picked from commit cab0fc3)

…shold commit efc4105 Author: Kemeng Shi <shikemeng@huaweicloud.com> Date: Sat Oct 7 23:39:56 2023 +0800 fuse: remove unneeded lock which protecting update of congestion_threshold Commit 670d21c ("fuse: remove reliance on bdi congestion") change how congestion_threshold is used and lock in fuse_conn_congestion_threshold_write is not needed anymore. 1. Access to supe_block is removed along with removing of bdi congestion. Then down_read(&fc->killsb) which protecting access to super_block is no needed. 2. Compare num_background and congestion_threshold without holding bg_lock. Then there is no need to hold bg_lock to update congestion_threshold. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> JIRA: https://issues.redhat.com/browse/RHEL-5619 Signed-off-by: Nico Pache <npache@redhat.com> (cherry picked from commit 168fb63)

jira LE-2157 cve CVE-2024-44947 Rebuild_History Non-Buildable kernel-5.14.0-503.14.1.el9_5 commit-author Jann Horn <jannh@google.com> commit 3c0da3d fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: a1d75f2 ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 3c0da3d) Signed-off-by: Jonathan Maple <jmaple@ciq.com> (cherry picked from commit 04ad364)

JIRA: https://issues.redhat.com/browse/RHEL-29564 commit 70e986c Author: Christoph Hellwig <hch@lst.de> Date: Thu Jun 1 16:59:02 2023 +0200 fuse: update ki_pos in fuse_perform_write Both callers of fuse_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> (cherry picked from commit 4669357)

JIRA: https://issues.redhat.com/browse/RHEL-29564 commit 596df33 Author: Christoph Hellwig <hch@lst.de> Date: Thu Jun 1 16:59:03 2023 +0200 fuse: drop redundant arguments to fuse_perform_write pos is always equal to iocb->ki_pos, and mapping is always equal to iocb->ki_filp->f_mapping. Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> (cherry picked from commit 69700d9)

Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi <miklos@szeredi.hu> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> (cherry picked from commit 8bcbbe9) (cherry picked from commit 23d6d83)

There is a potential race between fuse_read_interrupt() and fuse_request_end(). TASK1 in fuse_read_interrupt(): delete req->intr_entry (while holding fiq->lock) TASK2 in fuse_request_end(): req->intr_entry is empty -> skip fiq->lock wake up TASK3 TASK3 request is freed TASK1 in fuse_read_interrupt(): dereference req->in.h.unique ***BAM*** Fix by always grabbing fiq->lock if the request was ever interrupted (FR_INTERRUPTED set) thereby serializing with concurrent fuse_read_interrupt() calls. FR_INTERRUPTED is set before the request is queued on fiq->interrupts. Dequeing the request is done with list_del_init() but FR_INTERRUPTED is not cleared in this case. Reported-by: lijiazi <lijiazi@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit e1e71c1) (cherry picked from commit fefb72b)

Callers of fuse_writeback_range() assume that the file is ready for modification by the server in the supplied byte range after the call returns. If there's a write that extends the file beyond the end of the supplied range, then the file needs to be extended to at least the end of the range, but currently that's not done. There are at least two cases where this can cause problems: - copy_file_range() will return short count if the file is not extended up to end of the source range. - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file, hence the region may not be fully allocated. Fix by flushing writes from the start of the range up to the end of the file. This could be optimized if the writes are non-extending, etc, but it's probably not worth the trouble. Fixes: a2bc923 ("fuse: fix copy_file_range() in the writeback case") Fixes: 6b1bdb5 ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)") Cc: <stable@vger.kernel.org> # v5.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 59bda8e) (cherry picked from commit 00445ea)

The struct fuse_conn argument is not used and can be removed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit a9667ac) (cherry picked from commit 6e78684)

In writeback cache mode mtime/ctime updates are cached, and flushed to the server using the ->write_inode() callback. Closing the file will result in a dirty inode being immediately written, but in other cases the inode can remain dirty after all references are dropped. This result in the inode being written back from reclaim, which can deadlock on a regular allocation while the request is being served. The usual mechanisms (GFP_NOFS/PF_MEMALLOC*) don't work for FUSE, because serving a request involves unrelated userspace process(es). Instead do the same as for dirty pages: make sure the inode is written before the last reference is gone. - fallocate(2)/copy_file_range(2): these call file_update_time() or file_modified(), so flush the inode before returning from the call - unlink(2), link(2) and rename(2): these call fuse_update_ctime(), so flush the ctime directly from this helper Reported-by: chenguanyou <chenguanyou@xiaomi.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 5c791fe) (cherry picked from commit 46a668d)

Fuse ->release() is otherwise asynchronous for the reason that it can happen in contexts unrelated to close/munmap. Inode is already written back from fuse_flush(). Add it to fuse_vma_close() as well to make sure inode dirtying from mmaps also get written out before the file is released. Also add error handling. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 36ea233) (cherry picked from commit f451e1b)

Add missing inode lock annotatation; found by syzbot. Reported-and-tested-by: syzbot+9f747458f5990eaa8d43@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit bda9a71) (cherry picked from commit 66cf13d)

Due to the introduction of kmap_local_*, the storage of slots used for short-term mapping has changed from per-CPU to per-thread. kmap_atomic() disable preemption, while kmap_local_*() only disable migration. There is no need to disable preemption in several kamp_atomic places used in fuse. Link: https://lwn.net/Articles/836144/ Signed-off-by: Peng Hao <flyingpeng@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 5fe0fc9) (cherry picked from commit 90f0b8a)

'ia->io=io' has been set in fuse_io_alloc. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit b5d9758) (cherry picked from commit 0510052)

Logically it belongs there since attributes are invalidated due to the updated ctime. This is a cleanup and should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 371e8fd) (cherry picked from commit dba6ae4)

Use list_first_entry_or_null() instead of list_empty() + list_entry(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 84840ef) (cherry picked from commit 5d48884)

Rename didn't decrement/clear nlink on overwritten target inode. Create a common helper fuse_entry_unlinked() that handles this for unlink, rmdir and rename. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit cefd1b8) (cherry picked from commit 0234295)

The fuse_iget() call in create_new_entry() already updated the inode with all the new attributes and incremented the attribute version. Incrementing the nlink will result in the wrong count. This wasn't noticed because the attributes were invalidated right after this. Updating ctime is still needed for the writeback case when the ctime is not refreshed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 97f044f) (cherry picked from commit 0ac01b5)

Only invalidate attributes that the operation might have changed. Introduce two constants for common combinations of changed attributes: FUSE_STATX_MODIFY: file contents are modified but not size FUSE_STATX_MODSIZE: size and/or file contents modified Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit fa5eee5) (cherry picked from commit 5f061d1)

The attribute version in fuse_inode should be updated whenever the attributes might have changed on the server. In case of cached writes this is not the case, so updating the attr_version is unnecessary and could possibly affect performance. Open code the remaining part of fuse_write_update_size(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 8c56e03) (cherry picked from commit 5b3e1e3)

This function already updates the attr_version in fuse_inode, regardless of whether the size was changed or not. Rename the helper to fuse_write_update_attr() to reflect the more generic nature. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 27ae449) (cherry picked from commit 3480f87)

Extend the fuse_write_update_attr() helper to invalidate cached attributes after a write. This has already been done in all cases except in fuse_notify_store(), so this is mostly a cleanup. fuse_direct_write_iter() calls fuse_direct_IO() which already calls fuse_write_update_attr(), so don't repeat that again in the former. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit d347739) (cherry picked from commit 4905303)

A READ request returning a short count is taken as indication of EOF, and the cached file size is modified accordingly. Fix the attribute version checking to allow for changes to fc->attr_version on other inodes. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 484ce65) (cherry picked from commit 9a48447)

It's safe to call file_update_time() if writeback cache is not enabled, since S_NOCMTIME is set in this case. This part is purely a cleanup. __fuse_copy_file_range() also calls fuse_write_update_attr() only in the writeback cache case. This is inconsistent with other callers, where it's called unconditionally. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 20235b4) (cherry picked from commit f2abe85)

There are two instances of "bool is_wb = fc->writeback_cache" where the actual use mostly involves checking "is_wb && S_ISREG(inode->i_mode)". Clean up these cases by storing the second condition in the local variable. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit c15016b) (cherry picked from commit 8712ae1)

In case of writeback_cache fuse_fillattr() would revert the queried attributes to the cached version. Move this to fuse_change_attributes() in order to manage the writeback logic in a central helper. This will be necessary for patches that follow. Only fuse_do_getattr() -> fuse_fillattr() uses the attributes after calling fuse_change_attributes(), so this should not change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 04d82db) (cherry picked from commit 4856207)

If writeback_cache is enabled, then the size, mtime and ctime attributes of regular files are always valid in the kernel's cache. They are retrieved from userspace only when the inode is freshly looked up. Add a more generic "cache_mask", that indicates which attributes are currently valid in cache. This patch doesn't change behavior. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 4b52f05) (cherry picked from commit 10c2fe3)

When deciding to send a GETATTR request take into account the cache mask (which attributes are always valid). The cache mask takes precedence over the invalid mask. This results in the GETATTR request not being sent unnecessarily. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit ec85537) (cherry picked from commit 9da4e8f)

fuse_update_attributes() refreshes metadata for internal use. Each use needs a particular set of attributes to be refreshed, but currently that cannot be expressed and all but atime are refreshed. Add a mask argument, which lets fuse_update_get_attr() to decide based on the cache_mask and the inval_mask whether a GETATTR call is needed or not. Reported-by: Yongji Xie <xieyongji@bytedance.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit c6c745b) (cherry picked from commit 4e2c49d)

readhead is currently limited to bdi->ra_pages. One can change that after the mount with something like minor=$(stat -c "%d" /path/to/fuse) echo 1024 > /sys/class/bdi/0:$(minor)/read_ahead_kb Issue is that fuse-server cannot do that from its ->init method, as it has to know about device minor, which blocks before init is complete. Fuse already sets the bdi value, but upper limit is the current bdi value. For CAP_SYS_ADMIN we can allow higher values. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit 763c96d) (cherry picked from commit 99bc3f2)

When mounting a user-space filesystem on multiple clients, after concurrent ->setattr() calls from different node, stale inode attributes may be cached in some node. This is caused by fuse_setattr() racing with fuse_reverse_inval_inode(). When filesystem server receives setattr request, the client node with valid iattr cached will be required to update the fuse_inode's attr_version and invalidate the cache by fuse_reverse_inval_inode(), and at the next call to ->getattr() they will be fetched from user space. The race scenario is: 1. client-1 sends setattr (iattr-1) request to server 2. client-1 receives the reply from server 3. before client-1 updates iattr-1 to the cached attributes by fuse_change_attributes_common(), server receives another setattr (iattr-2) request from client-2 4. server requests client-1 to update the inode attr_version and invalidate the cached iattr, and iattr-1 becomes staled 5. client-2 receives the reply from server, and caches iattr-2 6. continue with step 2, client-1 invokes fuse_change_attributes_common(), and caches iattr-1 The issue has been observed from concurrent of chmod, chown, or truncate, which all invoke ->setattr() call. The solution is to use fuse_inode's attr_version to check whether the attributes have been modified during the setattr request's lifetime. If so, mark the attributes as invalid in the function fuse_change_attributes_common(). Signed-off-by: Guang Yuan Wu <gwu@ddn.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 78dec11) (cherry picked from commit 1e71f3f)

When having writeback cache enabled it is beneficial for data consistency to communicate to the FUSE server when the kernel prepares a page for caching. This lets the FUSE server react and lock the page. Additionally the kernel lets the FUSE server decide how much data it locks by the same call and keeps the given information in the dlm lock management. If the feature is not supported it will be disabled after first unsuccessful use. - Add DLM_LOCK fuse opcode - Add cache page lock caching for writeback cache functionality. This means sending out a FUSE call whenever the kernel prepares a page for writeback cache. The kernel will manage the cache so that it will keep track of already acquired locks. (except for the case that is documented in the code) - Use rb-trees for the management of the already 'locked' page ranges - Use rw_semaphore for synchronization in fuse_dlm_cache (cherry picked from commit 287c884) (cherry picked from commit 3b0ae3b)

Add support to invalidate inode aliases when doing inode invalidation. This is useful for distributed file systems, which use DLM for cache coherency. So, when a client losts its inode lock, it should invalidate its inode cache and dentry cache since the other client may delete this file after getting inode lock. Signed-off-by: Yong Ze Chen <yochen@ddn.com> (cherry picked from commit 49720b5) (cherry picked from commit d126533)

Renumber the operation code to a high value to avoid conflicts with upstream. (cherry picked from commit 27a0e9e) (cherry picked from commit 9cd8b25)

Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE filesystems to acquire a distributed lock manager (DLM) lock for protecting upcoming dirty pages when a previously read-only mapped page is about to be written. Signed-off-by: Cheng Ding <cding@ddn.com> (cherry picked from commit ec36c45) (cherry picked from commit 360f652)

Allow read_folio to return EAGAIN error and translate it to AOP_TRUNCATE_PAGE to retry page fault and read operations. This is used to prevent deadlock of folio lock/DLM lock order reversal: - Fault or read operations acquire folio lock first, then DLM lock. - FUSE daemon blocks new DLM lock acquisition while it invalidating page cache. invalidate_inode_pages2_range() acquires folio lock To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition with EAGAIN if it detects an in-flight page cache invalidating operation. Signed-off-by: Cheng Ding <cding@ddn.com> (cherry picked from commit 8ecf118) (cherry picked from commit 02eecf9)

generic/488 fails with fuse2fs in the following fashion: generic/488 _check_generic_filesystem: filesystem on /dev/sdf is inconsistent (see /var/tmp/fstests/generic/488.full for details) This test opens a large number of files, unlinks them (which really just renames them to fuse hidden files), closes the program, unmounts the filesystem, and runs fsck to check that there aren't any inconsistencies in the filesystem. Unfortunately, the 488.full file shows that there are a lot of hidden files left over in the filesystem, with incorrect link counts. Tracing fuse_request_* shows that there are a large number of FUSE_RELEASE commands that are queued up on behalf of the unlinked files at the time that fuse_conn_destroy calls fuse_abort_conn. Had the connection not aborted, the fuse server would have responded to the RELEASE commands by removing the hidden files; instead they stick around. Create a function to push all the background requests to the queue and then wait for the number of pending events to hit zero, and call this before fuse_abort_conn. That way, all the pending events are processed by the fuse server and we don't end up with a corrupt filesystem. Signed-off-by: Darrick J. Wong <djwong@kernel.org> (cherry picked from commit d4262f9) (cherry picked from commit 14e0495)

This is a preparation to allow fuse-io-uring bg queue flush from flush_bg_queue() This does two function renames: fuse_uring_flush_bg -> fuse_uring_flush_queue_bg fuse_uring_abort_end_requests -> fuse_uring_flush_bg And fuse_uring_abort_end_queue_requests() is moved to fuse_uring_stop_queues(). Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit e70ef24) (cherry picked from commit cd90b4d)

This is useful to have a unique API to flush background requests. For example when the bg queue gets flushed before the remaining of fuse_conn_destroy(). Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit fc4120c) (cherry picked from commit 487773b)

When calling the fuse server with a dlm request and the fuse server responds with some other error than ENOSYS most likely the lock size will be set to zero. In that case the kernel will abort the fuse connection. This is completely unnecessary. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit 0bc2f9c) (cherry picked from commit bccb169)

Check whether dlm is still enabled when interpreting the returned error from fuse server. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit f6fbf7c) (cherry picked from commit 3898d27)

Fix reference count leak of payload pages during fuse argument copies. Signed-off-by: Cheng Ding <cding@ddn.com> (cherry picked from commit 2436409)

- Increase the possible lock size to 64 bit. - change semantics of DLM locks to request start and end - change semantics of DLM request return to mark start and end of the locked area - better prepare dlm lock range cache rb-tree for unaligned byte range locks which could return any value as long as it is larger than the range requested - add the case where start and end are zero to destroy the cache Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit 3782c41)

Take actions on the PR merged event of this repo. Run copy-from-linux-branch.sh and create a PR for redfs. (cherry picked from commit f54872e)

generic_file_direct_write() also does this and has a large comment about. Reproducer here is xfstest's generic/209, which is exactly to have competing DIO write and cached IO read. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit 04869c8) (cherry picked from commit 45d035e)

This was done as condition on direct_io_allow_mmap, but I believe this is not right, as a file might be open two times - once with write-back enabled another time with FOPEN_DIRECT_IO. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit 45908a9) (cherry picked from commit c18d105)

This is another preparation and will be used for decision which queue to add a request to. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> (cherry picked from commit e4698fa) (cherry picked from commit 4c3941b)

This is preparation for follow up commits that allow to run with a reduced number of queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit 2e27c33) (cherry picked from commit 32d402c)

Add per-CPU and per-NUMA node bitmasks to track which io-uring queues are registered. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit be6edce) (cherry picked from commit 83db987)

Queues selection (fuse_uring_get_queue) can handle reduced number queues - using io-uring is possible now even with a single queue and entry. The FUSE_URING_REDUCED_Q flag is being introduce tell fuse server that reduced queues are possible, i.e. if the flag is set, fuse server is free to reduce number queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit f620f3d) (cherry picked from commit db5c364)

Running background IO on a different core makes quite a difference. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=272MiB/s (285MB/s) ... patched READ: bw=650MiB/s (682MB/s) Reason is easily visible, the fio process is migrating between CPUs when requests are submitted on the queue for the same core. With --iodepth=8 unpatched READ: bw=466MiB/s (489MB/s) patched READ: bw=641MiB/s (672MB/s) Without io-uring (--iodepth=8) READ: bw=729MiB/s (764MB/s) Without fuse (--iodepth=8) READ: bw=2199MiB/s (2306MB/s) (Test were done with <libfuse>/example/passthrough_hp -o allow_other --nopassthrough \ [-o io_uring] /tmp/source /tmp/dest ) Additional notes: With FURING_NEXT_QUEUE_RETRIES=0 (--iodepth=8) READ: bw=903MiB/s (946MB/s) With just a random qid (--iodepth=8) READ: bw=429MiB/s (450MB/s) With --iodepth=1 unpatched READ: bw=195MiB/s (204MB/s) patched READ: bw=232MiB/s (243MB/s) With --iodepth=1 --numjobs=2 unpatched READ: bw=366MiB/s (384MB/s) patched READ: bw=472MiB/s (495MB/s) With --iodepth=1 --numjobs=8 unpatched READ: bw=1437MiB/s (1507MB/s) patched READ: bw=1529MiB/s (1603MB/s) fuse without io-uring READ: bw=1314MiB/s (1378MB/s), 1314MiB/s-1314MiB/s ... no-fuse READ: bw=2566MiB/s (2690MB/s), 2566MiB/s-2566MiB/s ... In summary, for async requests the core doing application IO is busy sending requests and processing IOs should be done on a different core. Spreading the load on random cores is also not desirable, as the core might be frequency scaled down and/or in C1 sleep states. Not shown here, but differnces are much smaller when the system uses performance govenor instead of schedutil (ubuntu default). Obviously at the cost of higher system power consumption for performance govenor - not desirable either. Results without io-uring (which uses fixed libfuse threads per queue) heavily depend on the current number of active threads. Libfuse uses default of max 10 threads, but actual nr max threads is a parameter. Also, no-fuse-io-uring results heavily depend on, if there was already running another workload before, as libfuse starts these threads dynamically - i.e. the more threads are active, the worse the performance. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit c6399ea) (cherry picked from commit 44eaf77)

This is to further improve performance. fio --directory=/tmp/dest --name=iops.\$jobnum --rw=randread \ --bs=4k --size=1G --numjobs=1 --iodepth=4 --time_based\ --runtime=30s --group_reporting --ioengine=io_uring\ --direct=1 unpatched READ: bw=650MiB/s (682MB/s) patched: READ: bw=995MiB/s (1043MB/s) with --iodepth=8 unpatched READ: bw=641MiB/s (672MB/s) patched READ: bw=966MiB/s (1012MB/s) Reason is that with --iodepth=x (x > 1) fio submits multiple async requests and a single queue might become CPU limited. I.e. spreading the load helps. (cherry picked from commit 2e73b0b) (cherry picked from commit 7081fec)

With the reduced queue feature io-uring is marked as ready after receiving the 1st ring entry. At this time other queues just might be in the process of registration and then a race happens fuse_uring_queue_fuse_req -> no queue entry registered yet list_add_tail -> fuse request gets queued So far fetching requests from the list only happened from FUSE_IO_URING_CMD_COMMIT_AND_FETCH, but without new requests on the same queue, it would actually never send requests from that queue - the request was stuck. (cherry picked from commit 4cbee2e)

… to macb_open() In the non-RT kernel, local_bh_disable() merely disables preemption, whereas it maps to an actual spin lock in the RT kernel. Consequently, when attempting to refill RX buffers via netdev_alloc_skb() in macb_mac_link_up(), a deadlock scenario arises as follows: WARNING: possible circular locking dependency detected 6.18.0-08691-g2061f18ad76e #39 Not tainted ------------------------------------------------------ kworker/0:0/8 is trying to acquire lock: ffff00080369bbe0 (&bp->lock){+.+.}-{3:3}, at: macb_start_xmit+0x808/0xb7c but task is already holding lock: ffff000803698e58 (&queue->tx_ptr_lock){+...}-{3:3}, at: macb_start_xmit +0x148/0xb7c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&queue->tx_ptr_lock){+...}-{3:3}: rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x148/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #2 (_xmit_ETHER#2){+...}-{3:3}: rt_spin_lock+0x50/0x1f0 sch_direct_xmit+0x11c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #1 ((softirq_ctrl.lock)){+.+.}-{3:3}: lock_release+0x250/0x348 __local_bh_enable_ip+0x7c/0x240 __netdev_alloc_skb+0x1b4/0x1d8 gem_rx_refill+0xdc/0x240 gem_init_rings+0xb4/0x108 macb_mac_link_up+0x9c/0x2b4 phylink_resolve+0x170/0x614 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 -> #0 (&bp->lock){+.+.}-{3:3}: __lock_acquire+0x15a8/0x2084 lock_acquire+0x1cc/0x350 rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x808/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 other info that might help us debug this: Chain exists of: &bp->lock --> _xmit_ETHER#2 --> &queue->tx_ptr_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&queue->tx_ptr_lock); lock(_xmit_ETHER#2); lock(&queue->tx_ptr_lock); lock(&bp->lock); *** DEADLOCK *** Call trace: show_stack+0x18/0x24 (C) dump_stack_lvl+0xa0/0xf0 dump_stack+0x18/0x24 print_circular_bug+0x28c/0x370 check_noncircular+0x198/0x1ac __lock_acquire+0x15a8/0x2084 lock_acquire+0x1cc/0x350 rt_spin_lock+0x50/0x1f0 macb_start_xmit+0x808/0xb7c dev_hard_start_xmit+0x94/0x284 sch_direct_xmit+0x8c/0x37c __dev_queue_xmit+0x708/0x1120 neigh_resolve_output+0x148/0x28c ip6_finish_output2+0x2c0/0xb2c __ip6_finish_output+0x114/0x308 ip6_output+0xc4/0x4a4 mld_sendpack+0x220/0x68c mld_ifc_work+0x2a8/0x4f4 process_one_work+0x20c/0x5f8 worker_thread+0x1b0/0x35c kthread+0x144/0x200 ret_from_fork+0x10/0x20 Notably, invoking the mog_init_rings() callback upon link establishment is unnecessary. Instead, we can exclusively call mog_init_rings() within the ndo_open() callback. This adjustment resolves the deadlock issue. Furthermore, since MACB_CAPS_MACB_IS_EMAC cases do not use mog_init_rings() when opening the network interface via at91ether_open(), moving mog_init_rings() to macb_open() also eliminates the MACB_CAPS_MACB_IS_EMAC check. Fixes: 633e98a ("net: macb: use resolved link config in mac_link_up()") Cc: stable@vger.kernel.org Suggested-by: Kevin Hao <kexin.hao@windriver.com> Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Link: https://patch.msgid.link/20251222015624.1994551-1-xiaolei.wang@windriver.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bsbernd and others added 30 commits November 7, 2025 19:54

fuse: remove unused arg in fuse_write_file_get()

0f48fbc

The struct fuse_conn argument is not used and can be removed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit a9667ac) (cherry picked from commit 6e78684)

fuse: delete redundant code

a61d730

'ia->io=io' has been set in fuse_io_alloc. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit b5d9758) (cherry picked from commit 0510052)

fuse: simplify __fuse_write_file_get()

b9b6ab0

Use list_first_entry_or_null() instead of list_empty() + list_entry(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> (cherry picked from commit 84840ef) (cherry picked from commit 5d48884)

bsbernd and others added 24 commits November 8, 2025 21:03

fuse: Renumber FUSE_DLM_WB_LOCK to 100

84d8ba6

Renumber the operation code to a high value to avoid conflicts with upstream. (cherry picked from commit 27a0e9e) (cherry picked from commit 9cd8b25)

fuse: fix connection abort on mmap when fuse server returns ENOSYS

4e6fc61

Check whether dlm is still enabled when interpreting the returned error from fuse server. Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> (cherry picked from commit f6fbf7c) (cherry picked from commit 3898d27)

fuse: fix memory leak in fuse-over-io-uring argument copies

9f1af20

Fix reference count leak of payload pages during fuse argument copies. Signed-off-by: Cheng Ding <cding@ddn.com> (cherry picked from commit 2436409)

Create workflow the create pr for redfs in each branch

6db9027

Take actions on the PR merged event of this repo. Run copy-from-linux-branch.sh and create a PR for redfs. (cherry picked from commit f54872e)

fuse: {io-uring} Rename ring->nr_queues to max_nr_queues

2e39f56

This is preparation for follow up commits that allow to run with a reduced number of queues. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit 2e27c33) (cherry picked from commit 32d402c)

fuse: {io-uring} Use bitmaps to track registered queues

0f59155

Add per-CPU and per-NUMA node bitmasks to track which io-uring queues are registered. Signed-off-by: Bernd Schubert <bschubert@ddn.com> (cherry picked from commit be6edce) (cherry picked from commit 83db987)

bsbernd requested review from cding-ddn, hbirth and openunix November 10, 2025 15:52

bsbernd merged commit 2407a55 into redfs-rhel9_4-427.42.1 Nov 10, 2025

bsbernd deleted the ddn-update-redfs-rhel9_4-427.42.1 branch November 12, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ddn update redfs rhel9 4 427.42.1 #39

Ddn update redfs rhel9 4 427.42.1 #39

Uh oh!

bsbernd commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Ddn update redfs rhel9 4 427.42.1 #39

Ddn update redfs rhel9 4 427.42.1 #39

Uh oh!

Conversation

bsbernd commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants