fuse: multi-node mmap support #7

cding-ddn · 2025-07-07T17:47:31Z

Add PAGE_MKWRITE fuse request to allow FUSE daemon to acquire DLM lock for protecting dirty page creation.

Allow read_folio to return EAGAIN error and translate it to AOP_TRUNCATE_PAGE to retry page fault and read operations. This is used to prevent deadlock of folio lock/DLM lock order reversal:

Fault or read operations acquire folio lock first, then DLM lock.
FUSE daemon blocks new DLM lock acquisition while it invalidating page cache. invalidate_inode_pages2_range() acquires folio lock To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition with EAGAIN if it detects an in-flight page cache invalidating operation.

This enables memory mapping across cluster nodes with proper distributed locking coordination.

bsbernd · 2025-07-07T22:29:49Z

fs/fuse/fuse_i.h

+	unsigned int no_mkwrite:1;
+
 	/* Use io_uring for communication */
 	unsigned int io_uring;


Interesting, needs to be switched to io_uring:1

(unrelated, but had slipped through so far)

fs/fuse/file.c

bsbernd · 2025-07-07T22:34:26Z

Disadvantage of this way is that we get a PAGE_MKWRITE for every page - that will be expensive.

bsbernd · 2025-07-07T22:35:50Z

Needs a "Signed-off-by"

include/uapi/linux/fuse.h

fs/fuse/file.c

include/uapi/linux/fuse.h

fs/fuse/file.c

bsbernd · 2025-07-18T10:17:01Z

@cding-ddn I can't merge, there are conflicts. I think the 1st patch in the series is already merged.

Renumber the operation code to a high value to avoid conflicts with upstream.

Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE filesystems to acquire a distributed lock manager (DLM) lock for protecting upcoming dirty pages when a previously read-only mapped page is about to be written. Signed-off-by: Cheng Ding <cding@ddn.com>

Allow read_folio to return EAGAIN error and translate it to AOP_TRUNCATE_PAGE to retry page fault and read operations. This is used to prevent deadlock of folio lock/DLM lock order reversal: - Fault or read operations acquire folio lock first, then DLM lock. - FUSE daemon blocks new DLM lock acquisition while it invalidating page cache. invalidate_inode_pages2_range() acquires folio lock To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition with EAGAIN if it detects an in-flight page cache invalidating operation. Signed-off-by: Cheng Ding <cding@ddn.com>

cding-ddn · 2025-07-18T10:44:08Z

@bernd, I did a rebase, it can be merged now

jira LE-1907 Rebuild_History Non-Buildable kernel-5.14.0-427.37.1.el9_4 commit-author Dawid Osuchowski <dawid.osuchowski@linux.intel.com> commit d11a676 Ethtool callbacks can be executed while reset is in progress and try to access deleted resources, e.g. getting coalesce settings can result in a NULL pointer dereference seen below. Reproduction steps: Once the driver is fully initialized, trigger reset: # echo 1 > /sys/class/net/<interface>/device/reset when reset is in progress try to get coalesce settings using ethtool: # ethtool -c <interface> BUG: kernel NULL pointer dereference, address: 0000000000000020 PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 11 PID: 19713 Comm: ethtool Tainted: G S 6.10.0-rc7+ #7 RIP: 0010:ice_get_q_coalesce+0x2e/0xa0 [ice] RSP: 0018:ffffbab1e9bcf6a8 EFLAGS: 00010206 RAX: 000000000000000c RBX: ffff94512305b028 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9451c3f2e588 RDI: ffff9451c3f2e588 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: ffff9451c3f2e580 R11: 000000000000001f R12: ffff945121fa9000 R13: ffffbab1e9bcf760 R14: 0000000000000013 R15: ffffffff9e65dd40 FS: 00007faee5fbe740(0000) GS:ffff94546fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000020 CR3: 0000000106c2e005 CR4: 00000000001706f0 Call Trace: <TASK> ice_get_coalesce+0x17/0x30 [ice] coalesce_prepare_data+0x61/0x80 ethnl_default_doit+0xde/0x340 genl_family_rcv_msg_doit+0xf2/0x150 genl_rcv_msg+0x1b3/0x2c0 netlink_rcv_skb+0x5b/0x110 genl_rcv+0x28/0x40 netlink_unicast+0x19c/0x290 netlink_sendmsg+0x222/0x490 __sys_sendto+0x1df/0x1f0 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x82/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7faee60d8e27 Calling netif_device_detach() before reset makes the net core not call the driver when ethtool command is issued, the attempt to execute an ethtool command during reset will result in the following message: netlink error: No such device instead of NULL pointer dereference. Once reset is done and ice_rebuild() is executing, the netif_device_attach() is called to allow for ethtool operations to occur again in a safe manner. Fixes: fcea6f3 ("ice: Add stats and ethtool support") Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> (cherry picked from commit d11a676) Signed-off-by: Jonathan Maple <jmaple@ciq.com>

jira LE-4311 cve CVE-2025-38472 Rebuild_History Non-Buildable kernel-5.14.0-570.49.1.el9_6 commit-author Florian Westphal <fw@strlen.de> commit 2d72afb A crash in conntrack was reported while trying to unlink the conntrack entry from the hash bucket list: [exception RIP: __nf_ct_delete_from_lists+172] [..] #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack] #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack] #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack] [..] The nf_conn struct is marked as allocated from slab but appears to be in a partially initialised state: ct hlist pointer is garbage; looks like the ct hash value (hence crash). ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected ct->timeout is 30000 (=30s), which is unexpected. Everything else looks like normal udp conntrack entry. If we ignore ct->status and pretend its 0, the entry matches those that are newly allocated but not yet inserted into the hash: - ct hlist pointers are overloaded and store/cache the raw tuple hash - ct->timeout matches the relative time expected for a new udp flow rather than the absolute 'jiffies' value. If it were not for the presence of IPS_CONFIRMED, __nf_conntrack_find_get() would have skipped the entry. Theory is that we did hit following race: cpu x cpu y cpu z found entry E found entry E E is expired <preemption> nf_ct_delete() return E to rcu slab init_conntrack E is re-inited, ct->status set to 0 reply tuplehash hnnode.pprev stores hash value. cpu y found E right before it was deleted on cpu x. E is now re-inited on cpu z. cpu y was preempted before checking for expiry and/or confirm bit. ->refcnt set to 1 E now owned by skb ->timeout set to 30000 If cpu y were to resume now, it would observe E as expired but would skip E due to missing CONFIRMED bit. nf_conntrack_confirm gets called sets: ct->status |= CONFIRMED This is wrong: E is not yet added to hashtable. cpu y resumes, it observes E as expired but CONFIRMED: <resumes> nf_ct_expired() -> yes (ct->timeout is 30s) confirmed bit set. cpu y will try to delete E from the hashtable: nf_ct_delete() -> set DYING bit __nf_ct_delete_from_lists Even this scenario doesn't guarantee a crash: cpu z still holds the table bucket lock(s) so y blocks: wait for spinlock held by z CONFIRMED is set but there is no guarantee ct will be added to hash: "chaintoolong" or "clash resolution" logic both skip the insert step. reply hnnode.pprev still stores the hash value. unlocks spinlock return NF_DROP <unblocks, then crashes on hlist_nulls_del_rcu pprev> In case CPU z does insert the entry into the hashtable, cpu y will unlink E again right away but no crash occurs. Without 'cpu y' race, 'garbage' hlist is of no consequence: ct refcnt remains at 1, eventually skb will be free'd and E gets destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy. To resolve this, move the IPS_CONFIRMED assignment after the table insertion but before the unlock. Pablo points out that the confirm-bit-store could be reordered to happen before hlist add resp. the timeout fixup, so switch to set_bit and before_atomic memory barrier to prevent this. It doesn't matter if other CPUs can observe a newly inserted entry right before the CONFIRMED bit was set: Such event cannot be distinguished from above "E is the old incarnation" case: the entry will be skipped. Also change nf_ct_should_gc() to first check the confirmed bit. The gc sequence is: 1. Check if entry has expired, if not skip to next entry 2. Obtain a reference to the expired entry. 3. Call nf_ct_should_gc() to double-check step 1. nf_ct_should_gc() is thus called only for entries that already failed an expiry check. After this patch, once the confirmed bit check passes ct->timeout has been altered to reflect the absolute 'best before' date instead of a relative time. Step 3 will therefore not remove the entry. Without this change to nf_ct_should_gc() we could still get this sequence: 1. Check if entry has expired. 2. Obtain a reference. 3. Call nf_ct_should_gc() to double-check step 1: 4 - entry is still observed as expired 5 - meanwhile, ct->timeout is corrected to absolute value on other CPU and confirm bit gets set 6 - confirm bit is seen 7 - valid entry is removed again First do check 6), then 4) so the gc expiry check always picks up either confirmed bit unset (entry gets skipped) or expiry re-check failure for re-inited conntrack objects. This change cannot be backported to releases before 5.19. Without commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list") |= IPS_CONFIRMED line cannot be moved without further changes. Cc: Razvan Cojocaru <rzvncj@gmail.com> Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/ Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/ Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> (cherry picked from commit 2d72afb) Signed-off-by: Jonathan Maple <jmaple@ciq.com>

cding-ddn requested a review from bsbernd July 7, 2025 17:47

bsbernd reviewed Jul 7, 2025

View reviewed changes

fs/fuse/file.c Outdated Show resolved Hide resolved

yongzech self-requested a review July 8, 2025 05:35

bsbernd reviewed Jul 10, 2025

View reviewed changes

include/uapi/linux/fuse.h Show resolved Hide resolved

cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from e77b0a5 to b6ebff1 Compare July 11, 2025 09:14

bsbernd reviewed Jul 15, 2025

View reviewed changes

fs/fuse/file.c Show resolved Hide resolved

cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from b6ebff1 to 989869f Compare July 16, 2025 08:06

bsbernd reviewed Jul 16, 2025

View reviewed changes

fs/fuse/file.c Outdated Show resolved Hide resolved

bsbernd reviewed Jul 16, 2025

View reviewed changes

include/uapi/linux/fuse.h Outdated Show resolved Hide resolved

bsbernd reviewed Jul 16, 2025

View reviewed changes

fs/fuse/file.c Show resolved Hide resolved

bsbernd approved these changes Jul 16, 2025

View reviewed changes

cding-ddn changed the title ~~fuse: add PAGE_MKWRITE opcode~~ fuse: multi-node mmap support Jul 17, 2025

cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from 989869f to af8a424 Compare July 17, 2025 17:08

cding-ddn added 3 commits July 18, 2025 10:42

fuse: Renumber FUSE_DLM_WB_LOCK to 100

27a0e9e

Renumber the operation code to a high value to avoid conflicts with upstream.

cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from af8a424 to 8ecf118 Compare July 18, 2025 10:42

bsbernd merged commit 391f71c into DDNStorage:redfs-ubuntu-noble-6.8.0-58.60 Jul 18, 2025

cding-ddn deleted the mkwrite-noble-6.8.0-58.60 branch September 23, 2025 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fuse: multi-node mmap support #7

fuse: multi-node mmap support #7

Uh oh!

cding-ddn commented Jul 7, 2025

Uh oh!

bsbernd Jul 7, 2025

Uh oh!

bsbernd Jul 7, 2025

Uh oh!

Uh oh!

bsbernd commented Jul 7, 2025

Uh oh!

bsbernd commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bsbernd commented Jul 18, 2025

Uh oh!

cding-ddn commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fuse: multi-node mmap support #7

fuse: multi-node mmap support #7

Uh oh!

Conversation

cding-ddn commented Jul 7, 2025

Uh oh!

bsbernd Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

bsbernd Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bsbernd commented Jul 7, 2025

Uh oh!

bsbernd commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bsbernd commented Jul 18, 2025

Uh oh!

cding-ddn commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants