Skip to content

Conversation

@cding-ddn
Copy link
Collaborator

Add PAGE_MKWRITE fuse request to allow FUSE daemon to acquire DLM lock for protecting dirty page creation.

Allow read_folio to return EAGAIN error and translate it to AOP_TRUNCATE_PAGE to retry page fault and read operations. This is used to prevent deadlock of folio lock/DLM lock order reversal:

  • Fault or read operations acquire folio lock first, then DLM lock.
  • FUSE daemon blocks new DLM lock acquisition while it invalidating page cache. invalidate_inode_pages2_range() acquires folio lock To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition with EAGAIN if it detects an in-flight page cache invalidating operation.

This enables memory mapping across cluster nodes with proper distributed locking coordination.

@cding-ddn cding-ddn requested a review from bsbernd July 7, 2025 17:47
unsigned int no_mkwrite:1;

/* Use io_uring for communication */
unsigned int io_uring;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, needs to be switched to io_uring:1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated, but had slipped through so far)

@bsbernd
Copy link
Collaborator

bsbernd commented Jul 7, 2025

Disadvantage of this way is that we get a PAGE_MKWRITE for every page - that will be expensive.

@bsbernd
Copy link
Collaborator

bsbernd commented Jul 7, 2025

Needs a "Signed-off-by"

@yongzech yongzech self-requested a review July 8, 2025 05:35
@cding-ddn cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from e77b0a5 to b6ebff1 Compare July 11, 2025 09:14
@cding-ddn cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from b6ebff1 to 989869f Compare July 16, 2025 08:06
@cding-ddn cding-ddn changed the title fuse: add PAGE_MKWRITE opcode fuse: multi-node mmap support Jul 17, 2025
@cding-ddn cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from 989869f to af8a424 Compare July 17, 2025 17:08
@bsbernd
Copy link
Collaborator

bsbernd commented Jul 18, 2025

@cding-ddn I can't merge, there are conflicts. I think the 1st patch in the series is already merged.

Renumber the operation code to a high value to avoid conflicts with upstream.
Send a DLM_WB_LOCK request in the page_mkwrite handler to enable FUSE
filesystems to acquire a distributed lock manager (DLM) lock for
protecting upcoming dirty pages when a previously read-only mapped
page is about to be written.

Signed-off-by: Cheng Ding <cding@ddn.com>
Allow read_folio to return EAGAIN error and translate it to
AOP_TRUNCATE_PAGE to retry page fault and read operations.
This is used to prevent deadlock of folio lock/DLM lock order reversal:
 - Fault or read operations acquire folio lock first, then DLM lock.
 - FUSE daemon blocks new DLM lock acquisition while it invalidating
   page cache. invalidate_inode_pages2_range() acquires folio lock
To prevent deadlock, the FUSE daemon will fail its DLM lock acquisition
with EAGAIN if it detects an in-flight page cache invalidating
operation.

Signed-off-by: Cheng Ding <cding@ddn.com>
@cding-ddn cding-ddn force-pushed the mkwrite-noble-6.8.0-58.60 branch from af8a424 to 8ecf118 Compare July 18, 2025 10:42
@cding-ddn
Copy link
Collaborator Author

@bernd, I did a rebase, it can be merged now

@bsbernd bsbernd merged commit 391f71c into DDNStorage:redfs-ubuntu-noble-6.8.0-58.60 Jul 18, 2025
@cding-ddn cding-ddn deleted the mkwrite-noble-6.8.0-58.60 branch September 23, 2025 18:13
bsbernd pushed a commit that referenced this pull request Nov 7, 2025
jira LE-1907
Rebuild_History Non-Buildable kernel-5.14.0-427.37.1.el9_4
commit-author Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
commit d11a676

Ethtool callbacks can be executed while reset is in progress and try to
access deleted resources, e.g. getting coalesce settings can result in a
NULL pointer dereference seen below.

Reproduction steps:
Once the driver is fully initialized, trigger reset:
	# echo 1 > /sys/class/net/<interface>/device/reset
when reset is in progress try to get coalesce settings using ethtool:
	# ethtool -c <interface>

BUG: kernel NULL pointer dereference, address: 0000000000000020
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 11 PID: 19713 Comm: ethtool Tainted: G S                 6.10.0-rc7+ #7
RIP: 0010:ice_get_q_coalesce+0x2e/0xa0 [ice]
RSP: 0018:ffffbab1e9bcf6a8 EFLAGS: 00010206
RAX: 000000000000000c RBX: ffff94512305b028 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9451c3f2e588 RDI: ffff9451c3f2e588
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff9451c3f2e580 R11: 000000000000001f R12: ffff945121fa9000
R13: ffffbab1e9bcf760 R14: 0000000000000013 R15: ffffffff9e65dd40
FS:  00007faee5fbe740(0000) GS:ffff94546fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000020 CR3: 0000000106c2e005 CR4: 00000000001706f0
Call Trace:
<TASK>
ice_get_coalesce+0x17/0x30 [ice]
coalesce_prepare_data+0x61/0x80
ethnl_default_doit+0xde/0x340
genl_family_rcv_msg_doit+0xf2/0x150
genl_rcv_msg+0x1b3/0x2c0
netlink_rcv_skb+0x5b/0x110
genl_rcv+0x28/0x40
netlink_unicast+0x19c/0x290
netlink_sendmsg+0x222/0x490
__sys_sendto+0x1df/0x1f0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x82/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7faee60d8e27

Calling netif_device_detach() before reset makes the net core not call
the driver when ethtool command is issued, the attempt to execute an
ethtool command during reset will result in the following message:

    netlink error: No such device

instead of NULL pointer dereference. Once reset is done and
ice_rebuild() is executing, the netif_device_attach() is called to allow
for ethtool operations to occur again in a safe manner.

Fixes: fcea6f3 ("ice: Add stats and ethtool support")
	Suggested-by: Jakub Kicinski <kuba@kernel.org>
	Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com>
	Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
	Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
	Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit d11a676)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
openunix pushed a commit that referenced this pull request Dec 24, 2025
jira LE-4311
cve CVE-2025-38472
Rebuild_History Non-Buildable kernel-5.14.0-570.49.1.el9_6
commit-author Florian Westphal <fw@strlen.de>
commit 2d72afb

A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
    [exception RIP: __nf_ct_delete_from_lists+172]
    [..]
 #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
 #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
 #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
    [..]

The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:

 ct hlist pointer is garbage; looks like the ct hash value
 (hence crash).
 ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
 ct->timeout is 30000 (=30s), which is unexpected.

Everything else looks like normal udp conntrack entry.  If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
  - ct hlist pointers are overloaded and store/cache the raw tuple hash
  - ct->timeout matches the relative time expected for a new udp flow
    rather than the absolute 'jiffies' value.

If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.

Theory is that we did hit following race:

cpu x 			cpu y			cpu z
 found entry E		found entry E
 E is expired		<preemption>
 nf_ct_delete()
 return E to rcu slab
					init_conntrack
					E is re-inited,
					ct->status set to 0
					reply tuplehash hnnode.pprev
					stores hash value.

cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z.  cpu y was preempted before
checking for expiry and/or confirm bit.

					->refcnt set to 1
					E now owned by skb
					->timeout set to 30000

If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.

					nf_conntrack_confirm gets called
					sets: ct->status |= CONFIRMED
					This is wrong: E is not yet added
					to hashtable.

cpu y resumes, it observes E as expired but CONFIRMED:
			<resumes>
			nf_ct_expired()
			 -> yes (ct->timeout is 30s)
			confirmed bit set.

cpu y will try to delete E from the hashtable:
			nf_ct_delete() -> set DYING bit
			__nf_ct_delete_from_lists

Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:

			wait for spinlock held by z

					CONFIRMED is set but there is no
					guarantee ct will be added to hash:
					"chaintoolong" or "clash resolution"
					logic both skip the insert step.
					reply hnnode.pprev still stores the
					hash value.

					unlocks spinlock
					return NF_DROP
			<unblocks, then
			 crashes on hlist_nulls_del_rcu pprev>

In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.

Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.

To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.

Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.

It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:

Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.

Also change nf_ct_should_gc() to first check the confirmed bit.

The gc sequence is:
 1. Check if entry has expired, if not skip to next entry
 2. Obtain a reference to the expired entry.
 3. Call nf_ct_should_gc() to double-check step 1.

nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time.  Step 3 will therefore not remove the entry.

Without this change to nf_ct_should_gc() we could still get this sequence:

 1. Check if entry has expired.
 2. Obtain a reference.
 3. Call nf_ct_should_gc() to double-check step 1:
    4 - entry is still observed as expired
    5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
      and confirm bit gets set
    6 - confirm bit is seen
    7 - valid entry is removed again

First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.

This change cannot be backported to releases before 5.19. Without
commit 8a75a2c ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.

	Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5 ("netfilter: conntrack: remove the percpu dying list")
	Signed-off-by: Florian Westphal <fw@strlen.de>
	Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 2d72afb)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants