Skip to content

Conversation

@hbirth
Copy link
Collaborator

@hbirth hbirth commented Aug 18, 2025

No description provided.

hbirth and others added 4 commits August 14, 2025 13:00
When writing back pages while using writeback caching the code did a copy of data into
temporary pages to avoid a deadlock in reclaiming of memory.

This is an adaptation and backport of a patch by Joanne Koong joannelkoong@gmail.com.

Since we use pinned memory with io_uring we don't need the temporary copies
and we don't use the AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM flag in the pagemap.

Link: https://www.spinics.net/lists/linux-mm/msg407405.html
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
When calling the fuse server with a dlm request and the fuse server
responds with some other error than ENOSYS most likely the lock size
will be set to zero. In that case the kernel will abort the fuse
connection. This is completely unnecessary.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Check whether dlm is still enabled when interpreting the returned
error from fuse server.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
Sometimes the file offset alignment needs to be opt-in to achieve the
optimum performance at the backend store.

For example when ErasureCode [1] is used at the backend store, the
optimum write performance is achieved when the WRITE request is aligned
with the stripe size of ErasureCode.  Otherwise a non-aligned WRITE
request needs to be split at the stripe size boundary.  It is quite
costly to handle these split partial requests, as firstly the whole
stripe to which the split partial request belongs needs to be read out,
then overwrite the read stripe buffer with the request, and finally write
the whole stripe back to the persistent storage.

Thus the backend store can suffer severe performance degradation when
WRITE requests can not fit into one stripe exactly.  The write performance
can be 10x slower when the request is 256KB in size given 4MB stripe size.
Also there can be 50% performance degradation in theory if the request
is not stripe boundary aligned.

Besides, the conveyed test indicates that, the non-alignment issue
becomes more severe when decreasing fuse's max_ratio, maybe partly
because the background writeback now is more likely to run parallelly
with the dirtier.

fuse's max_ratio	ratio of aligned WRITE requests
----------------	-------------------------------
70			99.9%
40			74%
20			45%
10			20%

With the patched version, which makes the alignment constraint opt-in
when constructing WRITE requests, the ratio of aligned WRITE requests
increases to 98% (previously 20%) when fuse's max_ratio is 10.

fuse: fix alignment to work with redfs ubuntu

- small fix to make the fuse alignment patch work with redfs ubuntu 6.8.x
- add writeback_control to fuse_writepage_need_send() to make
more accurate decisions about when to skip sending data
- fix shift number for FUSE_ALIGN_PG_ORDER
- remove test code

[1] https://lore.kernel.org/linux-fsdevel/20240124070512.52207-1-jefflexu@linux.alibaba.com/T/#m9bce469998ea6e4f911555c6f7be1e077ce3d8b4
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
@hbirth hbirth changed the title add latest patches for fuse adapt patches from ubuntu redfs fuse for RHEL 10 Aug 18, 2025
@bsbernd
Copy link
Collaborator

bsbernd commented Aug 18, 2025

Would be good to change "fuse: fix commit 5c021bd to work for RHEL10" to be more descriptive - the commit might changes on rebase.

@hbirth
Copy link
Collaborator Author

hbirth commented Aug 18, 2025

Would be good to change "fuse: fix commit 5c021bd to work for RHEL10" to be more descriptive - the commit might changes on rebase.

oh yes ... forgot about that .. done!

bsbernd
bsbernd previously approved these changes Sep 10, 2025
Fuse in the linux kernel changed slightly since the patches to
remove tmp page copies were developed for linux 6.8.
The main logic did not change only some convenience functions
were added to the kernel which the patch did not use for
compatibility reasons since they were not present in lunx 6.8.

- use fuse_writepage_finish_stat() where appropriate
- use fuse_writepage_args_setup() to setup the args

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
@bsbernd bsbernd merged commit f1ec4a9 into DDNStorage:redfs-rhel10_0 Sep 24, 2025
bsbernd pushed a commit that referenced this pull request Nov 7, 2025
jira LE-1907
cve CVE-2024-41041
Rebuild_History Non-Buildable kernel-5.14.0-427.33.1.el9_4
commit-author Kuniyuki Iwashima <kuniyu@amazon.com>
commit 5c0b485

syzkaller triggered the warning [0] in udp_v4_early_demux().

In udp_v[46]_early_demux() and sk_lookup(), we do not touch the refcount
of the looked-up sk and use sock_pfree() as skb->destructor, so we check
SOCK_RCU_FREE to ensure that the sk is safe to access during the RCU grace
period.

Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
into the hash table.  Moreover, the SOCK_RCU_FREE check is done too early
in udp_v[46]_early_demux() and sk_lookup(), so there could be a small race
window:

  CPU1                                 CPU2
  ----                                 ----
  udp_v4_early_demux()                 udp_lib_get_port()
  |                                    |- hlist_add_head_rcu()
  |- sk = __udp4_lib_demux_lookup()    |
  |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
                                       `- sock_set_flag(sk, SOCK_RCU_FREE)

We had the same bug in TCP and fixed it in commit 871019b ("net:
set SOCK_RCU_FREE before inserting socket into hashtable").

Let's apply the same fix for UDP.

[0]:
WARNING: CPU: 0 PID: 11198 at net/ipv4/udp.c:2599 udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Modules linked in:
CPU: 0 PID: 11198 Comm: syz-executor.1 Not tainted 6.9.0-g93bda33046e7 #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Code: c5 7a 15 fe bb 01 00 00 00 44 89 e9 31 ff d3 e3 81 e3 bf ef ff ff 89 de e8 2c 74 15 fe 85 db 0f 85 02 06 00 00 e8 9f 7a 15 fe <0f> 0b e8 98 7a 15 fe 49 8d 7e 60 e8 4f 39 2f fe 49 c7 46 60 20 52
RSP: 0018:ffffc9000ce3fa58 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8318c92c
RDX: ffff888036ccde00 RSI: ffffffff8318c2f1 RDI: 0000000000000001
RBP: ffff88805a2dd6e0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0001ffffffffffff R12: ffff88805a2dd680
R13: 0000000000000007 R14: ffff88800923f900 R15: ffff88805456004e
FS:  00007fc449127640(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc449126e38 CR3: 000000003de4b002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 <TASK>
 ip_rcv_finish_core.constprop.0+0xbdd/0xd20 net/ipv4/ip_input.c:349
 ip_rcv_finish+0xda/0x150 net/ipv4/ip_input.c:447
 NF_HOOK include/linux/netfilter.h:314 [inline]
 NF_HOOK include/linux/netfilter.h:308 [inline]
 ip_rcv+0x16c/0x180 net/ipv4/ip_input.c:569
 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5624
 __netif_receive_skb+0x21/0xd0 net/core/dev.c:5738
 netif_receive_skb_internal net/core/dev.c:5824 [inline]
 netif_receive_skb+0x271/0x300 net/core/dev.c:5884
 tun_rx_batched drivers/net/tun.c:1549 [inline]
 tun_get_user+0x24db/0x2c50 drivers/net/tun.c:2002
 tun_chr_write_iter+0x107/0x1a0 drivers/net/tun.c:2048
 new_sync_write fs/read_write.c:497 [inline]
 vfs_write+0x76f/0x8d0 fs/read_write.c:590
 ksys_write+0xbf/0x190 fs/read_write.c:643
 __do_sys_write fs/read_write.c:655 [inline]
 __se_sys_write fs/read_write.c:652 [inline]
 __x64_sys_write+0x41/0x50 fs/read_write.c:652
 x64_sys_call+0xe66/0x1990 arch/x86/include/generated/asm/syscalls_64.h:2
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fc44a68bc1f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 cf f5 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 3c d0 f5 ff 48
RSP: 002b:00007fc449126c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007fc44a68bc1f
RDX: 0000000000000032 RSI: 00000000200000c0 RDI: 00000000000000c8
RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000032 R11: 0000000000000293 R12: 0000000000000000
R13: 000000000000000b R14: 00007fc44a5ec530 R15: 0000000000000000
 </TASK>

Fixes: 6acc9b4 ("bpf: Add helper to retrieve socket in BPF")
	Reported-by: syzkaller <syzkaller@googlegroups.com>
	Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
	Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240709191356.24010-1-kuniyu@amazon.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>

(cherry picked from commit 5c0b485)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants