Skip to content

[5.15-velinux] Intel: Backport to fix EDAC driver for GNR platform#42

Merged
guojinhui-liam merged 4 commits into
5.15-velinuxfrom
5.15-velinux-edac-fix-gnr
May 8, 2025
Merged

[5.15-velinux] Intel: Backport to fix EDAC driver for GNR platform#42
guojinhui-liam merged 4 commits into
5.15-velinuxfrom
5.15-velinux-edac-fix-gnr

Conversation

@AichunShi
Copy link
Copy Markdown

This PR is to fix EDAC driver for Intel GNR platform on 5.15-velinux kernel.

Upstream commits from v6.13:
a36667037a0c0e36c59407f8ae636295390239a5 EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
2397f795735219caa9c2fe61e7bcdd0652e670d3 EDAC/skx_common: Differentiate memory error sources

Upstream commits from v6.11 to fix compile warning:
8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
123b158 EDAC, i10nm: make skx_common.o a separate module

Test

Built and run the kernel successfully.
EDAC Test is PASS on GNR platform.

Configs

No Change.

arndb and others added 4 commits March 27, 2025 18:14
commit 123b158 upstream.

Commit 598afa0 ("kbuild: warn objects shared among multiple modules")
was added to track down cases where the same object is linked into
multiple modules. This can cause serious problems if some modules are
builtin while others are not.

That test triggers this warning:

scripts/Makefile.build:236: drivers/edac/Makefile: skx_common.o is added to multiple modules: i10nm_edac skx_edac

Make this a separate module instead.

[Tony: Added more background details to commit message]

Intel-SIG: commit 123b158 EDAC, i10nm: make skx_common.o a separate module
Backport to fix EDAC driver for GNR

Fixes: d4dc89d ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240529095132.1929397-1-arnd@kernel.org/
[ Aichun Shi: amend commit log ]
Signed-off-by: Aichun Shi <aichun.shi@intel.com>
commit 8b93582 upstream.

Commit

  afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module")

made skx_common.o a separate module. With skx_common.o now a separate
module, move the common debug code setup_{skx,i10nm}_debug() and
teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to
reduce code duplication. Additionally, prefix these function names with
'skx' to maintain consistency with other names in the file.

Intel-SIG: commit 8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
Backport to fix EDAC driver for GNR

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240829055101.56245-1-qiuxu.zhuo@intel.com
[ Aichun Shi: amend commit log ]
Signed-off-by: Aichun Shi <aichun.shi@intel.com>
commit 2397f795735219caa9c2fe61e7bcdd0652e670d3 upstream.

The current skx_common determines whether the memory error source is the
near memory of the 2LM system and then retrieves the decoded error results
from the ADXL components (near-memory vs. far-memory) accordingly.

However, some memory controllers may have limitations in correctly
reporting the memory error source, leading to the retrieval of incorrect
decoded parts from the ADXL.

To address these limitations, instead of simply determining whether the
memory error is from the near memory of the 2LM system, it is necessary to
distinguish the memory error source details as follows:

  Memory error from the near memory of the 2LM system.
  Memory error from the far memory of the 2LM system.
  Memory error from the 1LM system.
  Not a memory error.

This will enable the i10nm_edac driver to take appropriate actions for
those memory controllers that have limitations in reporting the memory
error source.

Intel-SIG: commit 2397f7957352 EDAC/skx_common: Differentiate memory error sources
Backport to fix EDAC driver for GNR

Fixes: ba987ea ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com>
Link: https://lore.kernel.org/r/20241015072236.24543-2-qiuxu.zhuo@intel.com
[ Aichun Shi: amend commit log ]
Signed-off-by: Aichun Shi <aichun.shi@intel.com>
commit a36667037a0c0e36c59407f8ae636295390239a5 upstream.

The Granite Rapids CPUs with Flat2LM memory configurations may
mistakenly report near-memory errors as far-memory errors, resulting
in the invalid decoded ADXL results:

  EDAC skx: Bad imc -1

Fix this incorrect far-memory error source indicator by prefetching the
decoded far-memory controller ID, and adjust the error source indicator
to near-memory if the far-memory controller ID is invalid.

Intel-SIG: commit a36667037a0c EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
Backport to fix EDAC driver for GNR

Fixes: ba987ea ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com>
Link: https://lore.kernel.org/r/20241015072236.24543-3-qiuxu.zhuo@intel.com
[ Aichun Shi: amend commit log ]
Signed-off-by: Aichun Shi <aichun.shi@intel.com>
@AichunShi AichunShi marked this pull request as ready for review March 27, 2025 10:34
@qiruibd
Copy link
Copy Markdown
Collaborator

qiruibd commented Apr 7, 2025

123b158 EDAC, i10nm: make skx_common.o a separate module , Does this patch cause problems? I remember there was an upstream patch to fix the problem caused by this commit?

@AichunShi
Copy link
Copy Markdown
Author

123b158 EDAC, i10nm: make skx_common.o a separate module , Does this patch cause problems? I remember there was an upstream patch to fix the problem caused by this commit?

Do you mean this commit: 8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
which is included in this PR?

@qiruibd
Copy link
Copy Markdown
Collaborator

qiruibd commented Apr 8, 2025

123b158 EDAC, i10nm: make skx_common.o a separate module , Does this patch cause problems? I remember there was an upstream patch to fix the problem caused by this commit?

Do you mean this commit: 8b93582 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common which is included in this PR?

It has been confirmed that this patch does not have an upstream commit to fix it, that is, there is no corresponding Fix tag.

@qiruibd
Copy link
Copy Markdown
Collaborator

qiruibd commented Apr 8, 2025

Acked

@guojinhui-liam guojinhui-liam merged commit f3c85cc into openvelinux:5.15-velinux May 8, 2025
guojinhui-liam added a commit that referenced this pull request May 8, 2025
[5.15-velinux] Intel: Backport to fix EDAC driver for GNR platform
guojinhui-liam pushed a commit that referenced this pull request Nov 18, 2025
commit 0315a07 upstream.

Commit 719c571 ("net: make napi_disable() symmetric with
enable") accidentally introduced a bug sometimes leading to a kernel
BUG when bringing an iface up/down under heavy traffic load.

Prior to this commit, napi_disable() was polling n->state until
none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
always flip them. Now there's a possibility to get away with the
NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
call with an uninitialized variable, rather than straight to
another round of the state check.

Error path looks like:

napi_disable():
unsigned long val, new; /* new is uninitialized */

do {
	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
				      NAPIF_STATE_SCHED is set */
	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
		usleep_range(20, 200);
		continue; /* go straight to the condition check */
	}
	new = val | <...>
} while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
						  writes garbage */

napi_enable():
do {
	val = READ_ONCE(n->state);
	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
<...>

while the typical BUG splat is like:

[  172.652461] ------------[ cut here ]------------
[  172.652462] kernel BUG at net/core/dev.c:6937!
[  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
[  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
[  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
[  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
[  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
[  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
< snip >
[  172.782403] Call Trace:
[  172.784857]  <TASK>
[  172.786963]  ice_up_complete+0x6f/0x210 [ice]
[  172.791349]  ice_xdp+0x136/0x320 [ice]
[  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
[  172.799648]  dev_xdp_install+0x61/0xe0
[  172.803401]  dev_xdp_attach+0x1e0/0x550
[  172.807240]  dev_change_xdp_fd+0x1e6/0x220
[  172.811338]  do_setlink+0xee8/0x1010
[  172.814917]  rtnl_setlink+0xe5/0x170
[  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
[  172.823732]  ? security_capable+0x36/0x50
< snip >

Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
for-loop with an explicit break.

From v1 [0]:
 - just use a for-loop to simplify both the fix and the existing
   code (Eric).

[0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com

Fixes: 719c571 ("net: make napi_disable() symmetric with enable")
Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants