ice: MSI-X managing #4

mswiatko · 2024-01-09T13:32:28Z

MSI-X managing

vmlinux.h declarations are more ergnomic, especially when working with kfuncs. The uapi headers are often incomplete for kfunc definitions. This commit also switches bitfield accesses to use CO-RE helpers. Switching to vmlinux.h definitions makes the verifier very unhappy with raw bitfield accesses. The error is: ; md.u.md2.dir = direction; 33: (69) r1 = *(u16 *)(r2 +11) misaligned stack access off (0x0; 0x0)+-64+11 size 2 Fix by using CO-RE-aware bitfield reads and writes. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/884bde1d9a351d126a3923886b945ea6b1b0776b.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>

test_progs is better than a shell script b/c C is a bit easier to maintain than shell. Also it's easier to use new infra like memory mapped global variables from C via bpf skeleton. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/a350db9e08520c64544562d88ec005a039124d9b.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>

This commit extends test_tunnel selftest to test the new XDP xfrm state lookup kfunc. Co-developed-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/e704e9a4332e3eac7b458e4bfdec8fcc6984cdb6.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Daniel Xu says: ==================== Add bpf_xdp_get_xfrm_state() kfunc This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and xfrm_state_put(). The intent is to support software RSS (via XDP) for the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed on (hopefully) reproducible AWS testbeds indicate that single tunnel pcpu ipsec can reach line rate on 100G ENA nics. Note this patchset only tests/shows generic xfrm_state access. The "secret sauce" (if you can really even call it that) involves accessing a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is available here [1]. [0]: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/03/ [1]: https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406 Changes from v5: * Improve kfunc doc comments * Remove extraneous replay-window setting on selftest reverse path * Squash two kfunc commits into one * Rebase to bpf-next to pick up bitfield write patches * Remove testing of opts.error in selftest prog Changes from v4: * Fixup commit message for selftest * Set opts->error -ENOENT for !x * Revert single file xfrm + bpf Changes from v3: * Place all xfrm bpf integrations in xfrm_bpf.c * Avoid using nval as a temporary * Rebase to bpf-next * Remove extraneous __failure_unpriv annotation for verifier tests Changes from v2: * Fix/simplify BPF_CORE_WRITE_BITFIELD() algorithm * Added verifier tests for bitfield writes * Fix state leakage across test_tunnel subtests Changes from v1: * Move xfrm tunnel tests to test_progs * Fix writing to opts->error when opts is invalid * Use __bpf_kfunc_start_defs() * Remove unused vxlanhdr definition * Add and use BPF_CORE_WRITE_BITFIELD() macro * Make series bisect clean Changes from RFCv2: * Rebased to ipsec-next * Fix netns leak Changes from RFCv1: * Add Antony's commit tags * Add KF_ACQUIRE and KF_RELEASE semantics ==================== Reviewed-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/cover.1702593901.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c 3a0b5a2 ("iavf: Introduce new state machines for flow director") 9526081 ("iavf: use iavf_schedule_aq_request() helper") https://lore.kernel.org/all/84e12519-04dc-bd80-bc34-8cf50d7898ce@intel.com/ drivers/net/ethernet/broadcom/bnxt/bnxt.c c13e268 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic") c2f8063 ("bnxt_en: Refactor RX VLAN acceleration logic.") a7445d6 ("bnxt_en: Add support for new RX and TPA_START completion types for P7") 1c7fd6e ("bnxt_en: Rename some macros for the P5 chips") https://lore.kernel.org/all/20231211110022.27926ad9@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c bd6781c ("bnxt_en: Fix wrong return value check in bnxt_close_nic()") 84793a4 ("bnxt_en: Skip nic close/open when configuring tstamp filters") https://lore.kernel.org/all/20231214113041.3a0c003c@canb.auug.org.au/ drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c 3d7a3f2 ("net/mlx5: Nack sync reset request when HotPlug is enabled") cecf44e ("net/mlx5: Allow sync reset flow when BF MGT interface device is present") https://lore.kernel.org/all/20231211110328.76c925af@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Besides already supported special "any" value and hex bit mask, support string-based parsing of delegation masks based on exact enumerator names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`, `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported symbolic names (ignoring __MAX_xxx guard values and stripping repetitive prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is normalized to lower case in mount option output. So "PROG_LOAD", "prog_load", and "MAP_create" are all valid values to specify for delegate_cmds options, "array" is among supported for map types, etc. Besides supporting string values, we also support multiple values specified at the same time, using colon (':') separator. There are corresponding changes on bpf_show_options side to use known values to print them in human-readable format, falling back to hex mask printing, if there are any unrecognized bits. This shouldn't be necessary when enum BTF information is present, but in general we should always be able to fall back to this even if kernel was built without BTF. As mentioned, emitted symbolic names are normalized to be all lower case. Example below shows various ways to specify delegate_cmds options through mount command and how mount options are printed back: 12/14 14:39:07.604 vmuser@archvm:~/local/linux/tools/testing/selftests/bpf $ mount | rg token $ sudo mkdir -p /sys/fs/bpf/token $ sudo mount -t bpf bpffs /sys/fs/bpf/token \ -o delegate_cmds=prog_load:MAP_CREATE \ -o delegate_progs=kprobe \ -o delegate_attachs=xdp $ mount | grep token bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp) Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Use both hex-based and string-based way to specify delegate mount options for BPF FS. Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231214225016.1209867-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Andrii Nakryiko says: ==================== BPF FS mount options parsing follow ups Original BPF token patch set ([0]) added delegate_xxx mount options which supported only special "any" value and hexadecimal bitmask. This patch set attempts to make specifying and inspecting these mount options more human-friendly by supporting string constants matching corresponding bpf_cmd, bpf_map_type, bpf_prog_type, and bpf_attach_type enumerators. This implementation relies on BTF information to find all supported symbolic names. If kernel wasn't built with BTF, BPF FS will still support "any" and hex-based mask. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=805707&state=* v1->v2: - strip BPF_, BPF_MAP_TYPE_, and BPF_PROG_TYPE_ prefixes, do case-insensitive comparison, normalize to lower case (Alexei). ==================== Link: https://lore.kernel.org/r/20231214225016.1209867-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>

The code gen generates a prototype for dump request free in the header, but no implementation in the source. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Commit 30c9020 ("tools: ynl-gen: use enum name from the spec") added pre-cooked user type for enums. Use it to fix ignoring enum-name provided in the spec. This changes a type in struct ethtool_tunnel_udp_entry but is generally inconsequential for current families. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Support genetlink families using simple fixed headers. Assume fixed header is identical for all ops of the family for now. Fixed headers are added to the request and reply structs as a _hdr member, and copied to/from netlink messages appropriately. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Fill in more empty handlers for TypeUnused. When 'unused' attr gets specified in a nested set we have to cleanly skip it during code generation. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Track which nests are recursive. Non-recursive nesting gets rendered in C as directly nested structs. For recursive ones we need to put a pointer in, rather than full struct. Track this information, no change to generated code, yet. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

We try to keep the structures and helpers "topologically sorted", to avoid forward declarations. When recursive nests are at play we need to sort twice, because structs which end up being marked as recursive will get a full set of forward declarations, so we should ignore them for the purpose of sorting. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

To avoid infinite nesting store recursive structs by pointer. If recursive struct is placed in the op directly - the first instance can be stored by value. That makes the code much less of a pain for majority of practical uses. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

We avoid printing forward declarations and prototypes for most types by sorting things topologically. But if structs nest we do need the forward declarations, there's no other way. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20231213231432.2944749-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…ilies' Jakub Kicinski says: ==================== tools: ynl-gen: fill in the gaps in support of legacy families Fill in the gaps in YNL C code gen so that we can generate user space code for all genetlink families for which we have specs. The two major changes we need are support for fixed headers and support for recursive nests. For fixed header support - place the struct for the fixed header directly in the request struct (and don't bother generating access helpers). The member of a fixed header can't be too complex, and also are by definition not optional so the user has to fill them in. The YNL core needs a bit of a tweak to understand that the attrs may now start at a fixed offset, which is not necessarily equal to sizeof(struct genlmsghdr). Dealing with nested sets is much harder. Previously we'd gen the nested structs as: struct outer { struct inner inner; }; If structs are recursive (e.g. inner contains outer again) we must break this chain and allocate one of the structs dynamically (store a pointer rather than full struct). ==================== Link: https://lore.kernel.org/r/20231213231432.2944749-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

With a clock interval of 400 nsec and a 64 bit transactions (32 bit preamble & 16 bit control & 16 bit data), it is reasonable to assume the mdio transaction will take 25.6 usec. Add a 30 usec delay before the first poll to reduce the chance of a 1000-2000 usec sleep. Reduce the timeout from 1000ms to 100ms as it is unlikely for the bus to take this long. Signed-off-by: Justin Chen <justin.chen@broadcom.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231213222744.2891184-2-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Simplify the code by using read_poll_timeout(). Signed-off-by: Justin Chen <justin.chen@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20231213222744.2891184-3-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Justin Chen says: ==================== net: mdio: mdio-bcm-unimac: optimizations and clean up Clean up mdio poll to use read_poll_timeout() and reduce the potential poll time. ==================== Link: https://lore.kernel.org/r/20231213222744.2891184-1-justin.chen@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Correct spelling as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231213043511.10357-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Correct spelling (s/and/any) and a run-on sentence. Spell out "multi". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20231213043650.12672-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Add a global variable NS_LIST to store all the namespaces that setup_ns created, so the caller could call cleanup_all_ns() instead of remember all the netns names when using cleanup_ns(). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…e namespace As the name \${rt-${rt}} may make reader confuse, convert the variable hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion. ]# ./srv6_end_dt46_l3vpn_test.sh ################################################################################ TEST SECTION: IPv6 routers connectivity test ################################################################################ TEST: Routers connectivity: rt-1 -> rt-2 [ OK ] TEST: Routers connectivity: rt-2 -> rt-1 [ OK ] ... TEST: IPv4 Hosts isolation: hs-t200-4 -X-> hs-t100-2 [ OK ] Tests passed: 34 Tests failed: 0 Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-3-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

… namespace As the name \${rt-${rt}} may make reader confuse, convert the variable hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion. ]# ./srv6_end_dt4_l3vpn_test.sh ################################################################################ TEST SECTION: IPv6 routers connectivity test ################################################################################ TEST: Routers connectivity: rt-1 -> rt-2 [ OK ] TEST: Routers connectivity: rt-2 -> rt-1 [ OK ] ... TEST: Hosts isolation: hs-t200-4 -X-> hs-t100-2 [ OK ] Tests passed: 18 Tests failed: 0 Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-4-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

… namespace As the name \${rt-${rt}} may make reader confuse, convert the variable hs/rt in setup_rt/hs to hid, rid. Here is the test result after conversion. ]# ./srv6_end_dt6_l3vpn_test.sh ################################################################################ TEST SECTION: IPv6 routers connectivity test ################################################################################ TEST: Routers connectivity: rt-1 -> rt-2 [ OK ] TEST: Routers connectivity: rt-2 -> rt-1 [ OK ] ... TEST: Hosts isolation: hs-t200-4 -X-> hs-t100-2 [ OK ] Tests passed: 18 Tests failed: 0 Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-5-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Here is the test result after conversion. There are some failures, but it also exists on my system without this patch. So it's not affectec by this patch and I will check the reason later. ]# time ./fcnal-test.sh /usr/bin/which: no nettest in (/root/.local/bin:/root/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) ########################################################################### IPv4 ping ########################################################################### ################################################################# No VRF SYSCTL: net.ipv4.raw_l3mdev_accept=0 TEST: ping out - ns-B IP [ OK ] TEST: ping out, device bind - ns-B IP [ OK ] TEST: ping out, address bind - ns-B IP [ OK ] ... ################################################################# SNAT on VRF TEST: IPv4 TCP connection over VRF with SNAT [ OK ] TEST: IPv6 TCP connection over VRF with SNAT [ OK ] Tests passed: 893 Tests failed: 21 real 52m48.178s user 0m34.158s sys 1m42.976s BTW, this test needs a really long time. So expand the timeout to 1h. Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-6-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

When running fib_nexthop_multiprefix test I saw all IPv6 test failed. e.g. ]# ./fib_nexthop_multiprefix.sh TEST: IPv4: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv6: host 0 to host 1, mtu 1300 [FAIL] With -v it shows COMMAND: ip netns exec h0 /usr/sbin/ping6 -s 1350 -c5 -w5 2001:db8:101::1 PING 2001:db8:101::1(2001:db8:101::1) 1350 data bytes From 2001:db8:100::64 icmp_seq=1 Packet too big: mtu=1300 --- 2001:db8:101::1 ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms Route get 2001:db8:101::1 via 2001:db8:100::64 dev eth0 src 2001:db8:100::1 metric 1024 expires 599sec mtu 1300 pref medium Searching for: 2001:db8:101::1 from :: via 2001:db8:100::64 dev eth0 src 2001:db8:100::1 .* mtu 1300 The reason is when CONFIG_IPV6_SUBTREES is not enabled, rt6_fill_node() will not put RTA_SRC info. After fix: ]# ./fib_nexthop_multiprefix.sh TEST: IPv4: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv6: host 0 to host 1, mtu 1300 [ OK ] Fixes: 735ab2f ("selftests: Add test with multiple prefixes using single nexthop") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-7-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…mespace Here is the test result after conversion. ]# ./fib_nexthop_multiprefix.sh TEST: IPv4: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv6: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv4: host 0 to host 2, mtu 1350 [ OK ] TEST: IPv6: host 0 to host 2, mtu 1350 [ OK ] TEST: IPv4: host 0 to host 3, mtu 1400 [ OK ] TEST: IPv6: host 0 to host 3, mtu 1400 [ OK ] TEST: IPv4: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv6: host 0 to host 1, mtu 1300 [ OK ] TEST: IPv4: host 0 to host 2, mtu 1350 [ OK ] TEST: IPv6: host 0 to host 2, mtu 1350 [ OK ] TEST: IPv4: host 0 to host 3, mtu 1400 [ OK ] TEST: IPv6: host 0 to host 3, mtu 1400 [ OK ] Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-8-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

…pace Here is the test result after conversion. ]# ./fib_nexthop_nongw.sh TEST: nexthop: get route with nexthop without gw [ OK ] TEST: nexthop: ping through nexthop without gw [ OK ] Acked-by: David Ahern <dsahern@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231213060856.4030084-9-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Previously, the ice driver had support for using a hanldler for bonding netdev events to ensure that conflicting features were not allowed to be activated at the same time. While this was still in place, additional support was added to specifically support SRIOV and LAG together. These both utilized the netdev event handler, but the SRIOV and LAG feature was behind a capabilities feature check to make sure the current NVM has support. The exclusion part of the event handler should be removed since there are users who have custom made solutions that depend on the non-exclusion of features. Wrap the creation/registration and cleanup of the event handler and associated structs in the probe flow with a feature check so that the only systems that support the full implementation of LAG features will initialize support. This will leave other systems unhindered with functionality as it existed before any LAG code was added. Fixes: bb52f42 ("ice: Add driver support for firmware changes for LAG") Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com>

According to the Intel Software Manual for I225, Section 7.5.2.7, hicredit should be multiplied by the constant link-rate value, 0x7736. Currently, the old constant link-rate value, 0x7735, from the boards supported on igb are being used, most likely due to a copy'n'paste, as the rest of the logic is the same for both drivers. Update hicredit accordingly. Fixes: 1ab011b ("igc: Add support for CBS offloading") Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Rodrigo Cataldo <rodrigo.cadore@l-acoustics.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

idpf_ring::skb serves only for keeping an incomplete frame between several NAPI Rx polling cycles, as one cycle may end up before processing the end of packet descriptor. The pointer is taken from the ring onto the stack before entering the loop and gets written there after the loop exits. When inside the loop, only the onstack pointer is used. For some reason, the logics is broken in the singleq mode, where the pointer is taken from the ring each iteration. This means that if a frame got fragmented into several descriptors, each fragment will have its own skb, but only the last one will be passed up the stack (containing garbage), leaving the rest leaked. Then, on ifdown, rxq::skb is being freed only in the splitq mode, while it can point to a valid skb in singleq as well. This can lead to a yet another skb leak. Just don't touch the ring skb field inside the polling loop, letting the onstack skb pointer work as expected: build a new skb if it's the first frame descriptor and attach a frag otherwise. On ifdown, free rxq::skb unconditionally if the pointer is non-NULL. Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>

The size of "i40e_dbg_command_buf" is 256, the size of "name" depends on "IFNAMSIZ", plus a null character and format size, the total size is more than 256. Improve readability and maintainability by replacing a hardcoded string allocation and formatting by the use of the kasprintf() helper. Fixes: 02e9c29 ("i40e: debugfs interface") Suggested-by: Simon Horman <horms@kernel.org> Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Suggested-by: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Kunwu Chan <kunwu.chan@hotmail.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Simon Horman <horms@kernel.org>

The link state of VF devices can be controlled via "ip link set", but the current state (auto/disabled) is not reported by "ip link show". Update ixgbe_ndo_get_vf_config() to make this info available to userspace. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>

Add support for driver-specific devlink loopback param. Supported values are "enabled", "disabled" and "prioritized". Default configuration is set to "enabled". Add documentation in networking/devlink/ice.rst. In previous generations of Intel NICs the trasmit scheduler was only limited by PCIe bandwidth when scheduling/assigning hairpin-badwidth between VFs. Changes to E810 HW design introduced scheduler limitation, so that available hairpin-bandwidth is bound to external port speed. In order to address this limitation and enable NFV services such as "service chaining" a knob to adjust the scheduler config was created. Driver can send a configuration message to the FW over admin queue and internal FW logic will reconfigure HW to prioritize and add more BW to VF to VF traffic. As end result for example 10G port will no longer limit hairpin-badwith to 10G and much higher speeds can be achieved. Devlink loopback param set to "prioritized" enables higher hairpin-badwitdh on related PFs. Configuration is applicable only to 8x10G and 4x25G cards. Changing loopback configuration will trigger CORER reset in order to take effect. Example command to change current value: devlink dev param set pci/0000:b2:00.3 name loopback value prioritized \ cmode runtime Co-developed-by: Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Pawel Kaminski <pawel.kaminski@intel.com>

Commit 6624e78 ("ice: split ice_vsi_setup into smaller functions") has refactored a bunch of code involved in PFR. In this process, TC queue number adjustment for XDP was lost. Bring it back. Lack of such adjustment causes interface to go into no-carrier after a reset, if XDP program is attached, with the following message: ice 0000:b1:00.0: Failed to set LAN Tx queue context, error: -22 ice 0000:b1:00.0 ens801f0np0: Failed to open VSI 0x0006 on switch 0x0001 ice 0000:b1:00.0: enable VSI failed, err -22, VSI index 0, type ICE_VSI_PF ice 0000:b1:00.0: PF VSI rebuild failed: -22 ice 0000:b1:00.0: Rebuild failed, unload and reload driver Fixes: 6624e78 ("ice: split ice_vsi_setup into smaller functions") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>

devm_kasprintf() returns a pointer to dynamically allocated memory which can be NULL upon failure. Fixes: d938a8c ("ice: Auxbus devices & driver for E822 TS") Cc: Kunwu Chan <kunwu.chan@hotmail.com> Suggested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org>

Switchdev mode allows to add mirroring rules to mirror incoming and outgoing packets to the interface's port representor. Previously, this was available only using software functionality. Add possibility to offload this functionality to the NIC hardware. Introduce ICE_MIRROR_PACKET filter action to the ice_sw_fwd_act_type enum to identify the desired action and pass it to the hardware as well as the VSI to mirror. Example of tc mirror command using hardware: tc filter add dev ens1f0np0 ingress protocol ip prio 1 flower src_mac b4:96:91:a5:c7:a7 skip_sw action mirred egress mirror dev eth1 ens1f0np0 - PF b4:96:91:a5:c7:a7 - source MAC address eth1 - PR of a VF to mirror to Co-developed-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Andrii Staikov <andrii.staikov@intel.com>

The e1000e driver supports hardware with a variety of different clock speeds, and thus a variety of different increment values used for programming its PTP hardware clock. The values currently programmed in e1000e_ptp_init are incorrect. In particular, only two maximum adjustments are used: 24000000 - 1, and 600000000 - 1. These were originally intended to be used with the 96 MHz clock and the 25 MHz clock. Both of these values are actually slightly too high. For the 96 MHz clock, the actual maximum value that can safely be programmed is 23,999,938. For the 25 MHz clock, the maximum value is 599,999,904. Worse, several devices use a 24 MHz clock or a 38.4 MHz clock. These parts are incorrectly assigned one of either the 24million or 600million values. For the 24 MHz clock, this is not a significant issue: its current increment value can support an adjustment up to 7billion in the positive direction. However, the 38.4 KHz clock uses an increment value which can only support up to 230,769,157 before it starts overflowing. To understand where these values come from, consider that frequency adjustments have the form of: new_incval = base_incval + (base_incval * adjustment) / (unit of adjustment) The maximum adjustment is reported in terms of parts per billion: new_incval = base_incval + (base_incval * adjustment) / 1 billion The largest possible adjustment is thus given by the following: max_incval = base_incval + (base_incval * max_adj) / 1 billion Re-arranging to solve for max_adj: max_adj = (max_incval - base_incval) * 1 billion / base_incval We also need to ensure that negative adjustments cannot underflow. This can be achieved simply by ensuring max_adj is always less than 1 billion. Introduce new macros in e1000.h codifying the maximum adjustment in PPB for each frequency given its associated increment values. Also clarify where these values come from by commenting about the above equations. Replace the switch statement in e1000e_ptp_init with one which mirrors the increment value switch statement from e1000e_get_base_timinica. For each device, assign the appropriate maximum adjustment based on its frequency. Some parts can have one of two frequency modes as determined by E1000_TSYNCRXCTL_SYSCFI. Since the new flow directly matches the assignments in e1000e_get_base_timinca, and uses well defined macro names, it is much easier to verify that the resulting maximum adjustments are correct. It also avoids difficult to parse construction such as the "hw->mac.type < e1000_phc_lpt", and the use of fallthrough which was especially confusing when combined with a conditional block. Note that I believe the current increment value configuration used for 24MHz clocks is sub-par, as it leaves at least 3 extra bits available in the INCVALUE register. However, fixing that requires more careful review of the clock rate and associated values. Reported-by: Trey Harrison <harrisondigitalmedia@gmail.com> Fixes: 68fe1d5 ("e1000e: Add Support for 38.4MHZ frequency") Fixes: d89777b ("e1000e: add support for IEEE-1588 PTP") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>

The driver should not report an error message when for a medialess port the link_down_on_close flag is enabled and the physical link cannot be set down. Fixes: 8ac7132 ("ice: Fix interface being down after reset with link-down-on-close flag on") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Katarzyna Wieczerzycka <katarzyna.wieczerzycka@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>

Disabling netdev with ethtool private flag "link-down-on-close" enabled can cause NULL pointer dereference bug. Shut down VSI regardless of "link-down-on-close" state. Fixes: 8ac7132 ("ice: Fix interface being down after reset with link-down-on-close flag on") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>

Tell hardware to write back completed descriptors even when interrupts are disabled. Otherwise, descriptors might not be written back until the hardware can flush a full cacheline of descriptors. This can cause unnecessary delays when traffic is light (or even trigger Tx queue timeout). The example scenario to reproduce the Tx timeout if the fix is not applied: - configure at least 2 Tx queues to be assigned to the same q_vector, - generate a huge Tx traffic on the first Tx queue - try to send a few packets using the second Tx queue. In such a case Tx timeout will appear on the second Tx queue because no completion descriptors are written back for that queue while interrupts are disabled due to NAPI polling. The patch is necessary to start work on the AF_XDP implementation for the idpf driver, because there may be a case where a regular LAN Tx queue and an XDP queue share the same NAPI. Fixes: c2d548c ("idpf: add TX splitq napi poll support") Fixes: a5ab9ee ("idpf: add singleq start_xmit and napi poll") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Co-developed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com>

Size of the virtchnl2_rss_key struct should be 7 bytes but the compiler introduces a padding byte for the structure alignment. This results in idpf sending an additional byte of memory to the device control plane than the expected buffer size. As the control plane enforces virtchnl message size checks to validate the message, set RSS key message fails resulting in the driver load failure. Remove implicit compiler padding by using "__packed" structure attribute for the virtchnl2_rss_key struct. Also there is no need to use __DECLARE_FLEX_ARRAY macro for the 'key_flex' struct field. So drop it. Fixes: 0d7502a ("virtchnl: add virtchnl version 2 ops") Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Reviewed-by: Simon Horman <horms@kernel.org>

Commit 3116f59 ("i40e: fix use-after-free in i40e_sync_filters_subtask()") avoided use-after-free issues, by increasing refcount during update the VSI filter list to the HW. However, it missed the unicast situation. When deleting an unicast FDB entry, the i40e driver will release the mac_filter, and i40e_service_task will concurrently request firmware to add the mac_filter, which will lead to the following use-after-free issue. Fix again for both netdev->uc and netdev->mc. BUG: KASAN: use-after-free in i40e_aqc_add_filters+0x55c/0x5b0 [i40e] Read of size 2 at addr ffff888eb3452d60 by task kworker/8:7/6379 CPU: 8 PID: 6379 Comm: kworker/8:7 Kdump: loaded Tainted: G Workqueue: i40e i40e_service_task [i40e] Call Trace: dump_stack+0x71/0xab print_address_description+0x6b/0x290 kasan_report+0x14a/0x2b0 i40e_aqc_add_filters+0x55c/0x5b0 [i40e] i40e_sync_vsi_filters+0x1676/0x39c0 [i40e] i40e_service_task+0x1397/0x2bb0 [i40e] process_one_work+0x56a/0x11f0 worker_thread+0x8f/0xf40 kthread+0x2a0/0x390 ret_from_fork+0x1f/0x40 Allocated by task 21948: kasan_kmalloc+0xa6/0xd0 kmem_cache_alloc_trace+0xdb/0x1c0 i40e_add_filter+0x11e/0x520 [i40e] i40e_addr_sync+0x37/0x60 [i40e] __hw_addr_sync_dev+0x1f5/0x2f0 i40e_set_rx_mode+0x61/0x1e0 [i40e] dev_uc_add_excl+0x137/0x190 i40e_ndo_fdb_add+0x161/0x260 [i40e] rtnl_fdb_add+0x567/0x950 rtnetlink_rcv_msg+0x5db/0x880 netlink_rcv_skb+0x254/0x380 netlink_unicast+0x454/0x610 netlink_sendmsg+0x747/0xb00 sock_sendmsg+0xe2/0x120 __sys_sendto+0x1ae/0x290 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0xa0/0x370 entry_SYSCALL_64_after_hwframe+0x65/0xca Freed by task 21948: __kasan_slab_free+0x137/0x190 kfree+0x8b/0x1b0 __i40e_del_filter+0x116/0x1e0 [i40e] i40e_del_mac_filter+0x16c/0x300 [i40e] i40e_addr_unsync+0x134/0x1b0 [i40e] __hw_addr_sync_dev+0xff/0x2f0 i40e_set_rx_mode+0x61/0x1e0 [i40e] dev_uc_del+0x77/0x90 rtnl_fdb_del+0x6a5/0x860 rtnetlink_rcv_msg+0x5db/0x880 netlink_rcv_skb+0x254/0x380 netlink_unicast+0x454/0x610 netlink_sendmsg+0x747/0xb00 sock_sendmsg+0xe2/0x120 __sys_sendto+0x1ae/0x290 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0xa0/0x370 entry_SYSCALL_64_after_hwframe+0x65/0xca Fixes: 3116f59 ("i40e: fix use-after-free in i40e_sync_filters_subtask()") Fixes: 41c445f ("i40e: main driver core") Signed-off-by: Ke Xiao <xiaoke@sangfor.com.cn> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: Di Zhu <zhudi2@huawei.com> Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Stop dividing the phase_offset value received from firmware. This fault is present since the initial implementation. The phase_offset value received from firmware is in 0.01ps resolution. Dpll subsystem is using the value in 0.001ps, raw value is adjusted before providing it to the user. The user can observe the value of phase offset with response to `pin-get` netlink message of dpll subsystem for an active pin: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml \ --do pin-get --json '{"id":2}' Where example of correct response would be: {'board-label': 'C827_0-RCLKA', 'capabilities': 6, 'clock-id': 4658613174691613800, 'frequency': 1953125, 'id': 2, 'module-name': 'ice', 'parent-device': [{'direction': 'input', 'parent-id': 6, 'phase-offset': -216839550, 'prio': 9, 'state': 'connected'}, {'direction': 'input', 'parent-id': 7, 'phase-offset': -42930, 'prio': 8, 'state': 'connected'}], 'phase-adjust': 0, 'phase-adjust-max': 16723, 'phase-adjust-min': -16723, 'type': 'mux'} Provided phase-offset value (-42930) shall be divided by the user with DPLL_PHASE_OFFSET_DIVIDER to get actual value of -42.930 ps. Before the fix, the response was not correct: {'board-label': 'C827_0-RCLKA', 'capabilities': 6, 'clock-id': 4658613174691613800, 'frequency': 1953125, 'id': 2, 'module-name': 'ice', 'parent-device': [{'direction': 'input', 'parent-id': 6, 'phase-offset': -216839, 'prio': 9, 'state': 'connected'}, {'direction': 'input', 'parent-id': 7, 'phase-offset': -42, 'prio': 8, 'state': 'connected'}], 'phase-adjust': 0, 'phase-adjust-max': 16723, 'phase-adjust-min': -16723, 'type': 'mux'} Where phase-offset value (-42), after division (DPLL_PHASE_OFFSET_DIVIDER) would be: -0.042 ps. Fixes: 8a3a565 ("ice: add admin commands to access cgu configuration") Fixes: 90e1c90 ("ice: dpll: implement phase related callbacks") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>

Although it does not seem to have any untoward side-effects, the use of ';' to separate to assignments seems more appropriate than ','. Flagged by clang-17 -Wcomma No functional change intended. Compile tested only. Signed-off-by: Simon Horman <horms@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>

Currently ixgbe driver is notified of overheating events via internal IXGBE_ERR_OVERTEMP error code. Change the approach for handle_lasi() to use freshly introduced is_overtemp function parameter which set when such event occurs. Change check_overtemp() to bool and return true if overtemp event occurs. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org>

Change returning codes to the kernel ones instead of the internal ones for the entire ixgbe driver. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>

Use generic devlink PF MSI-X parameter to allow user to change MSI-X range. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

With dynamic approach to alloc MSI-X there is no sense to statically split MSI-X between PF features. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

Remove the field to allow having more queues than MSI-X on VSI. As default the number will be the same, but if there won't be more MSI-X available VSI can run with at least one MSI-X. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

Move responsibility of MSI-X requesting for RDMA feature from ice driver to irdma driver. It is done to allow simple fallback when there is not enough MSI-X available. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

It can be needed to have some MSI-X allocated as static and rest as dynamic. For example on PF VSI. We want to always have minimum one MSI-X on it, because of that it is allocated as a static one, rest can be dynamic if it is supported. Change the ice_get_irq_res() to allow using static entries if they are free even if caller wants dynamic one. Fix limit values. Min and max in limit means the values that are valid, so decrease max and num_static by one. Set vsi::irq_dyn_alloc if dynamic allocation is supported. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

Implement enable_rdma devlink parameter to allow user to turn RDMA feature on and off. It is useful when there is no enough interrupts and user doesn't need RDMA feature. Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

After implementing pf->msix.max field, base vector for other use cases (like VFs) can be fixed. This simplify code when changing MSI-X amount on particular VF, because there is no need to move a base vector. Fixed base vector allow to reserve vectors from the beginning instead of from the end, which is also simpler in code. Store total and rest value in the same struct as max and min for PF. Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also use for other none PF use cases (SIOV). Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

Petr Machata says: ==================== mlxsw: Refactor reference counting code Amit Cohen writes: This set converts all reference counters defined as 'unsigned int' to refcount_t type. The reference counting of LAGs can be simplified, so first refactor the related code and then change the type of the reference counter. Patch set overview: Patches #1-#4 are preparations for LAG refactor Patch #5 refactors LAG code and change the type of reference counter Patch #6 converts the remaining reference counters in mlxsw driver ==================== Link: https://lore.kernel.org/r/cover.1706293430.git.petrm@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>

danobi and others added 30 commits December 14, 2023 17:12

dmertman and others added 26 commits December 19, 2023 11:25

ice: devlink PF MSI-X max and min parameter

dc3c377

Use generic devlink PF MSI-X parameter to allow user to change MSI-X range. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

ice: remove splitting MSI-X between features

98e94f1

With dynamic approach to alloc MSI-X there is no sense to statically split MSI-X between PF features. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

ice, irdma: move interrupts code to irdma

fbde820

Move responsibility of MSI-X requesting for RDMA feature from ice driver to irdma driver. It is done to allow simple fallback when there is not enough MSI-X available. Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>

mswiatko force-pushed the dev-queue branch from 087c173 to ac9a786 Compare January 25, 2024 11:41

mswiatko closed this Jan 25, 2024

mswiatko deleted the ice-msix-e1000-v3 branch January 25, 2024 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ice: MSI-X managing #4

ice: MSI-X managing #4

Uh oh!

mswiatko commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ice: MSI-X managing #4

ice: MSI-X managing #4

Uh oh!

Conversation

mswiatko commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants