Skip to content

TAM-Docker:integrates visualized INT/MOD into a unified platform, providing a dedicated container for all related processing tasks #2141

@yehaili-Kwai

Description

@yehaili-Kwai

TAM-Docker integrates visualized INT/MOD into a unified platform, providing a dedicated container for all related processing tasks.It supports user configuration delivery, functional logic processing, packet reporting and data parsing, as well as data storage and visualization.

1.solution Design

Image

The diagram above shows the overall architecture of the TAM system. In general, a self-developed NOS is used to abstract away differences between switching chips, providing a unified control logic for the upper-layer management platform. Vendors adapt their implementations according to Kuaishou’s requirements to achieve a unified goal.
Currently, the SONiC community does not support the MOD feature. The SAI functionality needs further development and improvement. SAI must support MOD data collection modes including, Stateless and Stateful. Kuaishou’s self-developed KNOS supports two methods for network platform data collection: GRPC-based software collection and UDP-based hardware collection.
1.     Command Line Enhancement
l  Configure and display TAM (MOD) parameters, including flow table configuration, collection mode setup, sampling rate settings, and collection configuration.
l  Command-line configuration currently interacts only with CONFIG_DB.
2.     SWSS container
l  tamOrch: Subscribes to APPL_DB and invokes SAI APIs to configure the chip based on the configuration.
l  MOD statistics collection: Aggregated telemetry collection.
3.     The SYNCD container's SAI development supports two MOD modes
l  Stateless MOD implemented based on flow tables
l  Stateful MOD implemented based on flow tables
4.     ASIC driver
l  Associate MOD data with a specific genetlink channel and multicast group
l  Place captured samples into that genetlink group
5.     Add a new TAM Docker container
l  tammgrd: Updates the TAM table in APPL_DB based on configuration, and allows selection of data collection and analysis mode (Stateless/Stateful MOD) via command-line configuration.
l  TAM agent: Performs packet loss data collection and analysis. It parses switch packet loss metadata—such as 5-tuple, packet loss reason, loss queue, and port information—and stores it in STATE_DB. Stateless MOD agent mode: Implements packet loss statistics.

2.Command-line design
MOD-related configuration, KNOS MOD control is delivered as follows:
l  Switch identifier: Used to uniquely identify the switch in the network and deliver SAI attributes
l  Sampling configuration: Sets the packet loss sampling rate
l  Collector configuration: Set the address (IP and port) of the device that receives dropped packets.
l  Aging interval configuration: If no new packet loss is detected within this time interval, the system considers the packet loss behavior to have stopped and performs data aging.
l  MOD enable switch: By default, MOD is disabled on the switch and needs to be enabled manually.
n  If MOD is not supported, enabling the MOD switch will report an appropriate error.

2.1. Configure TAM
2.1.1. Configure the switch's unique TAM identifier
Command: switch-id
Configuration note: This command must be executed within the TAM configuration view.
Feature description: This command configures the switch identifier, which uniquely identifies the switch and can be used in telemetry reports. the IPv4 loopback IP is used as the switch-id.

Example:

SONIC# show tam switch

TAM Device information
Switch ID : 10.6.0.24
Enterprise ID :
Device ID :

2.1.2. Clear the switch's unique TAM identifier
Command: no switch-id
Configuration note: This command must be executed within the TAM configuration view.


Example:
SONIC# configure terminal
SONIC(config)# tam
SONIC(config-tam)# no switch-id
SONIC# show tam switch

2.1.3. Configure Collector
Command: collector
Configuration note: This command must be executed within the TAM configuration view.
Function description: This command configures collector information. The collector configuration is a required attribute for SAI delivery.


Example:
SONIC(config-tam)# collector c2 ip 1.1.1.1 port 7676 protocol UDP
SONIC(config-tam)# show tam collectors
Name IP Address Port Protocol
c2 1.1.1.1 7676 UDP
 
2.1.4. Clear collector
Command: no collector
Function description: This command deletes the collector configuration.
 
Example:
SONIC(config-tam)# no collector c2

2.1.5. Configure sampling
Command: sampler
Configuration note: This command must be executed within the TAM configuration view.
Function description: This command configures sampling settings and sets the sampling rate.
 
Example:
SONIC(config-tam)# sampler s1 rate 1
SONIC# show tam samplers
Name Sample Rate
s1 1

2.1.6. Clear sampling configuration
Command: no sampler
Function description: This command deletes the sampling configuration.
 
Example:
SONIC(config-tam)# no sampler s34
SONIC# show tam samplers
Name Sample Rate
s1 1
 
2.2. Configure DROP MONITOR feature
2.2.1. Configure aging time
Command: aging-interval
Function Description: This command configures the aging time.
 
Example:
SONIC # configure terminal
SONIC(config)# tam
SONIC(config-tam)# drop-monitor
SONIC(config-tam-dm)# aging-interval 10
 
2.2.2. Clear Aging Configuration
Command: no aging-interval
Function Description: This command resets the aging interval to its default value.
 
Example:
SONIC(config-tam-dm)# no aging-interval

2.2.3. Enable Drop Monitor
Command: enable
Function Description: This command enables the Drop Monitor feature on the switch.
 
Example:
SONIC(config-tam-dm)# enable
SONIC(config-tam-dm)# end
SONIC# show tam drop-monitor
Status : Active
Switch ID : 10.6.0.24
Number of collectors : 1
Aging Interval : 60
 
2.2.4. Disable Drop Monitor
Command: no enable
Function Description: This command disables the Drop Monitor feature on the switch.
 
Example:
SONIC# configure terminal
SONIC(config)# tam
SONIC(config-tam)# drop-monitor
SONIC(config-tam-dm)# no enable
SONIC(config-tam-dm)# end
SONIC# show tam drop-monitor
Status : Inactive
Switch ID : 10.6.0.24
Number of collectors : 1
Aging Interval : 60
 
2.2.5. Configure Database Data Push Interval
Command: poll-interval
Function Description: Sets the interval for pushing collected data to the database. When packet loss statistics change, data is pushed to the database at the configured time interval.
 
Example:
SONIC(config-tam-dm)# poll-interval 10

2.3. Verify Configuration
SONIC(config-tam-dm)# end
2.3.1. View TAM Switch Configuration

Command: show tam switch
Function Description: This command displays TAM device information.



Example:

SONIC # show tam switch

TAM Device information
Switch ID : 10.6.0.24
Enterprise ID :
Device ID
 
2.3.2.Display TAM Collectors
Command: show tam collectors
Function Description: This command lists detailed information about all collectors or a specific collector.
 
Example:
SONIC# show tam collectors
Name IP Address Port Protocol
c1 1.1.1.1 9999 UDP
 
2.3.3. Display TAM Samplers
Command: show tam samplers
Function Description: This command lists detailed information about all samplers or a specific sampler.
 
Example:
SONIC # show tam samplers
Name Sample Rate
s1 1
 
2.3.4. Detail TAM Running Configuration
Command: show running-configuration tam
Function Description: This command displays detailed information about all samplers or a specific sampler in the current TAM configuration.
 
Example:
SONIC# show running-configuration tam
tam
switch-id 10.6.0.24
collector c1 ip 1.1.1.1 port 9999 protocol UDP
sampler s1 rate 1
!
drop-monitor
poll-interval 10
aging-interval 60
 
2.3.5. Display Packet Loss Data on the Switch Side
Command: show running-configuration tam
Function Description: This command displays packet loss data. The agent on the switch side parses the data and pushes it to the database every poll-interval seconds.
 
Example:
#SONIC(config-tam-dm)# show tam drop-monitor counters
Image

  1. TAM Configuration and Control Workflow
    3.1. Configuration and Control Process
Image

1.  The user configures TAM collectors, flow tables, collection modes, sampling-related parameters, and adds these configurations to the CONFIG DB.
2. The tammgrd daemon monitors the TAM tables in the CONFIG DB and updates the configuration information of TAM collectors, flow tables, collection modes, and sampling-related parameters into the TAM_TABLE in the APPL DB.
3. tamorch detects data updates in the TAM table of the APPL DB, processes them, and calls the corresponding SAI TAM APIs to deliver the configuration to the chip.
4. When the SAI TAM configuration is delivered, the SAI driver creates a genetlink family and multicast group for the port. When the chip detects packet loss, it can send packet loss metadata to the TAM agent via netlink for processing.
5. After detecting the user configuration, the TAM container starts the TAM agent. Based on the mode configuration, it sets and starts the corresponding agent mode (stateless/stateful), registers a callback to open a netlink socket for packet reception, and registers a TAM data handler on the netlink socket.

 
3.2. TAM Data Flow
The diagram below shows an example packet path flow for TAM:
Image

1.  After the ASIC detects packet loss data, it encapsulates the TAM data and uploads it to the CPU, triggering an interrupt to the ASIC driver. The ASIC driver determines that this is a sample buffer received due to TAM sampling being enabled on the interface.
2.   The ASIC driver checks whether it is associated with a specific genetlink channel name and group. It encapsulates the sample in a genetlink buffer and sends it using genlmsg_multicast().
3.  The callback handler registered by the TAM agent waits for the multicast group ID of the specific genetlink family name and receives the encapsulated sample. The TAM data parser extracts the encapsulated TAM metadata attributes and writes the sample to the STATE DB using Redis.
4.  The Telemetry container retrieves packet loss information from the STATE DB, constructs a gRPC datagram, and forwards it to the configured TAM collector.
NET_DM_ATTR_UNSPEC,
NET_DM_ATTR_ALERT_MODE, /* u8 /
NET_DM_ATTR_PC, /
u64 /
NET_DM_ATTR_SYMBOL, /
string /
NET_DM_ATTR_IN_PORT, /
nested /
NET_DM_ATTR_TIMESTAMP, /
u64 /
NET_DM_ATTR_PROTO, /
u16 /
NET_DM_ATTR_PAYLOAD, /
binary /
NET_DM_ATTR_PAD,
NET_DM_ATTR_TRUNC_LEN, /
u32 /
NET_DM_ATTR_ORIG_LEN, /
u32 /
NET_DM_ATTR_QUEUE_LEN, /
u32 /
NET_DM_ATTR_STATS, /
nested /
NET_DM_ATTR_HW_STATS, /
nested /
NET_DM_ATTR_ORIGIN, /
u16 /
NET_DM_ATTR_HW_TRAP_GROUP_NAME, /
string /
NET_DM_ATTR_HW_TRAP_NAME, /
string /
NET_DM_ATTR_HW_ENTRIES, /
nested /
NET_DM_ATTR_HW_ENTRY, /
nested /
NET_DM_ATTR_HW_TRAP_COUNT, /
u32 /
NET_DM_ATTR_SW_DROPS, /
flag /
NET_DM_ATTR_HW_DROPS, /
flag /
NET_DM_ATTR_FLOW_ACTION_COOKIE, /
binary /
NET_DM_ATTR_REASON, /
string /
NET_DM_ATTR_DROP_REASON, /
u32 /
NET_DM_ATTR_STAGE, /
string /
NET_DM_ATTR_COUNT, /
string /
NET_DM_ATTR_START_TIME_STAMP, /
string /
NET_DM_ATTR_LAST_TIME_STAMP, /
string */

3.3. SWSS tamOrch Data Flow
Image

In Orchagent, tamOrch is introduced to handle configuration requests. tamOrch parses TAM feature configurations from the command line, and in combination with switch chip capabilities, uses the standard SAI TAM functions to deliver SAI TAM configurations through the sairedis interface, thereby implementing TAM feature configuration on the switch chip.
tamOrch responds to user configuration as shown below:
1.       Configure flow table
2.       Configure data format
3.       Configure transmission mode
4.       Configure collection information
5.       Configure packet drop mode
6.       Configure packet drop type
7.       Bind flow table

3.4. SAI TAM interface implementation

Image

Genetlink SAI interface usage

// 1、create tam report
tam_attr_list[0].id = SAI_TAM_REPORT_ATTR_TYPE;
tam_attr_list[0].value.s32 = SAI_TAM_REPORT_TYPE_GENETLINK;
sai_rc = sai_tam_api->create_tam_report(tam_report_id, gSwitchId, 1, tam_attr_list);

// 2、create tam report action
tam_attr_list[0].id = SAI_TAM_EVENT_ACTION_ATTR_REPORT_TYPE;
tam_attr_list[0].value.oid = tam_report_id;
sai_status_t sai_rc = sai_tam_api->create_tam_event_action(tam_event_action_id, gSwitchId, 1, tam_attr_list);

// 3、create tam transport
tam_attr_list[0].id = SAI_TAM_TRANSPORT_ATTR_TRANSPORT_TYPE;
tam_attr_list[0].value.s32 = SAI_TAM_TRANSPORT_TYPE_NONE;
sai_rc = sai_tam_api->create_tam_transport(tam_transport_id, gSwitchId, 0, tam_attr_list);

// 4、create policer
count = 0;
tam_attr_list[count].id = SAI_POLICER_ATTR_METER_TYPE;
tam_attr_list[count].value.s32 = SAI_METER_TYPE_PACKETS;
count++;
tam_attr_list[count].id = SAI_POLICER_ATTR_MODE;
tam_attr_list[count].value.s32 = SAI_POLICER_MODE_SR_TCM;
count++;
tam_attr_list[count].id = SAI_POLICER_ATTR_CBS;
tam_attr_list[count].value.s32 = 2000;
count++;
tam_attr_list[count].id = SAI_POLICER_ATTR_CIR;
tam_attr_list[count].value.s32 = 1000; //rate limit MOD packets to CPU
count++;
tam_attr_list[count].id = SAI_POLICER_ATTR_RED_PACKET_ACTION;
tam_attr_list[count].value.s32 = SAI_PACKET_ACTION_DROP;
count++;
sai_rc = sai_policer_api->create_policer(sai_policer_obj, gSwitchId, count, tam_attr_list);

// 5 create hostif trap
// 5.1 create hostif
count = 0
tam_attr_list[count].id = SAI_HOSTIF_ATTR_TYPE;
tam_attr_list[count].value.s32 = SAI_HOSTIF_TYPE_GENETLINK;
count++;
tam_attr_list[count].id = SAI_HOSTIF_ATTR_NAME; //GENETLINK family name
strncpy(tam_attr_list[count].value.chardata, "NET_DM", 31);
tam_attr_list[count].value.chardata[31] = '\0';
count++;
tam_attr_list[count].id = SAI_HOSTIF_ATTR_GENETLINK_MCGRP_NAME;
strncpy(tam_attr_list[count].value.chardata, "events", 31);
tam_attr_list[count].value.chardata[31] = '\0';
count++;
sai_rc = sai_hostif_api->create_hostif(sai_hostif_obj, gSwitchId, count, tam_attr_list);
// 5.2 create hostif trap group
count = 0
tam_attr_list[count].id = SAI_HOSTIF_TRAP_GROUP_ATTR_QUEUE;
tam_attr_list[count].value.s32 = mod_queue;
count++;
tam_attr_list[count].id = SAI_HOSTIF_TRAP_GROUP_ATTR_POLICER;
tam_attr_list[count].value.oid = sai_policer_obj;
count++;
sai_rc = sai_hostif_api->create_hostif_trap_group(sai_hostif_trap_group_obj, gSwitchId, count, tam_attr_list);
// 5.3 create hostif user defined trap
tam_attr_list[count].id = SAI_HOSTIF_USER_DEFINED_TRAP_ATTR_TYPE;
tam_attr_list[count].value.s32 = SAI_HOSTIF_USER_DEFINED_TRAP_TYPE_TAM;
count++;
tam_attr_list[count].id = SAI_HOSTIF_USER_DEFINED_TRAP_ATTR_TRAP_GROUP;
tam_attr_list[count].value.oid = sai_hostif_trap_group_obj;
count++;
sai_rc = sai_hostif_api->create_hostif_user_defined_trap(sai_hostif_udt_obj, gSwitchId, count, tam_attr_list);
// 5.4 create hostif table entry
tam_attr_list[count].id = SAI_HOSTIF_TABLE_ENTRY_ATTR_TYPE;
tam_attr_list[count].value.s32 = SAI_HOSTIF_TABLE_ENTRY_TYPE_TRAP_ID;
count++;
tam_attr_list[count].id = SAI_HOSTIF_TABLE_ENTRY_ATTR_TRAP_ID;
tam_attr_list[count].value.oid = sai_hostif_udt_obj;
count++;
tam_attr_list[count].id = SAI_HOSTIF_TABLE_ENTRY_ATTR_CHANNEL_TYPE;
tam_attr_list[count].value.s32 = SAI_HOSTIF_TABLE_ENTRY_CHANNEL_TYPE_GENETLINK;
count++;
tam_attr_list[count].id = SAI_HOSTIF_TABLE_ENTRY_ATTR_HOST_IF;
tam_attr_list[count].value.oid = sai_hostif_obj;
count++;
sai_rc = sai_hostif_api->create_hostif_table_entry(sai_hostif_table_entry_obj, gSwitchId, count, tam_attr_list);

// 6 create tam collector
tam_attr_list[count].id = SAI_TAM_COLLECTOR_ATTR_TRANSPORT;
tam_attr_list[count].value.oid = tam_transport_id;
count++;
tam_attr_list[count].id = SAI_TAM_COLLECTOR_ATTR_HOSTIF_TRAP;
tam_attr_list[count].value.oid = sai_hostif_udt_obj;
count++;
tam_attr_list[count].id = SAI_TAM_COLLECTOR_ATTR_DSCP_VALUE;
tam_attr_list[count].value.u8 = 0;
count++;
tam_attr_list[count].id = SAI_TAM_COLLECTOR_ATTR_SRC_IP;
tam_attr_list[count].value.ipaddr.addr_family = SAI_IP_ADDR_FAMILY_IPV4;
tam_attr_list[count].value.ipaddr.addr.ip4 = htonl(0x01010101);
count++;
tam_attr_list[count].id = SAI_TAM_COLLECTOR_ATTR_DST_IP;
tam_attr_list[count].value.ipaddr.addr_family = SAI_IP_ADDR_FAMILY_IPV4;
tam_attr_list[count].value.ipaddr.addr.ip4 = htonl(0x01010102);
count++;
sai_rc = sai_tam_api->create_tam_collector(tam_collector_id, gSwitchId, count, tam_attr_list);

// 7 create tam event
tam_attr_list[0].id = SAI_TAM_EVENT_ATTR_TYPE;
tam_attr_list[0].value.s32 = SAI_TAM_EVENT_TYPE_PACKET_DROP;
tam_attr_list[1].id = SAI_TAM_EVENT_ATTR_ACTION_LIST;
tam_attr_list[1].value.objlist.count = 1;
tam_attr_list[1].value.objlist.list = (sai_object_id_t *)calloc(1, sizeof(sai_object_id_t));
tam_attr_list[1].value.objlist.list[0] = tam_event_action_id;
tam_attr_list[2].id = SAI_TAM_EVENT_ATTR_COLLECTOR_LIST;
tam_attr_list[2].value.objlist.count = 1;
tam_attr_list[2].value.objlist.list = (sai_object_id_t *)calloc(1, sizeof(sai_object_id_t));
tam_attr_list[2].value.objlist.list[0] = tam_collector_id;
count = 3;
sai_rc = sai_tam_api->create_tam_event(tam_event_id, gSwitchId, count, tam_attr_list);

// 8 create tam
count = 0;
tam_attr_list[count].id = SAI_TAM_ATTR_EVENT_OBJECTS_LIST;
tam_attr_list[count].value.objlist.list = (sai_object_id_t *)calloc(1, sizeof(sai_object_id_t));
tam_attr_list[count].value.objlist.list[0] = tam_event_packet_drop_id;
tam_attr_list[count].value.objlist.count = 1;
count++;
tam_attr_list[count].id = SAI_TAM_ATTR_TAM_BIND_POINT_TYPE_LIST;
tam_attr_list[count].value.s32list.list = (sai_int32_t *)calloc(2, sizeof(sai_int32_t));
tam_attr_list[count].value.s32list.list[0] = SAI_TAM_BIND_POINT_TYPE_SWITCH;
tam_attr_list[count].value.s32list.count = 1;
count++;
sai_rc = sai_tam_api->create_tam(tam_id, gSwitchId, count, tam_attr_list);
vector<sai_object_id_t> tam_oid_list;
tam_oid_list.push_back(tam_id);

// 9 enable switch TAM
switch_attr.id = SAI_SWITCH_ATTR_TAM_OBJECT_ID;
switch_attr.value.objlist.count = (uint32_t)tam_oid_list.size();
switch_attr.value.objlist.list = tam_oid_list.data();
sai_status_t sai_rc = sai_switch_api->set_switch_attribute(gSwitchId, &switch_attr);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions