diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Build_ORT.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Build_ORT.md new file mode 100644 index 000000000..a23b89236 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Build_ORT.md @@ -0,0 +1,56 @@ +--- +title: Build ONNX Runtime with KleidiAI and SME2 for Android +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Build ONNX Runtime and benchmark application with KleidiAI and SME2 support for Android + +To run this on an Android device, you must cross-compile ORT using the Android NDK. +Prerequisites +- Android NDK: Version r26b or newer (r27+ recommended for latest SME2 toolchain support). +- CMake & Ninja: Ensure these are in your system PATH. + +### Build Command +Run the following from the root of the ONNX Runtime repository: +```bash +./build.sh --android --android_sdk_path $ANDROID_NDK_HOME --android_ndk_path $ANDROID_NDK_HOME --android_abi arm64-v8a --android_api 27 --config RelWithDebInfo --build_shared_lib --cmake_extra_defines onnxruntime_USE_KLEIDIAI=ON --cmake_generator Ninja --parallel +``` + +Note: The flag “onnxruntime_USE_KLEIDIAI=ON” triggers the inclusion of Arm KleidiAI kernels into the MLAS library. + +## Profiling Performance with onnxruntime_perf_test +Once the build is complete, you will find the libonnxruntime.so shared library and onnxruntime_perf_test binary in your build directory. The onnxruntime_perf_test is essential for measuring latency and identifying bottlenecks. +### Step 1: Push files to Android Device +```bash +adb push /Android/RelWithDebInfo/onnxruntime_perf_test /data/local/tmp/ +adb push /Android/RelWithDebInfo/libonnxruntime.so /data/local/tmp/ +adb push your_model.onnx /data/local/tmp/ +``` +### Step 2: Run the Performance Test +The perf_test tool allows you to simulate inference and gather statistics. For example, +```bash +# Execute on the device +adb shell "/data/local/tmp/onnxruntime_perf_test -e cpu -m times -r 20 -s -Z -x 1 /data/local/tmp/your_model.onnx" +``` +The command example set the arguments of the application as, +- “-e cpu” specifies the provider as cpu provider +- “-m times” specifies the test mode as “times” +- “-r 20” specifies the repeated times as 20 +- “-Z” disallows thread from spinning during runs to reduce cpu usage +- “-s” shows statistics result +- “-x 1” sets the number of threads used to parallelize the execution within nodes as 1 + +You can try other arguments setting if you would like to. + +### Step 3: Deep Dive into Operator Profiling +To see exactly how many milliseconds are spent on each operator, use the profiling flag -p. +```bash +adb shell "/data/local/tmp/onnxruntime_perf_test -p profile.json -e cpu -m times -r 5 -s -Z -x 1 /data/local/tmp/your_model.onnx" +adb pull /data/local/tmp/profile.json +``` +The argument “-p” enables performance profiling during the benchmark run. When you provide this flag followed by a filename, ONNX Runtime will generate a JSON file containing a detailed trace of the model execution. +You can view the results by opening [prefetto tool]( https://ui.perfetto.dev/), and loading the generated JSON file. This allows you to see a visual timeline of which operations took the most time. +You also can convert the JSON file to a CSV sheet by creating a python script. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/KleidiAI_integration.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/KleidiAI_integration.md new file mode 100644 index 000000000..bccd73c50 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/KleidiAI_integration.md @@ -0,0 +1,105 @@ +--- +title: Integration of KleidiAI to ORT MLAS +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Integration of KleidiAI to ONNX runtime MLAS +ONNX runtime is built with KleidiAI support: +1. Detection: At runtime, MLAS checks the CPU capabilities for SME2 support. +2. Dispatch: If SME2 is detected, MLAS overrides its default kernels. For example, a Gemm (General Matrix Multiplication) operation that would normally use standard vector instructions (such as NEON) is dispatched to a KleidiAI SME2 micro-kernel. + +Currently, KleidiAI in MLAS provides ArmKleidiAI::MlasConv, ArmKleidiAI::MlasGemmBatch and ArmKleidiAI::MlasDynamicQGemmBatch kernels. + +### The ArmKleidiAI::MlasConv kernel +Usually, 2D fp32 convolution operators with batch_size=1 and multiple filters (filter kernel is equal or greater than (3,3)) are dispatched to the ArmKleidiAI::MlasConv kernel. + +For example, the figure below shows a (7,7) Conv node. + +![Diagram illustrating an example of 7x7 Conv alt-text#center](images/conv_nodes_7x7.jpg "An example of (7,7) Conv node") + +ArmKleidiAI::MlasConv kernel makes use of KleidiAI’s indirect matrix multiplication (imatmul) micro kernel to accelerate the convolution. + +The function calls are shown as below. +```text +onnxruntime::InferenceSession::Run +|--onnxruntime::utils::ExecuteGraph +| |--onnxruntime::utils::ExecuteGraphImp +| | |--onnxruntime::ExecuteThePlan +| | | |--onnxruntime::concurrency::ThreadPool::Schedule +| | | | |--onnxruntime::RunSince +| | | | | |--onnxruntime::LaunchKernelStep::Execute +| | | | | | |--onnxruntime::ExecuteKernel +| | | | | | | |--onnxruntime::Conv::Compute +| | | | | | | | |--MlasConv +| | | | | | | | | |--ArmKleidiAI::MlasConv +| | | | | | | | | | |--ConvolveSme +| | | | | | | | | | | |--MlasTrySimpleParallel +| | | | | | | | | | | | |--kai_run_lhs_imatmul_pack_x32p2vlx1_x32p_sme +| | | | | | | | | | | | | |--kai_kernel_lhs_imatmul_pack_x32p2vlx1_x32p_sme +| | | | | | | | | | | | |--kai_run_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme +| | | | | | | | | | | | | |--kai_kernel_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme +| | | | | | | | | | | | |--kai_run_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa +| | | | | | | | | | | | | |--kai_kernel_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa +``` + +### The ArmKleidiAI::MlasGemmBatch kernel +It performs a batched fp32 matrix multiplication (GEMM or GemV) operation using KleidiAI matmul micro kernels. fp32 Conv operators with (1,1) filter kernels also use this kernel. + +For example, the figure below shows a (1,1) Conv node. + +![Diagram illustrating an example of 1x1 Conv alt-text#center](images/conv_nodes_1x1.jpg "An example of (1,1) FusedConv node") + +The function calls of fp32 Conv operators with (1,1) filter kernels are shown below. + +```text +onnxruntime::InferenceSession::Run +|--onnxruntime::utils::ExecuteGraph +| |--onnxruntime::utils::ExecuteGraphImp +| | |--onnxruntime::ExecuteThePlan +| | | |--onnxruntime::concurrency::ThreadPool::Schedule +| | | | |--onnxruntime::RunSince +| | | | | |--onnxruntime::LaunchKernelStep::Execute +| | | | | | |--onnxruntime::ExecuteKernel +| | | | | | | |--onnxruntime::Conv::Compute +| | | | | | | | |--MlasConv +| | | | | | | | | |--MlasGemmBatch +| | | | | | | | | | |--ArmKleidiAI::MlasGemmBatch +| | | | | | | | | | | |--MlasTrySimpleParallel +| | | | | | | | | | | | |--kai_run_lhs_pack_f32p2vlx1_f32_sme +| | | | | | | | | | | | | |--kai_kernel_lhs_pack_f32p2vlx1_f32_sme +| | | | | | | | | | | | |--ArmKleidiAI::MlasGemmPackB +| | | | | | | | | | | | | |--kai_run_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme +| | | | | | | | | | | | | | |--kai_kernel_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme +| | | | | | | | | | | | |--kai_run_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa +| | | | | | | | | | | | | |--kai_kernel_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa +``` + +For example, the figure below shows a Gemm node. + +![Diagram illustrating an example of Gemm node alt-text#center](images/Gemm_node.jpg "An example of Gemm node") + +The function calls of fp32 Gemm operators are shown below. +```text +onnxruntime::InferenceSession::Run +|--onnxruntime::utils::ExecuteGraph +| |--onnxruntime::utils::ExecuteGraphImp +| | |--onnxruntime::ExecuteThePlan +| | | |--onnxruntime::concurrency::ThreadPool::Schedule +| | | | |--onnxruntime::RunSince +| | | | | |--onnxruntime::LaunchKernelStep::Execute +| | | | | | |--onnxruntime::ExecuteKernel +| | | | | | | |--onnxruntime::Gemm::Compute +| | | | | | | | |--MlasGemm +| | | | | | | | | |--MlasGemmBatch +| | | | | | | | | | |--ArmKleidiAI::MlasGemmBatch +| | | | | | | | | | | |--MlasTrySimpleParallel +| | | | | | | | | | | | |--kai_run_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa +| | | | | | | | | | | | | |--kai_kernel_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa +``` + +### The ArmKleidiAI::MlasDynamicQGemmBatch kernel +This kernel is for Matmul with float output of dynamic quantized A and symmetric quantized B. +It uses KleidiAI *kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa* micro kernel. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Overview.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Overview.md new file mode 100644 index 000000000..189dc3013 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Overview.md @@ -0,0 +1,47 @@ +--- +title: ONNX runtime overview +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## ONNX runtime overview +With the rise of on-device AI, squeezing performance from CPUs has become critical. Arm’s Scalable Matrix Extension 2 (SME2) represents a leap forward, offering significant speedups for matrix-heavy workloads like Transformers and CNNs. +This learning path will walk you through the technical steps to integrate KleidiAI—Arm's specialized micro-kernel library with SME2 support—into ONNX Runtime (ORT) and profile its performance using onnxruntime_perf_test on Android devices. + +### Understanding the ONNX Runtime Software Stack +Firstly, let us look at the internal architecture of the ONNX Runtime. +![Diagram illustrating ONNX runtime components alt-text#center](images/ort_overview.jpg "The ONNX runtime overview") + +#### 1. In-Memory Graph +When loading an ONNX model, ORT parses the protobuf file and creates an In-Memory Graph. This is a live representation of the model’s structure, consisting of: +- Nodes: Representing operations (e.g., MatMul, Conv, Add). +- Edges: Representing the flow of data (tensors) between those operations. + +During this stage, ORT performs Graph Optimizations like constant folding and node fusion. +#### 2. Graph Partitioner +The Graph Partitioner decides which part of the model runs on which hardware. It analyzes the computational graph and matches nodes to the registered Execution Providers. +It clusters adjacent nodes assigned to the same EP into "Subgraphs". +#### 3. Graph Runner +Once the graph is partitioned, the Graph Runner is responsible for the actual execution of the operators in the correct order. It manages the flow of data (Tensors) between nodes. +In ORT, parallelism is split into two distinct levels to maximize hardware utilization: Intra-op (inside an operator/node, splitting a single heavy operation/node into smaller chunks) and Inter-op (between different operators, running multiple independent operators at the same time). + +#### 4. Execution Provider (EP) +An Execution Provider is the abstraction layer that interfaces with specific hardware or libraries. +Each EP provides a set of "Kernels" (optimized math functions) for specific operators. +Examples: +- CPU: Default CPU, Intel DNNL, XNNPACK etc. +- GPU: NVIDIA CUDA/TensorRT, AMD MIGraphX, DirectML etc. +- Others: NPU, Qualcomm QNN etc. + +If a specialized EP doesn't support a specific operator, ORT automatically falls back to the CPU provider. + +Default CPU provider uses Microsoft Linear Algebra Subprogram (MLAS). MLAS is a minimal version of BLAS library which implements an optimized version of linear algebra operations such as general matrix multiply (GEMM) in low-level languages with various processor support. For aarch64, MLAS already utilizes dotprod, i8mm, fp16, bf16 vector instructions for acceleration. + +The KleidiAI-optimized MLAS can delegate high-performance matrix operations to KleidiAI micro kernels. KleidiAI provides micro-kernels specifically tuned for SME2, allowing ORT to instantly leverage the latest hardware features. + +This learning path focuses on Arm CPU Execution Provider. + + + diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Profiling_example.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Profiling_example.md new file mode 100644 index 000000000..da3828674 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/Profiling_example.md @@ -0,0 +1,105 @@ +--- +title: Profiling – Use Resnet50v2 fp32 model as an example +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Profile an ONNX model – Use Resnet50v2 as an example +The Resnet50v2 fp32 ONNX model can be downloaded from Hugging Face or Modescope. + +The Android device that we used is a VIVO X300 phone with MTK D9500 processor, which has Arm C1-Ultra, C1-Premium and C1-Pro CPU cores with SME2 support on it. We chose a C1-Pro CPU core running at 2.0GHz to run the onnxruntime_perf_test benchmark application. You can use any other Android device with SME2 support. + +To compare the performance of running Resnet50v2 on ORT with SME2 and without SME2 support, we built two versions of ORT, one with SME2 support (set *onnxruntime_USE_KLEIDIAI=ON* when building ORT), the other without SME2 support(*onnxruntime_USE_KLEIDIAI=OFF* when building ORT). + +Run following command on the device, +```bash +taskset 1 ./onnxruntime_perf_test -e cpu -r 5 -m times -s -Z -x 1 ./resnet50v2.onnx -p "resnet50v2.onnx_1xC1-Pro_profile +``` + +The *taskset 1* in the command sets the CPU affinity of *onnxruntime_perf_test* benchmark to CPU core 0, which is a C1-Pro CPU core. +*-x 1* in the command sets the number of threads used to parallelize the execution within nodes as 1 (single thread). + +Here is output from running onnxruntime_perf_test with ORT with SME2 support as below. +```text +Setting intra_op_num_threads to 1 +Disabling intra-op thread spinning between runs +Session creation time cost: 0.217932 s +First inference time cost: 196 ms +Total inference time cost: 0.49481 s +Total inference requests: 5 +Average inference time cost total: 98.961997 ms +Total inference run time: 0.494854 s +Number of inferences per second: 10.104 +Avg CPU usage: 11 % +Peak working set size: 271122432 bytes +Avg CPU usage:11 +Peak working set size:271122432 +Runs:5 +Min Latency: 0.0958204 s +Max Latency: 0.101519 s +P50 Latency: 0.0995086 s +P90 Latency: 0.101519 s +P95 Latency: 0.101519 s +P99 Latency: 0.101519 s +P999 Latency: 0.101519 s +``` + +Here is output from running onnxruntime_perf_test with ORT without SME2 support as below. +```text +Setting intra_op_num_threads to 1 +Disabling intra-op thread spinning between runs +Session creation time cost: 0.227282 s +First inference time cost: 343 ms +Total inference time cost: 1.69691 s +Total inference requests: 5 +Average inference time cost total: 339.381120 ms +Total inference run time: 1.69697 s +Number of inferences per second: 2.94642 +Avg CPU usage: 11 % +Peak working set size: 241426432 bytes +Avg CPU usage:11 +Peak working set size:241426432 +Runs:5 +Min Latency: 0.333323 s +Max Latency: 0.34682 s +P50 Latency: 0.336476 s +P90 Latency: 0.34682 s +P95 Latency: 0.34682 s +P99 Latency: 0.34682 s +P999 Latency: 0.34682 s +``` +### Performance Indicators +| Metric | Non-KleidiAI | KleidiAI (with SME2) | Speed Up | +|---------|----------------|----------------|-------------------------------------------------| +| Latency per inference (ms) | 339 | 99 | >3.4 | + +We can use [prefetto tool](https://ui.perfetto.dev/), to view the two JSON profile files. + +The figure below is a screenshot of the view of the Non-KleidiAI version of JSON profile file. +The selected part(one model_run/SequentialExecutor) in the figure includes information of one inference execution. + +![Figure showing profile file of Non-KleidiAI version alt-text#center](images/resnet50v2_no_sme_prefetto.png "prefetto view of Non-KleidiAI version of ORT") + +The figure below is a screenshot of the view of the KleidiAI(with SME2) version of JSON profile file. +The selected part (one model_run/SequentialExecutor) in the figure includes information of one inference execution. +![Figure showing profile file of KleidiAI with SME2 version alt-text#center](images/resnet50v2_sme_prefetto.png "prefetto view of KleidiAI with SME2 version of ORT") + +We also convert the two JSON profile files to CSV sheets, then we combine the individual operator execution time of the Non-KleidiAI and KleidiAI(with SME2) version to a single chart. +![Figure showing operator time of both versions of ORT alt-text#center](images/resnet50v2_with_sme_without_sme_2.png "Operator execution time comparison") + +It shows that ORT with KleidiAI (with SME2) kernels uplifts the performance significantly, especially for convolution operators. + +If we use Arm Streamline tools and PMU counters for further investigation, in the timeline view of Streamline, we can see SME2 floating point Outer Product and Accumulate (MOPA) instruction is used intensively during the inference. + +![Figure showing SME2 instructions and cycles alt-text#center](images/resnet50v2_sme_onnx_streamline_1xgelas_annotation.png "SME2 instructions and cycles shown in Streamline") + +Then we combine the function call view of ORT without KleidiAI and with KleidiAI(with SME2) from Streamline to a single figure, + +![Figure showing function call percentage of both versions of ORT alt-text#center](images/function_call_compare.png "Function call percentage of both versions of ORT in Streamline ") + +It shows that KleidiAI kernels provide a significant performance uplift for convolution operators compared to the default MLSA kernels (*MlasSgemmKernelAdd* and *MlasSgemmKernelZero*). + +## Summary +By integrating KleidiAI (SME2) into ONNX Runtime, you unlock the massive parallel processing power of Arm SME2. This turns the Arm CPU from a "fallback" into a high-performance AI engine capable of running LLMs and complex vision models locally on devices. \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_index.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_index.md new file mode 100644 index 000000000..683c8dd05 --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_index.md @@ -0,0 +1,55 @@ +--- +title: Unleashing SME2 Performance - Profile ONNX models with KleidiAI-Optimized ONNX Runtime + +minutes_to_complete: 40 + +who_is_this_for: This is an advanced topic for software developers, performance engineers, and AI practitioners + +learning_objectives: + - Build ONNX runtime library with KleidiAI and SME2 support + - Profile performance of ONNX models + - Learn how KleidiAI and SME2 accelerates ONNX operators + +prerequisites: + - Knowledge of KleidiAI and SME2 + - An Android device with Arm SME2 support + +author: Zenon Zhilong Xiu + +### Tags +skilllevels: Advanced +subjects: ML +armips: + - Arm C1 CPU + - Arm SME2 unit +tools_software_languages: + - C++ + - ONNX runtime +operatingsystems: + - Android + - Linux + + + +further_reading: + - resource: + title: part 1 Arm Scalable Matrix Extension Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: part 2 Arm Scalable Matrix Extension Instructions + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + - resource: + title: part4 Arm SME2 Introduction + link: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/part4-arm-sme2-introduction + type: blog + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- \ No newline at end of file diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_next-steps.md new file mode 100644 index 000000000..727b395dd --- /dev/null +++ b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/Gemm_node.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/Gemm_node.jpg new file mode 100644 index 000000000..5e642b6bb Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/Gemm_node.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_1x1.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_1x1.jpg new file mode 100644 index 000000000..a1ac91d9e Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_1x1.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_7x7.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_7x7.jpg new file mode 100644 index 000000000..59f422f45 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/conv_nodes_7x7.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/function_call_compare.png b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/function_call_compare.png new file mode 100644 index 000000000..9bb73a3ef Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/function_call_compare.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/ort_overview.jpg b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/ort_overview.jpg new file mode 100644 index 000000000..141343d86 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/ort_overview.jpg differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_no_sme_prefetto.png b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_no_sme_prefetto.png new file mode 100644 index 000000000..049c79c05 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_no_sme_prefetto.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_onnx_streamline_1xgelas_annotation.png b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_onnx_streamline_1xgelas_annotation.png new file mode 100644 index 000000000..282c40bfe Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_onnx_streamline_1xgelas_annotation.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_prefetto.png b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_prefetto.png new file mode 100644 index 000000000..e188bf761 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_sme_prefetto.png differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_with_sme_without_sme_2.png b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_with_sme_without_sme_2.png new file mode 100644 index 000000000..9d4a295f4 Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/performance_onnxruntime_kleidiai_sme2/images/resnet50v2_with_sme_without_sme_2.png differ