Recording some materials about LLM.
- [Arxiv 2024] LoRA+: Efficient Low Rank Adaptation of Large Models. Applying different learning rate to adapter matrices A and B in LoRA. Code
- [Arxiv 2024] LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
- [Arxiv 2021] LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Training large model with minor trainable parameters by using low rank adapters. Code
- some open hug datasets. Mainly used for pretrain LLM.
- stanford alpaca dataset. 52k data for instruction finetuning LLM.
- alpaca cleaned finetune dataset. a cleaned and curated version of stanford alpaca dataset.
- open llm leaderboard. Evaluating LLM model with six different task.
- tinyBenchmarks, paper. A small version of open llm leaderboard. This benchmark only contains 100 samples of each task.
- lm-evaluation-harness. A evaluation tool, which can evaluate LLM on different tasks with sample commands.
- [Arxiv 2024] SADMoE: Exploiting Activation Sparsity with Dynamic-k Gating. Proposing a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis, while normal moe use fix-k expert selection for all tokens. Code
- [Arxiv 2024] Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts. Presenting a novel shortcut-connected MoE architecture with overlapping parallel strategy, designated as ScMoE, which effectively decouples communication from its conventional sequence, allowing for a substantial overlap of 70% to 100% with computation.
- [Arxiv 2024] Approximating Two-Layer Feedforward Networks for Efficient Transformers. Introducing several novel perspectives on MoEs, presenting a general framework that unifies various methods to approximate two-layer NNs (e.g., feedforward blocks of Transformers), including product-key memories (PKMs). Code
- [Arxiv 2024] A Survey of Resource-efficient LLM and Multimodal Foundation Models. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations.
- [Arxiv 2024] LLM Inference Unveiled: Survey and Roofline Model Insights. not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. Code
- [Arxiv 2023] A Survey on Model Compression for Large Language Models.
- [Arxiv 2024] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy grouping experts in a layer, merging them into a single expert, and then decomposing the merged expert with Singular Value Decomposition (SVD). Code
- [Arxiv 2024] SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression. Proposing a truncation-aware data whitening strategy to ensure a direct mapping between singular values and compression loss and adopting a layer-wise closed-form model parameter update strategy to compensate for accuracy degradation caused by SVD truncation. Code
- [Arxiv 2023] ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models. Adding activation to the SVD process to mitigate the outliters problem of activation. Code
- [Arxiv 2023] LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation This method combines the advantages of both low-rank approximations and pruning, using the low-rank svd to decompose the (W-S) matrix, where W is weight matrix and S is the pruning matrix. Code
- [Arxiv 2024] ShortGPT: Layers in Large Language Models are More Redundant Than You Expect. in this study, discovering that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, defining a metric called Block Influence (BI) to gauge the significance of each layer in LLMs, then directly deleting the redundant layers in LLMs based on their BI scores.
- [Arxiv 2024] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy grouping experts in a layer, merging them into a single expert, and then decomposing the merged expert with Singular Value Decomposition (SVD). Code
- [Arxiv 2023] LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation This method combines the advantages of both low-rank approximations and pruning, using the low-rank svd to decompose the (W-S) matrix, where W is weight matrix and S is the pruning matrix. Code
-
[Arxiv 2024] Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, this work proposes a novel dReLU function (add relu to both gate and up modules rather than original only gate module in ffn), which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Code
-
[Arxiv 2024] ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs. Introducing a general method that defines neuron activation through neuron output magnitudes and a tailored magnitude threshold, demonstrating that non-ReLU LLMs also exhibit sparse activation. To find the most efficient activation function for sparse computation, proposing a systematic framework to examine the sparsity of LLMs from three aspects: the trade-off between sparsity and performance, the predictivity of sparsity, and the hardware affinity.
-
[Arxiv 2024] ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models. This paper introduces an effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity without decreasing model performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along sine curves in multiple stages. ProSparse obtain high sparsity of 89.32% and 88.80% for LLaMA2-7B and LLaMA2-13B, respectively, achieving comparable performance to their original Swish-activated versions. Code1 Code2
-
[Arxiv 2024] CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models. Introducing a new framework for sparsifying the activations of base LLMs and reducing inference costs, dubbed Contextually Aware Thresholding for Sparsity(CATS). Mainly setting threshold for non-relu activation function to obtain sparsity.
-
[Arxiv 2024] HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference. To improve the sparsity prediction, HiRE use a compression scheme to cheaply predict top-k rows/columns with high recall, followed by full computation restricted to the predicted subset.
-
[Arxiv 2024] SADMoE: Exploiting Activation Sparsity with Dynamic-k Gating. Proposing a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis, while normal moe use fix-k expert selection for all tokens. Code
-
[Arxiv 2023] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. DEJAVU uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardwareaware implementation that speeds up LLM inference. Code
-
[Arxiv 2023] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. Major neurons in weigt matrix of LLM are cold. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. Code
-
[Arxiv 2023] LLM in a flash: Efficient Large Language Model Inference with Limited Memory. Reducing offloading parameter number by considering the activation sparsity in FFN module.
-
[Arxiv 2023] ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models. ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer when comparing with other activation functions, such as GELU, SiLU.
-
[Arxiv 2022] The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers. Studying the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
-
[Arxiv 2024] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. Introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.
-
[Arxiv 2023] BitNet: Scaling 1-bit Transformers for Large Language Models. Introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights (-1 or 1) from scratch.
- [Arxiv 2024] A Survey of Resource-efficient LLM and Multimodal Foundation Models. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations.
- [Arxiv 2024] LLM Inference Unveiled: Survey and Roofline Model Insights. not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. Code
- [ATC 2023] SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization. Dynamic runtime selection of optimal parallel strategy is enabled by efficient searching algorithm. Code
- [ATC 2023] Accelerating Distributed MoE Training and Inference with Lina. First, ystematically analyzing all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, designing and building Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor partitioning, so all-to-all and training step time is improved.
- [IPDPS 2023] MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism. a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, designing adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Code
- [Arxiv 2023] SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. SE-MoE proposes Elastic MoE training with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms in various types. For scalable inference in a single node, especially when the model size is larger than GPU memory, SE-MoE forms the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. Code
- [SIGCOMM 2023] Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models. All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. Proposing the novel data-centric paradigm: keeping data in-place and moving experts between GPUs.
- [Arxiv 2023] TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. Code
- [PPoPP 2022] FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models. Propose a performance model that can both accurately predict the latency of different operations of a specific training task, and intuitively analyze its end-to-end performance via a novel roofline-like model. Then, guided by this model, inventing a dynamic shadowing approach to cope with load imbalance, and a smart fine-grained schedule that splits different operations and executes them concurrently. Code
- [OSDI 2020] A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters Code
- [Arxiv 2022] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. An end-to-end MoE training and inference system. Code
- [Arxiv 2024] Splitwise: Efficient generative LLM inference using phase splitting. Split the LLM inference into the compute-intensive prompt computation phase and the memory-intensive token generation phase. Prompt computation phase can run on better GPUs with high computing power while token generation phase can run on lower power device, which can reduce the cost in cloud server.
- [Arxiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone. PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Project
- [Arxiv 2024] MOE-INFINITY: Activation-Aware Expert Offloading for Efficient MoE Serving. Designing a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading, including sequence-level tracing, activation-aware expert prefetching and caching techniques. Code
- [Arxiv 2024] Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference. Finding MoE models implicitly exhibit a strong inter-layer expert affinity, and only using one Alltoall communication to deliver the same functionality while previous methods all require two Alltoalls. Code
- [Arxiv 2024] Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models. Fiddler is a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Move activation from GPU to CPU rather than move weights from CPU to GPU. Code
- [Arxiv 2024] LLM as a System Service on Mobile Devices. By fully leveraging KV cache’s unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to compression. (2) IO-Recompute Pipelined Loading: it introduces recompute to swapping-in for acceleration. (3) Chunk Lifecycle Management: it optimizes the memory activities of chunks with an ahead-of-time swapping-out and an LCTRU (Least Compression-Tolerable and Recently-Used) queue based eviction.
- [Arxiv 2023] Fast Inference of Mixture-of-Experts Language Models with Offloading. Putting inputs of router in current layer to the router in next layer for predicting the needed expert, which helps overlap moving expert weights with computing. Code
- [Arxiv 2023] Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference. Changing the stucture of normal MoE by move next layer's router to current layer, which means the model can know the needed experts of next layer in current layer. This pre-gate structure can help overlap moving expert weights with computing.
- [Arxiv 2023] Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. Proposing three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing.
- [Arxiv 2023] SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models. Proposing an offline training strategy to build a data-aware hash function deployed in SiDA that replaces the router function in MoE layers. The hash function can produce all needed experts of all layers in this iteration.
- [Arxiv 2023] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. This design is underpinned by a crucial insight that expert weights, though voluminous, are infrequently accessed due to sparse activation patterns. To further mitigate the overhead associated with expert I/O swapping, EdgeMoE incorporates two innovative techniques: (1) Expert-wise bitwidth adaptation: This method reduces the size of expert weights with an acceptable level of accuracy loss. (2) Expert management: It predicts the experts that will be activated in advance and preloads them into the compute-I/O pipeline, thus further optimizing the process.
- [Arxiv 2023] SwapMoE: Efficient Memory-Constrained Serving of Large Sparse MoE Models via Dynamic Expert Pruning and Swapping When missing experts in GPU memory, skipping these missed experts or using other experts (in GPU memory) to replace these.
- [Arxiv 2023] STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. STI identifies that loading parameters time is highly longer than computation time. To address this problem, STI proposes dynamically adapting weights bit-width during the loading procedure according to parameters importance, minimizing loading overhead under maximum inference accuracy.
- [Arxiv 2023] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. Major neurons in weigt matrix of LLM are cold. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. Code
- [Arxiv 2022] DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. Code
- https://github.com/Shenggan/awesome-distributed-ml
- https://github.com/byungsoo-oh/ml-systems-papers
- https://github.com/UbiquitousLearning/Paper-list-resource-efficient-large-language-model
- https://github.com/inpluslab-wuhui/Systems-for-Foundation-Models
- [Arxiv 2024] Efficient Multimodal Large Language Models: A Survey. provide a comprehensive and systematic review of the current state of efficient MLLMs. Code