• MaskLLM introduces a novel approach to enhancing the efficiency of Large Language Models (LLMs) through a technique known as Learnable Semi-Structured Sparsity. This method addresses the inherent redundancy found in LLMs, which are characterized by their extensive parameter counts. By implementing a learnable pruning strategy, MaskLLM aims to reduce the computational burden during inference without compromising performance. The core innovation of MaskLLM lies in its ability to model sparsity patterns as a learnable distribution using Gumbel Softmax sampling. This allows for end-to-end training on large datasets, leading to the development of high-quality masks that can be effectively transferred to various downstream tasks. The method demonstrates two significant advantages: it scales well to large datasets, resulting in accurate mask learning, and it enables the transferability of learned sparsity across different domains or tasks. Empirical evaluations of MaskLLM were conducted on several LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with parameter sizes ranging from 843 million to 15 billion. The results indicate that MaskLLM outperforms existing state-of-the-art methods. For example, while leading approaches achieve a perplexity (PPL) of 10 or higher on the Wikitext dataset, MaskLLM achieves a PPL of 6.72 by learning masks with frozen weights, showcasing its effectiveness in maintaining performance while applying 2:4 sparsity. The methodology involves differentiable mask selection, where each group of parameters is associated with a learnable categorical distribution of candidate masks. This process is designed to be differentiable, allowing for seamless integration into the training pipeline. The research also highlights the importance of weight regularization in enhancing mask learning and demonstrates the effectiveness of transfer learning with prior masks, which can be refined through end-to-end training. Overall, MaskLLM represents a significant advancement in the field of LLMs, providing a framework for achieving lossless compression and improved efficiency in model deployment. The findings underscore the potential of learnable sparsity techniques in optimizing large-scale models for practical applications.