
The defense will be a virtual event. The Zoom link for the defense will be shared one day before the event. An electronic copy of the proposal is available through UT Box at: https://utexas.box.com/s/zf4v3kzvbbsarpvd5k020626tabcnbas. The title and abstract are below.
Title: Chasing Efficiency in the Era of Scaling Large Language Models
Abstract: Advancements in efficient ML techniques have demonstrated immense potential not only in the production of cost-effective and environment-friendly models, but also in the notable democratization of AI technologies. Model compression algorithms are instrumental in addressing two seemingly conflicting goals of modern AI: scaling up our deep networks, while satisfying energy consumption restrictions and demanding constraints for production systems. This dissertation contributes to efficient ML research by developing task-centric evaluation protocols tailored for compressed LLMs and designing three novel LLM compression algorithms to improve LLMs' efficiency while minimizing performance degradation. Specifically, this two-part dissertation focuses on two complementary aspects of advancing efficient ML.
First, Part I (Chapter 2 and 3) focuses on the existing challenges and limitations in the evaluation settings of state-of-the-art compression algorithms. Chapter 2 develops an automated evaluation framework (LLM-KICK) to define a realistic evaluation setup for compressed LLMs and facilitate a comprehensive assessment of the merits and limitations of SoTA compression algorithms. In addition, Chapter 3 investigates how different compression techniques impact LLMs' abilities to handle tasks of increasing difficulty.
Second, motivated by the findings from LLM-KICK developed in Part I, this dissertation reassesses the design choices of existing compression techniques such as missing connection to LLM pretraining dynamics, uniform compression across all LLM components, etc. Part II presents three novel LLM compression algorithms: (1) Chapter 4 builds WeLore, which unifies a novel non-uniform layer-wise LLM compression technique (WeLore-COMP) with a parameter-efficient recovery mechanism (WeLore-PEFT) for GPT-style architectures; (2) Chapter 5 designs MC-Suite, which is a collection of expert importance estimation strategies for expert-level sparsification in MoE-style architectures; (3) Chapter 6 develops FFN-SkipLLM, which aims to achieve efficiency at inference subject to the non-uniform computational demand of decoded tokens by skipping feed-forward components.
By addressing two complementary objectives of evaluation and novel algorithm developments, this dissertation draws the attention of the efficient ML community towards the potential to improve existing SoTA compression algorithms by equipping them with better evaluation setups and establishes that cues from LLM pre-training can be vital for developing robust novel algorithms.
Dissertation Committee: Ying Ding (Co-Chair), Atlas Wang (Co-Chair), Matt Lease, Hanlin Li