
An electronic copy of the proposal is available through UT Box at: https://utexas.app.box.com/s/9mjsz62teyvixn232s1zc73nv4svoes7. The title and abstract are below.
Title: Chasing Efficiency in the Era of Scaling Large Language Models
Abstract: Model compression techniques (e.g., pruning, quantization) aim to reduce model size and memory requirements, facilitating the production of more efficient and cost-effective models, and enabling their deployment in resource-constrained environments, such as edge devices and cloud services. Recent research efforts are focused on developing increasingly sophisticated compression methods, while evaluation protocols for compressed LLMs are surprisingly under-explored and rely primarily on perplexity. Unlike dense LLMs, the comparative evaluation of two compressed LLMs derived from one dense counterpart is more challenging due to the high similarity and alignment of architecture, training recipes, and parameters between them. To address this gap, this dissertation develops an automated evaluation framework, named LLM-KICK, tailored for compressed LLMs to identify the merits and limitations of SoTA compression algorithms. Motivated by the failure of SoTA compression methods on LLM-KICK, the second part of the dissertation reassesses the design choices of existing compression techniques such as missing connection to LLM pretraining dynamics, uniform compression across all LLM components, etc. In the proposed work, I outline the development plan for a novel layer-wise non-uniform compression algorithm grounded on LLM-pretraining dynamics and coupled with a resource-constrained finetuning strategy. The broader aim of this dissertation is to draw the attention of the LLM compression community towards potential to improve existing SoTA compression algorithms and to permeate that cues from LLM pre-training can be vital for developing robust novel algorithms.
Dissertation Committee: Ying Ding (Co-Chair), Atlas Wang (Co-Chair), Matt Lease, Hanlin Li