An electronic copy of the proposal is available through UT Box at: https://utexas.app.box.com/s/khc5qzgqyjlo1k0cvuf90szq899bhmsx. The title and abstract are below.
Title: Chasing Efficiency in the Era of Scaling Deep Neural Networks
Abstract: In recent times, with exponentially growing large datasets and models, new computing paradigms must be explored to address two seemingly conflicting goals: scaling our deep networks while satisfying energy consumption restrictions and demanding constraints for production systems. Deep Neural Network compression techniques (e.g., sparsity, quantization, distillation, matrix factorization) aim to reduce size and memory requirements, facilitating the production of more efficient and cost-effective models and allowing them to be deployed in various environments, such as edge devices and cloud services. In the first part of this dissertation, I aim to develop empirical foundations using rigorous controlled experiments to understand the emergence of compression opportunities and the role of existing significant and non-significant components in LLMs. In the second part, I investigate the subtle challenges in the evaluation of compressed LLMs and develop a novel benchmark (LLM-KICK) to identify the true merits of compression. Lastly, in this dissertation, I propose three novel LLM compression algorithms that enable computationally efficient inference and fine-tuning along with addressing extensive memory requirements.
Dissertation Committee: Ying Ding (Co-Chair), Atlas Wang (Co-Chair), Matt Lease, Hanlin Li