NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

  • Semyon Savkin*
  • , Eitan Porat
  • , Or Ordentlich
  • , Yury Polyanskiy
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NESTQUANT, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gos-set lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wiki-text2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Original languageEnglish
Pages (from-to)53042-53062
Number of pages21
JournalProceedings of Machine Learning Research
Volume267
StatePublished - 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025

Bibliographical note

Publisher Copyright:
© 2025, by the authors.

Fingerprint

Dive into the research topics of 'NestQuant: Nested Lattice Quantization for Matrix Products and LLMs'. Together they form a unique fingerprint.

Cite this