UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Наталя ХандусенкоAI Eng
25 March 2026, 14:01
2026-03-25
Google releases TurboQuant, an algorithm that reduces AI memory usage without losing accuracy
Large language models have a constant scaling problem. As the context window grows, the amount of memory required to store the KV cache grows proportionally, which exhausts GPU resources and slows down the inference process. The Google Research team has developed three compression algorithms: TurboQuant, PolarQuant, and Quantized Johnson-Lindenstrauss (QJL). All three allow for significant cache compression without degrading the quality of the model's results.
Large language models have a constant scaling problem. As the context window grows, the amount of memory required to store the KV cache grows proportionally, which exhausts GPU resources and slows down the inference process. The Google Research team has developed three compression algorithms: TurboQuant, PolarQuant, and Quantized Johnson-Lindenstrauss (QJL). All three allow for significant cache compression without degrading the quality of the model's results.
How TurboQuant provides scalable compression of large language models
The vector quantization method has long been used to compress complex numerical data used by artificial intelligence. The essence of the method is to replace a large range of values with a limited set of compact numbers. However, classical methods have a significant drawback: for each fragment of data, special coefficients must be stored separately in high quality. This adds 1–2 “extra” bits to each number, which significantly reduces the real compression efficiency, especially when memory is already limited.
TurboQuant solves this problem by combining two basic methods.
PolarQuant performs the main compression step by converting standard Cartesian coordinate vectors into polar coordinates. A conventional quantizer fixes the position along each axis independently, requiring normalization steps that vary depending on the data. Instead, PolarQuant maps the coordinate pairs to a polar system, expressing them in terms of radius and angle. Because the angular distribution is predictable and concentrated, this method eliminates the need for normalization and the overhead associated with it.
QJL works with residual error. Using the Johnson-Lindenstrauss transform, QJL reduces each remaining vector value to a single sign bit, either positive or negative. This step does not incur any memory overhead. To maintain accuracy when working with single-bit representations, QJL uses an estimator that combines high-precision query vectors with this simplified stored information when computing attention scores.
Google Research directly describes the combined result: "TurboQuant is a compression method that achieves significant model size reduction with zero loss of accuracy, making it ideal for supporting both key-value (KV) cache compression and vector search."
Test results in five sets of benchmarks
Google Research evaluated all three algorithms on five long-context benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. The test models were Gemma and Mistral. TurboQuant compressed KV caches to 3 bits per value without the need for retraining or fine-tuning the models. There was no measurable loss of accuracy in the question-answering, code generation, and text summarization tasks.
Memory reductions reached at least 6x compared to uncompressed KV storage. On NVIDIA H100 GPUs, 4-bit TurboQuant provided up to 8x speedup in attention logits compared to 32-bit unquantized keys. PolarQuant demonstrated near-zero loss performance in needle-in-haystack tests.
The algorithms were also evaluated against state-of-the-art vector search baselines, including Product Quantization (PQ) and RabbiQ. TurboQuant achieved excellent recall rates on the GloVe dataset (d=200) for top-k search tasks, doing so without the large codebooks and dataset-specific setup required by these baseline methods.
Google Research notes that TurboQuant is data-agnostic, meaning it doesn’t require any dataset-specific calibration. This feature simplifies integration into inference systems and reduces the preprocessing pipeline required before deployment.