Higgs and the new quantization could carry the AI of Chatgpt direct to your cell phone
Can you imagine using artificial intelligence of the chatgpt level directly on your cell phone, without cloud connection or excessive battery consumption? That future is much closer thanks to Higgs, a new quantization technique that allows to compress huge language models without the need for quality or loss of quality. Developed by researchers from institutions such as MIT and Kaust, this innovation promises to bring the power of AI to any device, from ultra light laptops to smartphones. It is the first real step towards truly portable artificial intelligence.
***
- A new theoretical framework mathematically connects the error by layer with the global performance of the model
- The Higgs method allows you to quantize models without calibration data, surpassing techniques such as GPTQ and AWQ
- Results show advantages in precision, speed and support for efficient inference in GPU
In an advance that could redefine the development and deployment of large -scale language models (LLMS), an international group of researchers has shown that it is possible to quantize these models more efficiently, and even more precise, without resorting to calibration data.
The finding is based on a new theoretical framework called “Linearity Theorem”, which establishes a direct and measurable relationship between the error induced by quantization in each layer of the model and the degradation of its global performance, commonly measured by perplexity.
The work, entitled “PUSHING THE LIMITS OF LARGE LANGUAGE MODEL QUANTISION VIA THE LINARITY THEOREM”has been led by Yandex academics, the MIT, the Kaust University and the Institute of Science and Technology of Austria (ISTA), and proposes a solid alternative to popular approaches such as GPTQ or AWQ, which depend strongly on calibration data to adjust the weights of the models during compression.
The quantization problem in the LLMS
Current language models, as call 3, GPT or Claude, contain hundreds of millions or even billion parameters.
When training FP16 or FP32 with precision, its implementation requires large amounts of memory and computing power.
In order to deploy them on more accessible devices or accelerate their inference on servers, quantization techniques are used: processes that reduce numerical representation of weights (for example, from 16 to 4 bits) to save resources without excessively compromising the accuracy of the model.
However, the technical challenge is huge. The quantization introduces error, and if performed inappropriately, it can significantly degrade the quality of the results of the model.
Therefore, for years, researchers have trusted calibration-based methods, that is, they require a specific data set to finely adjust post-training weights. Although effective, these methods add complexity, data dependence, often high processing times.
Linearity theorem: a new theoretical framework
The nucleus of the new approach proposed by Malinovskii, Panferov, Guo and colleagues is a mathematical formulation that precisely quantifies how the average quadratic error (MSE) introduced by quantizing a layer affects the global performance of the model, expressed as an increase in perplexity (a standard metric in language models).
The theorem affirms that, to Bitwidths reasonable (between 3 and 8 bits), There is a linear relationship between the error induced in a layer and the total loss of precision of the model.
This finding allows to optimize the quantization directly minimizing the MSE of each layer, without the need to observe the behavior of the model on a specific data set.
“The finding is counterintuitive: it is enough to reduce the MSE error of each layer, and with that the final perplexity is controlled. No data or specific adjustments are needed by model or task,” says the authors.
Higgs: quantization without data, more precise and efficient
On this theoretical basis, the researchers developed a new method called Higgs (Hadamard Incherence with Gaussian Mse-Optimal Grids). The procedure applies a Hadamard transformation to the pesos of the modelwhich “decoured for” its distribution and approximates it to a standard Gaussian.
Then, the weights are quantized using optimal grids in terms of MSE, previously calculated by vector quantization algorithms.
An Hadamard transformation It is a mathematical operation that reorganizes the data of a model (such as the weights of a neural network) to make them more “messy” or “incoherent.” This means that their values become less correlated with each other, which facilitates their compression without losing too much information.
In the context of quantization, applying an Hadamard transformation helps the pesos of the model behave as if they followed a Gaussian (normal) distribution, which is ideal for using more efficient quantization grids. The most interesting thing is that this operation is very fast and does not require entry data, which makes it perfect for quantization without calibration.
This process, notably, is completely free of data. It does not require a single input sample to function, which makes it an ideal tool for environments where calibration data is expensive, private or non -existent.
The experiments with models call 3.1 and Qwen show that Higgs systematically exceeds the existing methods in the range of 3 to 4 bitsboth in precision (measured in perplexity and accuracy in Zero-Shot and Few-Shot) and implementation efficiency.
Faster inference: fluteon integration
Higgs not only offers theoretical improvements, also has considerable practical advantages.
Its design allows you to integrate with Flutean optimized kernel for GPU inference that fuses the dequantization and multiplication of matrices. This translates into a much faster execution in real environments, such as inference servers.
Flute (Fast look-up table execution) is a type of Kernel Optimized to execute quantized artificial intelligence models in GPUS in an extremely efficient way.
Its main advantage is that it fuses two key steps – the Decuantization (convert the values compressed to its usable form) and the multiplication of matrices – into a single highly optimized operation.
This reduces the time of inference and memory consumption, especially in low latency scenarios such as mobile applications or real -time responses.
In the case of the Higgs method, flute allows you to execute quantized models with advanced grids without the need to redesign the hardware or sacrifice precision.
Tests on an RTX 4090 GPU show that Higgs reaches up to 3 times more tokens per second compared to FP16while maintaining similar or even better precision levels.
In addition, it can be dynamically configured to assign different Bitwidths to each layer according to its sensitivity, using an optimized strategy by linear programming based on the same linearity theorem.
How much better is Higgs?
The results are overwhelming. In tests on the model calls 3.1 8b, the Higgs method achieved:
-
A perplexity of 5.91 with 4 bits, better than GPTQ (6.23) and AWQ (6.22).
-
Best average score in tasks such as ARC, Winogrande, Hellaswag and Mmlu.
-
Accuracy comparable or superior to calibrated methods, even in completely free configurations.
Even in mixed configurations where hyggs is combined with GPTQ, additional improvements are observed, suggesting that the method can be integrated into existing pipelines to obtain the best of both worlds.
Implications for the ecosystem of AI
The impact of this advance could be broad. By eliminating the need for calibration, Higgs simplifies the quantization process, accelerates the adoption of models and democratizes access to high performance AI, especially in scenarios where calibration data is not available.
It also opens the door to new forms of adaptive quantization, in which computational resources can dynamically adjust according to the needs of the model or the environment, without severe penalties in the quality of inference.
Limitations and future
The authors recognize that there are still challenges to solve. For example, the mandatory use of Hadamard transformed adds a slight computational complexity, although it can be mitigated with “folding” techniques in the model architecture.
In addition, it has not yet been exhaustively evaluated in architectures such as Mixture-OF-Experts or in extensive generative tasks.
However, the linearity theorem offers a firm basis to continue exploring compression techniques, not only for LLMS, but also for other deep learning models where efficiency and performance must be carefully balanced.
Original image of Diariobitcoin, created with artificial intelligence, for free use, licensed under public domain.
This article was written by an AI content editor and reviewed by a human editor to guarantee quality and precision.
WARNING: Diariobitcoin offers informative and educational content on various topics, including cryptocurrencies, AI, technology and regulations. We do not provide financial advice. Cryptactive investments are high risk and may not be adequate for all. Investigate, consult an expert and verify the applicable legislation before investing. I could lose all its capital.
Subscribe to our newsletter
