Quantization, one of the most widely used methods for optimizing artificial intelligence, has limitations that the industry may soon encounter.
In the context of AI, quantization refers to the reduction of the number of bits required to represent data. Simply put, it's like when someone asks the time, and instead of saying "12:00:00.004," you say "noon." The answer is still accurate, but it's less detailed.
AI models consist of many parameters that can be quantized to reduce computational costs. For example, when a model uses fewer bits to store information, it reduces the computational requirements, making it easier to use in real-world scenarios. However, it's worth noting that quantization isn't always the perfect solution, especially when the original model has been trained on vast amounts of data.
Studies from experts at Harvard, Stanford, MIT, and others revealed that quantized models perform worse when the original model was trained for extended periods on large datasets. This challenges the commonly held belief that large models can be made more efficient through quantization.
For example, Meta's Llama 3 model, after quantization, showed a decline in performance, which is attributed to the way it was trained. This serves as a warning to AI companies that aim to reduce costs by quantizing large models in hopes of significant savings.
However, Kumar and his colleagues suggest an alternative approach: training models with lower precision. This could help make AI more resilient to the losses incurred during quantization. For instance, using 8-bit precision (instead of higher values) helps reduce model size and computational requirements while still maintaining the quality of outputs.
For hardware providers like Nvidia, which support 4-bit quantization, reducing precision is necessary for efficient memory use in data centers. However, excessive reduction in precision may negatively impact model performance.
The key takeaway from Kumar’s research is that even if it seems possible to keep reducing precision, there are limits beyond which you cannot go without damaging the model. This opens the door for new approaches and architectures aimed at achieving stable learning with lower precision and greater efficiency.