Most Large Language Models (LLMs) require 30GB+ of GPU memory or RAM to run inference. This raises the question: How are we able to fine-tune LLMs like Llama 3 8B on consumer GPUs or even use them locally? Not everyone has access to multiple expensive GPUs. The answer lies in model quantization.
In this tutorial, we will learn about Quantization in LLMs and convert Google’s Gemma model into a quantized model using LLlama.cpp.
https://www.statology.org/what-quantized-llms/