Gemma 4 models use a training trick to reduce memory traces


Presentation timeline for Gemma 4 GAT models.

TL;DR

  • Gemma 4 models are now available for download with quantization-aware training (QAT), which reduces the size and memory footprint of the models.
  • These open-source models retain quality better thanks to QAT compared to those using post-training quantization (PTQ).
  • The GAT-optimized Gemma 4 models are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B and Gemma 4 31B.

Follows Google introduced the laptop-grade Gemma 4 12B model earlier this week, the company released its new Gemma 4 model checkpoints with quantization-aware training. Quantization is necessary to reduce the amount of memory required to run lightweight models. The standard method is post-training quantization (PTQ), which quantizes the model after training, but can result in poorer performance. According to Google, the latest Gemma 4 versions use quantization-aware training (QAT) instead to reduce model quality loss and speed up decoding. blog post.

Google says that incorporating quantization into the training process results in checkpoints with better performance than PTQ-refined models. Compressed models work well on phones and laptops thanks to a special mobile quantization scheme. This involves the use of pre-calculated parameters, 2-bit compression on certain parts of the model, dictionary list and short-term memory compression. For the user, this results in a smaller model that consumes less system memory.

Don’t want to miss out on the best Android Authority?

google's preferred source tag is light@2xgoogle's preferred source tag is dark@2x

Multiple model sizes are available with QAT optimization, including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B and Gemma 4 31B. The smallest versions, as the text-only Gemma 4 E2B modelrequires less than one gigabyte of memory to run. With no resource intensive requirements, these small Gemma 4 checkpoints are ideal for running on phones.

Google has shared the approximate memory requirements for loading the new Gemma 4 models with different sizes of QAT:

Memory requirements of Gemma 4 model sizes.

Four different formats of Gemma 4 QAT models are available for download: unquantized QAT checkpoints, GPT-Generated Unified Format (GGUF), mobile-optimized, and Compressed Tensors. According to Google, these models “maintain similar quality to bfloat16, while dramatically reducing the memory requirements for loading the model.”

After downloading the Gemma 4 QAT model weights, users can run the checkpoints on their phones, laptops or desktops. you can find mobile and desktop models Hugging Face in as well as in LM Studio.

Thank you for being a part of our community. Read our Comment Policy before deployment.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *