
GLM-5.2 runs on RTX 4090 or 32GB CPU
Unsloth.ai's GLM-5.2 model can be quantized to 4 bits and run on a single RTX 4090 or a 32GB CPU-only system, enabling on-premise LLM deployment [UnsloTH Docs].
Unsloth.ai released the GLM-5.2 model, containing 5.2 billion parameters, which occupies roughly 10 GB in FP16 format [UnsloTH Docs]. The documentation provides a 4-bit quantized checkpoint that fits in 5 GB of VRAM, enabling inference on a single NVIDIA RTX 4090 or on a CPU-only workstation with 32 GB of RAM. Benchmarks on an RTX 4090 show 2.8 tokens / second per GPU core, translating to a full 2048-token generation in under 12 seconds [UnsloTH Docs]. On a 32-core AMD Threadripper CPU, the same generation completes in 45 seconds. Running GLM-5.2 locally costs roughly $0.10 per hour of GPU time, compared with $0.30-$0.45 per hour for comparable OpenAI or Anthropic API calls [UnsloTH Docs]. This cost parity, combined with the ability to keep sensitive prompts on-premise, makes local deployment viable for startups and regulated industries. Developers can also tweak prompts, fine-tune on proprietary data, and test changes in seconds, shortening the feedback loop for product teams building AI-augmented features.
Subscribe to the broadcast.
Daily digest of the day's most important tech news. No fluff. Engineering signal only.
// delivered via substack · double-opt-in confirmation


