LLMs are smaller than you think

It’s easier than ever to run an AI model at home.

01 November 2024

Only a handful of companies have the resources to train large language models like GPT, Claude, Gemini or Llama. Training just one powerful LLM can take tens of thousands of GPUs running for weeks at a time.

Running inference at scale is a big job too, with huge data centres required to generate results for apps with hundreds of millions of users.

But the models themselves aren’t as big as you might think.

LLMs are typically measured by how many parameters they have, with more parameters delivering stronger performance. The very largest models have hundreds of billions of parameters and are impractical for most people to run locally. Smaller models, however, are a different story.

As a rule of thumb, one billion parameters weighs in at around two gigabytes.¹

This means a model with eight billion parameters takes up around 16GB of disk space. Running inference for a model that size requires a bit more RAM than that – because the computer needs to be able to hold all the model parameters in memory alongside the software for running it – but is doable on many newer Macs and PCs.

Recent advances have led to even greater reductions in size while maintaining good performance. There are now quantized Llama models that take up as little as 1GB of disk space (less than two CD-ROMs) and need just 2GB of memory to run, which is small enough for many modern smartphones.

So while you probably won’t be training an LLM at home anytime soon, it’s pretty simple to download and run one on your own computer – and as always, that’s the best way to learn.

Model parameters are typically represented in 16-bit floating-point number format (FP16 or BF16), so one parameter occupies two bytes. ↩