Input Variables
The number of parameters defines the base memory requirement for storing the model. At 16-bit precision, each parameter requires 2 bytes.
Thus we'll need
During inference, additional memory is needed to store activations (intermediate results during forward propagation). Activation memory depends on the sequence length and batch size. Batch size effects the calculation linearly. To simplify things I will assume
There will be a memory overhead which includes optimizer state (if training) or additional memory buffers for inference. To simplify things I will assume
Then total memory considering concurrent users will be
A transformer model (like LLaMA) requires approximately:
Since FLOPS is typically measured in teraflops (TFLOPS) (trillions of FLOPS):
(number_of_parameters * precision / 8) / (1024 ** 3)
GB memory for parameters . During inference, additional memory is needed to store activations (intermediate results during forward propagation). Activation memory depends on the sequence length and batch size. Batch size effects the calculation linearly. To simplify things I will assume
25% * model size
for activations. There will be a memory overhead which includes optimizer state (if training) or additional memory buffers for inference. To simplify things I will assume
10% * model size
. Then total memory considering concurrent users will be
total_memory_gb = model_memory_gb + (activation_memory_per_user * concurrent_users) + overhead_memory_gb
. A transformer model (like LLaMA) requires approximately:
Compute per token ≈ 360 × Number of Parameters
. Where 360 refers
to a rough estimate of floating-point operations required per parameter for a single forward pass (token inference). Number of Parameters is the total model size
(e.g., 90 billion for LLaMA 3.2). Now, for multiple users and tokens per second: Total FLOPS = ( 360 × Parameters × Users × Tokens per second )
. Since FLOPS is typically measured in teraflops (TFLOPS) (trillions of FLOPS):
Required TFLOPS = 360 × Parameters × Users × Tokens per second / 10 ^ 12
.