Input Variables





     

    The number of parameters defines the base memory requirement for storing the model. At 16-bit precision, each parameter requires 2 bytes. Thus we'll need (number_of_parameters * precision / 8) / (1024 ** 3) GB memory for parameters .

    During inference, additional memory is needed to store activations (intermediate results during forward propagation). Activation memory depends on the sequence length and batch size. Batch size effects the calculation linearly. To simplify things I will assume 25% * model size for activations.

    There will be a memory overhead which includes optimizer state (if training) or additional memory buffers for inference. To simplify things I will assume 10% * model size.

    Then total memory considering concurrent users will be total_memory_gb = model_memory_gb + (activation_memory_per_user * concurrent_users) + overhead_memory_gb.

    A transformer model (like LLaMA) requires approximately: Compute per token ≈ 360 × Number of Parameters . Where 360 refers to a rough estimate of floating-point operations required per parameter for a single forward pass (token inference). Number of Parameters is the total model size (e.g., 90 billion for LLaMA 3.2). Now, for multiple users and tokens per second: Total FLOPS = ( 360 × Parameters × Users × Tokens per second ).

    Since FLOPS is typically measured in teraflops (TFLOPS) (trillions of FLOPS): Required TFLOPS = 360 × Parameters × Users × Tokens per second / 10 ^ 12.



    Your Results