
Large language models have demonstrated remarkable capabilities, but deploying them in production environments presents substantial engineering challenges. Over the past year, our team has gained significant experience running transformer models at scale. In this article, I'll share the key lessons we've learned about infrastructure, optimisation, and cost management.
The Scale of the Challenge
Modern transformer models are computationally demanding. A mid-sized model with 7 billion parameters requires:
Serving such models to thousands of concurrent users requires careful architecture.
Infrastructure Architecture
GPU Selection and Allocation
Not all GPUs are created equal for transformer inference:
**Memory Bandwidth**: Often the bottleneck for inference. A100s and H100s significantly outperform older generations.
**Multi-GPU Serving**: For larger models, we use tensor parallelism to spread the model across multiple GPUs, with careful attention to inter-GPU communication.
**Dynamic Batching**: Grouping requests into batches dramatically improves throughput, but requires balancing batch size against latency requirements.
Model Serving Infrastructure
We've evaluated several serving frameworks and settled on a combination approach:
**vLLM**: Excellent for batch inference with its PagedAttention implementation.
**TensorRT-LLM**: Best raw performance for NVIDIA GPUs when you can invest in optimisation.
**Custom Solutions**: For specific use cases, custom serving code can outperform general-purpose frameworks.
Optimisation Strategies
Quantisation
Reducing numerical precision is one of the most effective optimisation techniques:
**INT8 Quantisation**: 2x memory reduction with minimal accuracy loss for most models.
**INT4 Quantisation**: 4x memory reduction, but requires careful attention to accuracy. We use calibration datasets to maintain quality.
**Mixed Precision**: Critical layers (attention, final projections) can use higher precision whilst less sensitive layers use lower precision.
Speculative Decoding
For autoregressive generation, speculative decoding can significantly improve throughput:
In our benchmarks, speculative decoding improves throughput by 2-3x for long-form generation tasks.
KV Cache Optimisation
The key-value cache grows with sequence length and can become a bottleneck:
**PagedAttention**: Allocates KV cache in non-contiguous blocks, reducing memory waste.
**Sliding Window**: For some applications, maintaining only recent context is acceptable.
**Compression**: Experimental techniques for compressing older KV entries show promise.
Cost Management
Running large models at scale is expensive. We employ several strategies to manage costs:
**Spot/Preemptible Instances**: For batch workloads, spot instances can reduce costs by 60-90%.
**Auto-scaling**: Scale GPU allocation based on demand, with aggressive scale-down during low-traffic periods.
**Model Routing**: Not every request requires the largest model. Smart routing to appropriately-sized models reduces costs.
**Caching**: For common queries, caching results can dramatically reduce compute requirements.
Practical Recommendations
For teams beginning their scaling journey, I recommend:
Marcus Webb, Director of Research
Marcus Webb
Director of Research

