AI engineers, researchers, and tech enthusiasts interested in the underlying infrastructure and economics of large language models.
This video explores how AI models like GPT, Claude, and Gemini are trained and served, focusing on underlying infrastructure and economics.
Companies offer 'fast mode' for higher prices and faster token streaming. This section questions the mechanics and economics behind it.
The primary driver for latency and cost trade-offs is batch size. Speculative decoding is another optimization technique discussed.
Analysis involves looking at memory bandwidth and compute performance on a GPU cluster, considering weight and KV cache operations.
Compute time is estimated by considering matrix multiplications with active parameters and dividing by chip FLOPs.
Memory time involves fetching all model weights and the KV cache, which scales with batch size and context length.
Batching users significantly improves economics, potentially by 1000x. Graphs illustrate the trade-offs between batch size, compute, and memory time.