Continuous Batching
Continuous batching (also called iteration-level batching) is an LLM serving optimization that dramatically improves throughput by dynamically adding new requests to in-progress batches at every decoding step. Unlike traditional static batching which waits for all sequences to complete, continuous batching fills freed slots immediately as sequences finish. Pioneered by Orca (Yu et al., 2022) and implemented in vLLM, TGI, and TensorRT-LLM, it achieves 2-10x higher throughput than static batching. As of October 2025, continuous batching is standard in production LLM serving, enabling cost-efficient deployment of models like GPT-4, Claude, and Llama 3.
Overview
Traditional static batching processes fixed-size batches: wait for N requests, generate until all N complete, then start next batch. Problem: sequences finish at different times, leaving GPU underutilized. A batch of 32 sequences where one takes 2000 tokens and others finish at 100 tokens wastes 95% of GPU cycles waiting for the slow sequence. Continuous batching solves this by: (1) Starting generation as soon as any request arrives, (2) Adding new requests to batch at every decoding iteration when slots free up, (3) Removing completed sequences immediately. Result: GPU stays at maximum utilization (near 100% batch occupancy). Throughput improvement: 2-3x for typical workloads, up to 10x for variable-length sequences. Now standard in vLLM (23x faster than HuggingFace), TGI, TensorRT-LLM, and all production serving engines.
Key Implementations (October 2025)
- vLLM: PagedAttention + continuous batching, 24x throughput vs naive serving
- TGI (Text Generation Inference): Hugging Face's serving engine with continuous batching
- TensorRT-LLM: NVIDIA's optimized engine with iteration-level scheduling
- Ray Serve: Anyscale's framework with Orca-style batching
- DeepSpeed-FastGen: Microsoft's serving with dynamic batching
- OpenAI Triton Inference Server: Continuous batching for production GPT models
- Ollama: Local serving with continuous batching support
- Together AI: Cloud serving platform using continuous batching
Performance Improvements
vLLM benchmarks (Llama 3 70B, A100 80GB): Static batching: 12 tokens/sec/request, 40% GPU utilization. Continuous batching: 35 tokens/sec/request, 85% GPU utilization. Throughput: 2.9x improvement. For variable-length workloads (50-500 token outputs): Static: 8 req/sec, Continuous: 42 req/sec = 5.2x improvement. Cost impact: $1/hour GPU serving 8 requests/sec vs 42 requests/sec = 5x cost reduction per request. Latency: Time-to-first-token unchanged (~50ms), but overall throughput increase reduces queue wait times. The key benefit: higher GPU utilization (60-85% vs 20-40%) translates directly to cost savings.
How It Works
- Request arrival: Add to waiting queue immediately
- Every decode step: Check for free batch slots (completed sequences)
- Dynamic insertion: Add waiting requests to fill empty slots
- Sequence completion: Remove from batch instantly, free memory
- Batch size: Varies from 1 to max (e.g., 128) based on demand
- Memory management: PagedAttention enables efficient KV cache sharing
- Preemption: Can pause long requests to prioritize short ones
- No waiting: New requests start generating within one decode iteration
Code Example
# vLLM with continuous batching (automatic)
from vllm import LLM, SamplingParams
# Initialize LLM - continuous batching enabled by default
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
tensor_parallel_size=4, # 4 GPUs
max_num_seqs=128, # Maximum batch size
dtype="float16"
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Simulate continuous request stream
prompts = [
"Write a short story:",
"Explain quantum computing:",
"Generate Python code for:",
# ... many more requests
]
# vLLM automatically uses continuous batching
# Requests are processed as they arrive, filling batch slots dynamically
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Output: {output.outputs[0].text}")
print(f"Tokens: {len(output.outputs[0].token_ids)}")
print("-" * 80)
# Online serving with continuous batching
from vllm import AsyncLLMEngine, AsyncEngineArgs
from vllm.sampling_params import SamplingParams
import asyncio
async def generate_stream(engine, prompt):
"""Stream tokens as they're generated"""
sampling_params = SamplingParams(max_tokens=200, temperature=0.7)
# Request is added to batch immediately
request_id = f"req_{id(prompt)}"
results_generator = engine.generate(prompt, sampling_params, request_id)
async for output in results_generator:
# Tokens stream back while other requests process in parallel
yield output.outputs[0].text
async def main():
# Initialize async engine with continuous batching
engine_args = AsyncEngineArgs(
model="meta-llama/Meta-Llama-3-8B",
max_num_seqs=64, # Up to 64 concurrent requests
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
# Simulate multiple concurrent requests
tasks = [
generate_stream(engine, f"Request {i}: Explain AI")
for i in range(10)
]
# All requests process concurrently with continuous batching
for task in asyncio.as_completed(tasks):
async for token in await task:
print(token, end="", flush=True)
asyncio.run(main())
# TGI (Text Generation Inference) with continuous batching
# Docker: docker run --gpus all --shm-size 1g -p 8080:80 \
# ghcr.io/huggingface/text-generation-inference:latest \
# --model-id meta-llama/Meta-Llama-3-70B --max-batch-total-tokens 32768
# Client code
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
# Requests are batched continuously on server side
response = client.text_generation(
"Explain machine learning:",
max_new_tokens=200,
temperature=0.7
)
print(response)
Continuous vs Static Batching
Static batching: Wait for N requests → Process batch → Wait for slowest to finish → Start next batch. GPU utilization: 20-40% (waiting time dominates). Throughput: Limited by slowest sequence. Continuous batching: Process requests as they arrive → Add/remove from batch each iteration → GPU always busy. GPU utilization: 60-85% (near-optimal). Throughput: 2-10x higher. Memory: Both use similar total memory, but continuous batching with PagedAttention enables better memory efficiency. Latency: Static has high queue wait times; continuous starts immediately. For production serving: Continuous batching is strictly superior - same latency, dramatically higher throughput.
When to Use
- Production LLM serving: Essential for cost-efficient deployment
- Variable-length outputs: Maximum benefit when sequence lengths vary 2-10x
- High-throughput applications: Chatbots, API services, batch processing
- Cost optimization: 2-5x cost reduction per request vs static batching
- Real-time serving: Lower queue wait times for better user experience
- Multi-tenant serving: Efficiently share GPU across multiple users
- Cloud deployment: Maximize requests per GPU-hour
- Any LLM inference workload: No downside vs static batching
Professional Integration Services by 21medien
21medien offers LLM serving optimization services including vLLM deployment, TGI configuration, continuous batching tuning, and production infrastructure setup. Our team specializes in maximizing throughput through optimal batch size selection, memory configuration, and multi-GPU serving strategies. We help organizations reduce serving costs by 2-5x through continuous batching and related optimizations. Contact us for custom LLM serving solutions.
Resources
Orca paper (OSDI 2022): https://www.usenix.org/conference/osdi22/presentation/yu | vLLM blog: https://blog.vllm.ai/2023/06/20/vllm.html | TGI docs: https://huggingface.co/docs/text-generation-inference | vLLM GitHub: https://github.com/vllm-project/vllm