Continuous Batching

Overview

Traditional static batching processes fixed-size batches: wait for N requests, generate until all N complete, then start next batch. Problem: sequences finish at different times, leaving GPU underutilized. A batch of 32 sequences where one takes 2000 tokens and others finish at 100 tokens wastes 95% of GPU cycles waiting for the slow sequence. Continuous batching solves this by: (1) Starting generation as soon as any request arrives, (2) Adding new requests to batch at every decoding iteration when slots free up, (3) Removing completed sequences immediately. Result: GPU stays at maximum utilization (near 100% batch occupancy). Throughput improvement: 2-3x for typical workloads, up to 10x for variable-length sequences. Now standard in vLLM (23x faster than HuggingFace), TGI, TensorRT-LLM, and all production serving engines.

Key Implementations (October 2025)

vLLM: PagedAttention + continuous batching, 24x throughput vs naive serving
TGI (Text Generation Inference): Hugging Face's serving engine with continuous batching
TensorRT-LLM: NVIDIA's optimized engine with iteration-level scheduling
Ray Serve: Anyscale's framework with Orca-style batching
DeepSpeed-FastGen: Microsoft's serving with dynamic batching
OpenAI Triton Inference Server: Continuous batching for production GPT models
Ollama: Local serving with continuous batching support
Together AI: Cloud serving platform using continuous batching

Performance Improvements

vLLM benchmarks (Llama 3 70B, A100 80GB): Static batching: 12 tokens/sec/request, 40% GPU utilization. Continuous batching: 35 tokens/sec/request, 85% GPU utilization. Throughput: 2.9x improvement. For variable-length workloads (50-500 token outputs): Static: 8 req/sec, Continuous: 42 req/sec = 5.2x improvement. Cost impact: $1/hour GPU serving 8 requests/sec vs 42 requests/sec = 5x cost reduction per request. Latency: Time-to-first-token unchanged (~50ms), but overall throughput increase reduces queue wait times. The key benefit: higher GPU utilization (60-85% vs 20-40%) translates directly to cost savings.

How It Works

Request arrival: Add to waiting queue immediately
Every decode step: Check for free batch slots (completed sequences)
Dynamic insertion: Add waiting requests to fill empty slots
Sequence completion: Remove from batch instantly, free memory
Batch size: Varies from 1 to max (e.g., 128) based on demand
Memory management: PagedAttention enables efficient KV cache sharing
Preemption: Can pause long requests to prioritize short ones
No waiting: New requests start generating within one decode iteration

Code Example

# vLLM with continuous batching (automatic)
from vllm import LLM, SamplingParams

# Initialize LLM - continuous batching enabled by default
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    tensor_parallel_size=4,  # 4 GPUs
    max_num_seqs=128,  # Maximum batch size
    dtype="float16"
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Simulate continuous request stream
prompts = [
    "Write a short story:",
    "Explain quantum computing:",
    "Generate Python code for:",
    # ... many more requests
]

# vLLM automatically uses continuous batching
# Requests are processed as they arrive, filling batch slots dynamically
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Output: {output.outputs[0].text}")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("-" * 80)

# Online serving with continuous batching
from vllm import AsyncLLMEngine, AsyncEngineArgs
from vllm.sampling_params import SamplingParams
import asyncio

async def generate_stream(engine, prompt):
    """Stream tokens as they're generated"""
    sampling_params = SamplingParams(max_tokens=200, temperature=0.7)
    
    # Request is added to batch immediately
    request_id = f"req_{id(prompt)}"
    results_generator = engine.generate(prompt, sampling_params, request_id)
    
    async for output in results_generator:
        # Tokens stream back while other requests process in parallel
        yield output.outputs[0].text

async def main():
    # Initialize async engine with continuous batching
    engine_args = AsyncEngineArgs(
        model="meta-llama/Meta-Llama-3-8B",
        max_num_seqs=64,  # Up to 64 concurrent requests
    )
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    
    # Simulate multiple concurrent requests
    tasks = [
        generate_stream(engine, f"Request {i}: Explain AI")
        for i in range(10)
    ]
    
    # All requests process concurrently with continuous batching
    for task in asyncio.as_completed(tasks):
        async for token in await task:
            print(token, end="", flush=True)

asyncio.run(main())

# TGI (Text Generation Inference) with continuous batching
# Docker: docker run --gpus all --shm-size 1g -p 8080:80 \
#   ghcr.io/huggingface/text-generation-inference:latest \
#   --model-id meta-llama/Meta-Llama-3-70B --max-batch-total-tokens 32768

# Client code
from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Requests are batched continuously on server side
response = client.text_generation(
    "Explain machine learning:",
    max_new_tokens=200,
    temperature=0.7
)
print(response)

Continuous vs Static Batching

Static batching: Wait for N requests → Process batch → Wait for slowest to finish → Start next batch. GPU utilization: 20-40% (waiting time dominates). Throughput: Limited by slowest sequence. Continuous batching: Process requests as they arrive → Add/remove from batch each iteration → GPU always busy. GPU utilization: 60-85% (near-optimal). Throughput: 2-10x higher. Memory: Both use similar total memory, but continuous batching with PagedAttention enables better memory efficiency. Latency: Static has high queue wait times; continuous starts immediately. For production serving: Continuous batching is strictly superior - same latency, dramatically higher throughput.

When to Use

Production LLM serving: Essential for cost-efficient deployment
Variable-length outputs: Maximum benefit when sequence lengths vary 2-10x
High-throughput applications: Chatbots, API services, batch processing
Cost optimization: 2-5x cost reduction per request vs static batching
Real-time serving: Lower queue wait times for better user experience
Multi-tenant serving: Efficiently share GPU across multiple users
Cloud deployment: Maximize requests per GPU-hour
Any LLM inference workload: No downside vs static batching

Professional Integration Services by 21medien

21medien offers LLM serving optimization services including vLLM deployment, TGI configuration, continuous batching tuning, and production infrastructure setup. Our team specializes in maximizing throughput through optimal batch size selection, memory configuration, and multi-GPU serving strategies. We help organizations reduce serving costs by 2-5x through continuous batching and related optimizations. Contact us for custom LLM serving solutions.

Resources

Orca paper (OSDI 2022): https://www.usenix.org/conference/osdi22/presentation/yu | vLLM blog: https://blog.vllm.ai/2023/06/20/vllm.html | TGI docs: https://huggingface.co/docs/text-generation-inference | vLLM GitHub: https://github.com/vllm-project/vllm

Overview

Key Implementations (October 2025)

Performance Improvements

How It Works

Code Example

Continuous vs Static Batching

When to Use

Professional Integration Services by 21medien

Resources

Official Resources

Related Technologies

vLLM

Hugging Face TGI

PyTorch

Cookie Settings

Necessary Cookies

External Services