LLM inference serving: Deploy open-source LLMs (e.g., Qwen, GPT-OSS) via vLLM or SGLang with high-throughput REST or streaming endpoints
Real-time voice agent pipelines: Build sub-500ms latency voice agents using Pipecat or LiveKit, integrated with Twilio for inbound/outbound calling
Image & video model inference: Serve Stable Diffusion XL, video generation models, and VLMs (Visual Language Models) at scale with autoscaling
Model training & hyperparameter sweeps: Run distributed training jobs on multi-GPU H100 clusters with WandB integration for experiment tracking
Embedding & reranking APIs: Host high-throughput, low-latency REST servers for text embeddings and reranking models
Custom container inference: Wrap any model in a Dockerfile and deploy it with CI/CD pipelines, gradual rollouts, and secrets management

Cerebrium