Ollama provides a lightweight runtime for running large language models (LLMs) locally, exposing an OpenAI-compatible REST API (http://localhost:11434) that makes it a drop-in backend for many existing tools and frameworks. It handles model quantization, hardware acceleration (CPU, Apple Silicon via Metal, NVIDIA/AMD GPUs via CUDA/ROCm), and lifecycle management through a simple CLI and API surface. The ollama binary bundles everything needed — model weights are pulled from the Ollama model registry with a single command (ollama pull <model>). For scaling beyond local hardware, Ollama Cloud offers datacenter-grade inference with parallel request handling and real-time web access, accessible via the same API interface.