DVC (Data Version Control)
Free tierManage data the way code is managed — Git-like version control for AI/ML and data science.
Free tier available·Technical·API available·Open source
Key strengths
Git-like versioning for datasets and ML modelsOpen source with a large, active communitySeamlessly integrates with existing Git workflowsSupports petabyte-scale data lakes and object stores via lakeFSWorks with major cloud storage providers and local filesystems
Free tier + paid plans
San Francisco, USA
Founded 2017
Self-hostable
No ratings yet
Developer & Technical Documentation
DVC exposes a rich CLI and Python API for integrating into ML pipelines and CI/CD workflows:
- Pipeline DAGs — Define stages with inputs, outputs, and commands in
dvc.yaml. DVC builds a dependency graph and only re-runs stages whose dependencies have changed, enabling efficient, reproducible pipelines. - Experiment Tracking — Use
dvc exp run,dvc exp show, anddvc exp diffto branch, run, and compare experiments without cluttering your Git history. - Remote Storage Backends — Out-of-the-box support for AWS S3, Google Cloud Storage, Azure Blob Storage, SSH/SFTP, HDFS, HTTP, and local paths. Configure via
dvc remote addanddvc remote modify. - Python API — Access DVC programmatically via
import dvc.apito open versioned data files directly in your scripts, enabling clean integration with training code. - VS Code Extension — Provides a GUI for managing experiments, visualizing pipeline DAGs, and comparing metrics without leaving the editor.
- lakeFS Integration — For enterprise-scale needs, lakeFS layers a full Git branching model on top of S3-compatible object stores, enabling atomic commits, zero-copy branching, and data CI/CD at petabyte scale.
