Kubeflow is an umbrella project for running the end-to-end machine learning lifecycle on Kubernetes. It is not a single binary but a collection of independently installable components that share the same CRD-driven, Kubernetes-native model.
The main components are Kubeflow Pipelines (a DAG engine for ML workflows backed by Argo Workflows, with Python SDK, artifact tracking, and a UI for experiments and runs), the Training Operator (TFJob, PyTorchJob, MPIJob, XGBoostJob, PaddleJob CRDs for distributed training with gang scheduling), Katib (a CRD-driven hyperparameter tuning and neural architecture search controller supporting Bayesian optimization, TPE, Hyperband, and random search), Notebooks (managed JupyterLab and VSCode instances per user, mounted on PVCs with GPU access), and the Kubeflow Central Dashboard with multi-tenant profiles built on Istio + Dex for per-namespace isolation. KServe used to be in-tree as KFServing and is now a separate CNCF incubating project you compose with.
It originated at Google as a way to run TensorFlow Extended pipelines on Kubernetes and now underpins managed products like Vertex AI Pipelines, AWS SageMaker (Kubeflow on EKS), and IBM watsonx.ai. Its main alternatives are MLflow plus raw Kubernetes jobs for smaller setups, and SageMaker/Vertex for a fully managed path — Kubeflow is what you pick when you want to own the full stack on your own clusters.