|
| 1 | +--- |
| 2 | +sidebar: false |
| 3 | +title: "Optimizing ML Development at Kumo.ai with Velda" |
| 4 | +description: "A case study: how Kumo.ai reduced iteration time, improved GPU utilization, and sped up onboarding by using Velda for development and experimentation." |
| 5 | +date: 2025-10-21 |
| 6 | +author: Hema Raghavan |
| 7 | +tags: [machine-learning, ml-workflow, cloud-computing, gpu-computing, developer-productivity, velda] |
| 8 | +keywords: ["ML development", "GPU utilization", "Velda", "cloud development", "developer productivity", "MLOps"] |
| 9 | +excerpt: "How Kumo.ai used Velda to accelerate experiments, cut dependency update times, and increase GPU utilization across the team." |
| 10 | +image: "https://cdn-images-1.medium.com/max/2400/1*72Jo87jR0xg_3zzBkwkNLQ.png" |
| 11 | +readingTime: "6 min" |
| 12 | +category: "Case Study" |
| 13 | +--- |
| 14 | + |
| 15 | +# Optimizing ML Development at Kumo.ai |
| 16 | + |
| 17 | +## Background and Challenges |
| 18 | + |
| 19 | +At [Kumo.ai](https://kumo.ai), our ML engineers originally developed on **per-developer AWS GPU VMs (T4)**. That setup worked in the early stages but became a bottleneck as [**KumoRFM**](https://kumorfm.ai/), our relational foundation model, grew in scale and complexity. |
| 20 | + |
| 21 | +* Training required **L40S, A100 and sometimes multi-GPU nodes**, quickly increasing costs. |
| 22 | + |
| 23 | +* Many GPUs sat idle between runs, leading to low utilization. |
| 24 | + |
| 25 | +* Our **production Kubernetes cluster** was designed for **end-to-end (E2E) pipelines**, not iterative development. Running just one step meant repeating full data-prep and orchestration each time. |
| 26 | + |
| 27 | +* Our dependency stack—custom pip packages, private builds, and internal resolvers—made Docker rebuilds slow, often taking **10 minutes or more** for minor updates. |
| 28 | + |
| 29 | +We needed a way to **iterate faster** and **experiment flexibly** with GPUs, without disturbing production or waiting on heavy image builds. |
| 30 | + |
| 31 | + |
| 32 | +## The Solution: A Developer Journey with Velda |
| 33 | + |
| 34 | +### Onboarding & Environment Setup |
| 35 | + |
| 36 | +With Velda, onboarding became instant. Each engineer received a **pre-configured development environment** ready to run out of the box—no image builds or manual setup. |
| 37 | + |
| 38 | +We reused our internal upgrade script, originally meant for local installs, which now runs directly in Velda. There’s no need to build a separate docker image just to run jobs in the cluster. The only change was forcing GPU-enabled setup, even if started from a CPU node. |
| 39 | + |
| 40 | +This meant dependency updates that once took over ten minutes could now complete in under one. |
| 41 | + |
| 42 | +### IDE Integration |
| 43 | + |
| 44 | +Most of our engineers use **VS Code**, and several use **Cursor**, and some use direct SSH. Velda’s IDE plugin & CLI connects these editors directly to each developer’s Velda instance, with **no change to existing workflows**. |
| 45 | + |
| 46 | +We continue coding, running, and debugging exactly as before—just with on-demand GPUs available whenever needed. |
| 47 | + |
| 48 | +### Experimentation |
| 49 | + |
| 50 | +When training, we prefix commands with `vrun`, select the GPU type (T4, L4, L40S, A100, or multi-GPU), and start. Everything happens just within a few seconds. With the logs streamed, it’s hardly noticeable that the command is actually running remotely. |
| 51 | + |
| 52 | +```diff |
| 53 | +- python train.py |
| 54 | ++ vrun -P gpu-a100-8 python train.py |
| 55 | +``` |
| 56 | + |
| 57 | +Running multiple experiments in parallel is as simple as running the command multiple times with different flags, and we often do that to find the best hyperparameters. This accelerates our development significantly. |
| 58 | + |
| 59 | +### Keeping Production Separate |
| 60 | + |
| 61 | +Our **production pipeline remains on Kubernetes** because it’s **customer-facing** and built around **many existing integrations**—data ingestion, observability, and internal orchestration systems. |
| 62 | + |
| 63 | +We continue to rely on that cluster for end-to-end jobs, while **Velda powers development and experimentation**. |
| 64 | +This separation allows rapid iteration without touching production, while maintaining reliability for customer workloads. |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Key Highlights |
| 69 | + |
| 70 | +* Development environments ready in minutes—no manual setup or image builds. |
| 71 | +* Existing dependency scripts run natively inside Velda. |
| 72 | +* Full IDE integration for VS Code and Cursor; no workflow change. |
| 73 | +* GPU types selected per run for maximum flexibility. |
| 74 | +* Production workloads stay on Kubernetes for stability and customer integration. |
| 75 | + |
| 76 | +## Impact |
| 77 | + |
| 78 | +| Metric | Before Velda | With Velda | |
| 79 | +|---|---|---| |
| 80 | +| Environment setup | Manual VM / container build | Instant, pre-configured | |
| 81 | +| Dependency updates | Docker rebuilds (10–15 min) | Direct install (< 1 min) | |
| 82 | +| GPU utilization | ~15% | ~90% | |
| 83 | +| Experiment throughput | 1–2 runs/day | 10+ runs/day | |
| 84 | +| Production infrastructure | K8s E2E pipelines | Still K8s (customer-facing) | |
| 85 | + |
| 86 | +## Improvements Identified & Ongoing Collaboration |
| 87 | + |
| 88 | +During the proof-of-concept period, Velda and Kumo.ai collaborated closely to enhance reliability, performance, and enterprise readiness. |
| 89 | + |
| 90 | +The following examples highlight improvements implemented during the process. |
| 91 | + |
| 92 | +### Performance Improvements |
| 93 | + |
| 94 | +1. **Worker Startup Optimization** |
| 95 | + |
| 96 | + Reduced the time to start workload on a new on-demand GPU instance to **approximately 20 seconds**, significantly improving responsiveness during development. |
| 97 | + |
| 98 | +2. **Health and Reliability Enhancements** |
| 99 | + |
| 100 | + Automated **instance health checks** and **reset session commands** were introduced to detect and recover from non-responsive sessions, improving overall stability for development workloads. |
| 101 | + |
| 102 | +### Enterprise & DevOps Readiness |
| 103 | + |
| 104 | +1. **Custom Security Configuration** |
| 105 | + |
| 106 | + Velda environments now support **data encryption, IAM integration and firewall customization** aligned with Kumo’s internal compliance and network requirements. |
| 107 | + |
| 108 | +2. **Infrastructure as Code** |
| 109 | + An **easy-to-use Terraform module** was developed to allow ML engineers and DevOps teams to provision and adjust GPU pools independently, simplifying scaling and configuration management. |
| 110 | + |
| 111 | +3. **Usage Tracking and Audit Integration** |
| 112 | + To align with enterprise governance requirements, **usage tracking and audit logging** were added. This enables visibility into resource usage and session history for cost monitoring and compliance. |
| 113 | + |
| 114 | +### Ongoing Exploration |
| 115 | + |
| 116 | +1. **Batch Mode for Long-Running Jobs** |
| 117 | + |
| 118 | + A new **batch execution mode** with live log streaming is under active testing. |
| 119 | + This allows jobs to continue running even if a session disconnects, enabling developers to detach and later reattach to view logs or results. |
| 120 | + |
| 121 | +2. **Extended Use Cases** |
| 122 | + |
| 123 | + Both teams are exploring how Velda can support **CI/CD integration and pipeline automation**, extending its capabilities beyond interactive experimentation. |
| 124 | + |
| 125 | +## Conclusion |
| 126 | + |
| 127 | +The proof-of-concept process between Kumo.ai and Velda has been an ongoing, collaborative effort focused on improving reliability, security, and developer productivity. |
| 128 | + |
| 129 | +The work to date has already delivered tangible results—faster iteration, higher GPU utilization, and smoother onboarding—while preserving Kumo’s existing production setup and integrations. |
| 130 | + |
| 131 | +Both teams continue to collaborate on new capabilities and deployment patterns, using practical feedback from daily use to guide Velda’s evolution into a more flexible and enterprise-ready development platform. |
0 commit comments