Skip to content

Commit ebed713

Browse files
committed
Add kumo case study
1 parent d8f8923 commit ebed713

File tree

3 files changed

+152
-1
lines changed

3 files changed

+152
-1
lines changed

.vitepress/data/blogPosts.ts

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,25 @@ export interface BlogPost {
1212
}
1313

1414
export const blogPosts: BlogPost[] = [
15+
{
16+
"title": "Optimizing ML Development at Kumo.ai with Velda",
17+
"slug": "kumo-optimizing-ml-development",
18+
"description": "A case study: how Kumo.ai reduced iteration time, improved GPU utilization, and sped up onboarding by using Velda for development and experimentation.",
19+
"excerpt": "How Kumo.ai used Velda to accelerate experiments, cut dependency update times, and increase GPU utilization across the team.",
20+
"date": "2025-10-21",
21+
"author": "Hema Raghavan",
22+
"readingTime": "6 min",
23+
"category": "Case Study",
24+
"image": "https://cdn-images-1.medium.com/max/2400/1*72Jo87jR0xg_3zzBkwkNLQ.png",
25+
"tags": [
26+
"machine-learning",
27+
"ml-workflow",
28+
"cloud-computing",
29+
"gpu-computing",
30+
"developer-productivity",
31+
"velda"
32+
]
33+
},
1534
{
1635
"title": "Building a Scalable ML Workflow with Velda",
1736
"slug": "build-machine-learning-workflow",
@@ -103,4 +122,4 @@ export const blogPosts: BlogPost[] = [
103122
"machine-learning"
104123
]
105124
}
106-
];
125+
];
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
sidebar: false
3+
title: "Optimizing ML Development at Kumo.ai with Velda"
4+
description: "A case study: how Kumo.ai reduced iteration time, improved GPU utilization, and sped up onboarding by using Velda for development and experimentation."
5+
date: 2025-10-21
6+
author: Hema Raghavan
7+
tags: [machine-learning, ml-workflow, cloud-computing, gpu-computing, developer-productivity, velda]
8+
keywords: ["ML development", "GPU utilization", "Velda", "cloud development", "developer productivity", "MLOps"]
9+
excerpt: "How Kumo.ai used Velda to accelerate experiments, cut dependency update times, and increase GPU utilization across the team."
10+
image: "https://cdn-images-1.medium.com/max/2400/1*72Jo87jR0xg_3zzBkwkNLQ.png"
11+
readingTime: "6 min"
12+
category: "Case Study"
13+
---
14+
15+
# Optimizing ML Development at Kumo.ai
16+
17+
## Background and Challenges
18+
19+
At [Kumo.ai](https://kumo.ai), our ML engineers originally developed on **per-developer AWS GPU VMs (T4)**. That setup worked in the early stages but became a bottleneck as [**KumoRFM**](https://kumorfm.ai/), our relational foundation model, grew in scale and complexity.
20+
21+
* Training required **L40S, A100 and sometimes multi-GPU nodes**, quickly increasing costs.
22+
23+
* Many GPUs sat idle between runs, leading to low utilization.
24+
25+
* Our **production Kubernetes cluster** was designed for **end-to-end (E2E) pipelines**, not iterative development. Running just one step meant repeating full data-prep and orchestration each time.
26+
27+
* Our dependency stack—custom pip packages, private builds, and internal resolvers—made Docker rebuilds slow, often taking **10 minutes or more** for minor updates.
28+
29+
We needed a way to **iterate faster** and **experiment flexibly** with GPUs, without disturbing production or waiting on heavy image builds.
30+
31+
32+
## The Solution: A Developer Journey with Velda
33+
34+
### Onboarding & Environment Setup
35+
36+
With Velda, onboarding became instant. Each engineer received a **pre-configured development environment** ready to run out of the box—no image builds or manual setup.
37+
38+
We reused our internal upgrade script, originally meant for local installs, which now runs directly in Velda. There’s no need to build a separate docker image just to run jobs in the cluster. The only change was forcing GPU-enabled setup, even if started from a CPU node.
39+
40+
This meant dependency updates that once took over ten minutes could now complete in under one.
41+
42+
### IDE Integration
43+
44+
Most of our engineers use **VS Code**, and several use **Cursor**, and some use direct SSH. Velda’s IDE plugin & CLI connects these editors directly to each developer’s Velda instance, with **no change to existing workflows**.
45+
46+
We continue coding, running, and debugging exactly as before—just with on-demand GPUs available whenever needed.
47+
48+
### Experimentation
49+
50+
When training, we prefix commands with `vrun`, select the GPU type (T4, L4, L40S, A100, or multi-GPU), and start. Everything happens just within a few seconds. With the logs streamed, it’s hardly noticeable that the command is actually running remotely.
51+
52+
```diff
53+
- python train.py
54+
+ vrun -P gpu-a100-8 python train.py
55+
```
56+
57+
Running multiple experiments in parallel is as simple as running the command multiple times with different flags, and we often do that to find the best hyperparameters. This accelerates our development significantly.
58+
59+
### Keeping Production Separate
60+
61+
Our **production pipeline remains on Kubernetes** because it’s **customer-facing** and built around **many existing integrations**—data ingestion, observability, and internal orchestration systems.
62+
63+
We continue to rely on that cluster for end-to-end jobs, while **Velda powers development and experimentation**.
64+
This separation allows rapid iteration without touching production, while maintaining reliability for customer workloads.
65+
66+
---
67+
68+
## Key Highlights
69+
70+
* Development environments ready in minutes—no manual setup or image builds.
71+
* Existing dependency scripts run natively inside Velda.
72+
* Full IDE integration for VS Code and Cursor; no workflow change.
73+
* GPU types selected per run for maximum flexibility.
74+
* Production workloads stay on Kubernetes for stability and customer integration.
75+
76+
## Impact
77+
78+
| Metric | Before Velda | With Velda |
79+
|---|---|---|
80+
| Environment setup | Manual VM / container build | Instant, pre-configured |
81+
| Dependency updates | Docker rebuilds (10–15 min) | Direct install (< 1 min) |
82+
| GPU utilization | ~15% | ~90% |
83+
| Experiment throughput | 1–2 runs/day | 10+ runs/day |
84+
| Production infrastructure | K8s E2E pipelines | Still K8s (customer-facing) |
85+
86+
## Improvements Identified & Ongoing Collaboration
87+
88+
During the proof-of-concept period, Velda and Kumo.ai collaborated closely to enhance reliability, performance, and enterprise readiness.
89+
90+
The following examples highlight improvements implemented during the process.
91+
92+
### Performance Improvements
93+
94+
1. **Worker Startup Optimization**
95+
96+
Reduced the time to start workload on a new on-demand GPU instance to **approximately 20 seconds**, significantly improving responsiveness during development.
97+
98+
2. **Health and Reliability Enhancements**
99+
100+
Automated **instance health checks** and **reset session commands** were introduced to detect and recover from non-responsive sessions, improving overall stability for development workloads.
101+
102+
### Enterprise & DevOps Readiness
103+
104+
1. **Custom Security Configuration**
105+
106+
Velda environments now support **data encryption, IAM integration and firewall customization** aligned with Kumo’s internal compliance and network requirements.
107+
108+
2. **Infrastructure as Code**
109+
An **easy-to-use Terraform module** was developed to allow ML engineers and DevOps teams to provision and adjust GPU pools independently, simplifying scaling and configuration management.
110+
111+
3. **Usage Tracking and Audit Integration**
112+
To align with enterprise governance requirements, **usage tracking and audit logging** were added. This enables visibility into resource usage and session history for cost monitoring and compliance.
113+
114+
### Ongoing Exploration
115+
116+
1. **Batch Mode for Long-Running Jobs**
117+
118+
A new **batch execution mode** with live log streaming is under active testing.
119+
This allows jobs to continue running even if a session disconnects, enabling developers to detach and later reattach to view logs or results.
120+
121+
2. **Extended Use Cases**
122+
123+
Both teams are exploring how Velda can support **CI/CD integration and pipeline automation**, extending its capabilities beyond interactive experimentation.
124+
125+
## Conclusion
126+
127+
The proof-of-concept process between Kumo.ai and Velda has been an ongoing, collaborative effort focused on improving reliability, security, and developer productivity.
128+
129+
The work to date has already delivered tangible results—faster iteration, higher GPU utilization, and smoother onboarding—while preserving Kumo’s existing production setup and integrations.
130+
131+
Both teams continue to collaborate on new capabilities and deployment patterns, using practical feedback from daily use to guide Velda’s evolution into a more flexible and enterprise-ready development platform.

sitemap.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ description: Complete site navigation for Velda - find all pages, blog posts, an
2323

2424
## 📰 Blog
2525
- [Blog Home](/blog/) - Latest posts and updates
26+
- [Optimizing ML Development at Kumo.ai with Velda](/blog/kumo-optimizing-ml-development) - *October 20, 2025*
2627
- [Velda Blog - Cloud Development Insights & Updates](/blog/) - *October 19, 2025*
2728
- [Building a Scalable ML Workflow with Velda](/blog/build-machine-learning-workflow) - *September 24, 2025*
2829
- [vrun is All You Need: Revolutionizing Development with One Command](/blog/vrun-is-all-you-need) - *September 14, 2025*

0 commit comments

Comments
 (0)