Skip to content

Releases: dstackai/dstack

0.19.15

19 Jun 20:49
c10adfb

Choose a tag to compare

Services

Rolling deployments

This update introduces rolling deployments, which help avoid downtime when deploying new versions of your services.

When you apply an updated service configuration, dstack will gradually replace old service replicas with new ones. You can track the progress in the dstack apply output — the deployment number will be lower for old replicas and higher for new ones.

> dstack apply -f my-service.dstack.yml

Active run my-service already exists. Detected configuration changes that can be updated in-place: ['image', 'env', 'commands']
Update the run? [y/n]: y

⠋ Launching my-service...
 NAME                            BACKEND          RESOURCES                        PRICE    STATUS       SUBMITTED
 my-service deployment=1                                                                    running      11 mins ago
   replica=0 job=0 deployment=0  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0026  terminating  11 mins ago
   replica=1 job=0 deployment=1  aws (us-west-2)  cpu=2 mem=1GB disk=100GB (spot)  $0.0026  running      1 min ago

Currently, the following service configuration properties can be updated using rolling deployments: resources, volumes, image, user, privileged, entrypoint, python, nvcc, single_branch, env, shell, and commands.

Future releases will allow updating more properties and deploying new git repo commits.

Clusters

Updated default Docker images

If you don't specify a custom image in the run configuration, dstack uses its default images. These images have been improved for cluster environments and now include mpirun and NCCL tests. Additionally, if you are running on AWS EFA-capable instances, dstack will now automatically select an image with the appropriate EFA drivers. See our new AWS EFA guide for more details.

Server

Health metrics

The dstack server now exports some operational Prometheus metrics that allow to monitor its health. If you are running your own production-grade dstack server installation, refer to the metrics docs for details.

What's changed

New Contributors

Full Changelog: 0.19.13...0.19.15

0.19.13

11 Jun 10:33
0e31236

Choose a tag to compare

Clusters

Built-in InfiniBand support in dstack Docker images

The dstack default Docker images now come with built-in InfiniBand support, which includes the necessary libibverbs library and InfiniBand utilities from rdma-core. This means you can run torch distributed and other workloads utilizing NCCL, and they'll take full advantage of InfiniBand without custom Docker images.

You can try InfiniBand clusters with dstack on Nebius.

Built-in EFA support in dstack VM images

dstack switches to DLAMI as the default AWS GPU VM image from a custom one. DLAMI supports EFA out-of-the-box, so you no longer need to use a custom VM image to take advantage of EFA.

Server

GCS support for code uploads

It's now possible to configure the dstack server to use GCP Cloud Storage for code uploads. Previously, only DB and S3 storages were supported. Learn more in the Server deployment guide.

What's Changed

Full Changelog: 0.19.12...0.19.13

0.19.12

04 Jun 11:22
8732138

Choose a tag to compare

Clusters

Simplified use of MPI

startup_order and stop_criteria

New run configuration properties are introduced:

  • startup_order: any/master-first/workers-first specifies the order in which master and workers jobs are started.
  • stop_criteria: all-done/master-done specifies the criteria when a multi-node run should be considered finished.

These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.

DSTACK_MPI_HOSTFILE

dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.

Below is the updated NCCL tests example.

CLI

We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.

Examples

Distributed training

TRL

The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.

Axolotl

The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.

What's changed

  • [Feature] Update .gitignore logic to catch more cases by @colinjc in #2695
  • [Bug] Increase upload_code client timeout by @r4victor in #2709
  • [Bug] Fix missing apt-get update by @r4victor in #2710
  • [Internal]: Update git hooks and package.json by @olgenn in #2706
  • [Examples] Add distributed Axolotl and TRL example by @Bihan in #2703
  • [Docs] Update dstack-proxy contributing guide by @jvstme in #2683
  • [Feature] Implement DSTACK_MPI_HOSTFILE by @r4victor in #2718
  • [Feature] Implement startup_order and stop_criteria by @r4victor in #2714
  • [Bug] Fix CLI exiting while master starting by @r4victor in #2720
  • [Examples] Simplify NCCL tests example by @r4victor in #2723
  • [Examples] Update TRL Single Node example to uv by @Bihan in #2715
  • [Bug] Fix backward compatibility when creating fleets by @jvstme in #2727
  • [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in #2716
  • [Bug] Fix relative paths in dstack apply --repo by @jvstme in #2733
  • [Internal]: Drop hardcoded regions from the backend template by @jvstme in #2734
  • [Internal]: Update backend template to match ruff formatting by @jvstme in #2735

Full changelog: 0.19.11...0.19.12

0.19.12rc1

28 May 21:41
5d57e19

Choose a tag to compare

0.19.12rc1 Pre-release
Pre-release

What's Changed

  • Update gitignore logic to catch more cases by @colinjc in #2695

Full Changelog: 0.19.11...0.19.12rc1

0.19.11

28 May 09:52
25b977e

Choose a tag to compare

Runs

Replacing conda with uv

dstack's default Docker images now come with uv installed. Installing Python packages with uv can be significantly faster than with pip or conda. Here's for example, uv vs pip times for installing torch on GCP VMs:

# time uv pip install torch
...
real    0m32.771s
user    0m29.070s
sys     0m8.300s
# time pip install torch
...
real    2m26.338s
user    1m37.514s
sys     0m16.711s

To continue supporting pip, dstack now automatically activates a virtual environment with pip available.

conda is no longer included in dstack's default Docker images. If you need to use conda, it should be installed manually:

commands:
  - wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  - bash miniconda.sh -b -p /workflow/miniconda
  - eval "$(/workflow/miniconda/bin/conda shell.bash hook)"

Plugins

Built-in rest_plugin

dstack gets support for a built-in rest_plugin that allows writing custom plugins as API servers, so you don't need to install plugins as Python packages.

Plugins implemented as API servers have advantages over plugins implemented as Python packages in some cases:

  • No dependency conflicts with dstack.
  • You can use any programming language.
  • If you run the dstack server via Docker, you don't need to extend the dstack server image with plugins or map them via volumes.

To get started, check out the plugin server example. The rest_plugin server API is documented here.

AWS

New CPU series

dstack now supports most recent AWS CPU VMs based on Intel Xeon Sapphire Rapids: M7i, C7i, and R7i. It also adds support for the burstable T3 family. Previously, only M5, C5 and t2.small CPU instances were supported.

Azure

New CPU series

dstack now supports most recent Azure CPU VMs based on Intel Xeon Sapphire Rapids: general purpose Dsv6 and memory optimized Esv6 series. Previously, only Dsv3, Esv4, and Fsv2 series were supported.

GCP

New CPU series

dstack now supports most recent GCP CPU VMs: C4, M4, H3, N4, N2. Previously, only E2 and M1 were supported.

Note that C4, M4, H3, N4 instances do not currently support Volumes since they require Hyperdisk support.

Examples

Ray+RAGEN

The new Ray+RAGEN example shows how use dstack and RAGEN to fine-tune an agent on multiple nodes.

Breaking changes

  • conda is no longer included in dstack's default Docker images.

Deprecations

  • Azure VM series Dsv3 and Esv4 are deprecated.

What's Changed

New Contributors

Full Changelog: 0.19.10...0.19.11

0.19.11rc2

27 May 09:29
3caddcb

Choose a tag to compare

0.19.11rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: 0.19.10...0.19.11rc2

0.19.10

21 May 09:45

Choose a tag to compare

Runs

Priorities

Run configurations now support a priority field. This is a new property that accepts a number between 0 and 100. The higher the number, the higher the priority of the run. This influences the order in which runs are provisioned and executed in the queue.

type: task
name: train

# Can be 0–100; higher means higher priority
priority: 50

python: "3.10"

# Commands of the task
commands:
  - pip install -r fine-tuning/qlora/requirements.txt
  - python fine-tuning/qlora/train.py

# Retry policy to queue tasks
retry:
  on_events: [no-capacity]
  duration: 1d

Previously, submitted jobs were processed in a FIFO manner, with older jobs handled first. Now, jobs are first sorted by descending priority. Note that if a high-priority run cannot be scheduled, it does not block lower-priority runs from being scheduled (i.e., best-effort FIFO).

Note

It can also be useful to combine priority with retry to ensure tasks remain queued in case of capacity limits.

The priority field is updatable, so it can be modified for already submitted runs and will take effect.

CLI

dstack project command

The new dstack project command replaces the existing dstack config command.

  1. dstack project (same as dstack project list)
$ dstack project

 PROJECT         URL                    USER            DEFAULT
 peterschmidt85  https://sky.dstack.ai  peterschmidt85
 main            http://127.0.0.1:3000  admin              ✓
  1. dstack project set-default
$ dstack project set-default peterschmidt85
OK
  1. dstack project add (similar to old dstack config, but --project is changed to --name)
$ dstack project add --name peterschmidt85 --url https://sky.dstack.ai --token 76d8dd51-0470-74a7-24ed9ec18-fb7d341
OK

dstack ps -n/--last

The dstack ps command now supports a new -n/--last parameter to show last N runs:

✗ dstack ps -n 3
 NAME             BACKEND             RESOURCES                                    PRICE    STATUS      SUBMITTED    
 good-panther-2   gcp (europe-west4)  cpu=2 mem=8GB disk=100GB                     $0.0738  terminated  49 mins ago  
 new-chipmunk-1   azure (westeurope)  cpu=2 mem=8GB disk=100GB (spot)              $0.0158  terminated  23 hours ago 
 fuzzy-panther-1  runpod (EU-RO-1)    cpu=6 mem=31GB disk=100GB RTX2000Ada:16GB:1  $0.28    terminated  yesterday

Azure

Fsv2 series

The Azure backend now supports compute-optimized Fsv2 series:

✗ dstack apply -b azure
 Project              main                           
 User                 admin                          
 Configuration        .dstack.yml                    
 Type                 dev-environment                
 Resources            cpu=4.. mem=8GB.. disk=100GB.. 
 Spot policy          auto                           
 Max price            -                              
 Retry policy         -                              
 Creation policy      reuse-or-create                
 Idle duration        5m                             
 Max duration         -                              
 Inactivity duration  -                              
 Reservation          -                              

 #  BACKEND             RESOURCES                         INSTANCE TYPE      PRICE     
 1  azure (westeurope)  cpu=4 mem=8GB disk=100GB (spot)   Standard_F4s_v2    $0.0278   
 2  azure (westeurope)  cpu=4 mem=16GB disk=100GB (spot)  Standard_D4s_v3    $0.0312   
 3  azure (westeurope)  cpu=4 mem=32GB disk=100GB (spot)  Standard_E4-2s_v4  $0.0416   
    ...                                                                                
 Shown 3 of 98 offers, $40.962max

Major bugfixes

  • [Bug]: Instances with blocks feature cannot be used for multi-node runs #2650

Deprecations

  • The dstack config CLI command is deprecated in favor of dstack project add.

What's changed

Full changelog: 0.19.9...0.19.10

0.19.9

15 May 09:51
2f96871

Choose a tag to compare

Metrics

Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.

Screenshot 2025-05-15 at 20 02 59

AMD

On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.

CLI

Container exit status

The CLI now displays the container exit status of each failed run or job:

Screenshot 2025-05-15 at 16 36 49

This information can be seen via dstack ps if you pass -v:

Screenshot 2025-05-15 at 16 23 07

Server

Robust handling of networking issues

It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.

Environment variables

Two new environment variables are now available within runs:

  • DSTACK_RUN_ID stores the UUID of the run. It's unique for a run unlike DSTACK_RUN_NAME.
  • DSTACK_JOB_ID stores the UUID of the job submission. It's unique for every replica, job, and retry attempt.

What's changed

New contributors

Full Changelog: 0.19.8...0.19.9

0.19.8

07 May 15:46
2e3da2c

Choose a tag to compare

Nebius

InfiniBand clusters

The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.

type: fleet
name: my-fleet

nodes: 2
placement: cluster

resources:
  gpu: H100,H200:8

A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.

Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.

ARM

dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:

resources:
  cpu: arm:4..  # 4 or more ARM cores

If the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.

To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.

Lambda

GH200

With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.

type: dev-environment
name: my-env

ide: vscode

resources:
  gpu: GH200:1

If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:

$ dstack apply -f .dstack.yml

 #   BACKEND             RESOURCES                                      INSTANCE TYPE  PRICE
 1   lambda (us-east-3)  cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1  gpu_1x_gh200   $1.49

Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.

Azure

Managed identities

The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.

projects:
- name: main
  backends:
  - type: azure
    subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
    tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
    creds:
      type: default
    vm_managed_identity: dstack-rg/my-managed-identity

Make sure that dstack has the required permissions for managed identities to work.

What's changed

  • Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in #2599
  • Clarify how retry works for tasks and services by @r4victor in #2600
  • [Docs] Added Tenstorrent example by @peterschmidt85 in #2596
  • Lambda: Docker: use cgroupfs driver by @un-def in #2603
  • Don't collect Prometheus metrics on container-based backends by @un-def in #2605
  • Support Nebius InfiniBand clusters by @jvstme in #2604
  • Add ARM64 support by @un-def in #2595
  • Allow to configure Nebius InfiniBand fabrics by @jvstme in #2607
  • Support vm_managed_identity for Azure by @r4victor in #2608
  • Fix API quota hitting when provisioning many A3 instances by @r4victor in #2610

New contributors

Full changelog: 0.19.7...0.19.8

0.19.7

01 May 14:05
1321113

Choose a tag to compare

This update fixes multi-node fleet provisioning on GCP.

What's changed

  • Revert "Use AS_COMPACT collocation for gcp placement groups (#2587)" by @un-def in #2592

Full changelog: 0.19.6...0.19.7