Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 29 additions & 6 deletions examples/azure/aks-new_cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ This example creates the resources to run Anyscale on Azure AKS with either publ
The content of this module should be used as a starting point and modified to your own security and infrastructure
requirements.

## Available Node Pools

| Pool | VM Size | Capacity Type | Default Scale | Notes |
|------|---------|---------------|---------------|-------|
| `sys` | `Standard_D4s_v5` | On-demand | 3-5 nodes | Hosts system workloads and operator |
| `cpu8` | `Standard_D8s_v5` | On-demand | 0-10 nodes | Balanced CPU pool for moderate workloads |
| `cpu16` | `Standard_D16s_v5` | On-demand | 0-10 nodes | High-capacity CPU pool for larger workloads |

An accompanying Helm values file (`values/anyscale-operator.yaml`) maps these node pools to Anyscale instance types and pins the operator to the `cpu8` nodes.

## Getting Started

### Prerequisites
Expand Down Expand Up @@ -107,6 +117,22 @@ Ensure that you are logged into Anyscale with valid CLI credentials. (`anyscale

You will need an Anyscale platform API Key for the helm chart installation. You can generate one from the [Anyscale Web UI](https://console.anyscale.com/api-keys).

If you prefer to issue a key programmatically for a service account, use the Anyscale CLI:

```shell
anyscale service-account create-api-key --name <service-account-name>
```

The command returns the new token once; store it securely and then export it before running the Helm install (e.g. `export ANYSCALE_CLI_TOKEN=...`).

To rotate keys later, run:

```shell
anyscale service-account rotate-api-keys --name <service-account-name>
```

After rotation, update any deployments that rely on the token (for example, rerun the Helm upgrade with the new `anyscaleCliToken` value).

1. Using the output from the Terraform modules, register the Anyscale Cloud. It should look sonething like:

```shell
Expand Down Expand Up @@ -140,6 +166,7 @@ helm upgrade anyscale-operator anyscale/anyscale-operator \
--set-string region=<region> \
--set-string operatorIamIdentity=<anyscale_operator_client_id> \
--set-string workloadServiceAccountName=anyscale-operator \
-f values/anyscale-operator.yaml \
--namespace anyscale-operator \
--create-namespace \
-i
Expand Down Expand Up @@ -169,10 +196,7 @@ No modules.
|------|------|
| [azurerm_federated_identity_credential.anyscale_operator_fic](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/federated_identity_credential) | resource |
| [azurerm_kubernetes_cluster.aks](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster) | resource |
| [azurerm_kubernetes_cluster_node_pool.gpu_ondemand](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster_node_pool) | resource |
| [azurerm_kubernetes_cluster_node_pool.gpu_spot](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster_node_pool) | resource |
| [azurerm_kubernetes_cluster_node_pool.ondemand_cpu](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster_node_pool) | resource |
| [azurerm_kubernetes_cluster_node_pool.spot_cpu](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster_node_pool) | resource |
| [azurerm_kubernetes_cluster_node_pool.user](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/kubernetes_cluster_node_pool) | resource |
| [azurerm_resource_group.rg](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/resource_group) | resource |
| [azurerm_role_assignment.anyscale_blob_contrib](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/role_assignment) | resource |
| [azurerm_storage_account.sa](https://registry.terraform.io/providers/hashicorp/azurerm/4.26.0/docs/resources/storage_account) | resource |
Expand All @@ -187,10 +211,9 @@ No modules.
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_azure_subscription_id"></a> [azure\_subscription\_id](#input\_azure\_subscription\_id) | (Required) Azure subscription ID | `string` | n/a | yes |
| <a name="input_aks_cluster_name"></a> [aks\_cluster\_name](#input\_aks\_cluster\_name) | (Optional) Name of the AKS cluster (and related resources). | `string` | `"anyscale-demo"` | no |
| <a name="input_aks_cluster_name"></a> [aks\_cluster\_name](#input\_aks\_cluster\_name) | (Optional) Name of the AKS cluster (and related resources). | `string` | `"anyscale-aks-k8s"` | no |
| <a name="input_anyscale_operator_namespace"></a> [anyscale\_operator\_namespace](#input\_anyscale\_operator\_namespace) | (Optional) Kubernetes namespace for the Anyscale operator. | `string` | `"anyscale-operator"` | no |
| <a name="input_azure_location"></a> [azure\_location](#input\_azure\_location) | (Optional) Azure region for all resources. | `string` | `"West US"` | no |
| <a name="input_node_group_gpu_types"></a> [node\_group\_gpu\_types](#input\_node\_group\_gpu\_types) | (Optional) The GPU types of the AKS nodes.<br/>Possible values: ["T4", "A10", "A100", "H100"] | `list(string)` | <pre>[<br/> "T4"<br/>]</pre> | no |
| <a name="input_tags"></a> [tags](#input\_tags) | (Optional) Tags applied to all taggable resources. | `map(string)` | <pre>{<br/> "Environment": "dev",<br/> "Test": "true"<br/>}</pre> | no |

## Outputs
Expand Down
204 changes: 49 additions & 155 deletions examples/azure/aks-new_cluster/aks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ resource "azurerm_kubernetes_cluster" "aks" {
#checkov:skip=CKV_AZURE_4: "Ensure AKS logging to Azure Monitoring is Configured"
#checkov:skip=CKV_AZURE_227: "Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources"

name = var.aks_cluster_name
name = local.cluster_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the original name unchanged?

location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name

# lets kubectl talk to the API over the public FQDN
dns_prefix = "${var.aks_cluster_name}-dns"
dns_prefix = "${local.cluster_name}-dns"

# workload identity federation
oidc_issuer_enabled = true # publishes an OIDC issuer URL
Expand All @@ -36,16 +36,19 @@ resource "azurerm_kubernetes_cluster" "aks" {
#########################################################################
default_node_pool {
name = "sys"
vm_size = "Standard_D2s_v5"
vm_size = "Standard_D4s_v5"
vnet_subnet_id = azurerm_subnet.nodes.id
os_disk_size_gb = 64
type = "VirtualMachineScaleSets"

# autoscaler
# autoscaler tuned for resilient system services
auto_scaling_enabled = true
min_count = 1
max_count = 3
min_count = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the min_count to 3? We should keep it as minimal as possible.

max_count = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change it to 5?


upgrade_settings {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this upgrade_settings change?
We prefer not to maintain any optional settings here. Users can customize on their own.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this one, if we don't provide it, I think the AKS rejects it - as in it has to have upgrade_settings..

max_surge = "33%"
}
}

#########################################################################
Expand All @@ -63,107 +66,45 @@ resource "azurerm_kubernetes_cluster" "aks" {
tags = var.tags
}

###############################################################################
# CPU NODE POOL (Standard_D16s_v5) OnDemand
###############################################################################
resource "azurerm_kubernetes_cluster_node_pool" "ondemand_cpu" {

#checkov:skip=CKV_AZURE_168: "Ensure Azure Kubernetes Cluster (AKS) nodes should use a minimum number of 50 pods"
#checkov:skip=CKV_AZURE_227: "Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources"

name = "cpu16"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id

vm_size = "Standard_D16s_v5"
mode = "User"
vnet_subnet_id = azurerm_subnet.nodes.id

auto_scaling_enabled = true
min_count = 0
max_count = 10

node_taints = [
"node.anyscale.com/capacity-type=ON_DEMAND:NoSchedule"
]

tags = var.tags
}

###############################################################################
# CPU NODE POOL (Standard_D16s_v5) Spot
###############################################################################
resource "azurerm_kubernetes_cluster_node_pool" "spot_cpu" {

#checkov:skip=CKV_AZURE_168: "Ensure Azure Kubernetes Cluster (AKS) nodes should use a minimum number of 50 pods"
#checkov:skip=CKV_AZURE_227: "Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources"

name = "cpu16spot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id

vm_size = "Standard_D16s_v5"
mode = "User"
vnet_subnet_id = azurerm_subnet.nodes.id

auto_scaling_enabled = true
min_count = 0
max_count = 10

node_taints = [
"node.anyscale.com/capacity-type=SPOT:NoSchedule"
]

priority = "Spot"
eviction_policy = "Delete"

tags = var.tags
}

# USER NODE POOLS (CPU)
# Opinionated CPU node pools exposed to Anyscale users
locals {
gpu_pool_configs = {
T4 = {
name = "gput4"
vm_size = "Standard_NC16as_T4_v3"
product_name = "NVIDIA-T4"
gpu_count = "1"
user_node_pools = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are those GPU node pools? They are necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add them back on once I can test them, currently don't have GPU quota so couldn't test (WIP)

cpu8 = {
name = "cpu8"
vm_size = "Standard_D8s_v5"
min_count = 0
max_count = 10
node_labels = {
"node.anyscale.com/capacity-type" = "ON_DEMAND"
"nodepool.anyscale.com/name" = "cpu8"
}
node_taints = [
"node.anyscale.com/capacity-type=ON_DEMAND:NoSchedule"
]
}
A10 = {
name = "gpua10"
vm_size = "Standard_NV36ads_A10_v5"
product_name = "NVIDIA-A10"
gpu_count = "1"
cpu16 = {
name = "cpu16"
vm_size = "Standard_D16s_v5"
min_count = 0
max_count = 10
node_labels = {
"node.anyscale.com/capacity-type" = "ON_DEMAND"
"nodepool.anyscale.com/name" = "cpu16"
}
node_taints = [
"node.anyscale.com/capacity-type=ON_DEMAND:NoSchedule"
]
}
A100 = {
name = "gpua100"
vm_size = "Standard_NC24ads_A100_v4"
product_name = "NVIDIA-A100"
gpu_count = "1"
}
H100 = {
name = "gpuh100x8"
vm_size = "Standard_ND96isr_H100_v5"
product_name = "NVIDIA-H100"
gpu_count = "8"
}
}

# keep only the types the caller asked for
selected_gpu_pools = {
for k, v in local.gpu_pool_configs :
k => v if contains(var.node_group_gpu_types, k)
}
}

###############################################################################
# GPU Node POOL (Standard_NC16as_T4_v3) OnDemand
###############################################################################
resource "azurerm_kubernetes_cluster_node_pool" "user" {

#trivy:ignore:avd-azu-0168
#trivy:ignore:avd-azu-0227
resource "azurerm_kubernetes_cluster_node_pool" "gpu_ondemand" {
#checkov:skip=CKV_AZURE_168
#checkov:skip=CKV_AZURE_227
#checkov:skip=CKV_AZURE_168: "Ensure Azure Kubernetes Cluster (AKS) nodes should use a minimum number of 50 pods"
#checkov:skip=CKV_AZURE_227: "Ensure that the AKS cluster encrypt temp disks, caches, and data flows between Compute and Storage resources"

for_each = local.selected_gpu_pools
for_each = local.user_node_pools

name = each.value.name
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
Expand All @@ -172,74 +113,27 @@ resource "azurerm_kubernetes_cluster_node_pool" "gpu_ondemand" {
mode = "User"
vnet_subnet_id = azurerm_subnet.nodes.id

# ── autoscaling (shared across all pools) ───────────────────────────────────
auto_scaling_enabled = true
min_count = 0
max_count = 10

upgrade_settings { max_surge = "1" }

# ── labels & taints ────────────────────────────────────────────────────────
node_labels = {
"nvidia.com/gpu.product" = each.value.product_name
"nvidia.com/gpu.count" = each.value.gpu_count
}

node_taints = [
"node.anyscale.com/capacity-type=ON_DEMAND:NoSchedule",
"nvidia.com/gpu=present:NoSchedule",
"node.anyscale.com/accelerator-type=GPU:NoSchedule",
]
min_count = each.value.min_count
max_count = each.value.max_count

tags = var.tags
}

###############################################################################
# GPU Node POOL (Standard_NC16as_T4_v3) Spot
###############################################################################
#trivy:ignore:avd-azu-0168
#trivy:ignore:avd-azu-0227
resource "azurerm_kubernetes_cluster_node_pool" "gpu_spot" {
#checkov:skip=CKV_AZURE_168
#checkov:skip=CKV_AZURE_227

for_each = local.selected_gpu_pools

name = "${each.value.name}spot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
node_taints = each.value.node_taints
node_labels = merge(each.value.node_labels, {
"nodepool.anyscale.com/type" = "cpu"
})

vm_size = each.value.vm_size
mode = "User"
vnet_subnet_id = azurerm_subnet.nodes.id

# ── autoscaling (shared across all pools) ───────────────────────────────────
auto_scaling_enabled = true
min_count = 0
max_count = 10

# ── labels & taints ────────────────────────────────────────────────────────
node_labels = {
"nvidia.com/gpu.product" = each.value.product_name
"nvidia.com/gpu.count" = each.value.gpu_count
upgrade_settings {
max_surge = "1"
}

node_taints = [
"node.anyscale.com/capacity-type=ON_DEMAND:NoSchedule",
"nvidia.com/gpu=present:NoSchedule",
"node.anyscale.com/accelerator-type=GPU:NoSchedule",
]

priority = "Spot"
eviction_policy = "Delete"

tags = var.tags
}

##############################################################################
# MANAGED IDENTITY FOR ANYSCALE OPERATOR
###############################################################################
resource "azurerm_user_assigned_identity" "anyscale_operator" {
name = "${var.aks_cluster_name}-anyscale-operator-mi"
name = "${local.cluster_name}-anyscale-operator-mi"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
}
Expand Down
24 changes: 18 additions & 6 deletions examples/azure/aks-new_cluster/main.tf
Original file line number Diff line number Diff line change
@@ -1,13 +1,25 @@
resource "random_string" "storage_suffix" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change? We prefer not to have any randomness here. Otherwise debugging issues would be harder.

length = 6
upper = false
lower = true
numeric = true
special = false
}

locals {
vnet_cidr = "10.42.0.0/16"
nodes_subnet_cidr = "10.42.1.0/24"
vnet_cidr = "10.42.0.0/16"
nodes_subnet_cidr = "10.42.1.0/24"
cluster_name = var.aks_cluster_name
cluster_name_sanitized = join("", regexall("[a-z0-9]", lower(local.cluster_name)))
storage_account_name = substr("${local.cluster_name_sanitized}${random_string.storage_suffix.result}", 0, 24)
storage_container_name = "${local.cluster_name}-blob"
}

############################################
# resource group
############################################
resource "azurerm_resource_group" "rg" {
name = "${var.aks_cluster_name}-rg"
name = "${local.cluster_name}-rg"
location = var.azure_location
tags = var.tags
}
Expand All @@ -30,7 +42,7 @@ resource "azurerm_storage_account" "sa" {
#checkov:skip=CKV2_AZURE_21: "Ensure Storage logging is enabled for Blob service for read requests"
#checkov:skip=CKV2_AZURE_31: "Ensure VNET subnet is configured with a Network Security Group (NSG)"

name = replace("${var.aks_cluster_name}sa", "-", "") # demo-aks --> demoakssa
name = local.storage_account_name
resource_group_name = azurerm_resource_group.rg.name
location = azurerm_resource_group.rg.location
account_tier = "Standard"
Expand All @@ -46,7 +58,7 @@ resource "azurerm_storage_container" "blob" {

#checkov:skip=CKV2_AZURE_21: "Ensure Storage logging is enabled for Blob service for read requests"

name = "${var.aks_cluster_name}-blob"
name = local.storage_container_name
storage_account_id = azurerm_storage_account.sa.id
container_access_type = "private" # blobs are private but reachable via the public endpoint
}
Expand All @@ -55,7 +67,7 @@ resource "azurerm_storage_container" "blob" {
# networking (vnet and subnet)
############################################
resource "azurerm_virtual_network" "vnet" {
name = "${var.aks_cluster_name}-vnet"
name = "${local.cluster_name}-vnet"
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
address_space = [local.vnet_cidr]
Expand Down
Loading