Skip to content

Proposal: Query, Signal, and Ranking Pipeline for Drop Discovery #55

Description

@Breee

Goal

Extend Drop's DiscoveryPolicy so image discovery can rank images with more than a single usage-count score.

The proposed model is:

queries -> signals -> ranking -> selected images

This separates data collection from scoring:

  • Queries fetch raw data from systems such as Prometheus or Loki.
  • Signals derive named per-image metrics from query results.
  • Ranking strategies combine one or more signals into the final ordered image list.

The goal is to support practical image prewarming strategies for Kubernetes CI/CD workloads, especially GitLab Kubernetes executor node pools.


Problem

A simple count-based discovery strategy answers:

Which images appeared most often?

That is useful, but incomplete.

CI workloads have different shapes:

  • some images are used steadily throughout the day,
  • some images are used mainly during developer feedback hours,
  • some images appear in short high-concurrency bursts,
  • some images are used in nightly validation jobs,
  • some images are not frequent but are expensive when cold,
  • some images matter because node rotation leaves many nodes cold for them.

To support these cases, Drop needs named input data, reusable derived signals, and explicit ranking logic.


Design Overview

A DiscoveryPolicy should define:

spec:
  queries: []
  signals: []
  ranking: {}

Query

A query fetches raw observations.

Examples:

  • Prometheus range query for image usage.
  • Loki range query for Kubernetes image pull events.
  • Future external pull-cost profile.

Signal

A signal derives a named per-image value from query results.

Examples:

  • total-usage
  • peak-concurrency
  • developer-weighted-usage
  • recent-usage
  • p50-cold-pull-time

Ranking

A ranking strategy combines signals into the final score.

Examples:

  • rank by one signal,
  • weighted sum of normalized signals,
  • model-aware exposure score.

Discovery Strategies

1. Total Usage

Ranks images by total observed usage over a lookback window.

score(I) = sum(count_I(t) for t in W)

Required signal:

total-usage

Required query:

Prometheus image-usage range query

Use when:

  • the workload is stable,
  • the goal is a simple hot-image baseline,
  • the user wants the most commonly observed images.

Limitation:

  • May miss images that are not globally frequent but appear in large bursts.

2. Peak Same-Image Concurrency

Ranks images by maximum observed concurrent usage.

score(I) = max(count_I(t) for t in W)

Required signal:

peak-concurrency

Required query:

Prometheus image-usage range query

Use when:

  • CI has fan-out stages,
  • CI has scheduled high-volume jobs,
  • nightly validation jobs create many Pods using the same image,
  • registry pressure from synchronized cold pulls is a concern.

Limitation:

  • A rare spike can dominate if this is used alone.

3. Developer-Time Weighted Usage

Ranks images by usage during configured developer feedback windows.

score(I) = sum(weight(t) * count_I(t) for t in W)

Example weighting:

Time window Weight
07:00-09:00 0.3
09:00-17:00 1.0
17:00-20:00 0.3
otherwise 0.0

Required signal:

developer-weighted-usage

Required query:

Prometheus image-usage range query

Use when:

  • optimizing developer feedback time,
  • the team has known working-hour patterns,
  • interactive CI matters more than background/nightly work.

Limitation:

  • Requires timezone and window configuration.
  • May not fit globally distributed teams without multiple windows or broader policies.

4. Recent Usage

Ranks images by usage in a short recent window.

score(I) = sum(count_I(t) for t in recent window)

Required signal:

recent-usage

Required query:

Prometheus image-usage range query

Use when:

  • image usage changes quickly,
  • new images are introduced often,
  • short-lived project activity should influence prewarming.

Limitation:

  • Can overreact to temporary spikes.

5. Hybrid Usage + Peak Concurrency

Balances generally hot images and burst-heavy images.

score(I) =
  alpha * normalize(total_usage(I))
  + (1 - alpha) * normalize(peak_concurrency(I))

Example:

alpha = 0.7

Meaning:

70% total usage
30% peak concurrency

Required signals:

total-usage
peak-concurrency

Required query:

Prometheus image-usage range query

Use when:

  • the cluster has mixed workloads,
  • both steady hot images and bursty images matter,
  • pure count and pure max are both too narrow.

Limitation:

  • Requires normalization and explainable status output.

6. Hybrid Developer-Time Usage + Peak Concurrency

Balances developer-feedback relevance with burst detection.

score(I) =
  alpha * normalize(developer_weighted_usage(I))
  + (1 - alpha) * normalize(peak_concurrency(I))

Required signals:

developer-weighted-usage
peak-concurrency

Required query:

Prometheus image-usage range query

Use when:

  • developer feedback is the primary goal,
  • but off-hour bursts still matter operationally.

Limitation:

  • Requires both time-window weighting and normalization.

7. Count × Pull Time

Ranks images by usage multiplied by measured image availability time.

score(I) = total_usage(I) * p_hat(I)

Required signals:

total-usage
p50-cold-pull-time

or:

total-usage
p95-cold-pull-time

Required queries:

Prometheus image-usage query
Loki pull-event query or external pull-cost profile

Use when:

  • image pull costs vary significantly,
  • a medium-frequency but expensive image should outrank a tiny frequent image.

Limitation:

  • Requires per-image pull-time estimates.

8. Developer-Weighted Count × Pull Time

Ranks developer-relevant images by estimated cold-start cost.

score(I) = developer_weighted_usage(I) * p_hat(I)

Required signals:

developer-weighted-usage
p50-cold-pull-time

Required queries:

Prometheus image-usage query
Loki pull-event query or external pull-cost profile

Use when:

  • the goal is reducing developer-facing affected job-minutes.

Limitation:

  • Requires time-window configuration and pull-time estimates.

9. Model-Aware Exposure

Ranks images by estimated post-rotation cold-node exposure.

score(I) =
  J_target(I)
  * cold_fraction_hat(I)
  * p_hat(I)

with:

cold_fraction_hat(I) = (1 - 1/N) ^ J_pre(I)

Where:

  • N is the number of eligible CI nodes,
  • J_pre(I) is usage before the target window,
  • J_target(I) is usage during the target window,
  • p_hat(I) is measured or estimated image availability time.

Required signals:

pre-window-usage
target-window-usage
p50-cold-pull-time

Required configuration:

nodeCount

Required queries:

Prometheus image-usage query
Loki pull-event query or external pull-cost profile

Use when:

  • prewarming should be node-rotation-aware,
  • enough observability exists to estimate pull time,
  • the user wants a closer approximation of affected job-minutes.

Limitation:

  • More assumptions than usage-only strategies.
  • Should be implemented as a typed ranking strategy.

Required Pipeline Capabilities

Query Types

Prometheus

Used for:

  • total usage,
  • peak concurrency,
  • developer-time usage,
  • recent usage,
  • pre-window usage,
  • target-window usage.

Normalized output:

timestamp,image,value

Loki

Used for Kubernetes image-pull event analysis when Prometheus does not expose useful per-image pull durations.

Normalized output:

timestamp,pod,image,reason,message

Pull Cost Profile

Optional future alternative to Loki.

Normalized output:

image,p50ColdPullSeconds,p95ColdPullSeconds,sampleCount

This can be generated by an external analyzer if pull-time parsing should not live inside the Drop controller.


Signal Types

Signal type Purpose Example signals
aggregate Aggregate all samples per image total-usage, peak-concurrency
timeWeightedAggregate Apply time-window weights before aggregation developer-weighted-usage
windowAggregate Aggregate a specific sub-window recent-usage, pre-window-usage, target-window-usage
eventPullTime Derive pull-time stats from events p50-cold-pull-time, p95-cold-pull-time

Ranking Strategies

Ranking strategy Purpose
signal Rank directly by one signal
weightedSum Combine normalized signals
modelExposure Rank by expected post-rotation exposure

Proposed CRD Shape

Overview

apiVersion: drop.corewire.io/v1alpha1
kind: DiscoveryPolicy
metadata:
  name: gitlab-runner-discovery
spec:
  syncInterval: 1h
  maxImages: 30

  queries: []
  signals: []
  ranking: {}

Queries

Prometheus Image Usage Query

queries:
  - name: runner-image-usage
    type: prometheus
    prometheus:
      endpoint: https://mimir.example.com
      queryType: range
      lookback: 168h
      step: 1m
      query: |
        count(
          container_memory_working_set_bytes{
            container!="",
            container!="POD",
            namespace="gitlab-runner",
            pod=~"runner-.*"
          }
        ) by (image)

The query must return an image label.

Normalized result:

timestamp,image,value

Example:

2026-06-18T09:00:00Z,registry.example.com/ci/node-build:22,18
2026-06-18T09:01:00Z,registry.example.com/ci/node-build:22,21

Loki Image Pull Event Query

queries:
  - name: image-pull-events
    type: loki
    loki:
      endpoint: https://loki.example.com
      queryType: range
      lookback: 168h
      query: |
        {job="kubernetes-events", namespace="gitlab-runner"}
        | json
        | involvedObject_name =~ "runner-.*"
        | reason =~ "Pulling|Pulled|Failed|BackOff"
      parser:
        type: kubernetesEvents
        podField: involvedObject_name
        reasonField: reason
        messageField: message
        imageField: message

Normalized result:

timestamp,pod,image,reason,message

Expected event messages include:

Pulling image "registry.example.com/ci/java-gradle:21"
Successfully pulled image "registry.example.com/ci/java-gradle:21" in 42.3s
Container image "registry.example.com/ci/java-gradle:21" already present on machine
Failed to pull image "registry.example.com/ci/java-gradle:21"
Back-off pulling image "registry.example.com/ci/java-gradle:21"

Signals

aggregate

Aggregates all samples per image.

Supported methods:

sum
max
avg
count
min

Total usage:

signals:
  - name: total-usage
    queryRef: runner-image-usage
    type: aggregate
    aggregate:
      method: sum

Peak concurrency:

signals:
  - name: peak-concurrency
    queryRef: runner-image-usage
    type: aggregate
    aggregate:
      method: max

timeWeightedAggregate

Applies configured time weights before aggregation.

signals:
  - name: developer-weighted-usage
    queryRef: runner-image-usage
    type: timeWeightedAggregate
    timeWeightedAggregate:
      method: sum
      timezone: Europe/Berlin
      defaultWeight: "0"
      windows:
        - startHour: 7
          endHour: 9
          weight: "0.3"
        - startHour: 9
          endHour: 17
          weight: "1.0"
        - startHour: 17
          endHour: 20
          weight: "0.3"

windowAggregate

Aggregates a specific time window.

Recent usage:

signals:
  - name: recent-usage
    queryRef: runner-image-usage
    type: windowAggregate
    windowAggregate:
      method: sum
      relativeWindow: 2h

Pre-window usage:

signals:
  - name: pre-window-usage
    queryRef: runner-image-usage
    type: windowAggregate
    windowAggregate:
      method: sum
      timezone: Europe/Berlin
      window:
        start: "00:00"
        end: "09:00"

Target-window usage:

signals:
  - name: developer-window-usage
    queryRef: runner-image-usage
    type: windowAggregate
    windowAggregate:
      method: sum
      timezone: Europe/Berlin
      window:
        start: "09:00"
        end: "17:00"

eventPullTime

Derives image pull-time statistics from event records.

signals:
  - name: p50-cold-pull-time
    queryRef: image-pull-events
    type: eventPullTime
    eventPullTime:
      statistic: p50
      includeCacheHits: false
      durationMode: eventPair

Supported statistics:

p50
p90
p95
avg
max
count
failureCount
cacheHitCount

Supported duration modes:

Mode Meaning
eventPair Pulled.timestamp - Pulling.timestamp for the same Pod/image
messageDuration parse duration from a Pulled event message

Cache hits should be detected separately and excluded from cold-pull duration when:

includeCacheHits: false

Ranking Strategies

signal

Ranks directly by one signal.

ranking:
  strategy: signal
  signal:
    signalRef: total-usage

weightedSum

Combines normalized signals.

ranking:
  strategy: weightedSum
  weightedSum:
    normalize: minMax
    missingSignal: zero
    terms:
      - signalRef: total-usage
        weight: "0.7"
      - signalRef: peak-concurrency
        weight: "0.3"

Formula:

final_score(I) =
  0.7 * normalize(total_usage(I))
  + 0.3 * normalize(peak_concurrency(I))

Initial normalization method:

minMax

Formula:

normalized(x) = (x - min) / (max - min)

If all values are equal:

normalized(x) = 1

modelExposure

Ranks by expected post-rotation exposure.

ranking:
  strategy: modelExposure
  modelExposure:
    nodeCount: 100
    preWindowUsageSignalRef: pre-window-usage
    targetWindowUsageSignalRef: developer-window-usage
    pullTimeSignalRef: p50-cold-pull-time

Formula:

score(I) =
  J_target(I)
  * (1 - 1/N) ^ J_pre(I)
  * p_hat(I)

Complete Examples

Example 1: Hybrid Usage and Peak Concurrency

apiVersion: drop.corewire.io/v1alpha1
kind: DiscoveryPolicy
metadata:
  name: gitlab-hybrid-usage-concurrency
spec:
  syncInterval: 1h
  maxImages: 30

  queries:
    - name: runner-image-usage
      type: prometheus
      prometheus:
        endpoint: https://mimir.example.com
        queryType: range
        lookback: 168h
        step: 1m
        query: |
          count(
            container_memory_working_set_bytes{
              container!="",
              container!="POD",
              namespace="gitlab-runner",
              pod=~"runner-.*"
            }
          ) by (image)

  signals:
    - name: total-usage
      queryRef: runner-image-usage
      type: aggregate
      aggregate:
        method: sum

    - name: peak-concurrency
      queryRef: runner-image-usage
      type: aggregate
      aggregate:
        method: max

  ranking:
    strategy: weightedSum
    weightedSum:
      normalize: minMax
      missingSignal: zero
      terms:
        - signalRef: total-usage
          weight: "0.7"
        - signalRef: peak-concurrency
          weight: "0.3"

Example 2: Developer-Time Usage and Peak Concurrency

apiVersion: drop.corewire.io/v1alpha1
kind: DiscoveryPolicy
metadata:
  name: gitlab-developer-and-burst
spec:
  syncInterval: 1h
  maxImages: 30

  queries:
    - name: runner-image-usage
      type: prometheus
      prometheus:
        endpoint: https://mimir.example.com
        queryType: range
        lookback: 168h
        step: 1m
        query: |
          count(
            container_memory_working_set_bytes{
              container!="",
              container!="POD",
              namespace="gitlab-runner",
              pod=~"runner-.*"
            }
          ) by (image)

  signals:
    - name: developer-weighted-usage
      queryRef: runner-image-usage
      type: timeWeightedAggregate
      timeWeightedAggregate:
        method: sum
        timezone: Europe/Berlin
        defaultWeight: "0"
        windows:
          - startHour: 7
            endHour: 9
            weight: "0.3"
          - startHour: 9
            endHour: 17
            weight: "1.0"
          - startHour: 17
            endHour: 20
            weight: "0.3"

    - name: peak-concurrency
      queryRef: runner-image-usage
      type: aggregate
      aggregate:
        method: max

  ranking:
    strategy: weightedSum
    weightedSum:
      normalize: minMax
      missingSignal: zero
      terms:
        - signalRef: developer-weighted-usage
          weight: "0.7"
        - signalRef: peak-concurrency
          weight: "0.3"

Example 3: Model-Aware Exposure

apiVersion: drop.corewire.io/v1alpha1
kind: DiscoveryPolicy
metadata:
  name: gitlab-model-aware-exposure
spec:
  syncInterval: 1h
  maxImages: 30

  queries:
    - name: runner-image-usage
      type: prometheus
      prometheus:
        endpoint: https://mimir.example.com
        queryType: range
        lookback: 168h
        step: 5m
        query: |
          count(
            container_memory_working_set_bytes{
              container!="",
              container!="POD",
              namespace="gitlab-runner",
              pod=~"runner-.*"
            }
          ) by (image)

    - name: image-pull-events
      type: loki
      loki:
        endpoint: https://loki.example.com
        queryType: range
        lookback: 168h
        query: |
          {job="kubernetes-events", namespace="gitlab-runner"}
          | json
          | involvedObject_name =~ "runner-.*"
          | reason =~ "Pulling|Pulled|Failed|BackOff"
        parser:
          type: kubernetesEvents
          podField: involvedObject_name
          reasonField: reason
          messageField: message
          imageField: message

  signals:
    - name: pre-window-usage
      queryRef: runner-image-usage
      type: windowAggregate
      windowAggregate:
        method: sum
        timezone: Europe/Berlin
        window:
          start: "00:00"
          end: "09:00"

    - name: developer-window-usage
      queryRef: runner-image-usage
      type: windowAggregate
      windowAggregate:
        method: sum
        timezone: Europe/Berlin
        window:
          start: "09:00"
          end: "17:00"

    - name: p50-cold-pull-time
      queryRef: image-pull-events
      type: eventPullTime
      eventPullTime:
        statistic: p50
        includeCacheHits: false
        durationMode: eventPair

  ranking:
    strategy: modelExposure
    modelExposure:
      nodeCount: 100
      preWindowUsageSignalRef: pre-window-usage
      targetWindowUsageSignalRef: developer-window-usage
      pullTimeSignalRef: p50-cold-pull-time

Status and Observability

The controller should expose enough status to explain every selected image.

Example:

status:
  lastRunTime: "2026-06-18T10:00:00Z"
  observedGeneration: 4

  queryResults:
    - name: runner-image-usage
      type: prometheus
      series: 30
      samples: 60480
      status: success

    - name: image-pull-events
      type: loki
      records: 1820
      status: success

  signalResults:
    - name: total-usage
      images: 30
      status: success

    - name: peak-concurrency
      images: 30
      status: success

  discoveredImages:
    - image: registry.example.com/ci/java-gradle:21
      rank: 1
      finalScore: "0.8768"
      selected: true
      signals:
        - name: total-usage
          rawValue: "8210"
          normalizedValue: "0.824"
        - name: peak-concurrency
          rawValue: "96"
          normalizedValue: "1.0"
      ranking:
        strategy: weightedSum
        terms:
          - signal: total-usage
            weight: "0.7"
            contribution: "0.5768"
          - signal: peak-concurrency
            weight: "0.3"
            contribution: "0.3"

Status output should support debugging:

  • query failures,
  • missing labels,
  • missing signals,
  • normalization values,
  • ranking contributions,
  • final selected images.

Validation Plan

Query Tests

  • Prometheus query results are normalized into timestamp,image,value.
  • Loki query results are normalized into timestamp,pod,image,reason,message.
  • Missing image labels are rejected or ignored according to defined behavior.
  • Query failures are surfaced in status.

Signal Tests

  • aggregate.sum
  • aggregate.max
  • aggregate.avg
  • aggregate.count
  • timeWeightedAggregate
  • windowAggregate
  • eventPullTime

Ranking Tests

  • signal
  • weightedSum
  • modelExposure
  • missing signal handling,
  • normalization behavior,
  • deterministic tie-breaking.

Integration Tests

Use fake Prometheus and Loki responses to verify:

  • one query can feed multiple signals,
  • multiple signals can feed one ranking,
  • selected image order is deterministic,
  • status contains query, signal, and ranking details.

Implementation Split

Issue 1: CRD for Query, Signal, and Ranking Pipeline

Define the queries, signals, and ranking API.

Issue 2: Prometheus Query Execution

Implement named Prometheus range queries and normalized sample output.

Issue 3: Aggregate Signals

Implement:

aggregate.sum
aggregate.max
aggregate.avg
aggregate.count
aggregate.min

Issue 4: Basic Ranking

Implement signal ranking.

Issue 5: Weighted Ranking

Implement weightedSum ranking with minMax normalization.

Issue 6: Status Output

Expose query results, signal results, ranking contributions, and selected images.

Issue 7: Time-Based Signals

Implement:

timeWeightedAggregate
windowAggregate

Issue 8: Loki Query Source

Implement Loki range query support.

Issue 9: Event Pull-Time Signal

Implement eventPullTime.

Issue 10: Model-Aware Exposure Ranking

Implement typed modelExposure.

Issue 11: Documentation

Document:

  • total usage,
  • peak concurrency,
  • developer-time usage,
  • hybrid usage/concurrency,
  • pull-time-aware ranking,
  • model-aware exposure.

Design Decisions to Resolve

Missing signal behavior

Initial proposal:

missingSignal: zero

Alternative:

drop image from ranking if a required signal is missing

Pull-time statistic

Initial proposal:

p50-cold-pull-time

Alternative:

p95-cold-pull-time

The choice should be configurable.

Pull-time source

Two options:

  1. Native Loki query and eventPullTime.
  2. External ImagePullCostProfile produced by a separate analyzer.

A native Loki source is convenient. An external profile may keep the controller simpler.


Recommendation

Adopt the queries -> signals -> ranking pipeline for Drop discovery.

This design supports:

  • multiple signals from one query,
  • true hybrid ranking,
  • Prometheus and Loki inputs,
  • pull-time-aware ranking,
  • model-aware exposure scoring,
  • explainable status output,
  • and a clean split into implementation PRs.

The first production-ready strategies should be:

signal(total-usage)
signal(peak-concurrency)
weightedSum(total-usage, peak-concurrency)
signal(developer-weighted-usage)
weightedSum(developer-weighted-usage, peak-concurrency)

The advanced strategy should be:

modelExposure(pre-window-usage, target-window-usage, p50-cold-pull-time)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions