Skip to content

Conversation

@dblinkhorn
Copy link

This PR introduces a singleton cache for Octokit clients, authentication objects, and GitHub App configuration, so we can avoid making redundant SSM lookups and client creation.

This PR represents the first step in what will be a broader effort to modularize GitHub interactions. Future PRs will build on this one by adding runner management, job operations, advanced caching for runner and job metadata, etc.

@dblinkhorn dblinkhorn requested review from a team as code owners November 18, 2025 22:47
@dblinkhorn dblinkhorn changed the title feat: Add centralized GitHub service module with caching for clients, auth, and config feat(control-plane): Centralize GitHub client/auth/config caching Nov 18, 2025
@npalm
Copy link
Member

npalm commented Nov 19, 2025

@dblinkhorn thx, not digged in the PR it self. But please can you run yarn format to fix the prettier errors in CI.

@dblinkhorn dblinkhorn force-pushed the dblinkhorn/github-service-module branch 3 times, most recently from 14bd0be to 5f34e10 Compare November 20, 2025 22:43
@npalm npalm requested a review from Copilot November 21, 2025 21:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a singleton caching layer for GitHub API clients, authentication objects, and app configuration to optimize performance by avoiding redundant SSM parameter lookups and client instantiations. The caching implementation uses in-memory Maps for storing Octokit clients, app/installation authentication objects, and GitHub App configuration.

Key changes:

  • Created a new cache module (github/cache.ts) with functions to cache and retrieve Octokit clients, authentication objects, and app configs
  • Refactored existing authentication functions in auth.ts to utilize the new caching layer
  • Added cache reset functionality in test setup to ensure test isolation

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
lambdas/functions/control-plane/src/github/types.ts Defines GithubAppConfig interface for GitHub App configuration with appId, privateKey, and optional installationId
lambdas/functions/control-plane/src/github/index.ts Exports cache and types modules to make them available to other parts of the application
lambdas/functions/control-plane/src/github/cache.ts Implements singleton cache using Map structures with cache key generators and getter functions for clients, auth objects, and configs
lambdas/functions/control-plane/src/github/auth.ts Integrates caching into existing authentication and client creation functions to avoid redundant SSM lookups
lambdas/functions/control-plane/src/github/auth.test.ts Adds cache reset call in test setup and updates import statement to use type imports
lambdas/functions/control-plane/src/local.ts Removes an eslint-disable comment that is no longer needed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 10 to 12
export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
return `octokit-${token}-${ghesApiUrl}`;
}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing tokens directly in cache keys is a security concern as it could expose sensitive credentials in memory dumps or logs. The full token value should not be part of the cache key.

Consider using a hash of the token instead:

import { createHash } from 'crypto';

export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
  const tokenHash = createHash('sha256').update(token).digest('hex').substring(0, 16);
  return `octokit-${tokenHash}-${ghesApiUrl}`;
}

Copilot uses AI. Check for mistakes.
export interface GithubAppConfig {
appId: number;
privateKey: string;
installationId?: number;
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installationId field is defined in the GithubAppConfig interface but is never populated or used. It should either be:

  1. Removed from the interface if not needed, or
  2. Populated in the createAuth function (line 109-112) if it's intended to be part of the config

Currently, the installationId is handled separately in createAuth (line 119) rather than being part of the cached config.

Suggested change
installationId?: number;

Copilot uses AI. Check for mistakes.
Comment on lines 10 to 67
export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
return `octokit-${token}-${ghesApiUrl}`;
}

export function createAuthCacheKey(type: 'app' | 'installation', installationId?: number, ghesApiUrl: string = '') {
const id = installationId || 'none';
return `${type}-auth-${id}-${ghesApiUrl}`;
}

export function createAuthConfigCacheKey(ghesApiUrl: string = '') {
return `auth-config-${ghesApiUrl}`;
}

export async function getClient(key: string, creator: () => Promise<Octokit>): Promise<Octokit> {
if (clients.has(key)) {
return clients.get(key)!;
}

const client = await creator();
clients.set(key, client);
return client;
}

export async function getAppAuthObject(
key: string,
creator: () => Promise<AppAuthentication>,
): Promise<AppAuthentication> {
if (appAuthObjects.has(key)) {
return appAuthObjects.get(key)!;
}

const authObj = await creator();
appAuthObjects.set(key, authObj);
return authObj;
}

export async function getInstallationAuthObject(
key: string,
creator: () => Promise<InstallationAccessTokenAuthentication>,
): Promise<InstallationAccessTokenAuthentication> {
if (installationAuthObjects.has(key)) {
return installationAuthObjects.get(key)!;
}

const authObj = await creator();
installationAuthObjects.set(key, authObj);
return authObj;
}

export async function getAuthConfig(key: string, creator: () => Promise<GithubAppConfig>) {
if (authConfigs.has(key)) {
return authConfigs.get(key)!;
}

const config = await creator();
authConfigs.set(key, config);
return config;
}
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache functions and key generation functions lack JSDoc documentation. Adding documentation would help other developers understand:

  • The caching strategy and behavior
  • When cached values are returned vs when the creator function is called
  • The cache key format and what parameters affect it
  • The purpose of the reset() function

Example:

/**
 * Creates a cache key for Octokit clients based on token and API URL.
 * @param token - The authentication token
 * @param ghesApiUrl - Optional GitHub Enterprise Server API URL
 * @returns A unique cache key for this token/URL combination
 */
export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
  return `octokit-${token}-${ghesApiUrl}`;
}

Copilot uses AI. Check for mistakes.
}

export function createAuthCacheKey(type: 'app' | 'installation', installationId?: number, ghesApiUrl: string = '') {
const id = installationId || 'none';
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key generation uses installationId || 'none' which treats falsy values (0, null, undefined) the same way. While 0 is likely not a valid GitHub installation ID, this could lead to cache key collisions for different falsy values.

Consider using nullish coalescing instead for more precise handling:

const id = installationId ?? 'none';

This will only treat null and undefined as 'none', while preserving 0 if it were ever a valid value.

Suggested change
const id = installationId || 'none';
const id = installationId ?? 'none';

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems valid

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@dblinkhorn dblinkhorn force-pushed the dblinkhorn/github-service-module branch 2 times, most recently from 7e9cae3 to 8a5d1d0 Compare November 22, 2025 01:50
@npalm npalm requested a review from Copilot November 23, 2025 09:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

lambdas/functions/control-plane/src/github/auth.test.ts:38

  • Consider adding test cases to verify that the caching behavior works correctly. For example:
  • Verify that calling createGithubInstallationAuth twice with the same parameters only calls the underlying createAuth once (due to caching)
  • Verify that calling createGithubInstallationAuth with different installationId or ghesApiUrl values creates separate cache entries
  • Verify that the reset() function properly clears the cache so subsequent calls fetch fresh values

This would ensure the caching logic introduced in this PR works as intended.

beforeEach(() => {
  reset(); // clear all caches before each test
  vi.resetModules();
  vi.clearAllMocks();
  process.env = { ...cleanEnv };
  process.env.PARAMETER_GITHUB_APP_ID_NAME = PARAMETER_GITHUB_APP_ID_NAME;
  process.env.PARAMETER_GITHUB_APP_KEY_BASE64_NAME = PARAMETER_GITHUB_APP_KEY_BASE64_NAME;
  nock.disableNetConnect();
});

lambdas/functions/control-plane/src/github/auth.ts:91

  • String argument /[\\n]/g looks like a regular expression, but it will be interpreted as a string.
    const privateKey = Buffer.from(
      await getParameter(process.env.PARAMETER_GITHUB_APP_KEY_BASE64_NAME),
      'base64',
      // replace literal \n characters with new lines to allow the key to be stored as a
      // single line variable. This logic should match how the GitHub Terraform provider
      // processes private keys to retain compatibility between the projects
    )
      .toString()
      .replace('/[\\n]/g', String.fromCharCode(10));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
.toString()
.replace('/[\\n]/g', String.fromCharCode(10)),
.replace('/[\\n]/g', String.fromCharCode(10));
Copy link

Copilot AI Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern is incorrectly written as a string literal '/[\\n]/g' instead of an actual regex /\\n/g. This means the replace operation will look for the literal string /[\\n]/g instead of matching newline characters. The code should use .replace(/\\n/g, String.fromCharCode(10)) to properly replace literal \n sequences with actual newline characters.

Note: While this bug existed before this PR, it's being surfaced here because the line appears in the diff (due to the comma → semicolon change).

Suggested change
.replace('/[\\n]/g', String.fromCharCode(10));
.replace(/\\n/g, String.fromCharCode(10));

Copilot uses AI. Check for mistakes.
Copy link
Member

@npalm npalm Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dblinkhorn although this seems to be a valid comment. I tihink it is better to leave and create a seperate issue.

Alternative is indeed to fix it, but in that case I prefer to have a unit test.

Since potential problem is not introduced here. Leave it up to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went ahead and fixed it and added a new test.

@dblinkhorn
Copy link
Author

@npalm lemme know which of these copilot suggestions you'd like me to implement.

@dblinkhorn dblinkhorn force-pushed the dblinkhorn/github-service-module branch from 8a5d1d0 to e9421a4 Compare November 24, 2025 20:45
@npalm
Copy link
Member

npalm commented Nov 24, 2025

@npalm lemme know which of these copilot suggestions you'd like me to implement.

I hope to find later the week a bit of time to go through. Feel free to judge yourself and leave some comments. Copilot is not alwasy very useful. Just giving it a shot.

@dblinkhorn dblinkhorn force-pushed the dblinkhorn/github-service-module branch 2 times, most recently from 62ac34d to 412edec Compare November 25, 2025 22:58
@dblinkhorn dblinkhorn changed the title feat(control-plane): Centralize GitHub client/auth/config caching feat(control-plane): cache GitHub auth tokens and app credentials Nov 25, 2025
Introduces a module-level, singleton cache that stores auth objects to reduce repeated GitHub API and SSM calls for high-frequency workloads.

- Create new control-plane cache module
- Cache GitHub installation access tokens
- Cache GitHub App ID and private key from SSM
- Fix regex bug in private key newline replacement
- Add unit test for escaped newline sequence conversion
@dblinkhorn dblinkhorn force-pushed the dblinkhorn/github-service-module branch from 412edec to cc2cbef Compare November 25, 2025 22:59
dblinkhorn and others added 5 commits December 5, 2025 12:45
…ingle invocation (github-aws-runners#4603)

Currently we restrict the `scale-up` Lambda to only handle a single
event at a time. In very busy environments this can prove to be a
bottleneck: there are calls to GitHub and AWS APIs that happen each
time, and they can end up taking long enough that we can't process job
queued events faster than they arrive.

In our environment we are also using a pool, and typically we have
responded to the alerts generated by this (SQS queue length growing) by
expanding the size of the pool. This helps because we will more
frequently find that we don't need to scale up, which allows the lambdas
to exit a bit earlier, so we can get through the queue faster. But it
makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 `CreateFleet`
call to create instances, after working out how many are needed. This is
a job that can be batched. We can take any number of events, calculate
the diff between our current state and the number of jobs we have,
capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2
creates some of the instances we wanted but not all of them. Lambda has
a configurable function response type which can be set to
`ReportBatchItemFailures`. In this mode, we return a list of failed
messages from our handler and those are retried. We can make use of this
to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one
thing we should optimise for is not recreating GitHub API clients. We
need one client for the app itself, which we use to find out
installation IDs, and then one client for each installation which is
relevant to the batch of events we are processing. This is done by
creating a new client the first time we see an event for a given
installation.

We also remove the same `batch_size = 1` constraint from the `job-retry`
Lambda. This Lambda is used to retry events that previously failed.
However, instead of reporting failures to be retried, here we maintain
the pre-existing fault-tolerant behaviour where errors are logged but
explicitly do not cause message retries, avoiding infinite loops from
persistent GitHub API issues or malformed events.

Tests are added for all of this.

Tests in a private repo (sorry) look good. This was running ephemeral
runners with no pool, SSM high throughput enabled, the job queued check
\_dis_abled, batch size of 200, wait time of 10 seconds. The workflow
runs are each a matrix with 250 jobs.


![image](https://github.com/user-attachments/assets/0a656e99-8f1e-45e2-924b-0d5c1b6d6afb)

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: Niek Palm <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Upgrade to Lambda runtime Node24.x
- Upgrade minimal AWS Terraform provider
- Upgrade all lambda runtimes by default to 24x
- Breaking change!


## Dependency and environment upgrades:

* Updated all references to Node.js from version 22 to version 24 in
GitHub Actions workflows (`.github/workflows/lambda.yml`,
`.github/workflows/release.yml`) and Dockerfiles (`.ci/Dockerfile`,
`.devcontainer/Dockerfile`).
[[1]](diffhunk://#diff-b0732b88b9e5a3df48561602571a10179d2b28cbb21ba8032025932bc9606426L23-R23)
[[2]](diffhunk://#diff-87db21a973eed4fef5f32b267aa60fcee5cbdf03c67fafdc2a9b553bb0b15f34L33-R33)
[[3]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2)
[[4]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1)
* Upgraded the base Docker images for both the CI and devcontainer
environments to use newer Node.js image digests.
[[1]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2)
[[2]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1)

Terraform provider updates:

* Increased the minimum required version for the AWS Terraform provider
to `>= 6.21` in all example `versions.tf` files.
[[1]](diffhunk://#diff-61160e0ae9e70de675b98889710708e1a9edcccd5194a4a580aa234caa5103aeL5-R5)
[[2]](diffhunk://#diff-debb96ea7aef889f9deb4de51c61ca44a7e23832098e1c9d8b09fa54b1a96582L5-R5)
* Updated the `.terraform.lock.hcl` files in all examples to lock the
AWS provider at version `6.22.1`, the local provider at `2.6.1`, and the
null provider at `3.2.4` where applicable, along with updated hash
values and constraints.
[[1]](diffhunk://#diff-101dfb4a445c2ab4a46c53cbc92db3a43f3423ba1e8ee7b8a11b393ebe835539L5-R43)
[[2]](diffhunk://#diff-2a8b3082767f86cfdb88b00e963894a8cdd2ebcf705c8d757d46b55a98452a6cL5-R43)

---------

Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
Co-authored-by: Guilherme Caulada <[email protected]>
This pull request removes support for several deprecated AMI-related
variables across all modules, fully migrating the configuration to the
consolidated `ami` object. This change simplifies how AMI settings are
managed, improves consistency, and reduces confusion for users. All
references to the old variables (`ami_filter`, `ami_owners`,
`ami_id_ssm_parameter_name`, `ami_kms_key_arn`) have been removed from
module inputs, outputs, templates, documentation, and internal logic.

**Migration to consolidated AMI configuration:**

* Removed all deprecated AMI variables (`ami_filter`, `ami_owners`,
`ami_id_ssm_parameter_name`, `ami_kms_key_arn`) from module variable
definitions, outputs, and internal usage in `variables.tf`,
`outputs.tf`, and related files.
[[1]](diffhunk://#diff-05b5a57c136b6ff596500bcbfdcff145ef6cddea2a0e86d184d9daa9a65a288eL396-L424)
[[2]](diffhunk://#diff-23e8f44c0f21971190244acdb8a35eaa21af7578ed5f1b97bef83f1a566d979cL138-L165)
[[3]](diffhunk://#diff-de6c47c2496bd028a84d55ab12d8a4f90174ebfb6544b8b5c7b07a7ee4f27ec7L78-L90)
[[4]](diffhunk://#diff-2daea3e8167ce5d859f6f1bee08138dbe216003262325490e8b90477277c104aL70-L89)
[[5]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L73-R73)
[[6]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L186-L187)
[[7]](diffhunk://#diff-951f6bd1e32c3d27dd90e2dfb1f5232a704ef01fd925f3ee4323d6adc2dcdf5aL15-L20)
[[8]](diffhunk://#diff-3937b99021390c0192952207dd2e26a409e0c03446478fb09ac3cd360bb60ee5L9-L14)
* Updated example and documentation files to use the new `ami` object
structure, replacing previous usage of the deprecated variables.
[[1]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL36-R40)
[[2]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL52-R57)
[[3]](diffhunk://#diff-0a0d2ecd774e69a1397a913b6230f45692b49c0b17ccb103d318f6ab078353e2L48-R51)
[[4]](diffhunk://#diff-b2b9df08c45240d599f6260d246bad6e67129932174131db209341d8464247a8L18-R19)
[[5]](diffhunk://#diff-61032a0bb5f9d7ae65ba5155b2e58e12901f39bb7068f16b419d94c6f7a5b922L86-R89)
* Refactored module runner logic to only use the new `ami` object,
removing all fallback and compatibility code for the old variables.
[[1]](diffhunk://#diff-dc46acf24afd63ef8c556b77c126ccc6e578bc87e3aa09a931f33d9bf2532fbbL182-L185)
[[2]](diffhunk://#diff-57f00cdf57feef92ffcb35d6618e62e14ad652d5d520f331068332d5f3cade51L30-L32)
[[3]](diffhunk://#diff-e9624b388e62ca51cf1fe073eb0919588e8c36a1143ecdb49580996a89f13bebL40-R51)
* Updated internal references to AMI SSM parameter names and related
policies to use the new configuration, ensuring all resource and
environment variable logic is aligned with the consolidated approach.
[[1]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL13-R13)
[[2]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL28-R28)
[[3]](diffhunk://#diff-5921c9e3315946068538b290966e7e4e51b6e49d04c2466b0bdd4b298629b29dL56-R57)
[[4]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L28-R28)
[[5]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L151-R151)

With these updates, all AMI configuration is now handled through the
unified `ami` object, making runner setup more straightforward and
future-proof.


## Tested

- [x] default example
- [x] multi runner example

---------

Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com>
Co-authored-by: Copilot <[email protected]>
…ws-runners#4929)

## Description

This PR adds support for configuring EC2 placement groups for GitHub
Actions runners in the multi-runner module. It plumbs a new placement
option from the runner configuration through to the underlying EC2
runner module.

## Details

Updated modules/multi-runner/runners.tf to pass placement =
each.value.runner_config.placement into the runners module.
This allows specifying AWS placement groups for EC2 runners, enabling
tighter control over instance placement.
The change is backwards compatible: if placement is unset in
runner_config, behavior remains unchanged.

## Motivation / Future work

Placement groups are a prerequisite for supporting macOS runners, which
require a host_id.
A follow-up PR will add explicit macOS support leveraging this new
placement wiring.

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Niek Palm <[email protected]>
@npalm npalm changed the base branch from main to next December 9, 2025 17:56
@npalm
Copy link
Member

npalm commented Dec 9, 2025

@dblinkhorn I have changed the target branch. Since we are busy to get a a new release out. I just want to see if there are any conflicts.

@npalm npalm requested a review from guicaulada December 9, 2025 17:58
@guicaulada
Copy link
Contributor

guicaulada commented Dec 11, 2025

I have confirmed this works, however I'm testing further to verify the impact of the cache on the amount of requests.

The first thing that comes to mind, is that this implementation is very specific for GitHub Authentication while we could implement a more generic cache library that could be used by all lambdas in different areas with ease.

@npalm @dblinkhorn what do you think about converting this to a more generic cache implementation that can be used in more places in the future?

@npalm npalm changed the base branch from next to main December 13, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants