-
Notifications
You must be signed in to change notification settings - Fork 694
feat(control-plane): cache GitHub auth tokens and app credentials #4899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(control-plane): cache GitHub auth tokens and app credentials #4899
Conversation
|
@dblinkhorn thx, not digged in the PR it self. But please can you run |
14bd0be to
5f34e10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a singleton caching layer for GitHub API clients, authentication objects, and app configuration to optimize performance by avoiding redundant SSM parameter lookups and client instantiations. The caching implementation uses in-memory Maps for storing Octokit clients, app/installation authentication objects, and GitHub App configuration.
Key changes:
- Created a new cache module (
github/cache.ts) with functions to cache and retrieve Octokit clients, authentication objects, and app configs - Refactored existing authentication functions in
auth.tsto utilize the new caching layer - Added cache reset functionality in test setup to ensure test isolation
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
lambdas/functions/control-plane/src/github/types.ts |
Defines GithubAppConfig interface for GitHub App configuration with appId, privateKey, and optional installationId |
lambdas/functions/control-plane/src/github/index.ts |
Exports cache and types modules to make them available to other parts of the application |
lambdas/functions/control-plane/src/github/cache.ts |
Implements singleton cache using Map structures with cache key generators and getter functions for clients, auth objects, and configs |
lambdas/functions/control-plane/src/github/auth.ts |
Integrates caching into existing authentication and client creation functions to avoid redundant SSM lookups |
lambdas/functions/control-plane/src/github/auth.test.ts |
Adds cache reset call in test setup and updates import statement to use type imports |
lambdas/functions/control-plane/src/local.ts |
Removes an eslint-disable comment that is no longer needed |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export function createTokenCacheKey(token: string, ghesApiUrl: string = '') { | ||
| return `octokit-${token}-${ghesApiUrl}`; | ||
| } |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing tokens directly in cache keys is a security concern as it could expose sensitive credentials in memory dumps or logs. The full token value should not be part of the cache key.
Consider using a hash of the token instead:
import { createHash } from 'crypto';
export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
const tokenHash = createHash('sha256').update(token).digest('hex').substring(0, 16);
return `octokit-${tokenHash}-${ghesApiUrl}`;
}| export interface GithubAppConfig { | ||
| appId: number; | ||
| privateKey: string; | ||
| installationId?: number; |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installationId field is defined in the GithubAppConfig interface but is never populated or used. It should either be:
- Removed from the interface if not needed, or
- Populated in the
createAuthfunction (line 109-112) if it's intended to be part of the config
Currently, the installationId is handled separately in createAuth (line 119) rather than being part of the cached config.
| installationId?: number; |
| export function createTokenCacheKey(token: string, ghesApiUrl: string = '') { | ||
| return `octokit-${token}-${ghesApiUrl}`; | ||
| } | ||
|
|
||
| export function createAuthCacheKey(type: 'app' | 'installation', installationId?: number, ghesApiUrl: string = '') { | ||
| const id = installationId || 'none'; | ||
| return `${type}-auth-${id}-${ghesApiUrl}`; | ||
| } | ||
|
|
||
| export function createAuthConfigCacheKey(ghesApiUrl: string = '') { | ||
| return `auth-config-${ghesApiUrl}`; | ||
| } | ||
|
|
||
| export async function getClient(key: string, creator: () => Promise<Octokit>): Promise<Octokit> { | ||
| if (clients.has(key)) { | ||
| return clients.get(key)!; | ||
| } | ||
|
|
||
| const client = await creator(); | ||
| clients.set(key, client); | ||
| return client; | ||
| } | ||
|
|
||
| export async function getAppAuthObject( | ||
| key: string, | ||
| creator: () => Promise<AppAuthentication>, | ||
| ): Promise<AppAuthentication> { | ||
| if (appAuthObjects.has(key)) { | ||
| return appAuthObjects.get(key)!; | ||
| } | ||
|
|
||
| const authObj = await creator(); | ||
| appAuthObjects.set(key, authObj); | ||
| return authObj; | ||
| } | ||
|
|
||
| export async function getInstallationAuthObject( | ||
| key: string, | ||
| creator: () => Promise<InstallationAccessTokenAuthentication>, | ||
| ): Promise<InstallationAccessTokenAuthentication> { | ||
| if (installationAuthObjects.has(key)) { | ||
| return installationAuthObjects.get(key)!; | ||
| } | ||
|
|
||
| const authObj = await creator(); | ||
| installationAuthObjects.set(key, authObj); | ||
| return authObj; | ||
| } | ||
|
|
||
| export async function getAuthConfig(key: string, creator: () => Promise<GithubAppConfig>) { | ||
| if (authConfigs.has(key)) { | ||
| return authConfigs.get(key)!; | ||
| } | ||
|
|
||
| const config = await creator(); | ||
| authConfigs.set(key, config); | ||
| return config; | ||
| } |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache functions and key generation functions lack JSDoc documentation. Adding documentation would help other developers understand:
- The caching strategy and behavior
- When cached values are returned vs when the creator function is called
- The cache key format and what parameters affect it
- The purpose of the
reset()function
Example:
/**
* Creates a cache key for Octokit clients based on token and API URL.
* @param token - The authentication token
* @param ghesApiUrl - Optional GitHub Enterprise Server API URL
* @returns A unique cache key for this token/URL combination
*/
export function createTokenCacheKey(token: string, ghesApiUrl: string = '') {
return `octokit-${token}-${ghesApiUrl}`;
}| } | ||
|
|
||
| export function createAuthCacheKey(type: 'app' | 'installation', installationId?: number, ghesApiUrl: string = '') { | ||
| const id = installationId || 'none'; |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache key generation uses installationId || 'none' which treats falsy values (0, null, undefined) the same way. While 0 is likely not a valid GitHub installation ID, this could lead to cache key collisions for different falsy values.
Consider using nullish coalescing instead for more precise handling:
const id = installationId ?? 'none';This will only treat null and undefined as 'none', while preserving 0 if it were ever a valid value.
| const id = installationId || 'none'; | |
| const id = installationId ?? 'none'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
7e9cae3 to
8a5d1d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
lambdas/functions/control-plane/src/github/auth.test.ts:38
- Consider adding test cases to verify that the caching behavior works correctly. For example:
- Verify that calling
createGithubInstallationAuthtwice with the same parameters only calls the underlyingcreateAuthonce (due to caching) - Verify that calling
createGithubInstallationAuthwith differentinstallationIdorghesApiUrlvalues creates separate cache entries - Verify that the
reset()function properly clears the cache so subsequent calls fetch fresh values
This would ensure the caching logic introduced in this PR works as intended.
beforeEach(() => {
reset(); // clear all caches before each test
vi.resetModules();
vi.clearAllMocks();
process.env = { ...cleanEnv };
process.env.PARAMETER_GITHUB_APP_ID_NAME = PARAMETER_GITHUB_APP_ID_NAME;
process.env.PARAMETER_GITHUB_APP_KEY_BASE64_NAME = PARAMETER_GITHUB_APP_KEY_BASE64_NAME;
nock.disableNetConnect();
});
lambdas/functions/control-plane/src/github/auth.ts:91
- String argument /[\\n]/g looks like a regular expression, but it will be interpreted as a string.
const privateKey = Buffer.from(
await getParameter(process.env.PARAMETER_GITHUB_APP_KEY_BASE64_NAME),
'base64',
// replace literal \n characters with new lines to allow the key to be stored as a
// single line variable. This logic should match how the GitHub Terraform provider
// processes private keys to retain compatibility between the projects
)
.toString()
.replace('/[\\n]/g', String.fromCharCode(10));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| .toString() | ||
| .replace('/[\\n]/g', String.fromCharCode(10)), | ||
| .replace('/[\\n]/g', String.fromCharCode(10)); |
Copilot
AI
Nov 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern is incorrectly written as a string literal '/[\\n]/g' instead of an actual regex /\\n/g. This means the replace operation will look for the literal string /[\\n]/g instead of matching newline characters. The code should use .replace(/\\n/g, String.fromCharCode(10)) to properly replace literal \n sequences with actual newline characters.
Note: While this bug existed before this PR, it's being surfaced here because the line appears in the diff (due to the comma → semicolon change).
| .replace('/[\\n]/g', String.fromCharCode(10)); | |
| .replace(/\\n/g, String.fromCharCode(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dblinkhorn although this seems to be a valid comment. I tihink it is better to leave and create a seperate issue.
Alternative is indeed to fix it, but in that case I prefer to have a unit test.
Since potential problem is not introduced here. Leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just went ahead and fixed it and added a new test.
|
@npalm lemme know which of these copilot suggestions you'd like me to implement. |
8a5d1d0 to
e9421a4
Compare
I hope to find later the week a bit of time to go through. Feel free to judge yourself and leave some comments. Copilot is not alwasy very useful. Just giving it a shot. |
62ac34d to
412edec
Compare
Introduces a module-level, singleton cache that stores auth objects to reduce repeated GitHub API and SSM calls for high-frequency workloads. - Create new control-plane cache module - Cache GitHub installation access tokens - Cache GitHub App ID and private key from SSM - Fix regex bug in private key newline replacement - Add unit test for escaped newline sequence conversion
412edec to
cc2cbef
Compare
…ingle invocation (github-aws-runners#4603) Currently we restrict the `scale-up` Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive. In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns. At its core, this Lambda's task is to construct an EC2 `CreateFleet` call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call. The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to `ReportBatchItemFailures`. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process. Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation. We also remove the same `batch_size = 1` constraint from the `job-retry` Lambda. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events. Tests are added for all of this. Tests in a private repo (sorry) look good. This was running ephemeral runners with no pool, SSM high throughput enabled, the job queued check \_dis_abled, batch size of 200, wait time of 10 seconds. The workflow runs are each a matrix with 250 jobs.  --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Niek Palm <[email protected]> Co-authored-by: Niek Palm <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Upgrade to Lambda runtime Node24.x - Upgrade minimal AWS Terraform provider - Upgrade all lambda runtimes by default to 24x - Breaking change! ## Dependency and environment upgrades: * Updated all references to Node.js from version 22 to version 24 in GitHub Actions workflows (`.github/workflows/lambda.yml`, `.github/workflows/release.yml`) and Dockerfiles (`.ci/Dockerfile`, `.devcontainer/Dockerfile`). [[1]](diffhunk://#diff-b0732b88b9e5a3df48561602571a10179d2b28cbb21ba8032025932bc9606426L23-R23) [[2]](diffhunk://#diff-87db21a973eed4fef5f32b267aa60fcee5cbdf03c67fafdc2a9b553bb0b15f34L33-R33) [[3]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2) [[4]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1) * Upgraded the base Docker images for both the CI and devcontainer environments to use newer Node.js image digests. [[1]](diffhunk://#diff-fd0c8401badda82156f9e7bd621fa3a0e586d8128e4a80af17c7cbff70bee11eL2-R2) [[2]](diffhunk://#diff-13bd9d7a30bf46656bc81f1ad5b908a627f9247be3f7d76df862b0578b534fc6L1-R1) Terraform provider updates: * Increased the minimum required version for the AWS Terraform provider to `>= 6.21` in all example `versions.tf` files. [[1]](diffhunk://#diff-61160e0ae9e70de675b98889710708e1a9edcccd5194a4a580aa234caa5103aeL5-R5) [[2]](diffhunk://#diff-debb96ea7aef889f9deb4de51c61ca44a7e23832098e1c9d8b09fa54b1a96582L5-R5) * Updated the `.terraform.lock.hcl` files in all examples to lock the AWS provider at version `6.22.1`, the local provider at `2.6.1`, and the null provider at `3.2.4` where applicable, along with updated hash values and constraints. [[1]](diffhunk://#diff-101dfb4a445c2ab4a46c53cbc92db3a43f3423ba1e8ee7b8a11b393ebe835539L5-R43) [[2]](diffhunk://#diff-2a8b3082767f86cfdb88b00e963894a8cdd2ebcf705c8d757d46b55a98452a6cL5-R43) --------- Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com> Co-authored-by: Guilherme Caulada <[email protected]>
This pull request removes support for several deprecated AMI-related variables across all modules, fully migrating the configuration to the consolidated `ami` object. This change simplifies how AMI settings are managed, improves consistency, and reduces confusion for users. All references to the old variables (`ami_filter`, `ami_owners`, `ami_id_ssm_parameter_name`, `ami_kms_key_arn`) have been removed from module inputs, outputs, templates, documentation, and internal logic. **Migration to consolidated AMI configuration:** * Removed all deprecated AMI variables (`ami_filter`, `ami_owners`, `ami_id_ssm_parameter_name`, `ami_kms_key_arn`) from module variable definitions, outputs, and internal usage in `variables.tf`, `outputs.tf`, and related files. [[1]](diffhunk://#diff-05b5a57c136b6ff596500bcbfdcff145ef6cddea2a0e86d184d9daa9a65a288eL396-L424) [[2]](diffhunk://#diff-23e8f44c0f21971190244acdb8a35eaa21af7578ed5f1b97bef83f1a566d979cL138-L165) [[3]](diffhunk://#diff-de6c47c2496bd028a84d55ab12d8a4f90174ebfb6544b8b5c7b07a7ee4f27ec7L78-L90) [[4]](diffhunk://#diff-2daea3e8167ce5d859f6f1bee08138dbe216003262325490e8b90477277c104aL70-L89) [[5]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L73-R73) [[6]](diffhunk://#diff-52d0673ff466b6445542e17038ea73a1cf41b8112f49ee57da4cebf8f0cb99c5L186-L187) [[7]](diffhunk://#diff-951f6bd1e32c3d27dd90e2dfb1f5232a704ef01fd925f3ee4323d6adc2dcdf5aL15-L20) [[8]](diffhunk://#diff-3937b99021390c0192952207dd2e26a409e0c03446478fb09ac3cd360bb60ee5L9-L14) * Updated example and documentation files to use the new `ami` object structure, replacing previous usage of the deprecated variables. [[1]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL36-R40) [[2]](diffhunk://#diff-ef2038e9f8d807236d2acebe3c3a191039f8021cc4a0188f4778de908f0d453bL52-R57) [[3]](diffhunk://#diff-0a0d2ecd774e69a1397a913b6230f45692b49c0b17ccb103d318f6ab078353e2L48-R51) [[4]](diffhunk://#diff-b2b9df08c45240d599f6260d246bad6e67129932174131db209341d8464247a8L18-R19) [[5]](diffhunk://#diff-61032a0bb5f9d7ae65ba5155b2e58e12901f39bb7068f16b419d94c6f7a5b922L86-R89) * Refactored module runner logic to only use the new `ami` object, removing all fallback and compatibility code for the old variables. [[1]](diffhunk://#diff-dc46acf24afd63ef8c556b77c126ccc6e578bc87e3aa09a931f33d9bf2532fbbL182-L185) [[2]](diffhunk://#diff-57f00cdf57feef92ffcb35d6618e62e14ad652d5d520f331068332d5f3cade51L30-L32) [[3]](diffhunk://#diff-e9624b388e62ca51cf1fe073eb0919588e8c36a1143ecdb49580996a89f13bebL40-R51) * Updated internal references to AMI SSM parameter names and related policies to use the new configuration, ensuring all resource and environment variable logic is aligned with the consolidated approach. [[1]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL13-R13) [[2]](diffhunk://#diff-bc00c0efa92f360635d026350da2fb775718514e2b1ae718281400e661b7469bL28-R28) [[3]](diffhunk://#diff-5921c9e3315946068538b290966e7e4e51b6e49d04c2466b0bdd4b298629b29dL56-R57) [[4]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L28-R28) [[5]](diffhunk://#diff-a598ba79d09e4770d55ed09e6b1d51e68c4a54562a3e3cbb46619d625a609d23L151-R151) With these updates, all AMI configuration is now handled through the unified `ami` object, making runner setup more straightforward and future-proof. ## Tested - [x] default example - [x] multi runner example --------- Co-authored-by: github-aws-runners-pr|bot <github-aws-runners-pr[bot]@users.noreply.github.com> Co-authored-by: Copilot <[email protected]>
…ws-runners#4929) ## Description This PR adds support for configuring EC2 placement groups for GitHub Actions runners in the multi-runner module. It plumbs a new placement option from the runner configuration through to the underlying EC2 runner module. ## Details Updated modules/multi-runner/runners.tf to pass placement = each.value.runner_config.placement into the runners module. This allows specifying AWS placement groups for EC2 runners, enabling tighter control over instance placement. The change is backwards compatible: if placement is unset in runner_config, behavior remains unchanged. ## Motivation / Future work Placement groups are a prerequisite for supporting macOS runners, which require a host_id. A follow-up PR will add explicit macOS support leveraging this new placement wiring. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niek Palm <[email protected]>
|
@dblinkhorn I have changed the target branch. Since we are busy to get a a new release out. I just want to see if there are any conflicts. |
|
I have confirmed this works, however I'm testing further to verify the impact of the cache on the amount of requests. The first thing that comes to mind, is that this implementation is very specific for GitHub Authentication while we could implement a more generic cache library that could be used by all lambdas in different areas with ease. @npalm @dblinkhorn what do you think about converting this to a more generic cache implementation that can be used in more places in the future? |
This PR introduces a singleton cache for Octokit clients, authentication objects, and GitHub App configuration, so we can avoid making redundant SSM lookups and client creation.
This PR represents the first step in what will be a broader effort to modularize GitHub interactions. Future PRs will build on this one by adding runner management, job operations, advanced caching for runner and job metadata, etc.