Skip to content

server/kv: Fix LRU list corruption under concurrent promotion#47

Merged
DwyaneShi merged 1 commit into
aibrix:devfrom
yapple:fix-lru-access-race
Mar 31, 2026
Merged

server/kv: Fix LRU list corruption under concurrent promotion#47
DwyaneShi merged 1 commit into
aibrix:devfrom
yapple:fix-lru-access-race

Conversation

@yapple
Copy link
Copy Markdown
Contributor

@yapple yapple commented Mar 27, 2026

  • priskv_lru_access: make promotion idempotent by always performing list_del_init then list_add under the LRU lock; ignore is_in_list to avoid duplicate inserts when publish/SEAL and GET race on the same key.
  • priskv_lru_del_key: switch to list_del_init so repeated deletions/touches remain safe.
  • Comments: add concurrency notes and fix a typo (insert).

Impact: prevents doubly-linked list corruption and potential crashes; improves stability of LRU operations under concurrency.

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to PrisKV! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to PrisKV's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Comment thread server/kv.c Outdated
static void priskv_lru_access(priskv_key *keynode, bool is_in_list)
{
priskv_kv *kv = keynode->kv;
(void)is_in_list; /* Kept for backward compatibility; logic no longer depends on it */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is removing is_in_list necessary to fix the issue?
  2. let's keep it if not. otherwise, let's remove it from all the usages, keeping compatibility is not needed since it's a internal API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test case only covers concurrent seal, and the following fixes the issue of concurrent seal UAF. The problem here is that when Acquire/Seal is concurrent, list_add for the same node may have issues, but there is currently no such test case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will supplement more comprehensive concurrent stress testing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing is_in_list is not to fix the issue, so I have reverted the commit about priskv_lru_access change.

@yapple yapple force-pushed the fix-lru-access-race branch from 518a00d to 28eec0b Compare March 28, 2026 03:48
…ger transport UTs

Fix a publish-window race where a new node becomes visible before LRU integration and could be observed by concurrent GET/UNPIN and reclaimed prematurely. Hold a temporary reference across the publish window and drop it after LRU join and old-node cleanup.

Observability and robustness:
- Count PIN on publish for parity with ACQUIRE+PIN.
- Update pin/unpin/pin_failed counters via atomic increments.
- Add warnings for potential stale handles and non-OK UNPIN statuses.

Unit tests (transport):
- Add a multithreaded combo test covering ALLOC+SEAL(PIN), concurrent ACQUIRE(PIN)/RELEASE(UNPIN), and verify both pin_count and delta counters. Use TLS slots to reliably capture per-thread status/token.
- Fix concurrent SEAL+PIN test to unpin via the latest handle to avoid stale-handle issues.
- Make transport tests accept nthreads/iters; run stress as "test-transport 8 200" from the test runner.

Files: server/kv.c, server/test/test_transport.c, run_unit_test.py
@yapple yapple force-pushed the fix-lru-access-race branch from 28eec0b to b1593ef Compare March 28, 2026 06:50
Copy link
Copy Markdown
Collaborator

@DwyaneShi DwyaneShi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DwyaneShi DwyaneShi merged commit 8530eee into aibrix:dev Mar 31, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants