fix(keys): avoid duplicate metrics from KeyIndex desync#14
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthrough
ChangesDuplicate Key Emission Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
KeyIndex could emit the same key more than once, surfacing as duplicate metrics (apache/apisix#11934). Two issues combine to cause this: 1. KeyIndex:list() iterated the stored key values (self.keys), which can transiently hold the same key value at two different slots when an expired metric is re-added at a new slot before the old slot is reclaimed. The duplicated value was then emitted twice. list() now walks the slots but only emits a key at the slot the index points at (self.index[key] == idx), so each key is listed exactly once while preserving ordering. 2. When add() re-adds a key whose shared-dict entry had already expired, it cleared the local index/keys but did not bump delete_count or clear the dict slot. Other workers therefore never did a full sync and kept the stale slot in their local self.keys, desynchronizing the index. add() now mirrors remove(): it clears the dict slot, increments the local deleted counter and bumps delete_count. Adds a regression test (TestKeyIndex:testExpiredReAddNoDuplicate) that reproduces the duplicate emission before the fix and passes after.
5cbbc18 to
57ec06c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@prometheus_test.lua`:
- Line 817: The KeyIndex:add() method call on self.key_index is not validating
the return value directly. Instead of relying on checking ngx.logs which is
global state and can be stale, capture the return value from
self.key_index:add("expkey", "eviction_err", 1) and assert it directly to verify
the operation succeeded or failed as expected. Apply the same fix to the similar
calls around lines 831-832.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c3b895e6-734f-4c98-bdd8-150cbbf6348d
📒 Files selected for processing (2)
prometheus_keys.luaprometheus_test.lua
Address review: KeyIndex:add() reports failures via its return value, so assert it directly instead of checking the global ngx.logs which can be stale/unrelated.
…t rc Address review feedback on the duplicate-metrics fix: - list(): iterate self.keys via pairs() instead of scanning 0..self.last. self.last grows monotonically with every add and is never reclaimed, so the slot-range scan was O(total slots ever created) on long-lived, high-churn workers. pairs() keeps it O(live keys) with the same index-based de-duplication. It also avoids a latent miss: sync() can reset self.last to key_count, below the highest populated local slot. - add() expired path: only treat the key as expired when expire() returns err == "not found". A non-"not found" failure no longer deletes a slot that may still be live (which would now also drop the metric); it is logged and the existing slot is kept to avoid creating a duplicate. - add() expired path: check the delete_count incr() return value and propagate eviction/forcible like remove() does, per the shared-dict error-handling convention. Drop the redundant dict:set(slot, nil): when expire() reports "not found" the slot is already gone. - tests: fix SimpleDict:expire to return true on success (matches ngx.shared.DICT); add a second-worker convergence assertion to the regression test; add testListDeduplicatesStaleSlot to isolate the list() guard.
membphis
left a comment
There was a problem hiding this comment.
Approved. No blocking issues found.
Problem
KeyIndexcould emit the same key more than once, which surfaces downstream as duplicate Prometheus metrics. This fixes the upstream report apache/apisix#11934 ("apisix reports duplicate metrics"), whose root cause lives in this fork rather than in apisix itself.Root cause
Two issues combine:
KeyIndex:list()iterated the stored key values rather than the index. It looped overself.keys(keyed by slot number) and copied every value.self.keyscan transiently hold the same key value at two different slots — e.g. when an expired metric is re-added at a new slot before the old slot is reclaimed — so the duplicated value was emitted twice.delete_countwas not bumped when an expired metric was re-added. Inadd(), when a key whose shared-dict entry had already expired is re-added, the code cleared the localindex/keys/expire_keysbut did not clear the dict slot or bumpdelete_count:Because
delete_countwas unchanged, other workers never triggered a full sync (KeyIndex:synconly does a full re-sync whenself.deleted ~= delete_count). They kept the stale slot in their localself.keyswhile the metric was re-added at a new slot — desynchronizing the index and causing the duplicate emission described above.Fix
list()now walks the slots but only emits a key at the slot the index currently points at (self.index[key] == idx), so each key is listed exactly once while preserving ordering.add()now mirrorsremove()on the expired-re-add path: it clears the dict slot, increments the localdeletedcounter and bumpsdelete_count, so other workers do a full sync and reclaim the stale slot.Public API and observable behavior are unchanged.
Test
Adds
TestKeyIndex:testExpiredReAddNoDuplicate, which:remove()(sokey_count/delete_countare untouched and the nextsync()is a no-op, leaving the index pointing at the vanished slot),delete_countwas bumped and thatlist()returns the key exactly once.Verified fail-before / pass-after:
delete_count == nilandlist()returns["expkey", "expkey"](test fails).delete_count == 1andlist()returns["expkey"](test passes).Full unit suite and
luacheck(per CI config) pass; the only remaining failure is the pre-existing, unrelatedTestPrometheus.testPrintfTable(a luaunit table-formatting difference present onmainbefore this change).Follow-up
Once a release is cut, an apisix-side rockspec bump will follow to pull in the fixed version.
Summary by CodeRabbit