Skip to content

chore: Refactor cache usage and code flow#60

Open
adwk67 wants to merge 22 commits intomainfrom
chore/refactor-internal-cache
Open

chore: Refactor cache usage and code flow#60
adwk67 wants to merge 22 commits intomainfrom
chore/refactor-internal-cache

Conversation

@adwk67
Copy link
Member

@adwk67 adwk67 commented Dec 12, 2025

Description

Part of #59.

Note

Once merged, we will need to release version 0.4.3 as a separate PR.

Implementation Details

The refactoring is somewhat extensive - which makes the changes difficult to follow - but the changes can be summarised as follows:

  • multiple caches are now handled independently of one another and have been moved out into TopologyCache
  • function buildNodeToDatanodeMap has been introduced to cache a map of dataNode names to IPs: this enables O(1) lookup when finding co-located dataNodes for client pods
  • program flow has been changed to hopefully make things more readable. The resolution flow consists of:
    • return topology from cache if possible, otherwise:
    • fetch dataNodes and prepare maps of their labels
    • build node-to-datanode map to use for O(1) colocated lookups
    • for each name, try the following in this order (enabling us to short-circuit on the first match):
      • attempt a direct dataNode lookup
      • fetch listeners, resolve them to their dataNodes and see if we have a match
      • fetch pods, find the co-located dataNode for the calling Pod
      • build and cache the topology
  • a podInformer is used to watch the namespace for pod changes and to thereby reduce the chane of cache-misses

How to test

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Reviewer

  • Code contains useful comments
  • (Integration-)Test cases added
  • Documentation added or updated
  • Changelog updated

Acceptance

  • Proper release label has been added

@adwk67 adwk67 requested a review from soenkeliebau December 17, 2025 16:37
@adwk67 adwk67 moved this to Development: In Review in Stackable Engineering Dec 17, 2025
@adwk67 adwk67 marked this pull request as ready for review December 17, 2025 16:37
@adwk67 adwk67 self-assigned this Dec 17, 2025
Copy link
Member

@soenkeliebau soenkeliebau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions - might be me misunderstanding the code though

@Override
public void onUpdate(Pod oldPod, Pod newPod) {
cache.putPod(oldPod.getMetadata().getName(), newPod);
for (PodIP ip : oldPod.getStatus().getPodIPs()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we refer to newPod here instead of oldPod - or both .. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think if it is the IPs that have changed, we need to delete the old ones and add the new ones. I'll change that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in f0b5381

for (String name : names) {
String resolvedName = resolveDataNodesFromListeners(name, listeners);
listenerToDataNodeNames.add(resolvedName);
if (endpoint.getSubsets().isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'endpoint' can be Null here I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I'll check for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added in a few more null checks: f0b5381

}
}
LOG.error("Unable to fetch CRD version for listeners. Returning default value.");
return "v1alpha1";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens downstream if we do this? Do we expose ourselves to worse parse-errors potentially because "someone" think they are handling a v1alpha1, which is not actually the case?
Might it be better to fail "visibly" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an internal check and if the fallback value of v1alpha1 is incorrect, then no listeners will be found and they won't be considered in the endpoint resolution.

listenerToDataNodeNames.add(resolvedName);
if (endpoint.getSubsets().isEmpty()) {
LOG.warn("Endpoint {} has no subsets - pod may be restarting", listenerName);
return listenerName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be mapped to /NotFound down the line - probably correct and intended, just wanted to mention it and see if it sparks a discussion :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Development: In Review

Development

Successfully merging this pull request may close these issues.

2 participants

Comments