Add finalizer management and EDPM-aware cleanup for immutable AC secrets#685
Conversation
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/61090ad340f246b2a571069f7190fad7 ❌ openstack-k8s-operators-content-provider FAILURE in 8m 49s |
| if k8s_errors.IsNotFound(err) { | ||
| // If the ACID is set but the secret is not found, requeue to let the cache sync | ||
| if instance.Status.ACID != "" { | ||
| return ctrl.Result{RequeueAfter: time.Second * 5}, nil |
There was a problem hiding this comment.
If the ACID is set in status but the Secret is not found, the controller assumes an informer cache lag and requeues every 5 seconds. If the Secret was genuinely deleted (e.g. manual kubectl delete secret), this becomes an infinite requeue loop.
Consider adding a bounded retry - for example, an annotation counter or a timestamp check - and after a threshold (e.g. 30 seconds or N retries), fall through to doRotate = true so the controller self-heals instead of spinning indefinitely.
There was a problem hiding this comment.
We don't actually need the requeue anymore - we suppress reconcile trigger on create events, so this scenario can't happen anymore. I will remove the guard and that fixes the infinite loop possibility, which is valid and I replicated manually the problem when I manually deleted the finalziers.
| // finalizer should only be removed once all nodes across all NodeSets have | ||
| // been redeployed with the new credentials. This depends on per-node secret | ||
| // rotation tracking: https://github.com/openstack-k8s-operators/openstack-operator/pull/1781 | ||
| func hasConsumerFinalizer(secret *corev1.Secret) bool { |
There was a problem hiding this comment.
The substring check strings.Contains(f, "-ac-consumer") could match unrelated finalizers that happen to contain that substring. A tighter check would reduce false-positive risk:
if strings.HasPrefix(f, "openstack.org/") && strings.HasSuffix(f, "-ac-consumer") {
return true
}This still matches all service-operator consumer finalizers (openstack.org/barbican-ac-consumer, openstack.org/cinder-ac-consumer, etc.) while being robust against accidental collisions.
|
|
||
| for i := range secretList.Items { | ||
| s := &secretList.Items[i] | ||
| if protected[s.Name] || hasConsumerFinalizer(s) || !controllerutil.ContainsFinalizer(s, acSecretFinalizer) { |
There was a problem hiding this comment.
The !controllerutil.ContainsFinalizer(s, acSecretFinalizer) guard creates a permanent orphan scenario. If the controller crashes between line 625 (removing the protection finalizer via Update) and line 631 (deleting the Secret), the Secret is left without a finalizer. On the next reconcile, this guard causes the loop to skip it, and it is never cleaned up.
One fix is to remove the ContainsFinalizer pre-condition entirely - if the Secret is not protected by name and has no consumer finalizer, it should be deleted regardless of whether it still carries the protection finalizer:
if protected[s.Name] || hasConsumerFinalizer(s) {
continue
}Alternatively, invert the order: issue the Delete first (the API server won't actually remove the object while the finalizer exists), then remove the finalizer. That way, even on crash between the two operations, the object is already marked for deletion and the next reconcile just removes the finalizer.
There was a problem hiding this comment.
Yes, we don't need that additional check. Old mutable secrets with only the protection finalizers should be safe since the protected is checked before hasConsumerFinalizer if service operator would be late to add consumer finalizer to the secret.
I don't think inversing the order is a good way to go, we would need to handle the Terminating state, because deletion would be blocked.
| @@ -455,26 +728,25 @@ func (r *ApplicationCredentialReconciler) storeACSecret( | |||
|
|
|||
| op, err := controllerutil.CreateOrPatch(ctx, helperObj.GetClient(), secret, func() error { | |||
There was a problem hiding this comment.
Since each rotation produces a unique Secret name (ac-<svc>-<first5>-secret), CreateOrPatch will effectively always Create. However, if the controller crashes after the Create succeeds but before the status is patched, the retry will find the Secret already exists and attempt a Patch - which will try to update .data on an immutable Secret and fail.
Consider replacing this with a plain Create + IsAlreadyExists check:
err := helperObj.GetClient().Create(ctx, secret)
if k8s_errors.IsAlreadyExists(err) {
logger.Info("Immutable AC secret already exists (likely retry), proceeding", "secret", secretName)
return secretName, nil
}
if err != nil {
return "", fmt.Errorf("failed to create immutable AC secret %s: %w", secretName, err)
}| userID string, | ||
| ) error { | ||
| logger := r.GetLogger(ctx) | ||
| serviceName := strings.TrimPrefix(instance.Name, "ac-") |
There was a problem hiding this comment.
This derivation assumes every KeystoneApplicationCredential CR name follows the ac-<service> convention. If a CR is created with a name that doesn't start with ac-, TrimPrefix returns the full name unchanged and the resulting service label / label-based queries silently target the wrong set of Secrets.
Consider either:
- Validating the naming convention via a webhook at admission time, or
- Adding an explicit
serviceNamefield to the Spec so the derivation is unnecessary and the contract is explicit.
There was a problem hiding this comment.
openstack-operator creates AC CRs with a name convention defined in keystone-operator -https://github.com/openstack-k8s-operators/openstack-operator/blob/main/internal/openstack/applicationcredential.go#L113
It's still possible to manually create the AC CR ( we do that for tests bypassing openstack-op), but we want openstack-operator to be the sole creator of AC CR based on config in controlplane CR.
We could create additional api/v1beta1/keystoneapplicationcredential_webhook.go , let's see what other reviewers comment.
mauricioharley
left a comment
There was a problem hiding this comment.
Thanks for the work on this, Veronika. The overall architecture is solid - immutable per-rotation secrets, Keystone-side revocation, and consumer finalizer coordination are the right design choices.
I left five inline comments on areas that could lead to subtle issues in production:
- Infinite requeue loop (line 223) - unbounded retry when a Secret is genuinely deleted
- Fragile substring match (line 564) -
hasConsumerFinalizercould match unrelated finalizers - Orphaned secrets (line 606) - crash between finalizer removal and delete leaves permanently uncleaned secrets
- Immutable secret retry failure (line 729) -
CreateOrPatchfails on retry against an immutable Secret - Implicit naming convention (line 584) -
TrimPrefixsilently misbehaves if CR name doesn't followac-<service>
Please address these before merging.
45357af to
9a3a2ad
Compare
|
Note: I used the |
|
/retest |
9a3a2ad to
f70bc06
Compare
Depends-On: openstack-k8s-operators/keystone-operator#685 Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
| } | ||
|
|
||
| secretList := &corev1.SecretList{} | ||
| if err := helperObj.GetClient().List(ctx, secretList, |
There was a problem hiding this comment.
Here we do the same iteration (by label) that can be realized through a lib-common helper (see my previous comment on reconcileDelete).
| } | ||
|
|
||
| fresh := &corev1.Secret{} | ||
| if err := helperObj.GetClient().Get(ctx, types.NamespacedName{Namespace: s.Namespace, Name: s.Name}, fresh); err != nil { |
There was a problem hiding this comment.
why do we .Get( a secret (stored in s) that we already have? I'm not sure this logic is required here (L630 - L635)
There was a problem hiding this comment.
At first I thought that we should re Get to guard against stale resourceVersion on Update, because between List on #604 and Update on #636 we call revokeKeystoneAC and in the meantime of that external call some service operator could modify the secret changint the resourceVersion.
However if we already simplify it to use GetSecrets() the List result is going to be fresh, so I don't think it's necessary with the other agreed changes.
There was a problem hiding this comment.
Actually, I think that even if the conflict would happen the reconcile will retry again, because if conflict happens then it means there's some change on the resource and Owns() watch is triggered and enqueue new recopncile
| } | ||
| logger.Info("Removed protection finalizer from AC secret", "secret", s.Name) | ||
| } | ||
| if err := helperObj.GetClient().Delete(ctx, fresh); err != nil && !k8s_errors.IsNotFound(err) { |
There was a problem hiding this comment.
Do we need the Delete? I'm wondering if RemoveFinalizer + Update is enough to see the Secret deleted.
There was a problem hiding this comment.
The secret still has owner reference to the AC CR that exists, so it wouldn't be collected unless the CR is deleted. For security reasons we don't want to keep unused AC secrets laying around, so we want to revoke AC in Keystone when these are fulfilled:
- it's not current
SecretName - it's not
PreviousSecretName - it doesn't have consumer finalizer
f70bc06 to
da55e5f
Compare
Depends-On: openstack-k8s-operators/keystone-operator#685 Signed-off-by: Veronika Fisarova <vfisarov@redhat.com>
7168633 to
b8993f6
Compare
|
In latest chjanges addressed two found issues:
|
|
/test functional All passed locally |
|
|
||
| // Migrate old mutable secrets: add the application-credential-service label | ||
| // if missing, so they become visible to label-based cleanup and deletion queries | ||
| serviceName := strings.TrimPrefix(instance.Name, "ac-") |
There was a problem hiding this comment.
this is used in 5 places. I think it would be good to make a func next to GetACCRName to do the reverse, like GetServiceNameFromAC or so.
|
|
||
| // The controller relies on the NodeSet watch to trigger the next reconcile | ||
| // when EDPM deploys complete — no RequeueAfter needed | ||
| if result != (ctrl.Result{}) { |
There was a problem hiding this comment.
but isn't it returning ctrl.Result{RequeueAfter: time.Minute} if we wait for a NodeSet ?
|
as discussed, the AC or AC secret consumed for edpm should get a label or annotation to reflect this. with this we could remove the global wait for edpm nodes to be update before cleanup can happen and ctlplane only services ac secret can be deleted as soon it is released. |
b8993f6 to
bebe19a
Compare
|
@stuggi addressed those two comments and added check for edpm-service annotation, so we don't block non-edpm services. Thanks! |
|
This change depends on a change that failed to merge. Change openstack-k8s-operators/lib-common#693 is needed. |
bebe19a to
13a7e6c
Compare
| func GetApplicationCredentialFromSecret( | ||
| // IsEDPMService returns true if this AC CR is annotated as an EDPM service. | ||
| func (ac *KeystoneApplicationCredential) IsEDPMService() bool { | ||
| return ac.GetAnnotations()[EDPMServiceAnnotation] == "true" |
There was a problem hiding this comment.
I am wondering for saftey mechanism, if we should return true if the annotation is set and true, or when the annotation is missing. so only return false if the annotation is there and false. what you think?
There was a problem hiding this comment.
As a good guard mechanism to not revoke EDPM related AC when the annotation is deleted or missing for some reason. So, we will revoke only when explicitly set to false.
That's good, I will update the openstack-k8s-operators/openstack-operator#1922 because right now we were setting true only on edpm services and for rest the annotation was missing.
| } | ||
|
|
||
| acID := string(s.Data[keystonev1.ACIDSecretKey]) | ||
| if acID != "" { |
There was a problem hiding this comment.
should be if acID != "" && identClient != nil I think?
| - If the NodeSet CRD is not installed or no NodeSets exist, cleanup and deletion proceed normally. | ||
|
|
||
| **Why all AC CRs are checked (not just EDPM services):** | ||
| The controller checks all `KeystoneApplicationCredential` CRs uniformly against NodeSet hashes rather than maintaining a hardcoded mapping of which AC CRs correspond to EDPM services. This keeps keystone-operator generic and avoids coupling to specific service names. |
There was a problem hiding this comment.
with the opt-in annotation this is no longer accurate? it should be either if the annotation is true, or with the above safety fallback if the annotation is true or not present?
There was a problem hiding this comment.
Yes, I forgot to update the docs, thanks!
13a7e6c to
54ab9c3
Compare
54ab9c3 to
13876c2
Compare
13876c2 to
eaea953
Compare
|
Did a version mistake when rebasing, now the dependencies from lib-common to test with dataplane CRD should be correct. |
|
Build failed (check pipeline). Post ❌ openstack-k8s-operators-content-provider FAILURE in 7m 38s |
Introduce immutable per-rotation AC secrets with deterministic names, Keystone-side revocation of unused rotated ACs, and EDPM-aware guards on both cleanup and deletion paths to defer action until dataplane nodes are redeployed. Remove unused GetApplicationCredentialFromSecret function and suppress Owns() create events on the secret watch to prevent a race condition caused by stale informer cache. EDPM awareness: Before this change, keystone-operator could revoke and delete rotated AC secrets immediately after rotation, or during AC CR deletion, even when EDPM dataplane nodes still reference the old credentials. This adds guards to both cleanupUnusedRotatedSecrets and reconcileDelete that check whether all OpenStackDataPlaneNodeSet resources have been redeployed with the current config secrets before proceeding. The check uses edpm.AreSecretHashesInSync() from the lib-common edpm module, which compares each NodeSet's status.secretHashes against the live secret hashes in the namespace. Key design decisions: - The controller watches OpenStackDataPlaneNodeSet via unstructured objects (no typed import) to avoid circular dependencies between service operators and openstack-operator. - All AC CRs are checked uniformly against NodeSet hashes rather than maintaining a hardcoded mapping of which AC CRs correspond to EDPM services. This keeps keystone-operator generic and future-proof. - Both cleanup (rotated secret revocation) and deletion (AC CR removal) are deferred when NodeSet hashes are out of sync. The NodeSet watch triggers re-evaluation when EDPM deploys complete. - If NodeSet hashes cannot be checked (e.g. CRD not installed, no NodeSets), cleanup and deletion proceed normally. - Controlplane consumer protection remains unchanged via the existing per-service finalizer convention (suffix: -ac-consumer). This approach is aligned with the RabbitMQ user deletion design in infra-operator, which uses the same lib-common edpm module to gate resource cleanup on NodeSet deployment status. Signed-off-by: Veronika Fisarova <vfisarov@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
eaea953 to
719e4af
Compare
|
Build failed (check pipeline). Post ❌ openstack-k8s-operators-content-provider FAILURE in 6m 15s |
|
Build failed (check pipeline). Post ❌ openstack-k8s-operators-content-provider FAILURE in 12m 09s |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Deydra71, mauricioharley The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4d7b7a0
into
openstack-k8s-operators:main
Jira: OSPRH-28176, OSPRH-27512, OSPRH-27519
Application Credential dev-doc: https://github.com/openstack-k8s-operators/dev-docs/blob/main/application_credentials.md
GetApplicationCredentialFromSecretfunctionapplication-credential-service: barbicanpreviousSecretNameused for tracking previously used AC secretapplication-credential-servicelabel is added to any existing secret that lacks it, making it visible to the label-based deletion. This ensures old secrets are properly revoked and cleaned up on the next rotation or CR deletion, and prevents orphaned protection finalizersedpm.AreSecretHashesInSync()from https://github.com/openstack-k8s-operators/lib-common/tree/main/modules/edpm. Before revoking rotated AC secrets or proceeding with AC CR deletion, the controller checks that allOpenStackDataPlaneNodeSetresources have been redeployed with the current config secrets. If any NodeSet has stale hashes, action is deferred until the next EDPM deploy. The controller watches NodeSet resources (via unstructured access to avoid circular dependencies) and re-evaluates when a NodeSet status changes. All AC CRs are checked uniformly - no hardcoded EDPM service mapping. This approach is aligned with the RabbitMQ user deletion design in infra-operator.Each service operator that consumes an AC secret now places a
openstack.org/<service>-ac-consumerfinalizer on the AC secret it is actively using. This ensures the keystone-operator cannot revoke or clean up secret while a service is still holding a reference to it.Assisted-by: Claude Opus 4.6 noreply@anthropic.com
Depends-On: openstack-k8s-operators/lib-common#693