Build: [AEA-0000] - Add retry logic to the deploy-api stage#2986
Build: [AEA-0000] - Add retry logic to the deploy-api stage#2986
Conversation
There was a problem hiding this comment.
Pull request overview
Adds retry logic around Lambda invocations in the deploy_api.sh GitHub Actions deploy stage to reduce transient deployment failures (e.g., intermittent 401s), while also cleaning up a couple of formatting issues.
Changes:
- Introduced
invoke_lambda_with_retry()to retry Lambda invocations up to 3 times with a delay. - Replaced direct
aws lambda invokecalls (secret put, instance put, spec publish) with the retry wrapper. - Minor whitespace/formatting fixes in the script.
|
|
||
| for attempt in $(seq 1 "$max_attempts"); do | ||
| echo "Invoking ${function_name} (attempt ${attempt} of ${max_attempts})" | ||
| aws lambda invoke --function-name "${function_name}" --cli-binary-format raw-in-base64-out --payload "file://${payload_file}" out.txt > response.json | ||
|
|
||
| if ! eval "cat response.json | jq -e '.FunctionError' >/dev/null"; then | ||
| return 0 | ||
| fi | ||
|
|
||
| echo "Error calling lambda (attempt ${attempt} of ${max_attempts})" | ||
| cat out.txt |
There was a problem hiding this comment.
out.txt and response.json are overwritten on every attempt and on every lambda call, which makes post-failure debugging harder (especially with retries). Consider incorporating the function name/attempt into the filenames, or at least printing both the lambda payload output and the invoke metadata for each failed attempt.
| for attempt in $(seq 1 "$max_attempts"); do | |
| echo "Invoking ${function_name} (attempt ${attempt} of ${max_attempts})" | |
| aws lambda invoke --function-name "${function_name}" --cli-binary-format raw-in-base64-out --payload "file://${payload_file}" out.txt > response.json | |
| if ! eval "cat response.json | jq -e '.FunctionError' >/dev/null"; then | |
| return 0 | |
| fi | |
| echo "Error calling lambda (attempt ${attempt} of ${max_attempts})" | |
| cat out.txt | |
| local safe_function_name | |
| safe_function_name=$(printf '%s' "${function_name}" | tr -c '[:alnum:]._-' '_') | |
| for attempt in $(seq 1 "$max_attempts"); do | |
| local payload_output_file="${safe_function_name}.attempt_${attempt}.out.txt" | |
| local invoke_metadata_file="${safe_function_name}.attempt_${attempt}.response.json" | |
| echo "Invoking ${function_name} (attempt ${attempt} of ${max_attempts})" | |
| aws lambda invoke --function-name "${function_name}" --cli-binary-format raw-in-base64-out --payload "file://${payload_file}" "${payload_output_file}" > "${invoke_metadata_file}" | |
| if ! jq -e '.FunctionError' "${invoke_metadata_file}" >/dev/null; then | |
| return 0 | |
| fi | |
| echo "Error calling lambda (attempt ${attempt} of ${max_attempts})" | |
| echo "Lambda payload output (${payload_output_file}):" | |
| cat "${payload_output_file}" | |
| echo "Lambda invoke metadata (${invoke_metadata_file}):" | |
| cat "${invoke_metadata_file}" |
| @@ -138,12 +165,7 @@ if [[ "${ENABLE_MUTUAL_TLS}" == "true" ]]; then | |||
| --arg proxygenSecretName "${proxygen_private_key_arn}" \ | |||
| '{apiName: $apiName, environment: $environment, secretName: $secretName, secretKey: $secretKey, secretCert: $secretCert, kid, $kid, proxygenSecretName: $proxygenSecretName}' > payload.json | |||
There was a problem hiding this comment.
The jq payload object contains kid, $kid, which is not valid key/value construction and will either set kid to null or cause a jq syntax error. This should be a proper field assignment (e.g. kid: $kid) so the lambda receives the expected KID value.
| '{apiName: $apiName, environment: $environment, secretName: $secretName, secretKey: $secretKey, secretCert: $secretCert, kid, $kid, proxygenSecretName: $proxygenSecretName}' > payload.json | |
| '{apiName: $apiName, environment: $environment, secretName: $secretName, secretKey: $secretKey, secretCert: $secretCert, kid: $kid, proxygenSecretName: $proxygenSecretName}' > payload.json |
There was a problem hiding this comment.
god damn, that might be the problem. Why does this work most of the time, then?
|



Summary
Details
Occasionally, the
deploy_api.shscript fails to get credentials and produces a 401 error. For example, here. Typically, manually re-running the workflow fixes this.This is a stab in the dark, but I've added a retrying function to the script, that will take three attempts at talking to the lambdas (secret put, instance put, and spec publish). The error is transient and I don't know what triggers it (though I think it's most likely to be external) so I don't have evidence that this will work, but it shouldn't hurt.