Update STONITH lambda with the latest changes from KCB

Closes #32803

Signed-off-by: Michal Hajas <mhajas@redhat.com>
Signed-off-by: Ryan Emerson <remerson@redhat.com>
Co-authored-by: Ryan Emerson <remerson@redhat.com>
This commit is contained in:
Michal Hajas 2024-09-11 13:48:26 +02:00 committed by GitHub
parent e140e71a52
commit d85ce41377
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 128 additions and 32 deletions

View file

@ -19,13 +19,12 @@ longer possible for the two sites to continue to replicate data between themselv
will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.
In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site
deployments only consist of two sites, this is not possible. Instead, we leverage "`fencing`" to ensure that when one of the
sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this
site is able to serve subsequent users requests.
In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible.
Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests.
As the state stored in {jdgserver_name} will be out-of-sync once the connectivity has been lost, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
This is why a site which is removed via fencing will not be re-added automatically, but only after such a synchronisation using the mual procedure <@links.ha id="operate-site-online" />.
Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync.
To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />.
In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics,
@ -41,6 +40,7 @@ a given time.
* ROSA HCP based multi-site Keycloak deployment
* AWS CLI Installed
* AWS Global Accelerator loadbalancer
* `jq` tool installed
== Procedure
. Enable Openshift user alert routing
@ -171,19 +171,12 @@ aws iam attach-role-policy \
LAMBDA_ZIP=/tmp/lambda.zip
cat << EOF > /tmp/lambda.py
include::examples/generated/fencing_lambda.py[tag=fencing-start]
expected_user = 'keycloak' # <1>
secret_name = 'webhook-password' # <2>
secret_region = 'eu-west-1' # <3>
include::examples/generated/fencing_lambda.py[tag=fencing-end]
include::examples/generated/fencing_lambda.py[]
EOF
zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py
</#noparse>
----
<1> The username required to authenticate Lambda requests
<2> The AWS secret containing the password <<aws-secret,defined earlier>>
<3> The AWS region which stores the password secret
+
. Create the Lambda function.
+
@ -233,7 +226,63 @@ aws lambda add-permission \
----
<1> The AWS Region hosting your Kubernetes clusters
+
. Retieve the Lambda Function URL
. Configure the Lambda's Environment variables:
+
.. In each Kubernetes cluster, retrieve the exposed {jdgserver_name} URL endpoint:
+
[source,bash]
----
<#noparse>
kubectl -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}' # <1>
</#noparse>
----
<1> Replace `$\{NAMESPACE}` with the namespace containing your {jdgserver_name} server
+
.. Upload the desired Environment variables
+
[source,bash]
----
<#noparse>
ACCELERATOR_NAME= # <1>
LAMBDA_REGION= # <2>
CLUSTER_1_NAME= # <3>
CLUSTER_1_ISPN_ENDPOINT= # <4>
CLUSTER_2_NAME= # <5>
CLUSTER_2_ISPN_ENDPOINT= # <6>
INFINISPAN_USER= # <7>
INFINISPAN_USER_SECRET= # <8>
WEBHOOK_USER= # <9>
WEBHOOK_USER_SECRET= # <10>
INFINISPAN_SITE_ENDPOINTS=$(echo "{\"${CLUSTER_NAME_1}\":\"${CLUSTER_1_ISPN_ENDPOINT}\",\"${CLUSTER_2_NAME}\":\"${CLUSTER_2_ISPN_ENDPOINT\"}" | jq tostring)
aws lambda update-function-configuration \
--function-name ${ACCELERATOR_NAME} \
--region ${LAMBDA_REGION} \
--environment "{
\"Variables\": {
\"INFINISPAN_USER\" : \"${INFINISPAN_USER}\",
\"INFINISPAN_USER_SECRET\" : \"${INFINISPAN_USER_SECRET}\",
\"INFINISPAN_SITE_ENDPOINTS\" : ${INFINISPAN_SITE_ENDPOINTS},
\"WEBHOOK_USER\" : \"${WEBHOOK_USER}\",
\"WEBHOOK_USER_SECRET\" : \"${WEBHOOK_USER_SECERT}\",
\"SECRETS_REGION\" : \"eu-central-1\"
}
}"
</#noparse>
----
+
<1> The name of the AWS Global Accelerator used by your deployment
<2> The AWS Region hosting your Kubernetes cluster and Lambda function
<3> The name of one of your {jdgserver_name} sites as defined in <@links.ha id="deploy-infinispan-kubernetes-crossdc" />
<4> The {jdgserver_name} endpoint URL associated with the CLUSER_1_NAME site
<5> The name of the second {jdgserver_name} site
<6> The {jdgserver_name} endpoint URL associated with the CLUSER_2_NAME site
<7> The username of a {jdgserver_name} user which has sufficient privileges to perform REST requests on the server
<8> The name of the AWS secret containing the password associated with the {jdgserver_name} user
<9> The username used to authenticate requests to the Lambda Function
<10> The name of the AWS secret containing the password used to authenticate requests to the Lambda function
+
. Retrieve the Lambda Function URL
+
.Command:
[source,bash]
@ -260,11 +309,7 @@ https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
[source,bash]
----
<#noparse>
ACCELERATOR_NAME= # <1>
NAMESPACE= # <2>
LOCAL_SITE= # <3>
REMOTE_SITE= # <4>
NAMESPACE= # The namespace containing your deployments
kubectl apply -n ${NAMESPACE} -f - << EOF
include::examples/generated/ispn-site-a.yaml[tag=fencing-secret]
</#noparse>

View file

@ -1,17 +1,37 @@
# tag::fencing-start[]
from urllib.error import HTTPError
import boto3
import jmespath
import json
import os
import urllib3
from base64 import b64decode
from urllib.parse import unquote
# Prevent unverified HTTPS connection warning
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
class MissingEnvironmentVariable(Exception):
pass
class MissingSiteUrl(Exception):
pass
def env(name):
if name in os.environ:
return os.environ[name]
raise MissingEnvironmentVariable(f"Environment Variable '{name}' must be set")
def handle_site_offline(labels):
a_client = boto3.client('globalaccelerator', region_name='us-west-2')
acceleratorDNS = labels['accelerator']
accelerator = jmespath.search(f"Accelerators[?DnsName=='{acceleratorDNS}']", a_client.list_accelerators())
accelerator = jmespath.search(f"Accelerators[?(DnsName=='{acceleratorDNS}'|| DualStackDnsName=='{acceleratorDNS}')]", a_client.list_accelerators())
if not accelerator:
print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found")
return
@ -40,6 +60,9 @@ def handle_site_offline(labels):
EndpointConfigurations=endpoints
)
print(f"Removed site={offline_site} from Accelerator EndpointGroup")
take_infinispan_site_offline(reporter, offline_site)
print(f"Backup site={offline_site} caches taken offline")
else:
print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup")
@ -55,11 +78,30 @@ def endpoint_belongs_to_site(endpoint, site):
return false
def get_secret(secret_name, region_name):
def take_infinispan_site_offline(reporter, offlinesite):
endpoints = json.loads(INFINISPAN_SITE_ENDPOINTS)
if reporter not in endpoints:
raise MissingSiteUrl(f"Missing URL for site '{reporter}' in 'INFINISPAN_SITE_ENDPOINTS' json")
endpoint = endpoints[reporter]
password = get_secret(INFINISPAN_USER_SECRET)
url = f"https://{endpoint}/rest/v2/container/x-site/backups/{offlinesite}?action=take-offline"
http = urllib3.PoolManager(cert_reqs='CERT_NONE')
headers = urllib3.make_headers(basic_auth=f"{INFINISPAN_USER}:{password}")
try:
rsp = http.request("POST", url, headers=headers)
if rsp.status >= 400:
raise HTTPError(f"Unexpected response status '%d' when taking site offline", rsp.status)
rsp.release_conn()
except HTTPError as e:
print(f"HTTP error encountered: {e}")
def get_secret(secret_name):
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name
region_name=SECRETS_REGION
)
return client.get_secret_value(SecretId=secret_name)['SecretString']
@ -90,14 +132,9 @@ def handler(event, context):
"statusCode": 401
}
# end::fencing-start[]
expected_user = 'keycloak'
secret_name = 'keycloak-master-password'
secret_region = 'eu-central-1'
# tag::fencing-end[]
expectedPass = get_secret(secret_name, secret_region)
expectedPass = get_secret(WEBHOOK_USER_SECRET)
username, password = decode_basic_auth_header(authorization)
if username != expected_user and password != expectedPass:
if username != WEBHOOK_USER and password != expectedPass:
print('Invalid username/password combination')
return {
"statusCode": 403
@ -109,6 +146,13 @@ def handler(event, context):
body = json.loads(body)
print(json.dumps(body))
if body['status'] != 'firing':
print("Ignoring alert as status is not 'firing', status was: '%s'" % body['status'])
return {
"statusCode": 204
}
for alert in body['alerts']:
labels = alert['labels']
if labels['alertname'] == 'SiteOffline':
@ -117,4 +161,11 @@ def handler(event, context):
return {
"statusCode": 204
}
# end::fencing-end[]
INFINISPAN_USER = env('INFINISPAN_USER')
INFINISPAN_USER_SECRET = env('INFINISPAN_USER_SECRET')
INFINISPAN_SITE_ENDPOINTS = env('INFINISPAN_SITE_ENDPOINTS')
SECRETS_REGION = env('SECRETS_REGION')
WEBHOOK_USER = env('WEBHOOK_USER')
WEBHOOK_USER_SECRET = env('WEBHOOK_USER_SECRET')