Update STONITH lambda with the latest changes from KCB

Closes #32803 Signed-off-by: Michal Hajas <mhajas@redhat.com> Signed-off-by: Ryan Emerson <remerson@redhat.com> Co-authored-by: Ryan Emerson <remerson@redhat.com>
2024-09-11 13:48:26 +02:00 · 2024-09-11 13:48:26 +02:00 · d85ce41377
commit d85ce41377
parent e140e71a52
2 changed files with 128 additions and 32 deletions
--- a/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc
+++ b/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc
@ -19,13 +19,12 @@ longer possible for the two sites to continue to replicate data between themselv
 will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
 sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.

-In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site
-deployments only consist of two sites, this is not possible. Instead, we leverage "`fencing`" to ensure that when one of the
-sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this
-site is able to serve subsequent users requests.
+In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible.
+Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests.

-As the state stored in {jdgserver_name} will be out-of-sync once the connectivity has been lost, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
-This is why a site which is removed via fencing will not be re-added automatically, but only after such a synchronisation using the mual procedure <@links.ha id="operate-site-online" />.
+Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync.
+To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
+This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />.

 In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
 and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics,
@ -41,6 +40,7 @@ a given time.
 * ROSA HCP based multi-site Keycloak deployment
 * AWS CLI Installed
 * AWS Global Accelerator loadbalancer
+* `jq` tool installed

 == Procedure
 . Enable Openshift user alert routing
@ -171,19 +171,12 @@ aws iam attach-role-policy \
 LAMBDA_ZIP=/tmp/lambda.zip
 cat << EOF > /tmp/lambda.py

-include::examples/generated/fencing_lambda.py[tag=fencing-start]
-    expected_user = 'keycloak' # <1>
-    secret_name = 'webhook-password' # <2>
-    secret_region = 'eu-west-1' # <3>
-include::examples/generated/fencing_lambda.py[tag=fencing-end]
+include::examples/generated/fencing_lambda.py[]

 EOF
 zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py
 </#noparse>
 ----
-<1> The username required to authenticate Lambda requests
-<2> The AWS secret containing the password <<aws-secret,defined earlier>>
-<3> The AWS region which stores the password secret
 +
 . Create the Lambda function.
 +
@ -233,7 +226,63 @@ aws lambda add-permission \
 ----
 <1> The AWS Region hosting your Kubernetes clusters
 +
-. Retieve the Lambda Function URL
+. Configure the Lambda's Environment variables:
+
+.. In each Kubernetes cluster, retrieve the exposed {jdgserver_name} URL endpoint:
+
+[source,bash]
+----
+<#noparse>
+kubectl -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}' # <1>
+</#noparse>
+----
+<1> Replace `$\{NAMESPACE}` with the namespace containing your {jdgserver_name} server
+
+.. Upload the desired Environment variables
+
+[source,bash]
+----
+<#noparse>
+ACCELERATOR_NAME= # <1>
+LAMBDA_REGION= # <2>
+CLUSTER_1_NAME= # <3>
+CLUSTER_1_ISPN_ENDPOINT= # <4>
+CLUSTER_2_NAME= # <5>
+CLUSTER_2_ISPN_ENDPOINT= # <6>
+INFINISPAN_USER= # <7>
+INFINISPAN_USER_SECRET= # <8>
+WEBHOOK_USER= # <9>
+WEBHOOK_USER_SECRET= # <10>
+
+INFINISPAN_SITE_ENDPOINTS=$(echo "{\"${CLUSTER_NAME_1}\":\"${CLUSTER_1_ISPN_ENDPOINT}\",\"${CLUSTER_2_NAME}\":\"${CLUSTER_2_ISPN_ENDPOINT\"}" | jq tostring)
+aws lambda update-function-configuration \
+    --function-name ${ACCELERATOR_NAME} \
+    --region ${LAMBDA_REGION} \
+    --environment "{
+      \"Variables\": {
+        \"INFINISPAN_USER\" : \"${INFINISPAN_USER}\",
+        \"INFINISPAN_USER_SECRET\" : \"${INFINISPAN_USER_SECRET}\",
+        \"INFINISPAN_SITE_ENDPOINTS\" : ${INFINISPAN_SITE_ENDPOINTS},
+        \"WEBHOOK_USER\" : \"${WEBHOOK_USER}\",
+        \"WEBHOOK_USER_SECRET\" : \"${WEBHOOK_USER_SECERT}\",
+        \"SECRETS_REGION\" : \"eu-central-1\"
+      }
+    }"
+</#noparse>
+----
+
+<1> The name of the AWS Global Accelerator used by your deployment
+<2> The AWS Region hosting your Kubernetes cluster and Lambda function
+<3> The name of one of your {jdgserver_name} sites as defined in <@links.ha id="deploy-infinispan-kubernetes-crossdc" />
+<4> The {jdgserver_name} endpoint URL associated with the CLUSER_1_NAME site
+<5> The name of the second {jdgserver_name} site
+<6> The {jdgserver_name} endpoint URL associated with the CLUSER_2_NAME site
+<7> The username of a {jdgserver_name} user which has sufficient privileges to perform REST requests on the server
+<8> The name of the AWS secret containing the password associated with the {jdgserver_name} user
+<9> The username used to authenticate requests to the Lambda Function
+<10> The name of the AWS secret containing the password used to authenticate requests to the Lambda function
+
+. Retrieve the Lambda Function URL
 +
 .Command:
 [source,bash]
@ -260,11 +309,7 @@ https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
 [source,bash]
 ----
 <#noparse>
-ACCELERATOR_NAME= # <1>
-NAMESPACE= # <2>
-LOCAL_SITE= # <3>
-REMOTE_SITE= # <4>
-
+NAMESPACE= # The namespace containing your deployments
 kubectl apply -n ${NAMESPACE} -f - << EOF
 include::examples/generated/ispn-site-a.yaml[tag=fencing-secret]
 </#noparse>
--- a/docs/guides/high-availability/examples/generated/fencing_lambda.py
+++ b/docs/guides/high-availability/examples/generated/fencing_lambda.py
@ -1,17 +1,37 @@
-# tag::fencing-start[]
+from urllib.error import HTTPError
+
 import boto3
 import jmespath
 import json
+import os
+import urllib3

 from base64 import b64decode
 from urllib.parse import unquote

+# Prevent unverified HTTPS connection warning
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+
+
+class MissingEnvironmentVariable(Exception):
+    pass
+
+
+class MissingSiteUrl(Exception):
+    pass
+
+
+def env(name):
+    if name in os.environ:
+        return os.environ[name]
+    raise MissingEnvironmentVariable(f"Environment Variable '{name}' must be set")
+

 def handle_site_offline(labels):
    a_client = boto3.client('globalaccelerator', region_name='us-west-2')

    acceleratorDNS = labels['accelerator']
-    accelerator = jmespath.search(f"Accelerators[?DnsName=='{acceleratorDNS}']", a_client.list_accelerators())
+    accelerator = jmespath.search(f"Accelerators[?(DnsName=='{acceleratorDNS}'|| DualStackDnsName=='{acceleratorDNS}')]", a_client.list_accelerators())
    if not accelerator:
        print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found")
        return
@ -40,6 +60,9 @@ def handle_site_offline(labels):
            EndpointConfigurations=endpoints
        )
        print(f"Removed site={offline_site} from Accelerator EndpointGroup")
+
+        take_infinispan_site_offline(reporter, offline_site)
+        print(f"Backup site={offline_site} caches taken offline")
    else:
        print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup")

@ -55,11 +78,30 @@ def endpoint_belongs_to_site(endpoint, site):
    return false


-def get_secret(secret_name, region_name):
+def take_infinispan_site_offline(reporter, offlinesite):
+    endpoints = json.loads(INFINISPAN_SITE_ENDPOINTS)
+    if reporter not in endpoints:
+        raise MissingSiteUrl(f"Missing URL for site '{reporter}' in 'INFINISPAN_SITE_ENDPOINTS' json")
+
+    endpoint = endpoints[reporter]
+    password = get_secret(INFINISPAN_USER_SECRET)
+    url = f"https://{endpoint}/rest/v2/container/x-site/backups/{offlinesite}?action=take-offline"
+    http = urllib3.PoolManager(cert_reqs='CERT_NONE')
+    headers = urllib3.make_headers(basic_auth=f"{INFINISPAN_USER}:{password}")
+    try:
+        rsp = http.request("POST", url, headers=headers)
+        if rsp.status >= 400:
+            raise HTTPError(f"Unexpected response status '%d' when taking site offline", rsp.status)
+        rsp.release_conn()
+    except HTTPError as e:
+        print(f"HTTP error encountered: {e}")
+
+
+def get_secret(secret_name):
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
-        region_name=region_name
+        region_name=SECRETS_REGION
    )
    return client.get_secret_value(SecretId=secret_name)['SecretString']

@ -90,14 +132,9 @@ def handler(event, context):
            "statusCode": 401
        }

-# end::fencing-start[]
-    expected_user = 'keycloak'
-    secret_name = 'keycloak-master-password'
-    secret_region = 'eu-central-1'
-# tag::fencing-end[]
-    expectedPass = get_secret(secret_name, secret_region)
+    expectedPass = get_secret(WEBHOOK_USER_SECRET)
    username, password = decode_basic_auth_header(authorization)
-    if username != expected_user and password != expectedPass:
+    if username != WEBHOOK_USER and password != expectedPass:
        print('Invalid username/password combination')
        return {
            "statusCode": 403
@ -109,6 +146,13 @@ def handler(event, context):

    body = json.loads(body)
    print(json.dumps(body))
+
+    if body['status'] != 'firing':
+        print("Ignoring alert as status is not 'firing', status was: '%s'" % body['status'])
+        return {
+            "statusCode": 204
+        }
+
    for alert in body['alerts']:
        labels = alert['labels']
        if labels['alertname'] == 'SiteOffline':
@ -117,4 +161,11 @@ def handler(event, context):
    return {
        "statusCode": 204
    }
-# end::fencing-end[]
+
+
+INFINISPAN_USER = env('INFINISPAN_USER')
+INFINISPAN_USER_SECRET = env('INFINISPAN_USER_SECRET')
+INFINISPAN_SITE_ENDPOINTS = env('INFINISPAN_SITE_ENDPOINTS')
+SECRETS_REGION = env('SECRETS_REGION')
+WEBHOOK_USER = env('WEBHOOK_USER')
+WEBHOOK_USER_SECRET = env('WEBHOOK_USER_SECRET')