Update STONITH lambda with the latest changes from KCB

Closes #32803

Signed-off-by: Michal Hajas <mhajas@redhat.com>
Signed-off-by: Ryan Emerson <remerson@redhat.com>
Co-authored-by: Ryan Emerson <remerson@redhat.com>
This commit is contained in:
Michal Hajas 2024-09-11 13:48:26 +02:00 committed by GitHub
parent e140e71a52
commit d85ce41377
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 128 additions and 32 deletions

View file

@ -19,13 +19,12 @@ longer possible for the two sites to continue to replicate data between themselv
will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites. sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.
In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible.
deployments only consist of two sites, this is not possible. Instead, we leverage "`fencing`" to ensure that when one of the Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests.
sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this
site is able to serve subsequent users requests.
As the state stored in {jdgserver_name} will be out-of-sync once the connectivity has been lost, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />. Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync.
This is why a site which is removed via fencing will not be re-added automatically, but only after such a synchronisation using the mual procedure <@links.ha id="operate-site-online" />. To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />.
In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts] In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics,
@ -41,6 +40,7 @@ a given time.
* ROSA HCP based multi-site Keycloak deployment * ROSA HCP based multi-site Keycloak deployment
* AWS CLI Installed * AWS CLI Installed
* AWS Global Accelerator loadbalancer * AWS Global Accelerator loadbalancer
* `jq` tool installed
== Procedure == Procedure
. Enable Openshift user alert routing . Enable Openshift user alert routing
@ -171,19 +171,12 @@ aws iam attach-role-policy \
LAMBDA_ZIP=/tmp/lambda.zip LAMBDA_ZIP=/tmp/lambda.zip
cat << EOF > /tmp/lambda.py cat << EOF > /tmp/lambda.py
include::examples/generated/fencing_lambda.py[tag=fencing-start] include::examples/generated/fencing_lambda.py[]
expected_user = 'keycloak' # <1>
secret_name = 'webhook-password' # <2>
secret_region = 'eu-west-1' # <3>
include::examples/generated/fencing_lambda.py[tag=fencing-end]
EOF EOF
zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py
</#noparse> </#noparse>
---- ----
<1> The username required to authenticate Lambda requests
<2> The AWS secret containing the password <<aws-secret,defined earlier>>
<3> The AWS region which stores the password secret
+ +
. Create the Lambda function. . Create the Lambda function.
+ +
@ -233,7 +226,63 @@ aws lambda add-permission \
---- ----
<1> The AWS Region hosting your Kubernetes clusters <1> The AWS Region hosting your Kubernetes clusters
+ +
. Retieve the Lambda Function URL . Configure the Lambda's Environment variables:
+
.. In each Kubernetes cluster, retrieve the exposed {jdgserver_name} URL endpoint:
+
[source,bash]
----
<#noparse>
kubectl -n ${NAMESPACE} get route infinispan-external -o jsonpath='{.status.ingress[].host}' # <1>
</#noparse>
----
<1> Replace `$\{NAMESPACE}` with the namespace containing your {jdgserver_name} server
+
.. Upload the desired Environment variables
+
[source,bash]
----
<#noparse>
ACCELERATOR_NAME= # <1>
LAMBDA_REGION= # <2>
CLUSTER_1_NAME= # <3>
CLUSTER_1_ISPN_ENDPOINT= # <4>
CLUSTER_2_NAME= # <5>
CLUSTER_2_ISPN_ENDPOINT= # <6>
INFINISPAN_USER= # <7>
INFINISPAN_USER_SECRET= # <8>
WEBHOOK_USER= # <9>
WEBHOOK_USER_SECRET= # <10>
INFINISPAN_SITE_ENDPOINTS=$(echo "{\"${CLUSTER_NAME_1}\":\"${CLUSTER_1_ISPN_ENDPOINT}\",\"${CLUSTER_2_NAME}\":\"${CLUSTER_2_ISPN_ENDPOINT\"}" | jq tostring)
aws lambda update-function-configuration \
--function-name ${ACCELERATOR_NAME} \
--region ${LAMBDA_REGION} \
--environment "{
\"Variables\": {
\"INFINISPAN_USER\" : \"${INFINISPAN_USER}\",
\"INFINISPAN_USER_SECRET\" : \"${INFINISPAN_USER_SECRET}\",
\"INFINISPAN_SITE_ENDPOINTS\" : ${INFINISPAN_SITE_ENDPOINTS},
\"WEBHOOK_USER\" : \"${WEBHOOK_USER}\",
\"WEBHOOK_USER_SECRET\" : \"${WEBHOOK_USER_SECERT}\",
\"SECRETS_REGION\" : \"eu-central-1\"
}
}"
</#noparse>
----
+
<1> The name of the AWS Global Accelerator used by your deployment
<2> The AWS Region hosting your Kubernetes cluster and Lambda function
<3> The name of one of your {jdgserver_name} sites as defined in <@links.ha id="deploy-infinispan-kubernetes-crossdc" />
<4> The {jdgserver_name} endpoint URL associated with the CLUSER_1_NAME site
<5> The name of the second {jdgserver_name} site
<6> The {jdgserver_name} endpoint URL associated with the CLUSER_2_NAME site
<7> The username of a {jdgserver_name} user which has sufficient privileges to perform REST requests on the server
<8> The name of the AWS secret containing the password associated with the {jdgserver_name} user
<9> The username used to authenticate requests to the Lambda Function
<10> The name of the AWS secret containing the password used to authenticate requests to the Lambda function
+
. Retrieve the Lambda Function URL
+ +
.Command: .Command:
[source,bash] [source,bash]
@ -260,11 +309,7 @@ https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws
[source,bash] [source,bash]
---- ----
<#noparse> <#noparse>
ACCELERATOR_NAME= # <1> NAMESPACE= # The namespace containing your deployments
NAMESPACE= # <2>
LOCAL_SITE= # <3>
REMOTE_SITE= # <4>
kubectl apply -n ${NAMESPACE} -f - << EOF kubectl apply -n ${NAMESPACE} -f - << EOF
include::examples/generated/ispn-site-a.yaml[tag=fencing-secret] include::examples/generated/ispn-site-a.yaml[tag=fencing-secret]
</#noparse> </#noparse>

View file

@ -1,17 +1,37 @@
# tag::fencing-start[] from urllib.error import HTTPError
import boto3 import boto3
import jmespath import jmespath
import json import json
import os
import urllib3
from base64 import b64decode from base64 import b64decode
from urllib.parse import unquote from urllib.parse import unquote
# Prevent unverified HTTPS connection warning
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
class MissingEnvironmentVariable(Exception):
pass
class MissingSiteUrl(Exception):
pass
def env(name):
if name in os.environ:
return os.environ[name]
raise MissingEnvironmentVariable(f"Environment Variable '{name}' must be set")
def handle_site_offline(labels): def handle_site_offline(labels):
a_client = boto3.client('globalaccelerator', region_name='us-west-2') a_client = boto3.client('globalaccelerator', region_name='us-west-2')
acceleratorDNS = labels['accelerator'] acceleratorDNS = labels['accelerator']
accelerator = jmespath.search(f"Accelerators[?DnsName=='{acceleratorDNS}']", a_client.list_accelerators()) accelerator = jmespath.search(f"Accelerators[?(DnsName=='{acceleratorDNS}'|| DualStackDnsName=='{acceleratorDNS}')]", a_client.list_accelerators())
if not accelerator: if not accelerator:
print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found") print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found")
return return
@ -40,6 +60,9 @@ def handle_site_offline(labels):
EndpointConfigurations=endpoints EndpointConfigurations=endpoints
) )
print(f"Removed site={offline_site} from Accelerator EndpointGroup") print(f"Removed site={offline_site} from Accelerator EndpointGroup")
take_infinispan_site_offline(reporter, offline_site)
print(f"Backup site={offline_site} caches taken offline")
else: else:
print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup") print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup")
@ -55,11 +78,30 @@ def endpoint_belongs_to_site(endpoint, site):
return false return false
def get_secret(secret_name, region_name): def take_infinispan_site_offline(reporter, offlinesite):
endpoints = json.loads(INFINISPAN_SITE_ENDPOINTS)
if reporter not in endpoints:
raise MissingSiteUrl(f"Missing URL for site '{reporter}' in 'INFINISPAN_SITE_ENDPOINTS' json")
endpoint = endpoints[reporter]
password = get_secret(INFINISPAN_USER_SECRET)
url = f"https://{endpoint}/rest/v2/container/x-site/backups/{offlinesite}?action=take-offline"
http = urllib3.PoolManager(cert_reqs='CERT_NONE')
headers = urllib3.make_headers(basic_auth=f"{INFINISPAN_USER}:{password}")
try:
rsp = http.request("POST", url, headers=headers)
if rsp.status >= 400:
raise HTTPError(f"Unexpected response status '%d' when taking site offline", rsp.status)
rsp.release_conn()
except HTTPError as e:
print(f"HTTP error encountered: {e}")
def get_secret(secret_name):
session = boto3.session.Session() session = boto3.session.Session()
client = session.client( client = session.client(
service_name='secretsmanager', service_name='secretsmanager',
region_name=region_name region_name=SECRETS_REGION
) )
return client.get_secret_value(SecretId=secret_name)['SecretString'] return client.get_secret_value(SecretId=secret_name)['SecretString']
@ -90,14 +132,9 @@ def handler(event, context):
"statusCode": 401 "statusCode": 401
} }
# end::fencing-start[] expectedPass = get_secret(WEBHOOK_USER_SECRET)
expected_user = 'keycloak'
secret_name = 'keycloak-master-password'
secret_region = 'eu-central-1'
# tag::fencing-end[]
expectedPass = get_secret(secret_name, secret_region)
username, password = decode_basic_auth_header(authorization) username, password = decode_basic_auth_header(authorization)
if username != expected_user and password != expectedPass: if username != WEBHOOK_USER and password != expectedPass:
print('Invalid username/password combination') print('Invalid username/password combination')
return { return {
"statusCode": 403 "statusCode": 403
@ -109,6 +146,13 @@ def handler(event, context):
body = json.loads(body) body = json.loads(body)
print(json.dumps(body)) print(json.dumps(body))
if body['status'] != 'firing':
print("Ignoring alert as status is not 'firing', status was: '%s'" % body['status'])
return {
"statusCode": 204
}
for alert in body['alerts']: for alert in body['alerts']:
labels = alert['labels'] labels = alert['labels']
if labels['alertname'] == 'SiteOffline': if labels['alertname'] == 'SiteOffline':
@ -117,4 +161,11 @@ def handler(event, context):
return { return {
"statusCode": 204 "statusCode": 204
} }
# end::fencing-end[]
INFINISPAN_USER = env('INFINISPAN_USER')
INFINISPAN_USER_SECRET = env('INFINISPAN_USER_SECRET')
INFINISPAN_SITE_ENDPOINTS = env('INFINISPAN_SITE_ENDPOINTS')
SECRETS_REGION = env('SECRETS_REGION')
WEBHOOK_USER = env('WEBHOOK_USER')
WEBHOOK_USER_SECRET = env('WEBHOOK_USER_SECRET')