In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate data between them.
The {jdgserver_name} is configured with a `FAIL` failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication.
In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline.
However, as multi-site deployments only consist of two sites, this is not possible.
Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration, and hence only this site is able to serve subsequent users requests.
In addition to the loadbalancer configuration, the fencing procedure disables replication between the two {jdgserver_name} clusters to allow serving user requests from the site that remains in the loadbalancer configuration.
As a result, the sites will be out-of-sync once the replication has been disabled.
To recover from the out-of-sync state, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure <@links.ha id="operate-site-online" />.
A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook.
The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline.
In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously.
We guard against this by ensuring that only a single Lambda instance can be executed at a given time.
The logic in the AWS Lambda ensures that always one site entry remains in the loadbalancer configuration.
kubectl -n openshift-user-workload-monitoring rollout status --watch statefulset.apps/alertmanager-user-workload
. [[aws-secret]]Decide upon a username/password combination which will be used to authenticate the Lambda webhook and create an AWS Secret storing the password
aws secretsmanager create-secret \
--name webhook-password \ # <1>
--secret-string changeme \ # <2>
--region eu-west-1 # <3>
<1> The name of the secret
<2> The password to be used for authentication
<3> The AWS region that hosts the secret
. Create the Role used to execute the Lambda.
ROLE_ARN=$(aws iam create-role \
--role-name ${FUNCTION_NAME} \
--assume-role-policy-document \
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Principal": {
"Service": ""
"Action": "sts:AssumeRole"
}' \
--query 'Role.Arn' \
--region eu-west-1 \ #<2>
--output text
<1> A name of your choice to associate with the Lambda and related resources
<2> The AWS Region hosting your Kubernetes clusters
. Create and attach the 'LambdaSecretManager' Policy so that the Lambda can access AWS Secrets
POLICY_ARN=$(aws iam create-policy \
--policy-name LambdaSecretManager \
--policy-document \
"Version": "2012-10-17",
"Statement": [
"Effect": "Allow",
"Action": [
"Resource": "*"
}' \
--query 'Policy.Arn' \
--output text
aws iam attach-role-policy \
--role-name ${FUNCTION_NAME} \
--policy-arn ${POLICY_ARN}
. Attach the `ElasticLoadBalancingReadOnly` policy so that the Lambda can query the provisioned Network Load Balancers