Rework AWS Lambda doc to show it is required (#33462)

Closes #33461 Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
2024-10-02 12:42:11 +02:00 · 2024-10-02 12:42:11 +02:00 · cb12f03003
commit cb12f03003
parent c1653448f3
3 changed files with 28 additions and 22 deletions
--- a/docs/documentation/release_notes/topics/26_0_0.adoc
+++ b/docs/documentation/release_notes/topics/26_0_0.adoc
@ -81,7 +81,9 @@ The new `footer.ftl` template provides a `content` macro that is rendered at the

 {project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:

- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active setups are now supported.
+- {project_name} deployments are now able to handle user requests simultaneously in both sites.
+
+- Active monitoring of the connectivity between the sites is now required to update the replication between the sites in case of a failure.

 - The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times caused by DNS caching by clients.

--- a/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc
+++ b/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc
@ -105,8 +105,9 @@ This is enforced by default, and can be disabled using the SPI option `spi-singl

 {project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:

- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active
-setups are now supported, while previous configurations which leveraged active/passive loadbalancer will continue to work.
+- {project_name} deployments are now able to handle user requests simultaneously in both sites. Previous load balancer configurations handling requests only in one site at a time will continue to work.
+
+- Active monitoring of the connectivity between the sites is now required to the replication between the sites in case of a failure. The blueprints describe a setup with Alertmanager and AWS Lambda.

 - The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times
 caused by DNS caching by clients.
@ -127,8 +128,8 @@ While previous versions of the cache configurations only logged warnings when th
 Due to that, you need to set up monitoring to disconnect the two sites in case of a site failure.
 The Keycloak High Availability Guide contains a blueprint on how to set this up.

-. While previous LoadBalancer configurations will continue to work with {project_name}, consider upgrading
-an existing Route53 configurations to avoid prolonged failover times due to client side DNS caching.
+. While previous load balancer configurations will continue to work with {project_name}, consider upgrading
+an existing Route53 configuration to avoid prolonged failover times due to client side DNS caching.

 . If you have updated your cache configuration XML file with remote-store configurations, those will no longer work.
 Instead, enable the `multi-site` feature and use the `cache-remove-*` options.
--- a/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc
+++ b/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc
@ -2,11 +2,12 @@
 <#import "/templates/links.adoc" as links>

 <@tmpl.guide
-title="Deploy an AWS Lambda to guard against Split-Brain"
+title="Deploy an AWS Lambda to disable a non-responding site"
 summary="Building block for loadbalancer resilience"
 tileVisible="false" >

-This {section} explains how to reduce the impact when split-brain scenarios occur between two sites in a multi-site deployment.
+This {section} explains how to resolve a split-brain scenarios between two sites in a multi-site deployment.
+It also disables replication if one site fails, so the other site can continue to serve requests.

 This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}.
 Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}.
@ -14,26 +15,28 @@ Use this deployment with the other building blocks outlined in the <@links.ha id
 include::partials/blueprint-disclaimer.adoc[]

 == Architecture
-In the event of a network communication failure between the two sites in a multi-site deployment, it is no
-longer possible for the two sites to continue to replicate data between themselves and the two sites
-will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
-sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.

-In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible.
-Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests.
+In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate data between them.
+The {jdgserver_name} is configured with a `FAIL` failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication.

-Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync.
-To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
-This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />.
+In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline.
+However, as multi-site deployments only consist of two sites, this is not possible.
+Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration, and hence only this site is able to serve subsequent users requests.
+
+In addition to the loadbalancer configuration, the fencing procedure disables replication between the two {jdgserver_name} clusters to allow serving user requests from the site that remains in the loadbalancer configuration.
+As a result, the sites will be out-of-sync once the replication has been disabled.
+
+To recover from the out-of-sync state, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
+This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure <@links.ha id="operate-site-online" />.

 In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
-and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics,
-which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects
-the current Global Accelerator configuration and removes the site reported to be offline.
+and AWS Lambda functions.
+A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook.
+The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline.

-In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both
-sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at
-a given time.
+In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously.
+We guard against this by ensuring that only a single Lambda instance can be executed at a given time.
+The logic in the AWS Lambda ensures that always one site entry remains in the loadbalancer configuration.

 == Prerequisites