Rework AWS Lambda doc to show it is required (#33462)

Closes #33461
Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
This commit is contained in:
Alexander Schwartz 2024-10-02 12:42:11 +02:00 committed by GitHub
parent c1653448f3
commit cb12f03003
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 28 additions and 22 deletions

View file

@ -81,7 +81,9 @@ The new `footer.ftl` template provides a `content` macro that is rendered at the
{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably: {project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:
- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active setups are now supported. - {project_name} deployments are now able to handle user requests simultaneously in both sites.
- Active monitoring of the connectivity between the sites is now required to update the replication between the sites in case of a failure.
- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times caused by DNS caching by clients. - The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times caused by DNS caching by clients.

View file

@ -105,8 +105,9 @@ This is enforced by default, and can be disabled using the SPI option `spi-singl
{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably: {project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably:
- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active - {project_name} deployments are now able to handle user requests simultaneously in both sites. Previous load balancer configurations handling requests only in one site at a time will continue to work.
setups are now supported, while previous configurations which leveraged active/passive loadbalancer will continue to work.
- Active monitoring of the connectivity between the sites is now required to the replication between the sites in case of a failure. The blueprints describe a setup with Alertmanager and AWS Lambda.
- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times - The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times
caused by DNS caching by clients. caused by DNS caching by clients.
@ -127,8 +128,8 @@ While previous versions of the cache configurations only logged warnings when th
Due to that, you need to set up monitoring to disconnect the two sites in case of a site failure. Due to that, you need to set up monitoring to disconnect the two sites in case of a site failure.
The Keycloak High Availability Guide contains a blueprint on how to set this up. The Keycloak High Availability Guide contains a blueprint on how to set this up.
. While previous LoadBalancer configurations will continue to work with {project_name}, consider upgrading . While previous load balancer configurations will continue to work with {project_name}, consider upgrading
an existing Route53 configurations to avoid prolonged failover times due to client side DNS caching. an existing Route53 configuration to avoid prolonged failover times due to client side DNS caching.
. If you have updated your cache configuration XML file with remote-store configurations, those will no longer work. . If you have updated your cache configuration XML file with remote-store configurations, those will no longer work.
Instead, enable the `multi-site` feature and use the `cache-remove-*` options. Instead, enable the `multi-site` feature and use the `cache-remove-*` options.

View file

@ -2,11 +2,12 @@
<#import "/templates/links.adoc" as links> <#import "/templates/links.adoc" as links>
<@tmpl.guide <@tmpl.guide
title="Deploy an AWS Lambda to guard against Split-Brain" title="Deploy an AWS Lambda to disable a non-responding site"
summary="Building block for loadbalancer resilience" summary="Building block for loadbalancer resilience"
tileVisible="false" > tileVisible="false" >
This {section} explains how to reduce the impact when split-brain scenarios occur between two sites in a multi-site deployment. This {section} explains how to resolve a split-brain scenarios between two sites in a multi-site deployment.
It also disables replication if one site fails, so the other site can continue to serve requests.
This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}.
Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}.
@ -14,26 +15,28 @@ Use this deployment with the other building blocks outlined in the <@links.ha id
include::partials/blueprint-disclaimer.adoc[] include::partials/blueprint-disclaimer.adoc[]
== Architecture == Architecture
In the event of a network communication failure between the two sites in a multi-site deployment, it is no
longer possible for the two sites to continue to replicate data between themselves and the two sites
will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different
sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites.
In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site deployments only consist of two sites, this is not possible. In the event of a network communication failure between sites in a multi-site deployment, it is no longer possible for the two sites to continue to replicate data between them.
Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this site is able to serve subsequent users requests. The {jdgserver_name} is configured with a `FAIL` failure policy, which ensures consistency over availability. Consequently, all user requests are served with an error message until the failure is resolved, either by restoring the network connection or by disabling cross-site replication.
Once the fencing procedure is triggered the replication between two {jdgserver_name} clusters in each site is no longer enabled and as a result the sites will be out-of-sync. In such scenarios, a quorum is commonly used to determine which sites are marked as online or offline.
To recover from the out-of-sync state a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />. However, as multi-site deployments only consist of two sites, this is not possible.
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved, but only after such a synchronisation using the manual procedure <@links.ha id="operate-site-online" />. Instead, we leverage "`fencing`" to ensure that when one of the sites is unable to connect to the other site, only one site remains in the loadbalancer configuration, and hence only this site is able to serve subsequent users requests.
In addition to the loadbalancer configuration, the fencing procedure disables replication between the two {jdgserver_name} clusters to allow serving user requests from the site that remains in the loadbalancer configuration.
As a result, the sites will be out-of-sync once the replication has been disabled.
To recover from the out-of-sync state, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />.
This is why a site which is removed via fencing will not be re-added automatically when the network communication failure is resolved. The remove site should only be re-added once the two sites have been synchronized using the outlined procedure <@links.ha id="operate-site-online" />.
In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts] In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts]
and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, and AWS Lambda functions.
which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, which results in the Prometheus AlertManager calling the AWS Lambda based webhook.
the current Global Accelerator configuration and removes the site reported to be offline. The triggered Lambda function inspects the current Global Accelerator configuration and removes the site reported to be offline.
In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both sites will trigger the webhook simultaneously.
sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at We guard against this by ensuring that only a single Lambda instance can be executed at a given time.
a given time. The logic in the AWS Lambda ensures that always one site entry remains in the loadbalancer configuration.
== Prerequisites == Prerequisites