From db14ab13650885da746dff959e953fc6e2c1cf52 Mon Sep 17 00:00:00 2001 From: Ryan Emerson Date: Wed, 7 Aug 2024 09:22:59 +0100 Subject: [PATCH] Refactor HA guide to refer to generic multi-site deployments Old Active/Passive guides replaced with Active/Active architecture, but A/P vs A/A distinction hidden from users in favour of generic multi-site docs. Closes #31029 Signed-off-by: Ryan Emerson Signed-off-by: Alexander Schwartz Co-authored-by: Alexander Schwartz --- .../release_notes/topics/26_0_0.adoc | 13 + .../topics/changes/changes-26_0_0.adoc | 27 ++ ...sive-sync.adoc => bblocks-multi-site.adoc} | 16 +- .../concepts-infinispan-cli-batch.adoc | 2 +- ...ive-sync.adoc => concepts-multi-site.adoc} | 79 ++--- .../deploy-aurora-multi-az.adoc | 4 +- ...deploy-aws-accelerator-fencing-lambda.adoc | 330 ++++++++++++++++++ .../deploy-aws-accelerator-loadbalancer.adoc | 304 ++++++++++++++++ .../deploy-aws-route53-failover-lambda.adoc | 252 ------------- .../deploy-aws-route53-loadbalancer.adoc | 281 --------------- .../deploy-infinispan-kubernetes-crossdc.adoc | 27 +- .../deploy-keycloak-kubernetes.adoc | 4 +- .../examples/generated/fencing_lambda.py | 120 +++++++ .../examples/generated/ispn-site-a.yaml | 138 ++++++-- .../examples/generated/ispn-site-b.yaml | 79 +++-- .../high-availability/introduction.adoc | 15 +- .../high-availability/operate-failover.adoc | 36 -- .../operate-network-partition-recovery.adoc | 79 ----- .../operate-site-offline.adoc | 79 +++++ .../operate-site-online.adoc | 77 ++++ .../operate-switch-back.adoc | 84 ----- .../operate-switch-over.adoc | 95 ----- .../operate-synchronize.adoc | 68 ++++ .../partials/accelerator/endpoint-group.adoc | 25 ++ .../partials/accelerator/nlb-arn.adoc | 21 ++ .../infinispan/infinispan-attributes.adoc | 4 + docs/guides/high-availability/pinned-guides | 4 +- .../accelerator-multi-az.dio.svg | 4 + .../active-active-sync.dio.svg | 4 + .../active-passive-sync.dio.svg | 21 -- .../infinispan-crossdc-az.dio.svg | 2 +- .../route53-multi-az-failover.svg | 4 - 32 files changed, 1299 insertions(+), 999 deletions(-) rename docs/guides/high-availability/{bblocks-active-passive-sync.adoc => bblocks-multi-site.adoc} (75%) rename docs/guides/high-availability/{concepts-active-passive-sync.adoc => concepts-multi-site.adoc} (60%) create mode 100644 docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc create mode 100644 docs/guides/high-availability/deploy-aws-accelerator-loadbalancer.adoc delete mode 100644 docs/guides/high-availability/deploy-aws-route53-failover-lambda.adoc delete mode 100644 docs/guides/high-availability/deploy-aws-route53-loadbalancer.adoc create mode 100644 docs/guides/high-availability/examples/generated/fencing_lambda.py delete mode 100644 docs/guides/high-availability/operate-failover.adoc delete mode 100644 docs/guides/high-availability/operate-network-partition-recovery.adoc create mode 100644 docs/guides/high-availability/operate-site-offline.adoc create mode 100644 docs/guides/high-availability/operate-site-online.adoc delete mode 100644 docs/guides/high-availability/operate-switch-back.adoc delete mode 100644 docs/guides/high-availability/operate-switch-over.adoc create mode 100644 docs/guides/high-availability/operate-synchronize.adoc create mode 100644 docs/guides/high-availability/partials/accelerator/endpoint-group.adoc create mode 100644 docs/guides/high-availability/partials/accelerator/nlb-arn.adoc create mode 100644 docs/guides/images/high-availability/accelerator-multi-az.dio.svg create mode 100644 docs/guides/images/high-availability/active-active-sync.dio.svg delete mode 100644 docs/guides/images/high-availability/active-passive-sync.dio.svg delete mode 100644 docs/guides/images/high-availability/route53-multi-az-failover.svg diff --git a/docs/documentation/release_notes/topics/26_0_0.adoc b/docs/documentation/release_notes/topics/26_0_0.adoc index 1c3180e63f..757727da0e 100644 --- a/docs/documentation/release_notes/topics/26_0_0.adoc +++ b/docs/documentation/release_notes/topics/26_0_0.adoc @@ -75,6 +75,19 @@ The `keycloak` login theme has been deprecated in favour of the new `keycloak.v2 While it remains the default for the new realms for compatibility reasons, it is strongly recommended to switch all the realm themes to `keycloak.v2`. += Highly available multi-site deployments + +{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably: + +- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active setups are now supported. + +- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times caused by DNS caching by clients. + +- Persistent user sessions are now a requirement of the architecture. Consequently, user sessions will be kept +on {project_name} or {jdgserver_name} upgrades. + +For information on how to migrate, see the link:{upgradingguide_link}[{upgradingguide_name}]. + = Admin Bootstrapping and Recovery In the past, regaining access to a {project_name} instance when all admin users were locked out was a challenging and complex process. Recognizing these challenges and aiming to significantly enhance the user experience, {project_name} now offers several straightforward methods to bootstrap a temporary admin account and recover lost admin access. diff --git a/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc b/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc index 7c6f5e69b4..c8c7b6a7ac 100644 --- a/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc +++ b/docs/documentation/upgrading/topics/changes/changes-26_0_0.adoc @@ -92,6 +92,33 @@ To disable this behavior, use the SPI option `spi-single-use-object-infinispan-p The SPI behavior of `SingleUseObjectProvider` has changed that for revoked tokens only the methods `put` and `contains` must be used. This is enforced by default, and can be disabled using the SPI option `spi-single-use-object-infinispan-persist-revoked-tokens`. += Highly available multi-site deployments + +{project_name} 26 introduces significant improvements to the recommended HA multi-site architecture, most notably: + +- {project_name} deployments on each site are now able to handle user requests simultaneously, therefore active/active +setups are now supported, while previous configurations which leveraged active/passive loadbalancer will continue to work. + +- The loadbalancer blueprint has been updated to use the AWS Global Accelerator as this avoids prolonged fail-over times +caused by DNS caching by clients. + +- Persistent user sessions are now a requirement of the architecture. Consequently, user sessions will be kept +on {project_name} or {jdgserver_name} upgrades. + +- External {jdgserver_name} request handling has been improved to reduce memory usage and request latency. + +As a consequence of the above changes, the following changes are required to your existing {project_name} deployments. + +. `distributed-cache` definitions provided by a cache configuration file are ignored when the `multi-site` feature is enabled, +so you must configure the connection to the external {jdgserver_name} deployment via the `cache-remote-*` command line arguments +or Keycloak CR as outlined in the blueprints. If a `remote-store` configuration is detected in the cache configuration file, +then a warning will be raised in the {project_name} logs. + +. Review your current cache configurations in the external {jdgserver_name} and update them with those outlined in the latest version of the {project_name}'s documentation. + +. While previous LoadBalancer configurations will continue to work with {project_name}, consider upgrading +an existing Route53 configurations to avoid prolonged failover times due to client side DNS caching. + = Admin Bootstrapping and Recovery It used to be difficult to regain access to a {project_name} instance when all admin users were locked out. The process required multiple advanced steps, including direct database access and manual changes. In an effort to improve the user experience, {project_name} now provides multiple ways to bootstrap a new admin account, which can be used to recover from such situations. diff --git a/docs/guides/high-availability/bblocks-active-passive-sync.adoc b/docs/guides/high-availability/bblocks-multi-site.adoc similarity index 75% rename from docs/guides/high-availability/bblocks-active-passive-sync.adoc rename to docs/guides/high-availability/bblocks-multi-site.adoc index a499f6a61b..5e5c0c8ce5 100644 --- a/docs/guides/high-availability/bblocks-active-passive-sync.adoc +++ b/docs/guides/high-availability/bblocks-multi-site.adoc @@ -2,10 +2,10 @@ <#import "/templates/links.adoc" as links> <@tmpl.guide -title="Building blocks active-passive deployments" +title="Building blocks multi-site deployments" summary="Overview of building blocks, alternatives and not considered options" > -The following building blocks are needed to set up an active-passive deployment with synchronous replication. +The following building blocks are needed to set up a multi-site deployment with synchronous replication. The building blocks link to a blueprint with an example configuration. They are listed in the order in which they need to be installed. @@ -14,7 +14,7 @@ include::partials/blueprint-disclaimer.adoc[] == Prerequisites -* Understanding the concepts laid out in the <@links.ha id="concepts-active-passive-sync"/> {section}. +* Understanding the concepts laid out in the <@links.ha id="concepts-multi-site"/> {section}. == Two sites with low-latency connection @@ -50,7 +50,7 @@ It might be considered in the future. [IMPORTANT] ==== -Only {jdgserver_name} server versions 15.0.0 or greater are supported in Active/Passive deployments. +Only {jdgserver_name} server versions 15.0.0 or greater are supported in multi-site deployments. ==== == {project_name} @@ -63,10 +63,6 @@ A clustered deployment of {project_name} in each site, connected to an external == Load balancer -A load balancer which checks the `/lb-check` URL of the {project_name} deployment in each site. - -*Blueprint:* <@links.ha id="deploy-aws-route53-loadbalancer"/>, optionally enhanced with <@links.ha id="deploy-aws-route53-failover-lambda"/> - -*Not considered:* AWS Global Accelerator as it supports only weighted traffic routing and not active-passive failover. -To support active-passive failover, additional logic using, for example, AWS CloudWatch and AWS Lambda would be necessary to simulate the active-passive handling by adjusting the weights when the probes fail. +A load balancer which checks the `/lb-check` URL of the {project_name} deployment in each site, plus an automation to detect {jdgserver_name} connectivity problems between the two sites. +*Blueprint:* <@links.ha id="deploy-aws-accelerator-loadbalancer"/> together with <@links.ha id="deploy-aws-accelerator-fencing-lambda"/>. diff --git a/docs/guides/high-availability/concepts-infinispan-cli-batch.adoc b/docs/guides/high-availability/concepts-infinispan-cli-batch.adoc index 5e64974350..01f409a867 100644 --- a/docs/guides/high-availability/concepts-infinispan-cli-batch.adoc +++ b/docs/guides/high-availability/concepts-infinispan-cli-batch.adoc @@ -19,7 +19,7 @@ For human interactions, the CLI shell might still be a better fit. == Example -The following `Batch` CR takes a site offline as described in the operational procedure <@links.ha id="operate-switch-over" />. +The following `Batch` CR takes a site offline as described in the operational procedure <@links.ha id="operate-site-offline" />. [source,yaml,subs="+attributes"] ---- diff --git a/docs/guides/high-availability/concepts-active-passive-sync.adoc b/docs/guides/high-availability/concepts-multi-site.adoc similarity index 60% rename from docs/guides/high-availability/concepts-active-passive-sync.adoc rename to docs/guides/high-availability/concepts-multi-site.adoc index 08caba5c10..3aeab944eb 100644 --- a/docs/guides/high-availability/concepts-active-passive-sync.adoc +++ b/docs/guides/high-availability/concepts-multi-site.adoc @@ -2,14 +2,14 @@ <#import "/templates/links.adoc" as links> <@tmpl.guide -title="Concepts for active-passive deployments" -summary="Understanding an active-passive deployment with synchronous replication" > +title="Concepts for multi-site deployments" +summary="Understanding a multi-site deployment with synchronous replication" > -This topic describes a highly available active/passive setup and the behavior to expect. It outlines the requirements of the high availability active/passive architecture and describes the benefits and tradeoffs. +This topic describes a highly available multi-site setup and the behavior to expect. It outlines the requirements of the high availability architecture and describes the benefits and tradeoffs. == When to use this setup -Use this setup to be able to fail over automatically in the event of a site failure, which reduces the likelihood of losing data or sessions. Manual interactions are usually required to restore the redundancy after the failover. +Use this setup to provide {project_name} deployments that are able to tolerate site failures, reducing the likelihood of downtime. == Deployment, data storage and caching @@ -23,15 +23,14 @@ As session data of the external {jdgserver_name} is also cached in the Infinispa In the following paragraphs and diagrams, references to deploying {jdgserver_name} apply to the external {jdgserver_name}. -image::high-availability/active-passive-sync.dio.svg[] +image::high-availability/active-active-sync.dio.svg[] == Causes of data and service loss While this setup aims for high availability, the following situations can still lead to service or data loss: -* Network failures between the sites or failures of components can lead to short service downtimes while those failures are detected. -The service will be restored automatically. -The system is degraded until the failures are detected and the backup cluster is promoted to service requests. +* {project_name} site failure may result in requests failing in the period between the failure and the loadbalancer detecting +it, as requests may still be routed to the failed site. * Once failures occur in the communication between the sites, manual steps are necessary to re-synchronize a degraded setup. @@ -60,39 +59,32 @@ Monitoring is necessary to detect degraded setups. | Less than one minute | {jdgserver_name} cluster failure -| If the {jdgserver_name} cluster fails in the active site, {project_name} will not be able to communicate with the external {jdgserver_name}, and the {project_name} service will be unavailable. -The loadbalancer will detect the situation as `/lb-check` returns an error, and will fail over to the other site. +| If the {jdgserver_name} cluster fails in one of the sites, {project_name} will not be able to communicate with the external {jdgserver_name} on that site, and the {project_name} service will be unavailable. +The loadbalancer will detect the situation as `/lb-check` returns an error, and will direct all traffic to the other site. -The setup is degraded until the {jdgserver_name} cluster is restored and the session data is re-synchronized to the primary. +The setup is degraded until the {jdgserver_name} cluster is restored and the session data is re-synchronized. | No data loss^3^ | Seconds to minutes (depending on load balancer setup) | Connectivity {jdgserver_name} | If the connectivity between the two sites is lost, session information cannot be sent to the other site. Incoming requests might receive an error message or are delayed for some seconds. -The primary site marks the secondary site offline, and will stop sending data to the secondary. -The setup is degraded until the connection is restored and the session data is re-synchronized to the secondary site. +The {jdgserver_name} will mark the other site offline, and will stop sending data. +One of the sites needs to be taken offline in the load balancer until the connection is restored and the session data is re-synchronized between the two sites. +In the blueprints, we show how this can be automated. | No data loss^3^ -| Less than one minute +| Seconds to minutes (depending on load balancer setup) | Connectivity database -| If the connectivity between the two sites is lost, the synchronous replication will fail, and it might take some time for the primary site to mark the secondary offline. +| If the connectivity between the two sites is lost, the synchronous replication will fail. Some requests might receive an error message or be delayed for a few seconds. Manual operations might be necessary depending on the database. | No data loss^3^ | Seconds to minutes (depending on the database) -| Primary site -| If none of the {project_name} nodes are available, the loadbalancer will detect the outage and redirect the traffic to the secondary site. -Some requests might receive an error message while the loadbalancer has not detected the primary site failure. -The setup will be degraded until the primary site is back up and the session state has been manually synchronized from the secondary to the primary site. -| No data loss^3^ -| Less than one minute - -| Secondary site -| If the secondary site is not available, it will take a moment for the primary {jdgserver_name} and database to mark the secondary site offline. -Some requests might receive an error message while the detection takes place. -Once the secondary site is up again, the session state needs to be manually synced from the primary site to the secondary site. +| Site failure +| If none of the {project_name} nodes are available, the loadbalancer will detect the outage and redirect the traffic to the other site. +Some requests might receive an error message until the loadbalancer detects the failure. | No data loss^3^ | Less than one minute @@ -107,19 +99,11 @@ The statement "`No data loss`" depends on the setup not being degraded from prev == Known limitations -Upgrades:: -* On {project_name} or {jdgserver_name} version upgrades (major, minor and patch), all session data (except offline session) will be lost as neither supports zero downtime upgrades. - -Failovers:: +Site Failure:: * A successful failover requires a setup not degraded from previous failures. All manual operations like a re-synchronization after a previous failure must be complete to prevent data loss. Use monitoring to ensure degradations are detected and handled in a timely manner. -Switchovers:: -* A successful switchover requires a setup not degraded from previous failures. -All manual operations like a re-synchronization after a previous failure must be complete to prevent data loss. -Use monitoring to ensure degradations are detected and handled in a timely manner. - Out-of-sync sites:: * The sites can become out of sync when a synchronous {jdgserver_name} request fails. This situation is currently difficult to monitor, and it would need a full manual re-sync of {jdgserver_name} to recover. @@ -131,20 +115,17 @@ Manual operations:: == Questions and answers Why synchronous database replication?:: -A synchronously replicated database ensures that data written in the primary site is always available in the secondary site on failover and no data is lost. +A synchronously replicated database ensures that data written in one site is always available in the other site after site failures and no data is lost. +It also ensures that the next request will not return stale data, independent on which site it is served. Why synchronous {jdgserver_name} replication?:: -A synchronously replicated {jdgserver_name} ensures that sessions created, updated and deleted in the primary site are always available in the secondary site on failover and no data is lost. +A synchronously replicated {jdgserver_name} ensures that sessions created, updated and deleted in one site are always available on the other site after a site failure and no data is lost. +It also ensures that the next request will not return stale data, independent on which site it is served. Why is a low-latency network between sites needed?:: -Synchronous replication defers the response to the caller until the data is received at the secondary site. +Synchronous replication defers the response to the caller until the data is received at the other site. For synchronous database replication and synchronous {jdgserver_name} replication, a low latency is necessary as each request can have potentially multiple interactions between the sites when data is updated which would amplify the latency. -Why active-passive?:: -Some databases support a single writer instance with a reader instance which is then promoted to be the new writer once the original writer fails. -In such a setup, it is beneficial for the latency to have the writer instance in the same site as the currently active {project_name}. -Synchronous {jdgserver_name} replication can lead to deadlocks when entries in both sites are modified concurrently. - Is this setup limited to two sites?:: This setup could be extended to multiple sites, and there are no fundamental changes necessary to have, for example, three sites. Once more sites are added, the overall latency between the sites increases, and the likeliness of network failures, and therefore short downtimes, increases as well. @@ -152,20 +133,20 @@ Therefore, such a deployment is expected to have worse performance and an inferi For now, it has been tested and documented with blueprints only for two sites. Is a synchronous cluster less stable than an asynchronous cluster?:: -An asynchronous setup would handle network failures between the sites gracefully, while the synchronous setup would delay requests and will throw errors to the caller where the asynchronous setup would have deferred the writes to {jdgserver_name} or the database to the secondary site. -However, as the secondary site would never be fully up-to-date with the primary site, this setup could lead to data loss during failover. +An asynchronous setup would handle network failures between the sites gracefully, while the synchronous setup would delay requests and will throw errors to the caller where the asynchronous setup would have deferred the writes to {jdgserver_name} or the database on the other site. +However, as the two sites would never be fully up-to-date, this setup could lead to data loss during failures. This would include: + -- -* Lost logouts, meaning sessions are logged in the secondary site although they are logged out in to the primary site at the point of failover when using an asynchronous {jdgserver_name} replication of sessions. -* Lost changes leading to users being able to log in with an old password because database changes are not replicated to the secondary site at the point of failover when using an asynchronous database. -* Invalid caches leading to users being able to log in with an old password because invalidating caches are not propagated at the point of failover to the secondary site when using an asynchronous {jdgserver_name} replication. +* Lost logouts, meaning sessions are logged in one site although they are logged out in the other site at the point of failure when using an asynchronous {jdgserver_name} replication of sessions. +* Lost changes leading to users being able to log in with an old password because database changes are not replicated to the other site at the point of failure when using an asynchronous database. +* Invalid caches leading to users being able to log in with an old password because invalidating caches are not propagated at the point of failure to the other site when using an asynchronous {jdgserver_name} replication. -- + Therefore, tradeoffs exist between high availability and consistency. The focus of this topic is to prioritize consistency over availability with {project_name}. == Next steps -Continue reading in the <@links.ha id="bblocks-active-passive-sync" /> {section} to find blueprints for the different building blocks. +Continue reading in the <@links.ha id="bblocks-multi-site" /> {section} to find blueprints for the different building blocks. diff --git a/docs/guides/high-availability/deploy-aurora-multi-az.adoc b/docs/guides/high-availability/deploy-aurora-multi-az.adoc index c8f0eb0625..0f81b3f73b 100644 --- a/docs/guides/high-availability/deploy-aurora-multi-az.adoc +++ b/docs/guides/high-availability/deploy-aurora-multi-az.adoc @@ -8,8 +8,8 @@ tileVisible="false" > This topic describes how to deploy an Aurora regional deployment of a PostgreSQL instance across multiple availability zones to tolerate one or more availability zone failures in a given AWS region. -This deployment is intended to be used with the setup described in the <@links.ha id="concepts-active-passive-sync"/> {section}. -Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-active-passive-sync"/> {section}. +This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. +Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. include::partials/blueprint-disclaimer.adoc[] diff --git a/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc b/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc new file mode 100644 index 0000000000..cc6be6e087 --- /dev/null +++ b/docs/guides/high-availability/deploy-aws-accelerator-fencing-lambda.adoc @@ -0,0 +1,330 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Deploy an AWS Lambda to guard against Split-Brain" +summary="Building block for loadbalancer resilience" +tileVisible="false" > + +This {section} explains how to reduce the impact when split-brain scenarios occur between two sites in a multi-site deployment. + +This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. +Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. + +include::partials/blueprint-disclaimer.adoc[] + +== Architecture +In the event of a network communication failure between the two sites in a multi-site deployment, it is no +longer possible for the two sites to continue to replicate session state between themselves and the two sites +will become increasingly out-of-sync. As it is possible for subsequent Keycloak requests to be routed to different +sites, this may lead to unexpected behaviour as previous updates will not have been applied to both sites. + +In such scenarios a quorum is commonly used to determine which sites are marked as online or offline, however as multi-site +deployments only consist of two sites, this is not possible. Instead, we leverage "`fencing`" to ensure that when one of the +sites is unable to connect to the other site, only one site remains in the loadbalancer configuration and hence only this +site is able to serve subsequent users requests. + +As the state stored in {jdgserver_name} will be out-of-sync once the connectivity has been lost, a manual re-sync is necessary as described in <@links.ha id="operate-synchronize" />. +This is why a site which is removed via fencing will not be re-added automatically, but only after such a synchronisation using the mual procedure <@links.ha id="operate-site-online" />. + +In this {section} we describe how to implement fencing using a combination of https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerts] +and AWS Lambda functions. A Prometheus Alert is triggered when split-brain is detected by the {jdgserver_name} server metrics, +which results in the Prometheus AlertManager calling the AWS Lambda based webhook. The triggered Lambda function inspects +the current Global Accelerator configuration and removes the site reported to be offline. + +In a true split-brain scenario, where both sites are still up but network communication is down, it is possible that both +sites will trigger the webhook simultaneously. We guard against this by ensuring that only a single Lambda instance can be executed at +a given time. + +== Prerequisites + +* ROSA HCP based multi-site Keycloak deployment +* AWS CLI Installed +* AWS Global Accelerator loadbalancer + +== Procedure +. Enable Openshift user alert routing ++ +.Command: +[source,bash] +---- +kubectl apply -f - << EOF +apiVersion: v1 +kind: ConfigMap +metadata: + name: user-workload-monitoring-config + namespace: openshift-user-workload-monitoring +data: + config.yaml: | + alertmanager: + enabled: true + enableAlertmanagerConfig: true +EOF +kubectl -n openshift-user-workload-monitoring rollout status --watch statefulset.apps/alertmanager-user-workload +---- ++ +. [[aws-secret]]Decide upon a username/password combination which will be used to authenticate the Lambda webhook and create an AWS Secret storing the password ++ +.Command: +[source,bash] +---- +aws secretsmanager create-secret \ + --name webhook-password \ # <1> + --secret-string changeme \ # <2> + --region eu-west-1 # <3> +---- +<1> The name of the secret +<2> The password to be used for authentication +<3> The AWS region that hosts the secret ++ +. Create the Role used to execute the Lambda. ++ +.Command: +[source,bash] +---- +<#noparse> +FUNCTION_NAME= # <1> +ROLE_ARN=$(aws iam create-role \ + --role-name ${FUNCTION_NAME} \ + --assume-role-policy-document \ + '{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "lambda.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] + }' \ + --query 'Role.Arn' \ + --region eu-west-1 \ #<2> + --output text +) + +---- +<1> A name of your choice to associate with the Lambda and related resources +<2> The AWS Region hosting your Kubernetes clusters ++ +. Create and attach the 'LambdaSecretManager' Policy so that the Lambda can access AWS Secrets ++ +.Command: +[source,bash] +---- +<#noparse> +POLICY_ARN=$(aws iam create-policy \ + --policy-name LambdaSecretManager \ + --policy-document \ + '{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "secretsmanager:GetSecretValue" + ], + "Resource": "*" + } + ] + }' \ + --query 'Policy.Arn' \ + --output text +) +aws iam attach-role-policy \ + --role-name ${FUNCTION_NAME} \ + --policy-arn ${POLICY_ARN} + +---- ++ +. Attach the `ElasticLoadBalancingReadOnly` policy so that the Lambda can query the provisioned Network Load Balancers ++ +.Command: +[source,bash] +---- +<#noparse> +aws iam attach-role-policy \ + --role-name ${FUNCTION_NAME} \ + --policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingReadOnly + +---- ++ +. Attach the `GlobalAcceleratorFullAccess` policy so that the Lambda can update the Global Accelerator EndpointGroup ++ +.Command: +[source,bash] +---- +<#noparse> +aws iam attach-role-policy \ + --role-name ${FUNCTION_NAME} \ + --policy-arn arn:aws:iam::aws:policy/GlobalAcceleratorFullAccess + +---- ++ +. Create a Lambda ZIP file containing the required fencing logic ++ +.Command: +[source,bash] +---- +<#noparse> +LAMBDA_ZIP=/tmp/lambda.zip +cat << EOF > /tmp/lambda.py + +include::examples/generated/fencing_lambda.py[tag=fencing-start] + expected_user = 'keycloak' # <1> + secret_name = 'webhook-password' # <2> + secret_region = 'eu-west-1' # <3> +include::examples/generated/fencing_lambda.py[tag=fencing-end] + +EOF +zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py + +---- +<1> The username required to authenticate Lambda requests +<2> The AWS secret containing the password <> +<3> The AWS region which stores the password secret ++ +. Create the Lambda function. ++ +.Command: +[source,bash] +---- +<#noparse> +aws lambda create-function \ + --function-name ${FUNCTION_NAME} \ + --zip-file fileb://${LAMBDA_ZIP} \ + --handler lambda.handler \ + --runtime python3.12 \ + --role ${ROLE_ARN} \ + --region eu-west-1 #<1> + +---- +<1> The AWS Region hosting your Kubernetes clusters ++ +. Expose a Function URL so the Lambda can be triggered as webhook ++ +.Command: +[source,bash] +---- +<#noparse> +aws lambda create-function-url-config \ + --function-name ${FUNCTION_NAME} \ + --auth-type NONE \ + --region eu-west-1 #<1> + +---- +<1> The AWS Region hosting your Kubernetes clusters ++ +. Allow public invocations of the Function URL ++ +.Command: +[source,bash] +---- +<#noparse> +aws lambda add-permission \ + --action "lambda:InvokeFunctionUrl" \ + --function-name ${FUNCTION_NAME} \ + --principal "*" \ + --statement-id FunctionURLAllowPublicAccess \ + --function-url-auth-type NONE \ + --region eu-west-1 # <1> + +---- +<1> The AWS Region hosting your Kubernetes clusters ++ +. Retieve the Lambda Function URL ++ +.Command: +[source,bash] +---- +<#noparse> +aws lambda get-function-url-config \ + --function-name ${FUNCTION_NAME} \ + --query "FunctionUrl" \ + --region eu-west-1 \#<1> + --output text + +---- +<1> The AWS region where the Lambda was created ++ +.Output: +[source,bash] +---- +https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws +---- +. In each Kubernetes cluster, configure a Prometheus Alert routing to trigger the Lambda on split-brain ++ +.Command: +[source,bash] +---- +<#noparse> +ACCELERATOR_NAME= # <1> +NAMESPACE= # <2> +LOCAL_SITE= # <3> +REMOTE_SITE= # <4> + +kubectl apply -n ${NAMESPACE} -f - << EOF +include::examples/generated/ispn-site-a.yaml[tag=fencing-secret] + +--- +include::examples/generated/ispn-site-a.yaml[tag=fencing-alert-manager-config] +--- +include::examples/generated/ispn-site-a.yaml[tag=fencing-prometheus-rule] +---- +<1> The username required to authenticate Lambda requests +<2> The password required to authenticate Lambda requests +<3> The Lambda Function URL +<4> The namespace value should be the namespace hosting the Infinispan CR and the site should be the remote site defined +by `spec.service.sites.locations[0].name` in your Infinispan CR +<5> The name of your local site defined by `spec.service.sites.local.name` in your Infinispan CR +<6> The DNS of your Global Accelerator + +== Verify + +To test that the Prometheus alert triggers the webhook as expected, perform the following steps to simulate a split-brain: + +. In each of your clusters execute the following: ++ +.Command: +[source,bash] +---- +<#noparse> +kubectl -n openshift-operators scale --replicas=0 deployment/infinispan-operator-controller-manager #<1> +kubectl -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager +kubectl -n ${NAMESPACE} scale --replicas=0 deployment/infinispan-router #<2> +kubectl -n ${NAMESPACE} rollout status -w deployment/infinispan-router + +---- +<1> Scale down the {jdgserver_name} Operator so that the next steop does not result in the deployment being recreated by the operator +<2> Scale down the Gossip Router deployment.Replace `$\{NAMESPACE}` with the namespace containing your {jdgserver_name} server ++ +. Verify the `SiteOffline` event has been fired on a cluster by inspecting the *Observe* -> *Alerting* menu in the Openshift +console ++ +. Inspect the Global Accelerator EndpointGroup in the AWS console and there should only be a single endpoint present ++ +. Scale up the {jdgserver_name} Operator and Gossip Router to re-establish a connection between sites: ++ +.Command: +[source,bash] +---- +<#noparse> +kubectl -n openshift-operators scale --replicas=1 deployment/infinispan-operator-controller-manager +kubectl -n openshift-operators rollout status -w deployment/infinispan-operator-controller-manager +kubectl -n ${NAMESPACE} scale --replicas=1 deployment/infinispan-router #<1> +kubectl -n ${NAMESPACE} rollout status -w deployment/infinispan-router + +---- +<1> Replace `$\{NAMESPACE}` with the namespace containing your {jdgserver_name} server ++ +. Inspect the `vendor_jgroups_site_view_status` metric in each site. A value of `1` indicates that the site is reachable. ++ +. Update the Accelerator EndpointGroup to contain both Endpoints. See the <@links.ha id="operate-site-online" /> {section} for details. + +== Further reading + +* <@links.ha id="operate-site-online" /> +* <@links.ha id="operate-site-offline" /> + + diff --git a/docs/guides/high-availability/deploy-aws-accelerator-loadbalancer.adoc b/docs/guides/high-availability/deploy-aws-accelerator-loadbalancer.adoc new file mode 100644 index 0000000000..0783431479 --- /dev/null +++ b/docs/guides/high-availability/deploy-aws-accelerator-loadbalancer.adoc @@ -0,0 +1,304 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Deploy an AWS Global Accelerator loadbalancer" +summary="Building block for a loadbalancer" +tileVisible="false" > + +This topic describes the procedure required to deploy an AWS Global Accelerator to route traffic between multi-site {project_name} deployments. + +This deployment is intended to be used with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. +Use this deployment with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. + +include::partials/blueprint-disclaimer.adoc[] + +== Audience + +This {section} describes how to deploy an AWS Global Accelerator instance to handle {project_name} client connection failover for multiple +availability-zone {project_name} deployments. + +== Architecture + +To ensure user requests are routed to each {project_name} site we need to utilise a loadbalancer. To prevent issues with +DNS caching on the client-side, the implementation should use a static IP address that remains the same +when routing clients to both availability-zones. + +In this {section} we describe how to route all {project_name} client requests via an AWS Global Accelerator loadbalancer. +In the event of a {project_name} site failing, the Accelerator ensures that all client requests are routed to the remaining +healthy site. If both sites are marked as unhealthy, then the Accelerator will "`fail-open`" and forward requests to a site +chosen at random. + +.AWS Global Accelerator Failover +image::high-availability/accelerator-multi-az.dio.svg[] + +An AWS Network Load Balancer (NLB) is created on both ROSA clusters in order to make the Keycloak +pods available as Endpoints to an AWS Global Accelerator instance. Each cluster endpoint is assigned a weight of +128 (half of the maximum weight 255) to ensure that accelerator traffic is routed equally to both availability-zones +when both clusters are healthy. + +== Prerequisites + +* ROSA based Multi-AZ {project_name} deployment + +== Procedure +. Create Network Load Balancers ++ +Perform the following on each of the {project_name} clusters: ++ +.. Login to the ROSA cluster ++ +.. Create a Kubernetes loadbalancer service ++ +.Command: +[source,bash] +---- +<#noparse> +cat < + apiVersion: v1 + kind: Service + metadata: + name: accelerator-loadbalancer + annotations: + service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: accelerator=${ACCELERATOR_NAME},site=${CLUSTER_NAME},namespace=${NAMESPACE} # <2> + service.beta.kubernetes.io/aws-load-balancer-type: "nlb" + service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/lb-check" + service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "https" + service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10" # <3> + service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "3" # <4> + service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3" # <5> + spec: + ports: + - name: https + port: 443 + protocol: TCP + targetPort: 8443 + selector: + app: keycloak + app.kubernetes.io/instance: keycloak + app.kubernetes.io/managed-by: keycloak-operator + sessionAffinity: None + type: LoadBalancer +EOF + +---- +<1> `$NAMESPACE` should be replaced with the namespace of your {project_name} deployment +<2> Add additional Tags to the resources created by AWS so that we can retrieve them later. `ACCELERATOR_NAME` should be +the name of the Global Accelerator created in subsequent steps and `CLUSTER_NAME` should be the name of the current site. +<3> How frequently the healthcheck probe is executed in seconds +<4> How many healthchecks must pass for the NLB to be considered healthy +<5> How many healthchecks must fail for the NLB to be considered unhealthy ++ +.. Take note of the DNS hostname as this will be required later: ++ +.Command: +[source,bash] +---- +kubectl -n $NAMESPACE get svc accelerator-loadbalancer --template="{{range .status.loadBalancer.ingress}}{{.hostname}}{{end}}" +---- ++ +.Output: +[source,bash] +---- +abab80a363ce8479ea9c4349d116bce2-6b65e8b4272fa4b5.elb.eu-west-1.amazonaws.com +---- ++ +. Create a Global Accelerator instance ++ +.Command: +[source,bash] +---- +aws globalaccelerator create-accelerator \ + --name example-accelerator \ #<1> + --ip-address-type DUAL_STACK \ #<2> + --region us-west-2 #<3> +---- +<1> The name of the accelerator to be created, update as required +<2> Can be 'DUAL_STACK' or 'IPV4' +<3> All `globalaccelerator` commands must use the region 'us-west-2' ++ +.Output: +[source,json] +---- +{ + "Accelerator": { + "AcceleratorArn": "arn:aws:globalaccelerator::606671647913:accelerator/e35a94dd-391f-4e3e-9a3d-d5ad22a78c71", #<1> + "Name": "example-accelerator", + "IpAddressType": "DUAL_STACK", + "Enabled": true, + "IpSets": [ + { + "IpFamily": "IPv4", + "IpAddresses": [ + "75.2.42.125", + "99.83.132.135" + ], + "IpAddressFamily": "IPv4" + }, + { + "IpFamily": "IPv6", + "IpAddresses": [ + "2600:9000:a400:4092:88f3:82e2:e5b2:e686", + "2600:9000:a516:b4ef:157e:4cbd:7b48:20f1" + ], + "IpAddressFamily": "IPv6" + } + ], + "DnsName": "a099f799900e5b10d.awsglobalaccelerator.com", #<2> + "Status": "IN_PROGRESS", + "CreatedTime": "2023-11-13T15:46:40+00:00", + "LastModifiedTime": "2023-11-13T15:46:42+00:00", + "DualStackDnsName": "ac86191ca5121e885.dualstack.awsglobalaccelerator.com" #<3> + } +} + +---- +<1> The ARN associated with the created Accelerator instance, this will be used in subsequent commands +<2> The DNS name which IPv4 {project_name} clients should connect to +<3> The DNS name which IPv6 {project_name} clients should connect to ++ +. Create a Listener for the accelerator ++ +.Command: +[source,bash] +---- +aws globalaccelerator create-listener \ + --accelerator-arn 'arn:aws:globalaccelerator::606671647913:accelerator/e35a94dd-391f-4e3e-9a3d-d5ad22a78c71' \ + --port-ranges '[{"FromPort":443,"ToPort":443}]' \ + --protocol TCP \ + --region us-west-2 +---- ++ +.Output: +[source,json] +---- +{ + "Listener": { + "ListenerArn": "arn:aws:globalaccelerator::606671647913:accelerator/e35a94dd-391f-4e3e-9a3d-d5ad22a78c71/listener/1f396d40", + "PortRanges": [ + { + "FromPort": 443, + "ToPort": 443 + } + ], + "Protocol": "TCP", + "ClientAffinity": "NONE" + } +} +---- ++ +. Create an Endpoint Group for the Listener ++ +.Command: +[source,bash] +---- +<#noparse> +CLUSTER_1_ENDPOINT_ARN=$(aws elbv2 describe-load-balancers \ + --query "LoadBalancers[?DNSName=='abab80a363ce8479ea9c4349d116bce2-6b65e8b4272fa4b5.elb.eu-west-1.amazonaws.com'].LoadBalancerArn" \ #<1> + --region eu-west-1 \ #<2> + --output text +) +CLUSTER_2_ENDPOINT_ARN=$(aws elbv2 describe-load-balancers \ + --query "LoadBalancers[?DNSName=='a1c76566e3c334e4ab7b762d9f8dcbcf-985941f9c8d108d4.elb.eu-west-1.amazonaws.com'].LoadBalancerArn" \ #<1> + --region eu-west-1 \ #<2> + --output text +) +ENDPOINTS='[ + { + "EndpointId": "'${CLUSTER_1_ENDPOINT_ARN}'", + "Weight": 128, + "ClientIPPreservationEnabled": false + }, + { + "EndpointId": "'${CLUSTER_2_ENDPOINT_ARN}'", + "Weight": 128, + "ClientIPPreservationEnabled": false + } +]' +aws globalaccelerator create-endpoint-group \ + --listener-arn 'arn:aws:globalaccelerator::606671647913:accelerator/e35a94dd-391f-4e3e-9a3d-d5ad22a78c71/listener/1f396d40' \ #<2> + --traffic-dial-percentage 100 \ + --endpoint-configurations ${ENDPOINTS} \ + --endpoint-group-region eu-west-1 \ #<3> + --region us-west-2 + +---- +<1> The DNS hostname of the Cluster's NLB +<2> The ARN of the Listener created in the previous step +<3> This should be the AWS region that hosts the clusters ++ +.Output: +[source,json] +---- +<#noparse> +{ + "EndpointGroup": { + "EndpointGroupArn": "arn:aws:globalaccelerator::606671647913:accelerator/e35a94dd-391f-4e3e-9a3d-d5ad22a78c71/listener/1f396d40/endpoint-group/2581af0dc700", + "EndpointGroupRegion": "eu-west-1", + "EndpointDescriptions": [ + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/abab80a363ce8479ea9c4349d116bce2/6b65e8b4272fa4b5", + "Weight": 128, + "HealthState": "HEALTHY", + "ClientIPPreservationEnabled": false + }, + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a1c76566e3c334e4ab7b762d9f8dcbcf/985941f9c8d108d4", + "Weight": 128, + "HealthState": "HEALTHY", + "ClientIPPreservationEnabled": false + } + ], + "TrafficDialPercentage": 100.0, + "HealthCheckPort": 443, + "HealthCheckProtocol": "TCP", + "HealthCheckPath": "undefined", + "HealthCheckIntervalSeconds": 30, + "ThresholdCount": 3 + } +} + +---- +. Optional: Configure your custom domain ++ +If you are using a custom domain, pointed your custom domain to the AWS Global Loadbalancer by configuring an Alias or CNAME in your custom domain. ++ +. Create or update the {project_name} Deployment ++ +Perform the following on each of the {project_name} clusters: ++ +.. Login to the ROSA cluster ++ +.. Ensure the Keycloak CR has the following configuration ++ +[source,yaml] +---- +<#noparse> +apiVersion: k8s.keycloak.org/v2alpha1 +kind: Keycloak +metadata: + name: keycloak +spec: + hostname: + hostname: $HOSTNAME # <1> + ingress: + enabled: false # <2> + +---- +<1> The hostname clients use to connect to Keycloak +<2> Disable the default ingress as all {project_name} access should be via the provisioned NLB ++ +To ensure that request forwarding works as expected, it is necessary for the Keycloak CR to specify the hostname through +which clients will access the {project_name} instances. This can either be the `DualStackDnsName` or `DnsName` hostname associated +with the Global Accelerator. If you are using a custom domain and pointed your custom domain to the AWS Global Loadbalancer, use your custom domain here. + +== Verify +To verify that the Global Accelerator is correctly configured to connect to the clusters, navigate to hostname configured above, and you should be presented with the {project_name} admin console. + + +== Further reading + +* <@links.ha id="operate-site-online" /> +* <@links.ha id="operate-site-offline" /> + + diff --git a/docs/guides/high-availability/deploy-aws-route53-failover-lambda.adoc b/docs/guides/high-availability/deploy-aws-route53-failover-lambda.adoc deleted file mode 100644 index 5bb091149b..0000000000 --- a/docs/guides/high-availability/deploy-aws-route53-failover-lambda.adoc +++ /dev/null @@ -1,252 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Deploy an AWS Route 53 Failover Lambda" -summary="Building block for loadbalancer resilience" -tileVisible="false" > - -After a Primary cluster has failed over to a Backup cluster due to a health check failure, the Primary must only serve requests -again after the SRE team has synchronized the two sites first as outlined in the <@links.ha id="operate-switch-back" /> {section}. - -If the Primary site would be marked as healthy by the Route 53 Health Check before the sites are synchronized, the Primary Site would start serving requests with outdated session and realm data. - -This {section} shows how an automatic fallback to a not-yet synchronized Primary site can be prevented with the help of AWS CloudWatch, SNS, and Lambda. - -== Architecture - -In the event of a Primary cluster failure, an https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html[AWS CloudWatch] -alarm sends a message to an https://aws.amazon.com/sns[AWS SNS] topic, which then triggers an https://aws.amazon.com/lambda/[AWS Lambda] function. -The Lambda function updates the Route53 health check of the Primary cluster so that it points to a non-existent path -`/lb-check-failed-over`, thus ensuring that it is impossible for the Primary to be marked as healthy until the path is -manually changed back to `/lb-check`. - -== Prerequisites - -* Deployment of {project_name} as described in the <@links.ha id="deploy-keycloak-kubernetes" /> {section} on a ROSA cluster running OpenShift 4.14 or later in two AWS availability zones in one AWS region. -* A Route53 configuration as described in the <@links.ha id="deploy-aws-route53-loadbalancer" /> {section}. - -== Procedure - -. Create an SNS topic to trigger a Lambda. -+ -.Command: -[source,bash] ----- -<#noparse> -PRIMARY_HEALTH_ID=233e180f-f023-45a3-954e-415303f21eab #<1> -ALARM_NAME=${PRIMARY_HEALTH_ID} -TOPIC_NAME=${PRIMARY_HEALTH_ID} -FUNCTION_NAME=${PRIMARY_HEALTH_ID} -TOPIC_ARN=$(aws sns create-topic --name ${TOPIC_NAME} \ - --query "TopicArn" \ - --tags "Key=HealthCheckId,Value=${PRIMARY_HEALTH_ID}" \ - --region us-east-1 \ - --output text -) - ----- -<1> Replace this with the ID of the xref:create-health-checks[Health Check] associated with your Primary cluster -+ -. Create a CloudWatch alarm to a send message to the SNS topic. -+ -.Command: -[source,bash] ----- -<#noparse> -aws cloudwatch put-metric-alarm \ - --alarm-actions ${TOPIC_ARN} \ - --actions-enabled \ - --alarm-name ${ALARM_NAME} \ - --dimensions "Name=HealthCheckId,Value=${PRIMARY_HEALTH_ID}" \ - --comparison-operator LessThanThreshold \ - --evaluation-periods 1 \ - --metric-name HealthCheckStatus \ - --namespace AWS/Route53 \ - --period 60 \ - --statistic Minimum \ - --threshold 1.0 \ - --treat-missing-data notBreaching \ - --region us-east-1 - ----- -+ -. Create the Role used to execute the Lambda. -+ -.Command: -[source,bash] ----- -<#noparse> -ROLE_ARN=$(aws iam create-role \ - --role-name ${FUNCTION_NAME} \ - --assume-role-policy-document \ - '{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Principal": { - "Service": "lambda.amazonaws.com" - }, - "Action": "sts:AssumeRole" - } - ] - }' \ - --query 'Role.Arn' \ - --region us-east-1 \ - --output text -) - ----- -+ -. Create a policy with the permissions required by the Lambda. -+ -.Command: -[source,bash] ----- -<#noparse> -POLICY_ARN=$(aws iam create-policy \ - --policy-name ${FUNCTION_NAME} \ - --policy-document \ - '{ - "Version": "2012-10-17", - "Statement": [ - { - "Effect": "Allow", - "Action": [ - "route53:UpdateHealthCheck" - ], - "Resource": "*" - } - ] - }' \ - --query 'Policy.Arn' \ - --region us-east-1 \ - --output text -) - ----- -+ -. Attach the custom policy to the Lambda role. -+ -.Command: -[source,bash] ----- -<#noparse> -aws iam attach-role-policy \ - --role-name ${FUNCTION_NAME} \ - --policy-arn ${POLICY_ARN} \ - --region us-east-1 - ----- -+ -. Attach the `AWSLambdaBasicExecutionRole` policy so that the Lambda logs can be written to CloudWatch -+ -.Command: -[source,bash] ----- -<#noparse> -aws iam attach-role-policy \ - --role-name ${FUNCTION_NAME} \ - --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole \ - --region us-east-1 - ----- -+ -. Create a Lambda ZIP file. -+ -.Command: -[source,bash] ----- -<#noparse> -LAMBDA_ZIP=/tmp/lambda.zip -cat << EOF > /tmp/lambda.py -import boto3 -import json - - -def handler(event, context): - print(json.dumps(event, indent=4)) - - msg = json.loads(event['Records'][0]['Sns']['Message']) - healthCheckId = msg['Trigger']['Dimensions'][0]['value'] - - r53Client = boto3.client("route53") - response = r53Client.update_health_check( - HealthCheckId=healthCheckId, - ResourcePath="/lb-check-failed-over" - ) - - print(json.dumps(response, indent=4, default=str)) - statusCode = response['ResponseMetadata']['HTTPStatusCode'] - if statusCode != 200: - raise Exception("Route 53 Unexpected status code %d" + statusCode) - -EOF -zip -FS --junk-paths ${LAMBDA_ZIP} /tmp/lambda.py - ----- -+ -. Create the Lambda function. -+ -.Command: -[source,bash] ----- -<#noparse> -FUNCTION_ARN=$(aws lambda create-function \ - --function-name ${FUNCTION_NAME} \ - --zip-file fileb://${LAMBDA_ZIP} \ - --handler lambda.handler \ - --runtime python3.11 \ - --role ${ROLE_ARN} \ - --query 'FunctionArn' \ - --region eu-west-1 \#<1> - --output text -) - ----- -<1> Replace with the AWS region hosting your ROSA cluster - -. Allow the SNS to trigger the Lambda. -+ -.Command: -[source,bash] ----- -<#noparse> -aws lambda add-permission \ - --function-name ${FUNCTION_NAME} \ - --statement-id function-with-sns \ - --action 'lambda:InvokeFunction' \ - --principal 'sns.amazonaws.com' \ - --source-arn ${TOPIC_ARN} \ - --region eu-west-1 #<1> - ----- -<1> Replace with the AWS region hosting your ROSA cluster - -. Invoke the Lambda when the SNS message is received. -+ -.Command: -[source,bash] ----- -<#noparse> -aws sns subscribe --protocol lambda \ - --topic-arn ${TOPIC_ARN} \ - --notification-endpoint ${FUNCTION_ARN} \ - --region us-east-1 - ----- - -== Verify - -To test the Lambda is triggered as expected, log in to the Primary cluster and scale the {project_name} deployment to zero Pods. -Scaling will cause the Primary's health checks to fail and the following should occur: - -* Route53 should start routing traffic to the {project_name} Pods on the Backup cluster. -* The Route53 health check for the Primary cluster should have `ResourcePath=/lb-check-failed-over` - -To direct traffic back to the Primary site, scale up the {project_name} deployment and manually revert the changes to the Route53 health check the Lambda has performed. - -For more information, see the <@links.ha id="operate-switch-back" /> {section}. - - \ No newline at end of file diff --git a/docs/guides/high-availability/deploy-aws-route53-loadbalancer.adoc b/docs/guides/high-availability/deploy-aws-route53-loadbalancer.adoc deleted file mode 100644 index ebb927aebb..0000000000 --- a/docs/guides/high-availability/deploy-aws-route53-loadbalancer.adoc +++ /dev/null @@ -1,281 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Deploy an AWS Route 53 loadbalancer" -summary="Building block for a loadbalancer" -tileVisible="false" > - -This topic describes the procedure required to configure DNS based failover for Multi-AZ {project_name} clusters using AWS Route53 for an active/passive setup. These instructions are intended to be used with the setup described in the <@links.ha id="concepts-active-passive-sync"/> {section}. -Use it together with the other building blocks outlined in the <@links.ha id="bblocks-active-passive-sync"/> {section}. - -include::partials/blueprint-disclaimer.adoc[] - -== Architecture - -All {project_name} client requests are routed by a DNS name managed by Route53 records. -Route53 is responsible to ensure that all client requests are routed to the Primary cluster when it is available and healthy, or to the backup cluster in the event of the primary availability-zone or {project_name} deployment failing. - -If the primary site fails, the DNS changes will need to propagate to the clients. -Depending on the client's settings, the propagation may take some minutes based on the client's configuration. -When using mobile connections, some internet providers might not respect the TTL of the DNS entries, which can lead to an extended time before the clients can connect to the new site. - -.AWS Route53 Failover -image::high-availability/route53-multi-az-failover.svg[] - -Two Openshift Routes are exposed on both the Primary and Backup ROSA cluster. -The first Route uses the Route53 DNS name to service client requests, whereas the second Route is used by Route53 to monitor the health of the {project_name} cluster. - -== Prerequisites - -* Deployment of {project_name} as described in <@links.ha id="deploy-keycloak-kubernetes" /> on a ROSA cluster running OpenShift 4.14 or later in two AWS availability zones in AWS one region. -* An owned domain for client requests to be routed through. - -== Procedure - -. [[create-hosted-zone]]Create a https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/CreatingHostedZone.html[Route53 Hosted Zone] using the root domain name through which you want all {project_name} clients to connect. -+ -Take note of the "Hosted zone ID", because this ID is required in later steps. - -. Retrieve the "Hosted zone ID" and DNS name associated with each ROSA cluster. -+ -For both the Primary and Backup cluster, perform the following steps: -+ -.. Log in to the ROSA cluster. -+ -.. Retrieve the cluster LoadBalancer Hosted Zone ID and DNS hostname -+ -.Command: -[source,bash] ----- -<#noparse> -HOSTNAME=$(oc -n openshift-ingress get svc router-default \ --o jsonpath='{.status.loadBalancer.ingress[].hostname}' -) -aws elbv2 describe-load-balancers \ ---query "LoadBalancers[?DNSName=='${HOSTNAME}'].{CanonicalHostedZoneId:CanonicalHostedZoneId,DNSName:DNSName}" \ ---region eu-west-1 \#<1> ---output json - ----- -<1> The AWS region hosting your ROSA cluster -+ -.Output: -[source,json] ----- -[ - { - "CanonicalHostedZoneId": "Z2IFOLAFXWLO4F", - "DNSName": "ad62c8d2fcffa4d54aec7ffff902c925-61f5d3e1cbdc5d42.elb.eu-west-1.amazonaws.com" - } -] ----- -+ -NOTE: ROSA clusters running OpenShift 4.13 and earlier use classic load balancers instead of application load balancers. Use the `aws elb describe-load-balancers` command and an updated query string instead. - -. [[create-health-checks]]Create Route53 health checks -+ -.Command: -[source,bash] ----- -<#noparse> -function createHealthCheck() { - # Creating a hash of the caller reference to allow for names longer than 64 characters - REF=($(echo $1 | sha1sum )) - aws route53 create-health-check \ - --caller-reference "$REF" \ - --query "HealthCheck.Id" \ - --no-cli-pager \ - --output text \ - --region us-east-1 \ - --health-check-config ' - { - "Type": "HTTPS", - "ResourcePath": "/lb-check", - "FullyQualifiedDomainName": "'$1'", - "Port": 443, - "RequestInterval": 30, - "FailureThreshold": 1, - "EnableSNI": true - } - ' -} -CLIENT_DOMAIN="client.keycloak-benchmark.com" #<1> -PRIMARY_DOMAIN="primary.${CLIENT_DOMAIN}" #<2> -BACKUP_DOMAIN="backup.${CLIENT_DOMAIN}" #<3> -createHealthCheck ${PRIMARY_DOMAIN} -createHealthCheck ${BACKUP_DOMAIN} - ----- -<1> The domain which {project_name} clients should connect to. -This should be the same, or a subdomain, of the root domain used to create the xref:create-hosted-zone[Hosted Zone]. -<2> The subdomain that will be used for health probes on the Primary cluster -<3> The subdomain that will be used for health probes on the Backup cluster -+ -.Output: -[source,bash] ----- -233e180f-f023-45a3-954e-415303f21eab #<1> -799e2cbb-43ae-4848-9b72-0d9173f04912 #<2> ----- -<1> The ID of the Primary Health check -<2> The ID of the Backup Health check -+ -. Create the Route53 record set -+ -.Command: -[source,bash] ----- -<#noparse> -HOSTED_ZONE_ID="Z09084361B6LKQQRCVBEY" #<1> -PRIMARY_LB_HOSTED_ZONE_ID="Z2IFOLAFXWLO4F" -PRIMARY_LB_DNS=ad62c8d2fcffa4d54aec7ffff902c925-61f5d3e1cbdc5d42.elb.eu-west-1.amazonaws.com -PRIMARY_HEALTH_ID=233e180f-f023-45a3-954e-415303f21eab -BACKUP_LB_HOSTED_ZONE_ID="Z2IFOLAFXWLO4F" -BACKUP_LB_DNS=a184a0e02a5d44a9194e517c12c2b0ec-1203036292.elb.eu-west-1.amazonaws.com -BACKUP_HEALTH_ID=799e2cbb-43ae-4848-9b72-0d9173f04912 -aws route53 change-resource-record-sets \ - --hosted-zone-id Z09084361B6LKQQRCVBEY \ - --query "ChangeInfo.Id" \ - --region us-east-1 \ - --output text \ - --change-batch ' - { - "Comment": "Creating Record Set for '${CLIENT_DOMAIN}'", - "Changes": [{ - "Action": "CREATE", - "ResourceRecordSet": { - "Name": "'${PRIMARY_DOMAIN}'", - "Type": "A", - "AliasTarget": { - "HostedZoneId": "'${PRIMARY_LB_HOSTED_ZONE_ID}'", - "DNSName": "'${PRIMARY_LB_DNS}'", - "EvaluateTargetHealth": true - } - } - }, { - "Action": "CREATE", - "ResourceRecordSet": { - "Name": "'${BACKUP_DOMAIN}'", - "Type": "A", - "AliasTarget": { - "HostedZoneId": "'${BACKUP_LB_HOSTED_ZONE_ID}'", - "DNSName": "'${BACKUP_LB_DNS}'", - "EvaluateTargetHealth": true - } - } - }, { - "Action": "CREATE", - "ResourceRecordSet": { - "Name": "'${CLIENT_DOMAIN}'", - "Type": "A", - "SetIdentifier": "client-failover-primary-'${SUBDOMAIN}'", - "Failover": "PRIMARY", - "HealthCheckId": "'${PRIMARY_HEALTH_ID}'", - "AliasTarget": { - "HostedZoneId": "'${HOSTED_ZONE_ID}'", - "DNSName": "'${PRIMARY_DOMAIN}'", - "EvaluateTargetHealth": true - } - } - }, { - "Action": "CREATE", - "ResourceRecordSet": { - "Name": "'${CLIENT_DOMAIN}'", - "Type": "A", - "SetIdentifier": "client-failover-backup-'${SUBDOMAIN}'", - "Failover": "SECONDARY", - "HealthCheckId": "'${BACKUP_HEALTH_ID}'", - "AliasTarget": { - "HostedZoneId": "'${HOSTED_ZONE_ID}'", - "DNSName": "'${BACKUP_DOMAIN}'", - "EvaluateTargetHealth": true - } - } - }] - } - ' - ----- -<1> The ID of the xref:create-hosted-zone[Hosted Zone] created earlier -+ -.Output: -[source] ----- -/change/C053410633T95FR9WN3YI ----- -+ -. Wait for the Route53 records to be updated -+ -.Command: -[source,bash] ----- -aws route53 wait resource-record-sets-changed --id /change/C053410633T95FR9WN3YI --region us-east-1 ----- -+ -. Update or create the {project_name} deployment -+ -For both the Primary and Backup cluster, perform the following steps: -+ -.. Log in to the ROSA cluster -+ -.. Ensure the `Keycloak` CR has the following configuration -+ -[source,yaml] ----- -<#noparse> -apiVersion: k8s.keycloak.org/v2alpha1 -kind: Keycloak -metadata: - name: keycloak -spec: - hostname: - hostname: ${CLIENT_DOMAIN} # <1> - ----- -<1> The domain clients used to connect to {project_name} -+ -To ensure that request forwarding works, edit the {project_name} CR to specify the hostname through which clients will access the {project_name} instances. -This hostname must be the `$CLIENT_DOMAIN` used in the Route53 configuration. -+ -.. Create health check Route -+ -.Command: -[source,bash] ----- -cat < -apiVersion: route.openshift.io/v1 -kind: Route -metadata: - name: aws-health-route -spec: - host: $DOMAIN #<2> - port: - targetPort: https - tls: - insecureEdgeTerminationPolicy: Redirect - termination: passthrough - to: - kind: Service - name: keycloak-service - weight: 100 - wildcardPolicy: None - -EOF ----- -<1> `$NAMESPACE` should be replaced with the namespace of your {project_name} deployment -<2> `$DOMAIN` should be replaced with either the `PRIMARY_DOMAIN` or `BACKUP_DOMAIN`, if the current cluster is the Primary of Backup cluster, respectively. - -== Verify - -Navigate to the chosen CLIENT_DOMAIN in your local browser and log in to the {project_name} console. - -To test failover works as expected, log in to the Primary cluster and scale the {project_name} deployment to zero Pods. -Scaling will cause the Primary's health checks to fail and Route53 should start routing traffic to the {project_name} Pods on the Backup cluster. - -== Optional: Failover Lambda - -To prevent a failed Primary cluster from becoming active without SRE input, follow the steps outlined in the -guide <@links.ha id="deploy-aws-route53-failover-lambda" /> - - \ No newline at end of file diff --git a/docs/guides/high-availability/deploy-infinispan-kubernetes-crossdc.adoc b/docs/guides/high-availability/deploy-infinispan-kubernetes-crossdc.adoc index ab6506551e..34a448959b 100644 --- a/docs/guides/high-availability/deploy-infinispan-kubernetes-crossdc.adoc +++ b/docs/guides/high-availability/deploy-infinispan-kubernetes-crossdc.adoc @@ -13,7 +13,7 @@ For simplicity, this topic uses the minimum configuration possible that allows { This {section} assumes two {ocp} clusters named `{site-a}` and `{site-b}`. -This is a building block following the concepts described in the <@links.ha id="concepts-active-passive-sync" /> {section}. +This is a building block following the concepts described in the <@links.ha id="concepts-multi-site" /> {section}. See the <@links.ha id="introduction" /> {section} for an overview. @@ -111,7 +111,7 @@ For more information, see the {infinispan-operator-docs}#securing-cross-site-con + Upload the Keystore and the Truststore in an {ocp} Secret. The secret contains the file content, the password to access it, and the type of the store. -Instructions for creating the certificates and the stores are beyond the scope of this guide. +Instructions for creating the certificates and the stores are beyond the scope of this {section}. + To upload the Keystore as a Secret, use the following command: + @@ -172,7 +172,7 @@ include::examples/generated/ispn-site-a.yaml[tag=infinispan-crossdc] -- + When using persistent sessions, limit the cache size limit for `sessions`, `offlineSessions`, `clientSessions`, and `offlineClientSessions` by extending the configuration as follows: - ++ [source,yaml] ---- distributedCache: @@ -195,24 +195,39 @@ include::examples/generated/ispn-site-b.yaml[tag=infinispan-crossdc] + {project_name} requires the following caches to be present: `sessions`, `actionTokens`, `authenticationSessions`, `offlineSessions`, `clientSessions`, `offlineClientSessions`, `loginFailures`, and `work`. + +Due to the use of persistent-user-sessions, the configuration of the caches `sessions`, `offlineSessions`, `clientSessions` and `offlineClientSessions` is different to the `authenticationSessions`, `actionTokens`, +`loginFailures` and `work` caches. In this section we detail how to configure both types of cache for multi-site deployments. ++ The {jdgserver_name} {infinispan-operator-docs}#creating-caches[Cache CR] allows deploying the caches in the {jdgserver_name} cluster. Cross-site needs to be enabled per cache as documented by {infinispan-xsite-docs}[Cross Site Documentation]. The documentation contains more details about the options used by this {section}. The following example shows the `Cache` CR for `{site-a}`. + -- -.sessions in `{site-a}` +.`sessions`, `offlineSessions`, `clientSessions` and `offlineClientSessions` caches in `{site-a}` [source,yaml] ---- include::examples/generated/ispn-site-a.yaml[tag=infinispan-cache-sessions] ---- +<1> Number of owners is only 1 as sessions are stored in the DB and so there is no need to hold redundant copies in memory. +<2> The maximum number of entries which will be cached in memory. +<3> The remote site name. +<4> The cross-site communication, in this case, `SYNC`. +-- ++ +-- +.Remaining caches in `{site-a}` +[source,yaml] +---- +include::examples/generated/ispn-site-a.yaml[tag=infinispan-cache-actionTokens] +---- <1> The cross-site merge policy, invoked when there is a write-write conflict. -Set this for the caches `sessions`, `authenticationSessions`, `offlineSessions`, `clientSessions` and `offlineClientSessions`, and do not set it for all other caches. +Set this for the caches `sessions`, `offlineSessions`, `clientSessions` and `offlineClientSessions`, and do not set it for all other caches. <2> The remote site name. <3> The cross-site communication, in this case, `SYNC`. -- + -For `{site-b}`, the `Cache` CR is similar except in point 2. +For `{site-b}`, the `Cache` CR is similar for both `*sessions` and all other caches, except for the `backups.` outlined in point 2 of the above diagram. + .session in `{site-b}` [source,yaml] diff --git a/docs/guides/high-availability/deploy-keycloak-kubernetes.adoc b/docs/guides/high-availability/deploy-keycloak-kubernetes.adoc index 8d9488b36c..18c9dc8e73 100644 --- a/docs/guides/high-availability/deploy-keycloak-kubernetes.adoc +++ b/docs/guides/high-availability/deploy-keycloak-kubernetes.adoc @@ -8,8 +8,8 @@ tileVisible="false" > This guide describes advanced {project_name} configurations for Kubernetes which are load tested and will recover from single Pod failures. -These instructions are intended for use with the setup described in the <@links.ha id="concepts-active-passive-sync"/> {section}. -Use it together with the other building blocks outlined in the <@links.ha id="bblocks-active-passive-sync"/> {section}. +These instructions are intended for use with the setup described in the <@links.ha id="concepts-multi-site"/> {section}. +Use it together with the other building blocks outlined in the <@links.ha id="bblocks-multi-site"/> {section}. == Prerequisites diff --git a/docs/guides/high-availability/examples/generated/fencing_lambda.py b/docs/guides/high-availability/examples/generated/fencing_lambda.py new file mode 100644 index 0000000000..1544bf6a09 --- /dev/null +++ b/docs/guides/high-availability/examples/generated/fencing_lambda.py @@ -0,0 +1,120 @@ +# tag::fencing-start[] +import boto3 +import jmespath +import json + +from base64 import b64decode +from urllib.parse import unquote + + +def handle_site_offline(labels): + a_client = boto3.client('globalaccelerator', region_name='us-west-2') + + acceleratorDNS = labels['accelerator'] + accelerator = jmespath.search(f"Accelerators[?DnsName=='{acceleratorDNS}']", a_client.list_accelerators()) + if not accelerator: + print(f"Ignoring SiteOffline alert as accelerator with DnsName '{acceleratorDNS}' not found") + return + + accelerator_arn = accelerator[0]['AcceleratorArn'] + listener_arn = a_client.list_listeners(AcceleratorArn=accelerator_arn)['Listeners'][0]['ListenerArn'] + + endpoint_group = a_client.list_endpoint_groups(ListenerArn=listener_arn)['EndpointGroups'][0] + endpoints = endpoint_group['EndpointDescriptions'] + + # Only update accelerator endpoints if two entries exist + if len(endpoints) > 1: + # If the reporter endpoint is not healthy then do nothing for now + # A Lambda will eventually be triggered by the other offline site for this reporter + reporter = labels['reporter'] + reporter_endpoint = [e for e in endpoints if endpoint_belongs_to_site(e, reporter)][0] + if reporter_endpoint['HealthState'] == 'UNHEALTHY': + print(f"Ignoring SiteOffline alert as reporter '{reporter}' endpoint is marked UNHEALTHY") + return + + offline_site = labels['site'] + endpoints = [e for e in endpoints if not endpoint_belongs_to_site(e, offline_site)] + del reporter_endpoint['HealthState'] + a_client.update_endpoint_group( + EndpointGroupArn=endpoint_group['EndpointGroupArn'], + EndpointConfigurations=endpoints + ) + print(f"Removed site={offline_site} from Accelerator EndpointGroup") + else: + print("Ignoring SiteOffline alert only one Endpoint defined in the EndpointGroup") + + +def endpoint_belongs_to_site(endpoint, site): + lb_arn = endpoint['EndpointId'] + region = lb_arn.split(':')[3] + client = boto3.client('elbv2', region_name=region) + tags = client.describe_tags(ResourceArns=[lb_arn])['TagDescriptions'][0]['Tags'] + for tag in tags: + if tag['Key'] == 'site': + return tag['Value'] == site + return false + + +def get_secret(secret_name, region_name): + session = boto3.session.Session() + client = session.client( + service_name='secretsmanager', + region_name=region_name + ) + return client.get_secret_value(SecretId=secret_name)['SecretString'] + + +def decode_basic_auth_header(encoded_str): + split = encoded_str.strip().split(' ') + if len(split) == 2: + if split[0].strip().lower() == 'basic': + try: + username, password = b64decode(split[1]).decode().split(':', 1) + except: + raise DecodeError + else: + raise DecodeError + else: + raise DecodeError + + return unquote(username), unquote(password) + + +def handler(event, context): + print(json.dumps(event)) + + authorization = event['headers'].get('authorization') + if authorization is None: + print("'Authorization' header missing from request") + return { + "statusCode": 401 + } + +# end::fencing-start[] + expected_user = 'keycloak' + secret_name = 'keycloak-master-password' + secret_region = 'eu-central-1' +# tag::fencing-end[] + expectedPass = get_secret(secret_name, secret_region) + username, password = decode_basic_auth_header(authorization) + if username != expected_user and password != expectedPass: + print('Invalid username/password combination') + return { + "statusCode": 403 + } + + body = event.get('body') + if body is None: + raise Exception('Empty request body') + + body = json.loads(body) + print(json.dumps(body)) + for alert in body['alerts']: + labels = alert['labels'] + if labels['alertname'] == 'SiteOffline': + handle_site_offline(labels) + + return { + "statusCode": 204 + } +# end::fencing-end[] diff --git a/docs/guides/high-availability/examples/generated/ispn-site-a.yaml b/docs/guides/high-availability/examples/generated/ispn-site-a.yaml index daa88e8912..a46130de6a 100644 --- a/docs/guides/high-availability/examples/generated/ispn-site-a.yaml +++ b/docs/guides/high-availability/examples/generated/ispn-site-a.yaml @@ -1,4 +1,16 @@ --- +# Source: ispn-helm/templates/infinispan-alerts.yaml +# tag::fencing-secret[] +apiVersion: v1 +kind: Secret +type: kubernetes.io/basic-auth +metadata: + name: webhook-credentials +stringData: + username: 'keycloak' # <1> + password: 'changme' # <2> +# end::fencing-secret[] +--- # Source: ispn-helm/templates/infinispan.yaml # There are several callouts in this YAML marked with `# <1>' etc. See 'running/infinispan-deployment.adoc` for the details.# tag::infinispan-credentials[] apiVersion: v1 @@ -120,9 +132,38 @@ data: clearcache offlineSessions clearcache sessions clearcache work - + # end::infinispan-crossdc-clear-caches[] --- +# Source: ispn-helm/templates/infinispan-alerts.yaml +# tag::fencing-alert-manager-config[] +apiVersion: monitoring.coreos.com/v1beta1 +kind: AlertmanagerConfig +metadata: + name: example-routing +spec: + route: + receiver: default + matchers: + - matchType: = + name: alertname + value: SiteOffline + receivers: + - name: default + webhookConfigs: + - url: 'https://tjqr2vgc664b6noj6vugprakoq0oausj.lambda-url.eu-west-1.on.aws/' # <3> + httpConfig: + basicAuth: + username: + key: username + name: webhook-credentials + password: + key: password + name: webhook-credentials + tlsConfig: + insecureSkipVerify: true +# end::fencing-alert-manager-config[] +--- # Source: ispn-helm/templates/infinispan.yaml # tag::infinispan-cache-actionTokens[] apiVersion: infinispan.org/v2alpha1 @@ -138,14 +179,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-b: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-actionTokens[] @@ -165,15 +208,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-b: # <2> + site-b: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-authenticationSessions[] @@ -193,15 +237,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-b: # <2> + site-b: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-clientSessions[] @@ -221,14 +266,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-b: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-loginFailures[] @@ -248,15 +295,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-b: # <2> + site-b: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-offlineClientSessions[] @@ -276,15 +324,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-b: # <2> + site-b: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-offlineSessions[] @@ -302,16 +351,19 @@ spec: template: |- distributedCache: mode: "SYNC" - owners: "2" + owners: "1" # <1> statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 + memory: + maxCount: 10000 # <2> backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-b: # <2> + site-b: # <3> backup: - strategy: "SYNC" # <3> + strategy: "SYNC" # <4> timeout: 13000 stateTransfer: chunkSize: 16 @@ -332,14 +384,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-b: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-work[] @@ -410,5 +464,23 @@ spec: namespace: keycloak # <12> url: openshift://api.site-b # <13> secretName: xsite-token-secret # <14> - + # end::infinispan-crossdc[] +--- +# Source: ispn-helm/templates/infinispan-alerts.yaml +# tag::fencing-prometheus-rule[] +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: xsite-status +spec: + groups: + - name: xsite-status + rules: + - alert: SiteOffline + expr: 'min by (namespace, site) (vendor_jgroups_site_view_status{namespace="default",site="site-b"}) == 0' # <4> + labels: + severity: critical + reporter: site-a # <5> + accelerator: a3da6a6cbd4e27b02.awsglobalaccelerator.com # <6> +# end::fencing-prometheus-rule[] diff --git a/docs/guides/high-availability/examples/generated/ispn-site-b.yaml b/docs/guides/high-availability/examples/generated/ispn-site-b.yaml index eb2b36fc0e..ddb8c96471 100644 --- a/docs/guides/high-availability/examples/generated/ispn-site-b.yaml +++ b/docs/guides/high-availability/examples/generated/ispn-site-b.yaml @@ -120,7 +120,7 @@ data: clearcache offlineSessions clearcache sessions clearcache work - + # end::infinispan-crossdc-clear-caches[] --- # Source: ispn-helm/templates/infinispan.yaml @@ -138,14 +138,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-a: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-actionTokens[] @@ -165,15 +167,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-a: # <2> + site-a: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-authenticationSessions[] @@ -193,15 +196,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-a: # <2> + site-a: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-clientSessions[] @@ -221,14 +225,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-a: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-loginFailures[] @@ -248,15 +254,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-a: # <2> + site-a: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-offlineClientSessions[] @@ -276,15 +283,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-a: # <2> + site-a: # <1> backup: - strategy: "SYNC" # <3> - timeout: 13000 + strategy: "SYNC" # <2> + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-offlineSessions[] @@ -302,16 +310,19 @@ spec: template: |- distributedCache: mode: "SYNC" - owners: "2" + owners: "1" # <1> statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 + memory: + maxCount: 10000 # <2> backups: - mergePolicy: ALWAYS_REMOVE # <1> - site-a: # <2> + site-a: # <3> backup: - strategy: "SYNC" # <3> + strategy: "SYNC" # <4> timeout: 13000 stateTransfer: chunkSize: 16 @@ -332,14 +343,16 @@ spec: mode: "SYNC" owners: "2" statistics: "true" - remoteTimeout: 14000 + remoteTimeout: 5000 + locking: + acquireTimeout: 4000 stateTransfer: chunkSize: 16 backups: site-a: # <2> backup: strategy: "SYNC" # <3> - timeout: 13000 + timeout: 4500 stateTransfer: chunkSize: 16 # end::infinispan-cache-work[] @@ -410,5 +423,5 @@ spec: namespace: keycloak # <12> url: openshift://api.site-a # <13> secretName: xsite-token-secret # <14> - + # end::infinispan-crossdc[] diff --git a/docs/guides/high-availability/introduction.adoc b/docs/guides/high-availability/introduction.adoc index 6bf9300d1a..163a4d2940 100644 --- a/docs/guides/high-availability/introduction.adoc +++ b/docs/guides/high-availability/introduction.adoc @@ -19,8 +19,8 @@ Additional performance tuning and security hardening are still recommended when <@profile.ifCommunity> == Concept and building block overview -* <@links.ha id="concepts-active-passive-sync" /> -* <@links.ha id="bblocks-active-passive-sync" /> +* <@links.ha id="concepts-multi-site" /> +* <@links.ha id="bblocks-multi-site" /> * <@links.ha id="concepts-database-connections" /> * <@links.ha id="concepts-threads" /> * <@links.ha id="concepts-memory-and-cpu-sizing" /> @@ -32,15 +32,14 @@ Additional performance tuning and security hardening are still recommended when * <@links.ha id="deploy-keycloak-kubernetes" /> * <@links.ha id="deploy-infinispan-kubernetes-crossdc" /> * <@links.ha id="connect-keycloak-to-external-infinispan" /> -* <@links.ha id="deploy-aws-route53-loadbalancer" /> -* <@links.ha id="deploy-aws-route53-failover-lambda" /> +* <@links.ha id="deploy-aws-accelerator-loadbalancer" /> +* <@links.ha id="deploy-aws-accelerator-fencing-lambda" /> == Operational procedures -* <@links.ha id="operate-failover" /> -* <@links.ha id="operate-switch-over" /> -* <@links.ha id="operate-network-partition-recovery" /> -* <@links.ha id="operate-switch-back" /> +* <@links.ha id="operate-synchronize" /> +* <@links.ha id="operate-site-offline" /> +* <@links.ha id="operate-site-online" /> diff --git a/docs/guides/high-availability/operate-failover.adoc b/docs/guides/high-availability/operate-failover.adoc deleted file mode 100644 index cedebec91d..0000000000 --- a/docs/guides/high-availability/operate-failover.adoc +++ /dev/null @@ -1,36 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Fail over to the secondary site" -summary="This describes the automatic and operational procedures necessary" > - -This {section} describes the steps to fail over from primary site to secondary site in a setup as outlined in <@links.ha id="concepts-active-passive-sync" /> together with the blueprints outlined in <@links.ha id="bblocks-active-passive-sync" />. - -== When to use procedure - -A failover from the primary site to the secondary site will happen automatically based on the checks configured in the loadbalancer. - -When the primary site loses its state in {jdgserver_name} or a network partition occurs that prevents the synchronization, manual procedures are necessary to recover the primary site before it can handle traffic again, see the <@links.ha id="operate-switch-back" /> {section}. - -To prevent fallback to the primary site before those manual steps have been performed, follow the procedure outlined in this guide. - -For a graceful switch to the secondary site, follow the instructions in the <@links.ha id="operate-switch-over" /> {section}. - -See the <@links.ha id="introduction" /> {section} for different operational procedures. - -== Procedure - -Follow these steps to prevent an automatic failover back to the Primary site or to manually force a failover. - -=== Route53 - -To force Route53 to mark the primary site as permanently not available and prevent an automatic fallback, edit the health check in AWS to point to a non-existent route (`/lb-check-failed-over`). - -== Optional: Failover Lambda - -To prevent a failed Primary cluster from becoming active without SRE input, follow the steps outlined in the -guide <@links.ha id="deploy-aws-route53-failover-lambda" /> - - - diff --git a/docs/guides/high-availability/operate-network-partition-recovery.adoc b/docs/guides/high-availability/operate-network-partition-recovery.adoc deleted file mode 100644 index e6934a17b4..0000000000 --- a/docs/guides/high-availability/operate-network-partition-recovery.adoc +++ /dev/null @@ -1,79 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Recover from an out-of-sync passive site" -summary="This describes the automatic and operational procedures necessary" > - -This {section} describes the procedures required to synchronize the secondary site with the primary site in a setup as outlined in <@links.ha id="concepts-active-passive-sync" /> together with the blueprints outlined in <@links.ha id="bblocks-active-passive-sync" />. - -include::partials/infinispan/infinispan-attributes.adoc[] - -// used by the CLI commands to avoid duplicating the code. -:stale-site: secondary -:keep-site: primary -:keep-site-name: {site-a-cr} -:stale-site-name: {site-b-cr} - -== When to use procedure - -Use this after a temporary disconnection between sites where {jdgserver_name} was disconnected and the contents of the caches are out-of-sync. - -At the end of the procedure, the session contents on the secondary site have been discarded and replaced by the session contents of the primary site. -All caches in the secondary site have been cleared to prevent invalid cached contents. - -See the <@links.ha id="introduction" /> {section} for different operational procedures. - -== Procedures - -=== {jdgserver_name} Cluster - -For the context of this {section}, `{site-a}` is the primary site and is active, and `{site-b}` is the secondary site and is passive. - -Network partitions may happen between the site and the replication between the {jdgserver_name} cluster will stop. -These procedures bring both sites back in sync. - -WARNING: Transferring the full state may impact the {jdgserver_name} cluster performance by increasing the response time and/or resources usage. - -The first procedure is to delete the stale data from the secondary site. - -. Login into your secondary site. - -. Shutdown {project_name}. -This will clear all {project_name} caches, and it prevents the state of {project_name} from being out-of-sync with {jdgserver_name}. -+ -When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to 0. - -<#include "partials/infinispan/infinispan-cli-connect.adoc" /> - -include::partials/infinispan/infinispan-cli-clear-caches.adoc[] - -Now we are ready to transfer the state from the primary site to the secondary site. - -. Login into your primary site - -<#include "partials/infinispan/infinispan-cli-connect.adoc" /> - -include::partials/infinispan/infinispan-cli-state-transfer.adoc[] - -As now the state is available in the secondary site, {project_name} can be started again: - -. Login into your secondary site. - -. Startup {project_name}. -+ -When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to the original value. - -=== AWS Aurora Database - -No action required. - -=== Route53 - -No action required. - -== Further reading - -See <@links.ha id="concepts-infinispan-cli-batch" /> on how to automate Infinispan CLI commands. - - diff --git a/docs/guides/high-availability/operate-site-offline.adoc b/docs/guides/high-availability/operate-site-offline.adoc new file mode 100644 index 0000000000..335e9c8e55 --- /dev/null +++ b/docs/guides/high-availability/operate-site-offline.adoc @@ -0,0 +1,79 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Take site offline" +summary="This describes how to take a site offline so that it no longer processes client requests" > + +== When to use this procedure + +During the deployment lifecycle it might be required that one of the sites is temporarily taken offline +for maintenance or to allow for software upgrades. To ensure that no user requests are routed to the site requiring +maintenance, it is necessary for the site to be removed from your loadbalancer configuration. + +== Procedure + +Follow these steps to remove a site from the loadbalancer so that no traffic can be routed to it. + +=== Global Accelerator + +. Determine the ARN of the Network Load Balancer (NLB) associated with the site to be kept online ++ +<#include "partials/accelerator/nlb-arn.adoc" /> ++ +. Update the Accelerator EndpointGroup to only include a single site ++ +<#include "partials/accelerator/endpoint-group.adoc" /> ++ +.Output: +[source,bash] +---- +{ + "EndpointGroups": [ + { + "EndpointGroupArn": "arn:aws:globalaccelerator::606671647913:accelerator/d280fc09-3057-4ab6-9330-6cbf1f450748/listener/8769072f/endpoint-group/a30b64ec1700", + "EndpointGroupRegion": "eu-west-1", + "EndpointDescriptions": [ + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a49e56e51e16843b9a3bc686327c907b/9b786f80ed4eba3d", + "Weight": 128, + "HealthState": "HEALTHY", + "ClientIPPreservationEnabled": false + }, + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a3c75f239541c4a6e9c48cf8d48d602f/5ba333e87019ccf0", + "Weight": 128, + "HealthState": "HEALTHY", + "ClientIPPreservationEnabled": false + } + ], + "TrafficDialPercentage": 100.0, + "HealthCheckPort": 443, + "HealthCheckProtocol": "TCP", + "HealthCheckIntervalSeconds": 30, + "ThresholdCount": 3 + } + ] +} +---- ++ +.. Update the EndpointGroup to only include the NLB retrieved in step 1. ++ +.Command: +[source,bash] +---- +aws globalaccelerator update-endpoint-group \ + --endpoint-group-arn arn:aws:globalaccelerator::606671647913:accelerator/d280fc09-3057-4ab6-9330-6cbf1f450748/listener/8769072f/endpoint-group/a30b64ec1700 \ + --region us-west-2 \ + --endpoint-configurations ' + [ + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a49e56e51e16843b9a3bc686327c907b/9b786f80ed4eba3d", + "Weight": 128, + "ClientIPPreservationEnabled": false + } + ] +' +---- + + diff --git a/docs/guides/high-availability/operate-site-online.adoc b/docs/guides/high-availability/operate-site-online.adoc new file mode 100644 index 0000000000..2e8dfe903c --- /dev/null +++ b/docs/guides/high-availability/operate-site-online.adoc @@ -0,0 +1,77 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Bring site online" +summary="This guide describes how to bring a site online so that it can process client requests." > + +== When to use this procedure + +This procedure describes how to re-add a Keycloak site to the Global Accelerator, after it has previously been taken offline, +so that it can once again service client requests. + +== Procedure + +Follow these steps to re-add a Keycloak site to the AWS Global Accelerator so that it can handle client requests. + +=== Global Accelerator + +. Determine the ARN of the Network Load Balancer (NLB) associated with the site to be brought online ++ +<#include "partials/accelerator/nlb-arn.adoc" /> ++ +. Update the Accelerator EndpointGroup to include both sites + +<#include "partials/accelerator/endpoint-group.adoc" /> ++ +.Output: +[source,bash] +---- +{ + "EndpointGroups": [ + { + "EndpointGroupArn": "arn:aws:globalaccelerator::606671647913:accelerator/d280fc09-3057-4ab6-9330-6cbf1f450748/listener/8769072f/endpoint-group/a30b64ec1700", + "EndpointGroupRegion": "eu-west-1", + "EndpointDescriptions": [ + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a3c75f239541c4a6e9c48cf8d48d602f/5ba333e87019ccf0", + "Weight": 128, + "HealthState": "HEALTHY", + "ClientIPPreservationEnabled": false + } + ], + "TrafficDialPercentage": 100.0, + "HealthCheckPort": 443, + "HealthCheckProtocol": "TCP", + "HealthCheckIntervalSeconds": 30, + "ThresholdCount": 3 + } + ] +} +---- ++ +.. Update the EndpointGroup to include the existing Endpoint and the NLB retrieved in step 1. ++ +.Command: +[source,bash] +---- +aws globalaccelerator update-endpoint-group \ + --endpoint-group-arn arn:aws:globalaccelerator::606671647913:accelerator/d280fc09-3057-4ab6-9330-6cbf1f450748/listener/8769072f/endpoint-group/a30b64ec1700 \ + --region us-west-2 \ + --endpoint-configurations ' + [ + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a3c75f239541c4a6e9c48cf8d48d602f/5ba333e87019ccf0", + "Weight": 128, + "ClientIPPreservationEnabled": false + }, + { + "EndpointId": "arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a49e56e51e16843b9a3bc686327c907b/9b786f80ed4eba3d", + "Weight": 128, + "ClientIPPreservationEnabled": false + } + ] +' +---- + + diff --git a/docs/guides/high-availability/operate-switch-back.adoc b/docs/guides/high-availability/operate-switch-back.adoc deleted file mode 100644 index 4d520abb5e..0000000000 --- a/docs/guides/high-availability/operate-switch-back.adoc +++ /dev/null @@ -1,84 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Switch back to the primary site" -summary="This describes the operational procedures necessary" > - -These procedures switch back to the primary site back after a failover or switchover to the secondary site. -In a setup as outlined in <@links.ha id="concepts-active-passive-sync" /> together with the blueprints outlined in <@links.ha id="bblocks-active-passive-sync" />. - -include::partials/infinispan/infinispan-attributes.adoc[] - -// used by the CLI commands to avoid duplicating the code. -:stale-site: primary -:keep-site: secondary -:keep-site-name: {site-b-cr} -:stale-site-name: {site-a-cr} - -== When to use this procedure - -These procedures bring the primary site back to operation when the secondary site is handling all the traffic. -At the end of the {section}, the primary site is online again and handles the traffic. - -This procedure is necessary when the primary site has lost its state in {jdgserver_name}, a network partition occurred between the primary and the secondary site while the secondary site was active, or the replication was disabled as described in the <@links.ha id="operate-switch-over"/> {section}. - -If the data in {jdgserver_name} on both sites is still in sync, the procedure for {jdgserver_name} can be skipped. - -See the <@links.ha id="introduction" /> {section} for different operational procedures. - -== Procedures - -=== {jdgserver_name} Cluster - -For the context of this {section}, `{site-a}` is the primary site, recovering back to operation, and `{site-b}` is the secondary site, running in production. - -After the {jdgserver_name} in the primary site is back online and has joined the cross-site channel (see <@links.ha id="deploy-infinispan-kubernetes-crossdc" />#verifying-the-deployment on how to verify the {jdgserver_name} deployment), the state transfer must be manually started from the secondary site. - -After clearing the state in the primary site, it transfers the full state from the secondary site to the primary site, and it must be completed before the primary site can start handling incoming requests. - -WARNING: Transferring the full state may impact the {jdgserver_name} cluster perform by increasing the response time and/or resources usage. - -The first procedure is to delete any stale data from the primary site. - -. Log in to the primary site. - -. Shutdown {project_name}. -This action will clear all {project_name} caches and prevents the state of {project_name} from being out-of-sync with {jdgserver_name}. -+ -When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to 0. - -<#include "partials/infinispan/infinispan-cli-connect.adoc" /> - -include::partials/infinispan/infinispan-cli-clear-caches.adoc[] - -Now we are ready to transfer the state from the secondary site to the primary site. - -. Log in into your secondary site. - -<#include "partials/infinispan/infinispan-cli-connect.adoc" /> - -include::partials/infinispan/infinispan-cli-state-transfer.adoc[] - -. Log in to the primary site. - -. Start {project_name}. -+ -When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to the original value. - -Both {jdgserver_name} clusters are in sync and the switchover from secondary back to the primary site can be performed. - -=== AWS Aurora Database - -include::partials/aurora/aurora-failover.adoc[] - -=== Route53 - -If switching over to the secondary site has been triggered by changing the health endpoint, edit the health check in AWS to point to a correct endpoint (`/lb-check`). -After some minutes, the clients will notice the change and traffic will gradually move over to the secondary site. - -== Further reading - -See <@links.ha id="concepts-infinispan-cli-batch" /> on how to automate Infinispan CLI commands. - - diff --git a/docs/guides/high-availability/operate-switch-over.adoc b/docs/guides/high-availability/operate-switch-over.adoc deleted file mode 100644 index e510fb3636..0000000000 --- a/docs/guides/high-availability/operate-switch-over.adoc +++ /dev/null @@ -1,95 +0,0 @@ -<#import "/templates/guide.adoc" as tmpl> -<#import "/templates/links.adoc" as links> - -<@tmpl.guide -title="Switch over to the secondary site" -summary="This topic describes the operational procedures necessary" > - -This procedure switches from the primary site to the secondary site when using a setup as outlined in <@links.ha id="concepts-active-passive-sync" /> together with the blueprints outlined in <@links.ha id="bblocks-active-passive-sync" />. - -include::partials/infinispan/infinispan-attributes.adoc[] - -== When to use this procedure - -Use this procedure to gracefully take the primary offline. - -Once the primary site is back online, use the {sections} <@links.ha id="operate-network-partition-recovery" /> and <@links.ha id="operate-switch-back" /> to return to the original state with the primary site being active. - -See the <@links.ha id="introduction" /> {section} for different operational procedures. - -== Procedures - -=== {jdgserver_name} Cluster - -For the context of this {section}, `{site-a}` is the primary site and `{site-b}` is the secondary site. - -When you are ready to take a site offline, a good practice is to disable the replication towards it. -This action prevents errors or delays when the channels are disconnected between the primary and the secondary site. - -==== Procedures to transfer state from secondary to primary site - -. Log in into your secondary site - -<#include "partials/infinispan/infinispan-cli-connect.adoc" /> - -. Disable the replication to the primary site by running the following command: -+ -.Command: -[source,bash,subs="+attributes"] ----- -site take-offline --all-caches --site={site-a-cr} ----- -+ -.Output: -[source,bash,subs="+attributes"] ----- -{ - "offlineClientSessions" : "ok", - "authenticationSessions" : "ok", - "sessions" : "ok", - "clientSessions" : "ok", - "work" : "ok", - "offlineSessions" : "ok", - "loginFailures" : "ok", - "actionTokens" : "ok" -} ----- - -. Check the replication status is `offline`. -+ -.Command: -[source,bash,subs="+attributes"] ----- -site status --all-caches --site={site-a-cr} ----- -+ -.Output: -[source,bash,subs="+attributes"] ----- -{ - "status" : "offline" -} ----- -+ -If the status is not `offline`, repeat the previous step. - -The {jdgserver_name} cluster in the secondary site is ready to handle requests without trying to replicate to the primary site. - -=== AWS Aurora Database - -include::partials/aurora/aurora-failover.adoc[] - -=== {project_name} Cluster - -No action required. - -=== Route53 - -To force Route53 to mark the primary site as not available, edit the health check in AWS to point to a non-existent route (`/lb-check-switched-over`). -After some minutes, the clients will notice the change and traffic will gradually move over to the secondary site. - -== Further reading - -See <@links.ha id="concepts-infinispan-cli-batch" /> on how to automate Infinispan CLI commands. - - diff --git a/docs/guides/high-availability/operate-synchronize.adoc b/docs/guides/high-availability/operate-synchronize.adoc new file mode 100644 index 0000000000..4b35761c7a --- /dev/null +++ b/docs/guides/high-availability/operate-synchronize.adoc @@ -0,0 +1,68 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Synchronize Sites" +summary="This describes the procedures required to synchronize an offline site with an online site" > + +include::partials/infinispan/infinispan-attributes.adoc[] + +== When to use this procedure + +Use this when the state of {jdgserver_name} clusters of two sites become disconnected and the contents of the caches are out-of-sync. +Perform this for example after a split-brain or when one site has been taken offline for maintenance. + +At the end of the procedure, the session contents on the secondary site have been discarded and replaced by the session +contents of the active site. All caches in the offline site are cleared to prevent invalid cache contents. + +== Procedures + +=== {jdgserver_name} Cluster + +For the context of this {section}, `{keep-site-name}` is the currently active site and `{stale-site-name}` is an offline site that is not part +of the AWS Global Accelerator EndpointGroup and is therefore not receiving user requests. + +WARNING: Transferring state may impact {jdgserver_name} cluster performance by increasing the response time and/or resources usage. + +The first procedure is to delete the stale data from the offline site. + +. Login into the offline site. + +. Shutdown {project_name}. +This will clear all {project_name} caches and prevents the {project_name} state from being out-of-sync with {jdgserver_name}. ++ +When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to 0. + +<#include "partials/infinispan/infinispan-cli-connect.adoc" /> +<#include "partials/infinispan/infinispan-cli-clear-caches.adoc" /> + +Now we are ready to transfer the state from the active site to the offline site. + +. Login into your Active site + +<#include "partials/infinispan/infinispan-cli-connect.adoc" /> + +<#include "partials/infinispan/infinispan-cli-state-transfer.adoc" /> + +Now the state is available in the offline site, {project_name} can be started again: + +. Login into your secondary site. + +. Startup {project_name}. ++ +When deploying {project_name} using the {project_name} Operator, change the number of {project_name} instances in the {project_name} Custom Resource to the original value. + +=== AWS Aurora Database + +No action required. + +=== AWS Global Accelerator + +Once the two sites have been synchronized, it is safe to add the previously offline site back to the Global Accelerator +EndpointGroup following the steps in the <@links.ha id="operate-site-online" /> {section}. + +== Further reading + +See <@links.ha id="concepts-infinispan-cli-batch" />. + + diff --git a/docs/guides/high-availability/partials/accelerator/endpoint-group.adoc b/docs/guides/high-availability/partials/accelerator/endpoint-group.adoc new file mode 100644 index 0000000000..3d512f3218 --- /dev/null +++ b/docs/guides/high-availability/partials/accelerator/endpoint-group.adoc @@ -0,0 +1,25 @@ +.. List the current endpoints in the Global Accelerator's EndpointGroup ++ +.Command: +[source,bash] +---- +<#noparse> +ACCELERATOR_NAME= # <1> +ACCELERATOR_ARN=$(aws globalaccelerator list-accelerators \ + --query "Accelerators[?Name=='${ACCELERATOR_NAME}'].AcceleratorArn" \ + --region us-west-2 \ # <2> + --output text +) +LISTENER_ARN=$(aws globalaccelerator list-listeners \ + --accelerator-arn ${ACCELERATOR_ARN} \ + --query "Listeners[*].ListenerArn" \ + --region us-west-2 \ + --output text +) +aws globalaccelerator list-endpoint-groups \ + --listener-arn ${LISTENER_ARN} \ + --region us-west-2 + +---- +<1> The name of the Accelerator to be updated +<2> The region must always be set to us-west-2 when querying AWS Global Accelerators diff --git a/docs/guides/high-availability/partials/accelerator/nlb-arn.adoc b/docs/guides/high-availability/partials/accelerator/nlb-arn.adoc new file mode 100644 index 0000000000..7a44ab05fc --- /dev/null +++ b/docs/guides/high-availability/partials/accelerator/nlb-arn.adoc @@ -0,0 +1,21 @@ +.Command: +[source,bash] +---- +<#noparse> +NAMESPACE= # <1> +REGION= # <2> +HOSTNAME=$(kubectl -n $NAMESPACE get svc accelerator-loadbalancer --template="{{range .status.loadBalancer.ingress}}{{.hostname}}{{end}}") +aws elbv2 describe-load-balancers \ + --query "LoadBalancers[?DNSName=='${HOSTNAME}'].LoadBalancerArn" \ + --region ${REGION} \ + --output text + +---- +<1> The Kubernetes namespace containing the Keycloak deployment +<2> The AWS Region hosting the Kubernetes cluster ++ +.Output: +[source,bash] +---- +arn:aws:elasticloadbalancing:eu-west-1:606671647913:loadbalancer/net/a49e56e51e16843b9a3bc686327c907b/9b786f80ed4eba3d +---- diff --git a/docs/guides/high-availability/partials/infinispan/infinispan-attributes.adoc b/docs/guides/high-availability/partials/infinispan/infinispan-attributes.adoc index 686f3bda57..dde7145b3f 100644 --- a/docs/guides/high-availability/partials/infinispan/infinispan-attributes.adoc +++ b/docs/guides/high-availability/partials/infinispan/infinispan-attributes.adoc @@ -23,3 +23,7 @@ :ispn-operator: {jdgserver_name} Operator :site-a: Site-A :site-b: Site-B +:stale-site: offline +:keep-site: active +:keep-site-name: {site-a-cr} +:stale-site-name: {site-b-cr} diff --git a/docs/guides/high-availability/pinned-guides b/docs/guides/high-availability/pinned-guides index 2ad53d1e3a..535537be41 100644 --- a/docs/guides/high-availability/pinned-guides +++ b/docs/guides/high-availability/pinned-guides @@ -1,6 +1,6 @@ introduction -concepts-active-passive-sync -bblocks-active-passive-sync +concepts-multi-site +bblocks-multi-site deploy-aurora-multi-az deploy-keycloak-kubernetes deploy-infinispan-kubernetes-crossdc diff --git a/docs/guides/images/high-availability/accelerator-multi-az.dio.svg b/docs/guides/images/high-availability/accelerator-multi-az.dio.svg new file mode 100644 index 0000000000..8a306387d9 --- /dev/null +++ b/docs/guides/images/high-availability/accelerator-multi-az.dio.svg @@ -0,0 +1,4 @@ + + + +
AWS Region
AWS Region
«Local»
Browser
«Local»...
Availability Zone
Availability Zone
ROSA Cluster 1
ROSA Cluster 1
SVC
SVC
«Pod»
Keycloak
«Pod»...
Network
Load Balancer
Network...
Availability Zone
Availability Zone
ROSA Cluster 2
ROSA Cluster 2
SVC
SVC
«Pod»
Keycloak
«Pod»...
Network
Load Balancer
Network...
AWS Global Accelerator
AWS Global Accelerat...
Weight: 50
Weight: 50
Weight: 50
Weight: 50
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/guides/images/high-availability/active-active-sync.dio.svg b/docs/guides/images/high-availability/active-active-sync.dio.svg new file mode 100644 index 0000000000..2a17651517 --- /dev/null +++ b/docs/guides/images/high-availability/active-active-sync.dio.svg @@ -0,0 +1,4 @@ + + + +
Site B
Site B
Site A
Site A
Keycloak
Keycloak
Infinispan
Infinispan
Browser
Browser
Infinispan
Infinispan
Keycloak
Keycloak
Load Balancer
Load Balancer
<<sync>>
<<sync>>
Synchronously
replicated
Database
Synchronously...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/guides/images/high-availability/active-passive-sync.dio.svg b/docs/guides/images/high-availability/active-passive-sync.dio.svg deleted file mode 100644 index cf373f638b..0000000000 --- a/docs/guides/images/high-availability/active-passive-sync.dio.svg +++ /dev/null @@ -1,21 +0,0 @@ - - - - - -
Secondary Datacenter (passive)
Secondary Datacenter (passive)
Primary site (active)
Primary site (active)
Keycloak
Keycloak
Infinispan
Infinispan
Browser
Browser
Infinispan
Infinispan
Keycloak
Keycloak
Load Balancer
Load Balancer
Communication path
after failover / switchover 
Communication path...
<<sync>>
<<sync>>
Synchronously
replicated
Database
Synchronously...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/guides/images/high-availability/infinispan-crossdc-az.dio.svg b/docs/guides/images/high-availability/infinispan-crossdc-az.dio.svg index 571bd88b3b..2340277dc2 100644 --- a/docs/guides/images/high-availability/infinispan-crossdc-az.dio.svg +++ b/docs/guides/images/high-availability/infinispan-crossdc-az.dio.svg @@ -1,4 +1,4 @@ -
Primary site (active)
Primary site (active)
Kubernetes Cluster
Kubernetes Cluster
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
Secondary site (passive)
Secondary site (passive)
Kubernetes Cluster
Kubernetes Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
Text is not SVG - cannot display
\ No newline at end of file +
Site A
Site A
Kubernetes Cluster
Kubernetes Cluster
«Pod»
Infinispan
«Pod»...
«Pod»
GossipRouter
«Pod»...
Site B
Site B
Kubernetes Cluster
Kubernetes Cluster
«Pod»
GossipRouter
«Pod»...
«Pod»
Infinispan
«Pod»...
Text is not SVG - cannot display
\ No newline at end of file diff --git a/docs/guides/images/high-availability/route53-multi-az-failover.svg b/docs/guides/images/high-availability/route53-multi-az-failover.svg deleted file mode 100644 index 822fc80de6..0000000000 --- a/docs/guides/images/high-availability/route53-multi-az-failover.svg +++ /dev/null @@ -1,4 +0,0 @@ - - - -
AWS Region
AWS Region
«Local»
Browser
«Local»...
Availability Zone
Availability Zone
ROSA Primary Cluster
ROSA Primary Cluster
Client
Route
Client...
«Pod»
Keycloak
«Pod»...
Health
Route
Health...
Availability Zone
Availability Zone
AWS Route53
AWS Route53
ROSA Backup Cluster
ROSA Backup Cluster
Health
Route
Health...
«Pod»
Keycloak
«Pod»...
Client
Route
Client...
Client Requests
Client Requests
Text is not SVG - cannot display
\ No newline at end of file