diff --git a/docs/guides/high-availability/concepts-memory-and-cpu-sizing.adoc b/docs/guides/high-availability/concepts-memory-and-cpu-sizing.adoc index a3d68d1289..ff302b6d98 100644 --- a/docs/guides/high-availability/concepts-memory-and-cpu-sizing.adoc +++ b/docs/guides/high-availability/concepts-memory-and-cpu-sizing.adoc @@ -25,7 +25,6 @@ Still, those caches need to be filled when an instance is restarted, so do not s Summary: * The used CPU scales linearly with the number of requests up to the tested limit below. -* The used memory scales linearly with the number of active sessions up to the tested limit below. Recommendations: @@ -120,14 +119,14 @@ Slower failover and more cost-effective:: Reduce the CPU requests and limits as above by 50% for the second site. When one of the sites fails, scale the remaining site from 3 Pod to 6 Pods either manually, automated, or using a Horizontal Pod Autoscaler. This requires enough spare capacity on the cluster or cluster auto-scaling capabilities. Alternative setup for some environments:: -Reduce the CPU requests by 50% for the second site, but keep the CPU limits as above. This way the remaining site can take the traffic but only at the downside the Nodes will experience a CPU pressure and therefore slower response times during peak traffic. -The benefit of this setup is that the number of Pod do not need to scale during failovers which is simpler to set up. +Reduce the CPU requests by 50% for the second site, but keep the CPU limits as above. This way, the remaining site can take the traffic, but only at the downside that the Nodes will experience CPU pressure and therefore slower response times during peak traffic. +The benefit of this setup is that the number of Pods does not need to scale during failovers which is simpler to set up. == Reference architecture The following setup was used to retrieve the settings above to run tests of about 10 minutes for different scenarios: -* OpenShift 4.15.x deployed on AWS via ROSA. +* OpenShift 4.16.x deployed on AWS via ROSA. * Machinepool with `m5.4xlarge` instances. * {project_name} deployed with the Operator and 3 pods in a high-availability setup with two sites in active/active mode. * OpenShift's reverse proxy running in passthrough mode were the TLS connection of the client is terminated at the Pod. diff --git a/docs/guides/high-availability/introduction.adoc b/docs/guides/high-availability/introduction.adoc index 40f2117eaa..537480879a 100644 --- a/docs/guides/high-availability/introduction.adoc +++ b/docs/guides/high-availability/introduction.adoc @@ -12,6 +12,101 @@ Those setups are intended for a transparent network on a single site. The {project_name} high-availability guide goes one step further to describe setups across multiple sites. While this setup adds additional complexity, that extra amount of high availability may be needed for some environments. +== When to use a multi-site setup + +The multi-site deployment capabilities of {project_name} are targeted at use cases that: + +* Are constrained to a single +<@profile.ifProduct> +AWS Region. + +<@profile.ifCommunity> +AWS Region or an equivalent low-latency setup. + +* Permit planned outages for maintenance. +* Fit within a defined user and request count. +* Can accept the impact of periodic outages. + +<@profile.ifCommunity> +== Tested Configuration + +We regularly test {project_name} with the following configuration: + +<@profile.ifProduct> +== Supported Configuration + + +* Two Openshift single-AZ clusters, in the same AWS Region +** Provisioned with https://www.redhat.com/en/technologies/cloud-computing/openshift/aws[Red Hat OpenShift Service on AWS] (ROSA), +<@profile.ifProduct> +either ROSA HCP or ROSA classic. + +<@profile.ifCommunity> +using ROSA HCP. + + +** Each Openshift cluster has all its workers in a single Availability Zone. +** OpenShift version +<@profile.ifProduct> +4.16 (or later). + +<@profile.ifCommunity> +4.16. + + +* Amazon Aurora PostgreSQL database +** High availability with a primary DB instance in one Availability Zone, and a synchronously replicated reader in the second Availability Zone +** Version ${properties["aurora-postgresql.version"]} +* AWS Global Accelerator, sending traffic to both ROSA clusters +* AWS Lambda +<@profile.ifCommunity> +triggered by ROSA's Prometheus and Alert Manager + +to automate failover + +<@profile.ifProduct> +Any deviation from the configuration above is not supported and any issue must be replicated in that environment for support. + +<@profile.ifCommunity> +While equivalent setups should work, you will need to verify the performance and failure behavior of your environment. +We provide functional tests, failure tests and load tests in the https://github.com/keycloak/keycloak-benchmark[Keycloak Benchmark Project]. + + +Read more on each item in the <@links.ha id="bblocks-multi-site" /> {section}. + +<@profile.ifProduct> +== Maximum load + +<@profile.ifCommunity> +== Tested load + +We regularly test {project_name} with the following load: + + +* 100,000 users +* 300 requests per second + +<@profile.ifCommunity> +While we did not see a hard limit in our tests with these values, we ask you to test for higher volumes with horizontally and vertically scaled {project_name} name instances and databases. + + +See the <@links.ha id="concepts-memory-and-cpu-sizing" /> {section} for more information. + +== Limitations + +<@profile.ifCommunity> +Even with the additional redundancy of the two sites, downtimes can still occur: + + +* During upgrades of {project_name} or {jdgserver_name} both sites needs to be taken offline for the duration of the upgrade. +* During certain failure scenarios, there may be downtime of up to 5 minutes. +* After certain failure scenarios, manual intervention may be required to restore redundancy by bringing the failed site back online. +* During certain switchover scenarios, there may be downtime of up to 5 minutes. + +For more details on limitations see the <@links.ha id="concepts-multi-site" /> {section}. + +== Next steps + The different {sections} introduce the necessary concepts and building blocks. For each building block, a blueprint shows how to set a fully functional example. Additional performance tuning and security hardening are still recommended when preparing a production setup.