4459ed66ad
Closes #26490 Signed-off-by: Kamesh Akella <kamesh.asp@gmail.com>
101 lines
4.4 KiB
Text
101 lines
4.4 KiB
Text
<#import "/templates/guide.adoc" as tmpl>
|
|
<#import "/templates/links.adoc" as links>
|
|
|
|
<@tmpl.guide
|
|
title="Concepts for sizing CPU and memory resources"
|
|
summary="Understand these concepts to avoid resource exhaustion and congestion"
|
|
preview="true"
|
|
previewDiscussionLink="https://github.com/keycloak/keycloak/discussions/25269"
|
|
tileVisible="false" >
|
|
|
|
Use this as a starting point to size a product environment.
|
|
Adjust the values for your environment as needed based on your load tests.
|
|
|
|
== Performance recommendations
|
|
|
|
[WARNING]
|
|
====
|
|
* Performance will be lowered when scaling to more Pods (due to additional overhead) and using a cross-datacenter setup (due to additional traffic and operations).
|
|
|
|
* Increased cache sizes can improve the performance when {project_name} instances running for a longer time.
|
|
This will decrease response times and reduce IOPS on the database.
|
|
Still, those caches need to be filled when an instance is restarted, so do not set resources too tight based on the stable state measured once the caches have been filled.
|
|
|
|
* Use these values as a starting point and perform your own load tests before going into production.
|
|
====
|
|
|
|
Summary:
|
|
|
|
* The used CPU scales linearly with the number of requests up to the tested limit below.
|
|
* The used memory scales linearly with the number of active sessions up to the tested limit below.
|
|
|
|
Recommendations:
|
|
|
|
* The base memory usage for an inactive Pod is 1 GB of RAM.
|
|
|
|
* Leave 1 GB extra head-room for spikes of RAM.
|
|
|
|
* For each 100,000 active user sessions, add 500 MB per Pod in a three-node cluster (tested with up to 200,000 sessions).
|
|
+
|
|
This assumes that each user connects to only one client.
|
|
Memory requirements increase with the number of client sessions per user session (not tested yet).
|
|
|
|
* For each 30 user logins per second, 1 vCPU per Pod in a three-node cluster (tested with up to 300 per second).
|
|
+
|
|
{project_name} spends most of the CPU time hashing the password provided by the user.
|
|
|
|
* For each 450 client credential grants per second, 1 vCPU per Pod in a three node cluster (tested with up to 2000 per second).
|
|
+
|
|
Most CPU time goes into creating new TLS connections, as each client runs only a single request.
|
|
|
|
* For each 350 refresh token requests per second, 1 vCPU per Pod in a three node cluster (tested with up to 435 refresh token requests per second).
|
|
|
|
* Leave 200% extra head-room for CPU usage to handle spikes in the load.
|
|
This ensures a fast startup of the node, and sufficient capacity to handle failover tasks like, for example, re-balancing Infinispan caches, when one node fails.
|
|
Performance of {project_name} dropped significantly when its Pods were throttled in our tests.
|
|
|
|
=== Calculation example
|
|
|
|
Target size:
|
|
|
|
* 50,000 active user sessions
|
|
* 30 logins per seconds
|
|
* 450 client credential grants per second
|
|
* 350 refresh token requests per second
|
|
|
|
Limits calculated:
|
|
|
|
* CPU requested: 3 vCPU
|
|
+
|
|
(30 logins per second = 1 vCPU, 450 client credential grants per second = 1 vCPU, 350 refresh token = 1 vCPU)
|
|
|
|
* CPU limit: 9 vCPU
|
|
+
|
|
(Allow for three times the CPU requested to handle peaks, startups and failover tasks, and also refresh token handling which we don't have numbers on, yet)
|
|
|
|
* Memory requested: 1.25 GB
|
|
+
|
|
(1 GB base memory plus 250 MB RAM for 50,000 active sessions)
|
|
|
|
* Memory limit: 2.25 GB
|
|
+
|
|
(adding 1 GB to the memory requested)
|
|
|
|
== Reference architecture
|
|
|
|
The following setup was used to retrieve the settings above to run tests of about 10 minutes for different scenarios:
|
|
|
|
* OpenShift 4.13.x deployed on AWS via ROSA.
|
|
* Machinepool with `m5.4xlarge` instances.
|
|
* {project_name} deployed with the Operator and 3 pods.
|
|
* Default user password hashing with PBKDF2(SHA512) 210,000 hash iterations (which is the default).
|
|
* Client credential grants don't use refresh tokens (which is the default).
|
|
* Database seeded with 100,000 users and 100,000 clients.
|
|
* Infinispan caches at default of 10,000 entries, so not all clients and users fit into the cache, and some requests will need to fetch the data from the database.
|
|
* All sessions in distributed caches as per default, with two owners per entries, allowing one failing Pod without losing data.
|
|
* OpenShift's reverse proxy running in passthrough mode were the TLS connection of the client is terminated at the Pod.
|
|
* PostgreSQL deployed inside the same OpenShift with ephemeral storage.
|
|
+
|
|
Using a database with persistent storage will have longer database latencies, which might lead to longer response times; still, the throughput should be similar.
|
|
|
|
</@tmpl.guide>
|