Add Health Checks for Multi-Site deployment to HA Docs

Closes #33143

Signed-off-by: Kamesh Akella <kamesh.asp@gmail.com>
Signed-off-by: Alexander Schwartz <aschwart@redhat.com>
Signed-off-by: Kamesh Akella <kakella@redhat.com>
Co-authored-by: Alexander Schwartz <aschwart@redhat.com>
Co-authored-by: andymunro <48995441+andymunro@users.noreply.github.com>
This commit is contained in:
Kamesh Akella 2024-10-02 04:43:25 -04:00 committed by GitHub
parent cd83f628db
commit d07d5ebf1a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 115 additions and 0 deletions

View file

@ -0,0 +1,113 @@
<#import "/templates/guide.adoc" as tmpl>
<#import "/templates/links.adoc" as links>
<@tmpl.guide
title="Health checks for multi-site deployments"
summary="Validating the health of a multi-site deployment" >
When running the <@links.ha id="introduction" /> in a Kubernetes environment,
you should automate checks to see if everything is up and running as expected.
This page provides an overview of URLs,
Kubernetes resources, and Healthcheck endpoints available to verify a multi-site setup of {project_name}.
== Overview
A proactive monitoring strategy aims to detect and alert about issues before they impact users. This strategy is the key for a highly resilient and highly available {project_name} application.
Health checks across various architectural components (such as application health, load balancing, caching, and overall system status) are critical for:
Ensuring high availability:: Verifying that all sites and the load balancer are operational is a key to ensure that a system can handle requests even if one site goes down.
Maintaining performance:: Checking the health and distribution of the {jdgserver_name} cache ensures that {project_name} can maintain optimal performance by efficiently handling sessions and other temporary data.
Operational resilience:: By continuously monitoring the health of both {project_name} and its dependencies within the Kubernetes environment, the system can quickly identify and possibly auto-remediate issues, reducing downtime.
== Prerequisites
. https://kubernetes.io/docs/tasks/tools/#kubectl[Kubectl CLI is installed and configured].
. Install https://jqlang.github.io/jq/download/[jq] if it is not already installed on your operating system.
== Specific health checks
=== {project_name} load balancer and sites
Verifies the health of the {project_name} application through its load balancer and both primary and backup sites. This ensures that {project_name} is accessible and that the load balancing mechanism is functioning correctly across different geographical or network locations.
This command returns the health status of the {project_name} application's connection to its configured database, thus confirming the reliability of database connections.
This command is available only on the management port and not from the external URL.
In a Kubernetes setup, the sub-status `health/ready` is checked periodically to make the Pod as ready.
[source,bash]
----
curl -s https://keycloak:managementport/health
----
This command verifies the `lb-check` endpoint of the load balancer and ensures the {project_name} application cluster is up and running.
[source,bash]
----
curl -s https://keycloak-load-balancer-url/lb-check
----
These commands will return the running status of the Site A and Site B of the {project_name} in a multi-site setup.
[source,bash]
----
curl -s https://keycloak_site_a_url/lb-check
curl -s https://keycloak_site_b_url/lb-check
----
=== {jdgserver_name} Cache health
Check the health of the default cache manager and individual caches in an external {jdgserver_name} cluster.
This check is vital for {project_name} performance and reliability,
as {jdgserver_name} is often used for distributed caching and session clustering in {project_name} deployments.
This command returns the overall health of the {jdgserver_name} cache manager, which is useful as the Admin user does not need to provide user credentials to get the health status.
[source,bash]
----
curl -s https://infinispan_rest_url/rest/v2/cache-managers/default/health/status
----
In contrast to the preceding health checks, the following health checks require the Admin user to provide the {jdgserver_name} user credentials as part of the request to peek into the overall health of the external {jdgserver_name} cluster caches.
[source,bash]
----
curl -u <infinispan_user>:<infinispan_pwd> -s https://infinispan_rest_url/rest/v2/cache-managers/default/health \
| jq 'if .cluster_health.health_status == "HEALTHY" and (all(.cache_health[].status; . == "HEALTHY")) then "HEALTHY" else "UNHEALTHY" end'
----
The `jq` filter is a convenience to compute the overall health based on the individual cache health.
You can also choose to run the above command without the `jq` filter to see the full details.
=== {jdgserver_name} Cluster distribution
Assesses the distribution health of the {jdgserver_name} cluster, ensuring that the cluster's nodes are correctly distributing data. This step is essential for the scalability and fault tolerance of the caching layer.
You can modify the `expectedCount 3` argument to match the total nodes in the cluster and validate if they are healthy or not.
[source,bash]
----
curl <infinispan_user>:<infinispan_pwd> -s https://infinispan_rest_url/rest/v2/cluster\?action\=distribution \
| jq --argjson expectedCount 3 'if map(select(.node_addresses | length > 0)) | length == $expectedCount then "HEALTHY" else "UNHEALTHY" end'
----
=== Overall, {jdgserver_name} system health
Uses the `kubectl` CLI tool to query the health status of {jdgserver_name} clusters and the {project_name} service in the specified namespace. This comprehensive check ensures that all components of the {project_name} deployment are operational and correctly configured within the Kubernetes environment.
[source,bash]
----
kubectl get infinispan -n <NAMESPACE> -o json \
| jq '.items[].status.conditions' \
| jq 'map({(.type): .status})' \
| jq 'reduce .[] as $item ([]; . + [keys[] | select($item[.] != "True")]) | if length == 0 then "HEALTHY" else "UNHEALTHY: " + (join(", ")) end'
----
=== {project_name} readiness in Kubernetes
Specifically, checks for the readiness and rolling update conditions of {project_name} deployments in Kubernetes,
ensuring that the {project_name} instances are fully operational and not undergoing updates that could impact availability.
[source,bash]
----
kubectl wait --for=condition=Ready --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n <NAMESPACE>
kubectl wait --for=condition=RollingUpdate=False --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n <NAMESPACE>
----
</@tmpl.guide>

View file

@ -39,6 +39,7 @@ Additional performance tuning and security hardening are still recommended when
* <@links.ha id="operate-synchronize" />
* <@links.ha id="operate-site-offline" />
* <@links.ha id="operate-site-online" />
* <@links.ha id="health-checks-multi-site" />
</@profile.ifCommunity>

View file

@ -6,6 +6,7 @@ deploy-infinispan-kubernetes-crossdc
deploy-keycloak-kubernetes
deploy-aws-route53-loadbalancer
deploy-aws-route53-failover-lambda
health-checks-multi-site
operate-failover
operate-switch-over
operate-network-partition-recovery