Add Health Checks for Multi-Site deployment to HA Docs
Closes #33143 Signed-off-by: Kamesh Akella <kamesh.asp@gmail.com> Signed-off-by: Alexander Schwartz <aschwart@redhat.com> Signed-off-by: Kamesh Akella <kakella@redhat.com> Co-authored-by: Alexander Schwartz <aschwart@redhat.com> Co-authored-by: andymunro <48995441+andymunro@users.noreply.github.com>
This commit is contained in:
parent
cd83f628db
commit
d07d5ebf1a
3 changed files with 115 additions and 0 deletions
113
docs/guides/high-availability/health-checks-multi-site.adoc
Normal file
113
docs/guides/high-availability/health-checks-multi-site.adoc
Normal file
|
@ -0,0 +1,113 @@
|
|||
<#import "/templates/guide.adoc" as tmpl>
|
||||
<#import "/templates/links.adoc" as links>
|
||||
|
||||
<@tmpl.guide
|
||||
title="Health checks for multi-site deployments"
|
||||
summary="Validating the health of a multi-site deployment" >
|
||||
|
||||
When running the <@links.ha id="introduction" /> in a Kubernetes environment,
|
||||
you should automate checks to see if everything is up and running as expected.
|
||||
|
||||
This page provides an overview of URLs,
|
||||
Kubernetes resources, and Healthcheck endpoints available to verify a multi-site setup of {project_name}.
|
||||
|
||||
== Overview
|
||||
|
||||
A proactive monitoring strategy aims to detect and alert about issues before they impact users. This strategy is the key for a highly resilient and highly available {project_name} application.
|
||||
|
||||
Health checks across various architectural components (such as application health, load balancing, caching, and overall system status) are critical for:
|
||||
|
||||
Ensuring high availability:: Verifying that all sites and the load balancer are operational is a key to ensure that a system can handle requests even if one site goes down.
|
||||
|
||||
Maintaining performance:: Checking the health and distribution of the {jdgserver_name} cache ensures that {project_name} can maintain optimal performance by efficiently handling sessions and other temporary data.
|
||||
|
||||
Operational resilience:: By continuously monitoring the health of both {project_name} and its dependencies within the Kubernetes environment, the system can quickly identify and possibly auto-remediate issues, reducing downtime.
|
||||
|
||||
== Prerequisites
|
||||
|
||||
. https://kubernetes.io/docs/tasks/tools/#kubectl[Kubectl CLI is installed and configured].
|
||||
|
||||
. Install https://jqlang.github.io/jq/download/[jq] if it is not already installed on your operating system.
|
||||
|
||||
== Specific health checks
|
||||
|
||||
=== {project_name} load balancer and sites
|
||||
|
||||
Verifies the health of the {project_name} application through its load balancer and both primary and backup sites. This ensures that {project_name} is accessible and that the load balancing mechanism is functioning correctly across different geographical or network locations.
|
||||
|
||||
This command returns the health status of the {project_name} application's connection to its configured database, thus confirming the reliability of database connections.
|
||||
This command is available only on the management port and not from the external URL.
|
||||
In a Kubernetes setup, the sub-status `health/ready` is checked periodically to make the Pod as ready.
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
curl -s https://keycloak:managementport/health
|
||||
----
|
||||
|
||||
This command verifies the `lb-check` endpoint of the load balancer and ensures the {project_name} application cluster is up and running.
|
||||
[source,bash]
|
||||
----
|
||||
curl -s https://keycloak-load-balancer-url/lb-check
|
||||
----
|
||||
|
||||
These commands will return the running status of the Site A and Site B of the {project_name} in a multi-site setup.
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
curl -s https://keycloak_site_a_url/lb-check
|
||||
curl -s https://keycloak_site_b_url/lb-check
|
||||
----
|
||||
|
||||
=== {jdgserver_name} Cache health
|
||||
Check the health of the default cache manager and individual caches in an external {jdgserver_name} cluster.
|
||||
This check is vital for {project_name} performance and reliability,
|
||||
as {jdgserver_name} is often used for distributed caching and session clustering in {project_name} deployments.
|
||||
|
||||
This command returns the overall health of the {jdgserver_name} cache manager, which is useful as the Admin user does not need to provide user credentials to get the health status.
|
||||
[source,bash]
|
||||
----
|
||||
curl -s https://infinispan_rest_url/rest/v2/cache-managers/default/health/status
|
||||
----
|
||||
|
||||
In contrast to the preceding health checks, the following health checks require the Admin user to provide the {jdgserver_name} user credentials as part of the request to peek into the overall health of the external {jdgserver_name} cluster caches.
|
||||
[source,bash]
|
||||
----
|
||||
curl -u <infinispan_user>:<infinispan_pwd> -s https://infinispan_rest_url/rest/v2/cache-managers/default/health \
|
||||
| jq 'if .cluster_health.health_status == "HEALTHY" and (all(.cache_health[].status; . == "HEALTHY")) then "HEALTHY" else "UNHEALTHY" end'
|
||||
----
|
||||
|
||||
The `jq` filter is a convenience to compute the overall health based on the individual cache health.
|
||||
You can also choose to run the above command without the `jq` filter to see the full details.
|
||||
|
||||
=== {jdgserver_name} Cluster distribution
|
||||
Assesses the distribution health of the {jdgserver_name} cluster, ensuring that the cluster's nodes are correctly distributing data. This step is essential for the scalability and fault tolerance of the caching layer.
|
||||
|
||||
You can modify the `expectedCount 3` argument to match the total nodes in the cluster and validate if they are healthy or not.
|
||||
[source,bash]
|
||||
----
|
||||
curl <infinispan_user>:<infinispan_pwd> -s https://infinispan_rest_url/rest/v2/cluster\?action\=distribution \
|
||||
| jq --argjson expectedCount 3 'if map(select(.node_addresses | length > 0)) | length == $expectedCount then "HEALTHY" else "UNHEALTHY" end'
|
||||
----
|
||||
|
||||
=== Overall, {jdgserver_name} system health
|
||||
Uses the `kubectl` CLI tool to query the health status of {jdgserver_name} clusters and the {project_name} service in the specified namespace. This comprehensive check ensures that all components of the {project_name} deployment are operational and correctly configured within the Kubernetes environment.
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
kubectl get infinispan -n <NAMESPACE> -o json \
|
||||
| jq '.items[].status.conditions' \
|
||||
| jq 'map({(.type): .status})' \
|
||||
| jq 'reduce .[] as $item ([]; . + [keys[] | select($item[.] != "True")]) | if length == 0 then "HEALTHY" else "UNHEALTHY: " + (join(", ")) end'
|
||||
----
|
||||
|
||||
=== {project_name} readiness in Kubernetes
|
||||
Specifically, checks for the readiness and rolling update conditions of {project_name} deployments in Kubernetes,
|
||||
ensuring that the {project_name} instances are fully operational and not undergoing updates that could impact availability.
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
kubectl wait --for=condition=Ready --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n <NAMESPACE>
|
||||
kubectl wait --for=condition=RollingUpdate=False --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n <NAMESPACE>
|
||||
----
|
||||
|
||||
</@tmpl.guide>
|
|
@ -39,6 +39,7 @@ Additional performance tuning and security hardening are still recommended when
|
|||
* <@links.ha id="operate-synchronize" />
|
||||
* <@links.ha id="operate-site-offline" />
|
||||
* <@links.ha id="operate-site-online" />
|
||||
* <@links.ha id="health-checks-multi-site" />
|
||||
|
||||
</@profile.ifCommunity>
|
||||
|
||||
|
|
|
@ -6,6 +6,7 @@ deploy-infinispan-kubernetes-crossdc
|
|||
deploy-keycloak-kubernetes
|
||||
deploy-aws-route53-loadbalancer
|
||||
deploy-aws-route53-failover-lambda
|
||||
health-checks-multi-site
|
||||
operate-failover
|
||||
operate-switch-over
|
||||
operate-network-partition-recovery
|
||||
|
|
Loading…
Reference in a new issue