This topic describes a highly available multi-site setup and the behavior to expect. It outlines the requirements of the high availability architecture and describes the benefits and tradeoffs.
When the data is changed in one {project_name} instance, that data is updated in the database, and an invalidation message is sent to the other site using the replicated `work` cache.
Session-related data is stored in the replicated caches of the Infinispan caches of {project_name}, and forwarded to the external {jdgserver_name}, which forwards information to the external {jdgserver_name} running synchronously in the other site.
As session data of the external {jdgserver_name} is also cached in the Infinispan caches, invalidation messages of the replicated `work` cache are needed for invalidation.
* Once failures occur in the communication between the sites, manual steps are necessary to re-synchronize a degraded setup.
* Degraded setups can lead to service or data loss if additional components fail.
Monitoring is necessary to detect degraded setups.
== Failures which this setup can survive
[%autowidth]
|===
| Failure | Recovery | RPO^1^ | RTO^2^
| Database node
| If the writer instance fails, the database can promote a reader instance in the same or other site to be the new writer.
| No data loss
| Seconds to minutes (depending on the database)
| {project_name} node
| Multiple {project_name} instances run in each site. If one instance fails, it takes a few seconds for the other nodes to notice the change, and some incoming requests might receive an error message or are delayed for some seconds.
| No data loss
| Less than one minute
| {jdgserver_name} node
| Multiple {jdgserver_name} instances run in each site. If one instance fails, it takes a few seconds for the other nodes to notice the change. Sessions are stored in at least two {jdgserver_name} nodes, so a single node failure does not lead to data loss.
| If the {jdgserver_name} cluster fails in one of the sites, {project_name} will not be able to communicate with the external {jdgserver_name} on that site, and the {project_name} service will be unavailable.
The loadbalancer will detect the situation as `/lb-check` returns an error, and will direct all traffic to the other site.
The {jdgserver_name} will mark the other site offline, and will stop sending data.
One of the sites needs to be taken offline in the load balancer until the connection is restored and the session data is re-synchronized between the two sites.
In the blueprints, we show how this can be automated.
^1^ Recovery point objective, assuming all parts of the setup were healthy at the time this occurred. +
^2^ Recovery time objective. +
^3^ Manual operations needed to restore the degraded setup.
The statement "`No data loss`" depends on the setup not being degraded from previous failures, which includes completing any pending manual operations to resynchronize the state between the sites.
* A successful failover requires a setup not degraded from previous failures.
All manual operations like a re-synchronization after a previous failure must be complete to prevent data loss.
Use monitoring to ensure degradations are detected and handled in a timely manner.
Out-of-sync sites::
* The sites can become out of sync when a synchronous {jdgserver_name} request fails.
This situation is currently difficult to monitor, and it would need a full manual re-sync of {jdgserver_name} to recover.
Monitoring the number of cache entries in both sites and the {project_name} log file can show when resynch would become necessary.
Manual operations::
* Manual operations that re-synchronize the {jdgserver_name} state between the sites will issue a full state transfer which will put a stress on the system (network, CPU, Java heap in {jdgserver_name} and {project_name}).
A synchronously replicated database ensures that data written in one site is always available in the other site after site failures and no data is lost.
It also ensures that the next request will not return stale data, independent on which site it is served.
A synchronously replicated {jdgserver_name} ensures that sessions created, updated and deleted in one site are always available on the other site after a site failure and no data is lost.
It also ensures that the next request will not return stale data, independent on which site it is served.
For synchronous database replication and synchronous {jdgserver_name} replication, a low latency is necessary as each request can have potentially multiple interactions between the sites when data is updated which would amplify the latency.
Is this setup limited to two sites?::
This setup could be extended to multiple sites, and there are no fundamental changes necessary to have, for example, three sites.
Once more sites are added, the overall latency between the sites increases, and the likeliness of network failures, and therefore short downtimes, increases as well.
Therefore, such a deployment is expected to have worse performance and an inferior.
For now, it has been tested and documented with blueprints only for two sites.
Is a synchronous cluster less stable than an asynchronous cluster?::
An asynchronous setup would handle network failures between the sites gracefully, while the synchronous setup would delay requests and will throw errors to the caller where the asynchronous setup would have deferred the writes to {jdgserver_name} or the database on the other site.
However, as the two sites would never be fully up-to-date, this setup could lead to data loss during failures.
* Lost logouts, meaning sessions are logged in one site although they are logged out in the other site at the point of failure when using an asynchronous {jdgserver_name} replication of sessions.
* Lost changes leading to users being able to log in with an old password because database changes are not replicated to the other site at the point of failure when using an asynchronous database.
* Invalid caches leading to users being able to log in with an old password because invalidating caches are not propagated at the point of failure to the other site when using an asynchronous {jdgserver_name} replication.
Therefore, tradeoffs exist between high availability and consistency. The focus of this topic is to prioritize consistency over availability with {project_name}.