Merge pull request #251 from hmlnarik/KEYCLOAK-4188-Documentation-for-Cross-DC

KEYCLOAK-4188 Fixes
This commit is contained in:
Hynek Mlnařík 2017-11-30 13:23:39 +01:00 committed by GitHub
commit 5f7f288bf5
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -196,8 +196,8 @@ Follow these steps to set up the {jdgserver_name} server:
...
</stack>
```
NOTE: This is just an example setup to have things quickly running. In production, you are not required to use `tcp` stack for the JGroups `RELAY2`, but you can configure any other stack. For example, you could use the UDP protocol, if the network between your data centers is able to support multicast. Similarly, you are not required to use `TCPPING` as discovery protocol. And in production, you probably won't use `TCPPING` due it's static nature. Finally, site names are also configurable.
Details of this more-detailed setup are out-of-scope of the {project_name} documentation. See the JDG documentation and JGroups documentation for more details.
NOTE: This is just an example setup to have things quickly running. In production, you are not required to use `tcp` stack for the JGroups `RELAY2`, but you can configure any other stack. For example, you could use the default udp stack, if the network between your data centers is able to support multicast. Just make sure that the {jdgserver_name} and {project_name} clusters are mutually indiscoverable. Similarly, you are not required to use `TCPPING` as discovery protocol. And in production, you probably won't use `TCPPING` due it's static nature. Finally, site names are also configurable.
Details of this more-detailed setup are out-of-scope of the {project_name} documentation. See the {jdgserver_name} documentation and JGroups documentation for more details.
+
@ -238,6 +238,7 @@ NOTE: Details about the configuration options inside `replicated-cache-configura
```xml
<relay site="site2">
<remote-site name="site1" channel="xsite"/>
<property name="relay_multicasts">false</property>
</relay>
```
+
@ -266,7 +267,8 @@ cd JDG1_HOME/bin
+
```
cd JDG2_HOME/bin
./standalone.sh -c clustered.xml -Djava.net.preferIPv4Stack=true \ -Djboss.default.multicast.address=234.56.78.100 \
./standalone.sh -c clustered.xml -Djava.net.preferIPv4Stack=true \
-Djboss.default.multicast.address=234.56.78.100 \
-Djboss.node.name=jdg2
```
+
@ -290,7 +292,7 @@ When you use the MBean `jgroups:type=protocol,cluster="cluster",protocol=GMS`, y
(1) jdg2
```
+
NOTE: In production, you can have more JDG servers in every data center. You just need to ensure that JDG servers in same data center are using the same multicast address (In other words, the same `jboss.default.multicast.address` during startup). Then in jconsole in `GMS` protocol view, you will see all the members of current cluster.
NOTE: In production, you can have more {jdgserver_name} servers in every data center. You just need to ensure that {jdgserver_name} servers in same data center are using the same multicast address (In other words, the same `jboss.default.multicast.address` during startup). Then in jconsole in `GMS` protocol view, you will see all the members of current cluster.
[[serversetup]]
@ -498,13 +500,13 @@ Event 'CLIENT_CACHE_ENTRY_REMOVED', key '193489e7-e2bc-4069-afe8-f1dfa73084ea',
This section contains some tips and options related to Cross-Datacenter Replication.
* When you run the {project_name} server inside a data center, it is required that the database referenced in `KeycloakDS` datasource is already running and available in that data center. It is also necessary that the JDG server referenced by the `outbound-socket-binding`, which is referenced from the Infinispan cache `remote-store` element, is already running. Otherwise the {project_name} server will fail to start.
* When you run the {project_name} server inside a data center, it is required that the database referenced in `KeycloakDS` datasource is already running and available in that data center. It is also necessary that the {jdgserver_name} server referenced by the `outbound-socket-binding`, which is referenced from the Infinispan cache `remote-store` element, is already running. Otherwise the {project_name} server will fail to start.
* Every data center can have more database nodes if you want to support database failover and better reliability. Refer to the documentation of your database and JDBC driver for the details how to set this up on the database side and how the `KeycloakDS` datasource on Keycloak side needs to be configured.
* Every datacenter can have more JDG servers running in the cluster. This is useful if you want some failover and better fault tolerance. The HotRod protocol used for communication between JDG servers and {project_name} servers has a feature that JDG servers will automatically send new topology to the {project_name} servers about the change in the JDG cluster, so the remote store on {project_name} side will know to which JDG servers it can connect. Read the JDG/Infinispan and Wildfly documentation for more details.
* Every datacenter can have more {jdgserver_name} servers running in the cluster. This is useful if you want some failover and better fault tolerance. The HotRod protocol used for communication between {jdgserver_name} servers and {project_name} servers has a feature that {jdgserver_name} servers will automatically send new topology to the {project_name} servers about the change in the {jdgserver_name} cluster, so the remote store on {project_name} side will know to which {jdgserver_name} servers it can connect. Read the {jdgserver_name} and Wildfly documentation for more details.
* It is highly recommended that a master JDG server is running in every site before the {project_name} servers in **any** site are started. As in our example, we started both `jdg1` and `jdg2` first, before all {project_name} servers. If you still need to run the {project_name} server and the backup site is offline, it is recommended to manually switch the backup site offline on the JDG servers on your site, as described in link:onoffline[Bringing sites offline and online]. If you do not manually switch the unavailable site offline, the first startup may fail or they may be some exceptions during startup until the backup site is taken offline automatically due the configured count of failed operations.
* It is highly recommended that a master {jdgserver_name} server is running in every site before the {project_name} servers in **any** site are started. As in our example, we started both `jdg1` and `jdg2` first, before all {project_name} servers. If you still need to run the {project_name} server and the backup site is offline, it is recommended to manually switch the backup site offline on the {jdgserver_name} servers on your site, as described in link:onoffline[Bringing sites offline and online]. If you do not manually switch the unavailable site offline, the first startup may fail or they may be some exceptions during startup until the backup site is taken offline automatically due the configured count of failed operations.
[[onoffline]]
@ -512,15 +514,15 @@ This section contains some tips and options related to Cross-Datacenter Replicat
For example, assume this scenario:
. Site `site2` is entirely offline from the `site1` perspective. This means that all JDG servers on `site2` are off *or* the network between `site1` and `site2` is broken.
. You run {project_name} servers and JDG server `jdg1` in site `site1`
. Site `site2` is entirely offline from the `site1` perspective. This means that all {jdgserver_name} servers on `site2` are off *or* the network between `site1` and `site2` is broken.
. You run {project_name} servers and {jdgserver_name} server `jdg1` in site `site1`
. Someone logs in on a {project_name} server on `site1`.
. The {project_name} server from `site1` will try to write the session to the remote cache on `jdg1` server, which is supposed to backup data to the `jdg2` server in the `site2`. See link:communication[Communication details] for more information.
. Server `jdg2` is offline or unreachable from `jdg1`. So the backup from `jdg1` to `jdg2` will fail.
. The exception is thrown in `jdg1` log and the failure will be propagated from `jdg1` server to {project_name} servers as well because the default `FAIL` backup failure policy is configured. See link:backupfailure[Backup failure policy] for details around the backup policies.
. The error will happen on {project_name} side too and user may not be able to finish his login.
According to your environment, it may be more or less probable that the network between sites is unavailable or temporarily broken (split-brain). In case this happens, it is good that JDG servers on `site1` are aware of the fact that JDG servers on `site2` are unavailable, so they will stop trying to reach the servers in the `jdg2` site and the backup failures won't happen. This is called `Take site offline` .
According to your environment, it may be more or less probable that the network between sites is unavailable or temporarily broken (split-brain). In case this happens, it is good that {jdgserver_name} servers on `site1` are aware of the fact that {jdgserver_name} servers on `site2` are unavailable, so they will stop trying to reach the servers in the `jdg2` site and the backup failures won't happen. This is called `Take site offline` .
.Take site offline
@ -564,7 +566,7 @@ State transfer is a required, manual step. {jdgserver_name} does not do this aut
A bidirectional state transfer will ensure that entities which were created *after* split-brain on `site1` will be transferred to `site2`. This is not an issue as they do not yet exist on `site2`. Similarly, entities created *after* split-brain on `site2` will be transferred to `site1`. Possibly problematic parts are those entities which exist *before* split-brain on both sites and which were updated during split-brain on both sites. When this happens, one of the sites will *win* and will overwrite the updates done during split-brain by the second site.
Unfortunately, there is no any universal solution to this. Split-brains and network outages are just state, which is usually impossible to be handled 100% correctly with 100% consistent data between sites. In the case of {project_name}, it typically is not a critical issue. In the worst case, users will need to re-login again to their clients, or have the improper count of loginFailures tracked for brute force protection. See the JDG/JGroups/Infinispan documentation for more tips how to deal with split-brain.
Unfortunately, there is no any universal solution to this. Split-brains and network outages are just state, which is usually impossible to be handled 100% correctly with 100% consistent data between sites. In the case of {project_name}, it typically is not a critical issue. In the worst case, users will need to re-login again to their clients, or have the improper count of loginFailures tracked for brute force protection. See the {jdgserver_name}/JGroups documentation for more tips how to deal with split-brain.
The state transfer can be also done on the {jdgserver_name} side through JMX. The operation name is `pushState`. There are few other operations to monitor status, cancel push state, and so on.
More info about state transfer is available in the link:https://access.redhat.com/documentation/en-us/red_hat_jboss_data_grid/7.1/html/administration_and_configuration_guide/set_up_cross_datacenter_replication#state_transfer_between_sites[{jdgserver_name} docs].
@ -632,11 +634,11 @@ An important part of the `backup` element is the `strategy` attribute. You must
If the `SYNC` backup is used, then the backup is synchronous and operation is considered finished on the caller ({project_name} server) side once the backup is processed on the second site. This has worse performance than `ASYNC`, but on the other hand, you are sure that subsequent reads of the particular entity, such as user session, on `site2` will see the updates from `site1`. Also, it is needed if you want data consistency. As with `ASYNC` the caller is not notified at all if backup to the other site failed.
For some caches, it is even possible to not backup at all and completely skip writing data to the JDG server. To set this up, do not use the `remote-store` element for the particular cache on the {project_name} side (file `KEYCLOAK_HOME/standalone/configuration/standalone-ha.xml`) and then the particular `replicated-cache` element is also not needed on the JDG side.
For some caches, it is even possible to not backup at all and completely skip writing data to the {jdgserver_name} server. To set this up, do not use the `remote-store` element for the particular cache on the {project_name} side (file `KEYCLOAK_HOME/standalone/configuration/standalone-ha.xml`) and then the particular `replicated-cache` element is also not needed on the {jdgserver_name} side.
By default, all 7 caches are configured with `SYNC` backup, which is the safest option. Here are a few things to consider:
* If you are using active/passive mode (all {project_name} servers are in single site `site1` and the JDG server in `site2` is used purely as backup. More details [here](#modes)), then it is usually fine to use `ASYNC` strategy for all the caches to save the performance.
* If you are using active/passive mode (all {project_name} servers are in single site `site1` and the {jdgserver_name} server in `site2` is used purely as backup. More details [here](#modes)), then it is usually fine to use `ASYNC` strategy for all the caches to save the performance.
* The `work` cache is used mainly to send some messages, such as cache invalidation events, to the other site. It is also used to ensure that some special events, such as userStorage synchronizations, happen only on single site. It is recommended to keep this set to `SYNC`.
@ -672,7 +674,7 @@ The following tips are intended to assist you should you need to troubleshoot:
* It is recommended to go through the link:setup[example setup] and have this one working first, so that you have some understanding of how things work. It is also wise to read this entire document to have some understanding of things.
* Check in jconsole cluster status (GMS) and the JGroups status (RELAY) of JDG as described in link:jdgsetup[JDG server setup]. If things do not look as expected, then the issue is likely in the setup of {jdgserver_name} servers.
* Check in jconsole cluster status (GMS) and the JGroups status (RELAY) of {jdgserver_name} as described in link:jdgsetup[JDG server setup]. If things do not look as expected, then the issue is likely in the setup of {jdgserver_name} servers.
* For the {project_name} servers, you should see a message like this during the server startup:
+
@ -707,13 +709,13 @@ it usually means that {project_name} server is not able to reach the {jdgserver_
+
then check the log of corresponding {jdgserver_name} server of your site and check if has failed to backup to the other site. If the backup site is unavailable, then it is recommended to switch it offline, so that {jdgserver_name} server won't try to backup to the offline site causing the operations to pass successfully on {project_name} server side as well. See link:administration[Administration of Cross-DC Deployment] for more.
* Check the Infinispan statistics, which are available through JMX. For example, try to login and then see if the new session was successfully written to both JDG servers and is available in the `sessions` cache there. This can be done indirectly by checking the count of elements in the `sessions` cache for the MBean `jboss.datagrid-infinispan:type=Cache,name="sessions(repl_sync)",manager="clustered",component=Statistics` and attribute `numberOfEntries`. After login, there should be one more entry for `numberOfEntries` on both JDG servers on both sites.
* Check the Infinispan statistics, which are available through JMX. For example, try to login and then see if the new session was successfully written to both {jdgserver_name} servers and is available in the `sessions` cache there. This can be done indirectly by checking the count of elements in the `sessions` cache for the MBean `jboss.datagrid-infinispan:type=Cache,name="sessions(repl_sync)",manager="clustered",component=Statistics` and attribute `numberOfEntries`. After login, there should be one more entry for `numberOfEntries` on both {jdgserver_name} servers on both sites.
* Enable DEBUG logging as described link:serversetup[here]. For example, if you log in and you think that the new session is not available on the second site, it's good to check the {project_name} server logs and check that listeners were triggered as described in the link:serversetup[the setup section]. If you do not know and want to ask on keycloak-user mailing list, it is helpful to send the log files from {project_name} servers on both datacenters in the email. Either add the log snippets to the mails or put the logs somewhere and reference them in the email.
* If you updated the entity, such as `user`, on {project_name} server on `site1` and you do not see that entity updated on the {project_name} server on `site2`, then the issue can be either in the replication of the synchronous database itself or that {project_name} caches are not properly invalidated. You may try to temporarily disable the {project_name} caches as described link:{installguide_disablingcaching_link}[here] to nail down if the issue is at the database replication level. Also it may help to manually connect to the database and check if data are updated as expected. This is specific to every database, so you will need to consult the documentation for your database.
* Sometimes you may see the exceptions related to locks like this in JDG log:
* Sometimes you may see the exceptions related to locks like this in {jdgserver_name} log:
+
```
(HotRodServerHandler-6-35) ISPN000136: Error executing command ReplaceCommand,
@ -727,7 +729,7 @@ Those exceptions are not necessarily an issue. They may happen anytime when conc
of redirects in your browser and you see the errors like this in the {project_name} server log:
+
```
2017-11-27 14:50:31,587 WARN [org.keycloak.events] (default task-17) type=LOGIN_ERROR, realmId=master, clientId=null, userId=null, ipAddress=37.188.148.81, error=expired_code, restart_after_timeout=true
2017-11-27 14:50:31,587 WARN [org.keycloak.events] (default task-17) type=LOGIN_ERROR, realmId=master, clientId=null, userId=null, ipAddress=aa.bb.cc.dd, error=expired_code, restart_after_timeout=true
```
+
it probably means that your load balancer needs to be set to support sticky sessions. Make sure that the provided route name used during startup of {project_name} server (Property `jboss.node.name`) contains the correct name used by the load balancer server to identify the current server.