Clustering

Basics

This section is irrelevant, if SQL RDBMS storage premium feature is used. Please refer to the feature documentation, as well as the clustering documentation for the chosen RDBMS.

It is possible to run Gentics Mesh in a cluster mode. When done so all changes to your data with one exception are automatically being distributed to other instances. Binary data (uploads) are currently not being handled by our cluster implementation and need dedicated handling.

At its core Gentics Mesh makes use of the OrientDB clustering mode which allows run multiple instances in a master-master mode. This means that each instance is able to receive data which is in turn automatically being distributed to other instances.

Clustering is also a way to increase redundancy and fault tolerance since each instance can still run independently in case of a network failure or hardware fault of other instances.

Configuration

Configuration Type Default Description

cluster.enabled

Path

false

This setting is optional and can be used to configure the network which the cluster daemons bind to. Gentics Mesh will try to determine the network automatically if no setting has been provided. The value of this setting will currently also be used to connect other instances to the configured instance. So make sure that the IP/Host can be reached from other potential instances in your network.

cluster.networkHost

String

-

Host which will be used to bind clustering related ports

cluster.clusterName

String

-

Name of the cluster which will be used to group multiple instances together. The setting is required when clustering is enabled. It can also be used to form a new cluster next to an existing cluster. Only instances with the same clusterName will be able to find eachother.

cluster.vertxPort

Number

0 (random)

The vertxPort setting is used by the Vert.x eventbus message service. By default vert.x will choose any free port and utilize it for the service.

nodeName

String

-

The node name is used to identify the instance in the cluster. The name must be unique to a single instance and should not be changed.

mesh.yml
nodeName: "nodeA"
cluster:
  networkHost: "192.168.10.20"
  clusterName: "mesh.testing"
  vertxPort: 0
  enabled: true

Port Mapping

Clustering involves the following components: Vert.x, OrientDB, Hazelcast. Each component utilize different ports.

Table 1. Table Port Mappings
Component Default Port Protocol Configuration file

Vert.x

0 (random, eventbus server)

TCP

mesh.yml - cluster/vertxPort

OrientDB

2424-2430 (binary), 2480-2490 (HTTP)

TCP

orientdb-server-config.xml - network/listeners/listener

Hazelcast

2434 (dynamic)

TCP, UDP

hazelcast.xml - network/port

Setup

  • Initial setup

If you have not yet run Gentics Mesh with clustering mode disabled you need to setup a database first. You can either start Gentics Mesh in single mode and stop it and start it again in clustering mode or start Gentics Mesh directly using the -initCluster command line argument. Similar to the first time when you start Gentics Mesh in single mode a new data directory will be created for you. The only difference is that new instances will be able to connect to your instance right away.

  • Adding slaves

If you have not yet setup a database and just start Gentics Mesh with clustering enabled not much will happen. It will wait for other instances in the cluster which can provide a database for it.

You can start up additional instances once your initial cluster instance is up and running.

  • Enable clustering on an non-clustered instance

Just set the cluster.enabled flag and specify a cluster.clusterName within your mesh.yml.

Node discovery

By default all nodes will be discovered using multicast discovery. In that configuration all instances must share the same network and be able to receive multicast broadcast messages.

Alternatively it is also possible to hardcode the IP addresses of the cluster instances within the hazelcast.xml file. Just replace the existing join section with the following one:

hazelcast.xml
...
  <join>
      <multicast enabled="false"/>
      <tcp-ip enabled="true">
   	  <member>192.168.10.100</member> <!-- instance A -->
   	  <member>192.168.10.101</member> <!-- instance B -->
    </tcp-ip>
  </join>
...

Kubernetes

The hazelcast-kubernetes plugin can be used to enable autodiscovery of nodes in a kubernetes environment. The plugin itself is already part of Gentics Mesh and just needs to be configured in the hazelcast.xml file.

Example configuration:

hazelcast.xml
...
<network>
        <port auto-increment="true">5701</port>
        <join>
            <!-- deactivate normal discovery -->
            <multicast enabled="false"/>
            <tcp-ip enabled="false" />

            <!-- activate the Kubernetes plugin -->
            <discovery-strategies>
                <discovery-strategy enabled="true"
                        class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">
                    <properties>
                        <property name="service-dns">MY-SERVICE-DNS-NAME</property>
                        <property name="service-dns-timeout">10</property>
                    </properties>
                </discovery-strategy>
            </discovery-strategies>
        </join>
  </network>
...
For more details on the setup of the plugin please take a look at the documentation.

Session distribution

Since Gentics Mesh is not using sessions it is not needed to distribute sessions across the cluster. A JWT will be used to authenticate the user. Each instance of Gentics Mesh is able to validate the JWT by using the crypographic key which has been stored in the config/keystore.jceks file. This means the same keystore.jceks must be available on each instance.

Elasticsearch

Note that the Elasticsearch needs to be clustered dedicatedly. This can be achieved by running a configured dedicated instance or by configuring the instance that is started by Gentics Mesh. Please refer to the Elasticsearch documentation.

Handling binary data

File uploads are currently not automatically being distributed. This means that it is required to share the data of the upload directory manually. One option would be to use a distributed file system like GlusterFS, S3 or NFS. The only directory which needs to be shared across the cluster is the upload directory which can be configured in the mesh.yml file.

mesh.yml
upload:
  directory: "data/binaryFiles"

Cluster Coordination

A cluster request coordinator can be enabled to add a coordination layer in each Gentics Mesh instance.

This layer will automatically delegate requests to an elected master instance.

This process is useful when running a multi-master or master-replica setup.

In a multi-master setup the write requests on all instances will be directed to an elected master instance. This is useful in order to reduce contention and to avoid synchronization issues.

In a master-replica setup this can be useful to delegate write requests on replica instances to the elected master instance.

Configuration

By default the coordination layer is disabled and first needs to be enabled in the cluster settings.

Modes

The cluster.coordinatorMode setting or MESH_CLUSTER_COORDINATOR_MODE environment variable can control how the coordinator should behave.

  • DISABLED - In this mode no coordination of requests will be done.

  • CUD - In this mode only write requests (Create, Update, Delete) will be delegated to the elected master. All other requests will be directly processed by the instance.

  • ALL - In this mode all requests (CRUD) will be delegated to the elected master.

Topology management

The cluster.coordinatorTopology setting or MESH_CLUSTER_COORDINATOR_TOPOLOGY environment variable controls whether the coordinator will also manage the current cluster topology. By default no cluster topology management will be done.

  • UNMANAGED - Don’t manage cluster topology.

  • MASTER_REPLICA - Use the elected master also as database master and make other nodes in the cluster replicas.

The topology can only be managed when the coordination mode is not disabled.

Master Election

The cluster.coordinatorRegex setting or MESH_CLUSTER_COORDINATOR_REGEX environment variable controls which nodes in the cluster can be elected as master instances.

Example: The regex gentics-mesh-[0-9] would only match nodes gentics-mesh-0, gentics-mesh-1 but not gentics-mesh-backup.

A new master will automatically be elected whenever the master instance is no longer online.

Only instances which provide a master database can be elected. Replica servers are not eligible.

Headers

Delegation can be disabled for individual requests when adding the X-Mesh-Direct: true header to the request.

If a request was delegated, the X-Mesh-Forwarded-From will be added to the response. The value of this header is the name of the node from which the request was forwarded from.

Transaction handling / Change propagation

Transactions are directly being handled by OrientDB.

The distribution of data (write) is executed synchronous and will be directly visible on other instances. Operations which modify permissions are handled asynchronous and may take a few milliseconds to propagate throughout the cluster. Operations which invoke node, micronode or branch migrations can only be executed separately throughout the cluster. A running migration on one instance will cause the other instances of the cluster to reject any migration request as long as a migration is active.

Upgrading a cluster

There are two types of upgrades. An upgrade can either be handled online of offline.

Online Upgrade

Online upgrades can happen without any downtime. First you stop a single instance within your cluster. Next you start the instance up again using the new Gentics Mesh version. The instance will directly join the cluster and receive the changes from other instances. This is possible since the newly started Gentics Mesh version uses the same database revision as the older version.

Rollback: You can rollback by stopping the Gentics Mesh version and starting the older version.

Offline Upgrade

Offline upgrades are required when database changes need to be applied. These changes could negatively impact other (older) instances in the cluster and thus an offline upgrade will only occur locally and not directly propagate throughout the cluster. Gentics Mesh will not join the cluster in such situations. The cluster will only form in-between versions which use the same database revision. You can check the list of Database revisions in order to determine whether an online or offline upgrade can be done.

Data which has been written to other instances in the cluster can’t be merged back to the now upgraded instance. Instead the upgraded instance will act as a new source of truth. Other instances in the cluster need to be stopped one by one and restarted with the new version. The database from the initially migrated instance will be replicated to these instances and replace their older database. The old database will be moved in a backup location and can later be deleted. You could alternatively also just start new instances which replicate the data and afterwards drop the now outdated cluster.

Rollback: You can rollback as long as you still have instances which have not yet been migrated by just starting older Gentics Mesh instances. These instances will automatically join the corresponding cluster with the same database revision.

AWS / GCE

There is currently no built-in support for these platforms.

Scaling

When dynamic discovery via multicast is enabled it is possible to just start additional Gentics Mesh instances in order to scale horizontally. The newly start instances must not have any data within the graph storage directory. The needed data will automatically be replicated when joining the cluster.

FAQ

  1. What happens if my initial instances crashes?

    The cluster automatically realigns itself and operation can continue normally.

  2. Can I add new instances at any time?

    Yes. New instances can be added at any time.

  3. Are my changes directly visible on other instances?

    The replication handles this as fast as the network allows but by default replication is happening synchronous to full fill the writeQuorum and asynchronous once the quorum has been satisfied. which means that it could take a few moments until your changes are propagated throughout the cluster. This behaviour is configurable via the OrientDB writeQuorum setting. Take a look at the OrientDB distributed configuration if you want to know more. Our tests currently only cover the writeQuorum and readQuorum of 1.

  4. What happens if the network between my instances fails?

    The instances will continue to operate normally but will no longer be able to see each other’s changes. Once the network issue is resolved the instances will update them self and resume normal operation.

  5. I want to use a load balancer to distribute load across my instances. Do I need to handle sticky sessions?

    Gentics Mesh does not use sessions. Instead a stateless JWT mechanism is used. This means you can direct your traffic to any of clustered instances. No need to setup something special.

  6. Can I use sharding to split up my data across multiple data centers?

    No. Sharding is not supported but you are still able to span a cluster across multiple datacenters.

  7. Can I split a single cluster into one or more clusters?

    Yes. This can be done by starting a new cluster using a different cluster.clusterName setting within the mesh.yml file.

Monitoring

The /api/v2/admin/cluster/status endpoint can be used to retrieve information about the cluster topology and status of instances within the cluster.

Additionally it is possible to access the JMX beans of OrientDB.

Limitations

  • Binary data (uploads) are currently not automatically being distributed to other nodes. You may use a clustering filesystem or NFS to share this data.

  • All cluster instances must use the same Gentics Mesh version. Checks have been added to prevent instances from joining a cluster if the Gentics Mesh version does not match up.

  • It is currently not possible to configure network bind host and different network host announce host. The node must currently bind to the same network which is also used to connect to the host.