The number of nodes in an EventStore cluster is fixed… no it’s not! (Clones and other node roles)

Clones can be used to add new nodes to an existing EventStore cluster temporary

Imagine that we have a classic 3 nodes cluster up and running for ages. Now someone in the business side of the company is asking to add redundancy over a separate Data Center for security reasons in case of catastrophic disaster. You have some choices here:

Spend the next 5 days building a custom Synchroniser (…that will never work as good as you need)
Persuade the business that you can do a good enough backup/restore strategy
Attach Clones in separate Data Centers or Availability zones that will be part of the existing cluster as Warm copies.

The option #1 is interesting and it poses good coding challenges. At EventStore we are actually finalising this as a new plugin called GeoReplica. This will allow replication ‘between’ different clusters. You can also build it for your own but it will be difficult to replicate exactly your existing cluster and therefore you can end up with a ‘far too be identical’ result. Plus you have to maintain and monitor another piece of software.

The option #2 is something that you always must provide. It doesn’t matter if you are replicating the data with the other options, a good backup/restore strategy is always one of the most important things to set and make sure it will works. But it’s not a good enough solution for the business requirement. We can consider it like a cold copy of your data separate from the hot (synch replica’s) and warm (async replica’s) copies.

The option #3 is to attach to the existing cluster one or more Clones that will receive all the data. No need to stop the cluster. This is a replica ‘within’ the same cluster as opposed to ‘between’ different clusters as in option #1. With the current version of Event Store Clone can cause split brain and therefore it’s not something to consider. As soon as the new Non Promotable Clone feature is out then this option will be achievable.

The (fixed) Cluster size and the dynamic number or nodes

The nodes in an EventStore cluster must agreed on who will be the Master.
The ‘ClusterSize’ setting is where you decide how many nodes will be part of this election process. Once that the Master is elected and accepted by all the other nodes, the Master will acknowledge writes only when a majority of the nodes have persisted the data. The majority in that sense is commonly defined as the quorum.

The different Roles of nodes

At the time that I’m writing this article, in an EventStore cluster we can have 3 types of nodes: Master, Slaves, Clones. Another new role that is coming in the next weeks is the Non Promotable Clone. I will write a separate article to introduce that new type.

Master: The Master ensures that the data are committed and persisted to the disks before send back to the client an acknowledge message. You can only have one Master at the time. If two masters are detected, a new election is started and the Master with less data is shut down in order to be restarted and re-join the cluster.

Slaves: In a cluster you need one or more Slaves that form the quorum. The quorum is the majority of nodes necessary to confirm that the write is persisted. It is recommended to define an odd number of nodes to avoid the split brain problem. Common numbers for the ‘ClusterSize’ setting are 3 or 5 (to have a majority of 2 nodes and a majority of 3 nodes).

Clones: When you want to add nodes you can attach them without changing the ‘ClusterSize’. They will automatically become Clones. A Clone differs from a Slave as the data are replicated one way asynchronously. No need to wait for the acknowledge as it’s not part of the quorum. For that reason they did not add much overhead to the other nodes.

Following is an simple example of a cluster formed by 4 nodes (1 Master, 2 Slaves, 1 Clone) all located in the same Data Center or Availability Zone.

EventStore cluster and Clones — EventStore cluster with Clones

I can imagine what questions can arise looking at that figure:

Who has decided that node #4 is a Clone?

All the nodes have similar settings. There are 2 settings that are quite important for that configuration. The ‘ClusterSize’ in the above scenario is set to 3. That size is the same on all the nodes. The ‘ClusterGossipPort’ must be the same in all the nodes. To get the cluster set as above, you start at last the node that you want to be Clone. It will join the already existent cluster and the ClusterController with the help of the ElectionService will set it as Clone.

Following is an example of configuration. If you want to run a cluster on your development machine to give it a try you just need to change the ‘x113’ with the number of the node example ‘1113’ and so on. Copy and paste the EventStore files in separate folders and start all the nodes with the following configuration:
(example: EventStore.ClusterNode.exe –config=config.yaml)

IntTcpPort: x111

ExtTcpPort: x112

IntHttpPort: x113

ExtHttpPort: x114

IntTcpHeartbeatInterval: 1500

IntTcpHeartbeatTimeout: 3000

ExtTcpHeartbeatInterval: 1500

ExtTcpHeartbeatTimeout: 3000

GossipIntervalMs: 2000

GossipTimeoutMs: 4000

IntHttpPrefixes: http://*:x113/

ExtHttpPrefixes: http://*:x114/

DiscoverViaDns: true

ClusterDns: eventstore

ClusterGossipPort: 2113

ClusterSize: 3

AddInterfacePrefixes: false

If you want to use the ClusterDns discovery as in my settings and you are on Windows remember to add the dns name ‘eventstore’ in the host file. I equivalent setting need to be done on Linux.

What happen if one of the other nodes goes down?

In that case the Clone let the remaining nodes know that it is available to join the election process. It will be promoted as Master or Slave and the cluster will be keep up and running. In that scenario you must consider the implication that this can have. If the ex-Clone is located on a separate Data Center or Availability Zone the performances can decrease. You probably also need to configure the dns in a way that the clients can connect to it. This is why in the figure above, all the nodes are located in the same Data Center/Availability Zone.

What happen if the old node come back up?

It will try to re-join the cluster. A new election process will start and eventually it will become the new Clone. If that is the case, you can try to stop the node that you want to be the Clone, make the old node join the cluster and become Master or Slave and then make your preferred node-to-be-clone joining the cluster to be then assigned as Clone.

As you can see, the possibility to attach Clones to an existing Cluster open up new ways to provide high availability other than just having Warm Copies. At the same time it can become tricky in a production environment when you want to control these roles and avoid having some of the nodes change role.

Use of ‘NodePriority’ to influence the Election Process

In case you want to control the assignment of the Clone role to specific nodes and only allow them to be promoted in case of emergency then you can make use of the ‘NodePriority’ setting.

This setting is set per default to 0. The node with an higher number has more chances to be promoted as Master during the election process. To have more interesting scenarios where you can move the Clones to separate Data-Centers or Availability Zones and let them be promoted only in edge cases you can leave the ‘NodePriority’ default setting of 0 in the Clones and set an higher number in the original nodes (#1 #2 #3 in the picture above).

Bear in mind that the ‘NodePriority’ setting is not giving you the 100% certainty that the Clone once joined the election process will not be promoted. It’s only one of the criteria’s that the Election Service consider.

Split Brain risk… be aware of it

While you run 4 instances of EventStore at the same time, there could be a risk of having 2 separate clusters forming if you were to have a network partition between the 4 nodes. When they rejoin the single group and an election takes place the one with most up-to-date data will be elected and the other one will shut down. There will be a message in the log of the node about going down for offline truncation due to being a deposed master.

If your nodes are on the same network and your settings are ok, 99% of time the truncated data are only stats or internal messages, not from clients. The risk of loosing data increase if you spread your nodes across different networks for example in different availability zones.

This risk will disappear using Non-Promotable-Clones as they can’t be part of an election.

If you are new to EventStore it’s a mature solid distributed no-sql database for Events.
It’s free and Open Source. You can download the OSS version from here
https://eventstore.org/downloads/

Commercial support is available in case you need
https://eventstore.org/support/

Enjoy!

Riccardo Di Nuzzo in UK

…there is no fun maintaining existing piles of crap