Zookeeper Overview

Distributed highly available synchronization for large distributed systems. Zookeeper stores data in a hierarchical name space, much like a filesystem or a tree data structure. Clients can read from and write to nodes (like files/directories) and in this way have a shared configuration service, ask for a global mutex across all connected clients and many other. Zookeeper is a CP system with regard to CAP. It will not reply to a query unless Zookeeper is certain that the response is correct.

Zookeeper Server

Servers manage the stored data and the synchronization with each other and communicate with zookeeper clients. Several zookeeper servers form a cluster (or an ensemble). Running a single standalone zookeeper server does not bring any benefit but can be used for testing purposes. Its failure brings down the zookeeper service. In order to respond to a query, there needs to be a majority (or quorum) or running servers. For Zookeeper, that is at least ceil(N/2) connected servers, where N is a number of servers on a cluster. A reasonable amount of servers on a cluster is 3 (1 can fail), 5 (2 can fail), and so on. More about Zookeeper reliability, see Stackoverflow.

Zookeeper Client

Clients should have a list of several zookeeper servers on the cluster (preferably all). If connecting to one of the servers fails, the client will try to connect to the others. Clients communicate with each other by writing to and reading from the nodes on Zookeeper servers. Our Zookeeper client is implemented in the MDM server and communicates with other MDM Zookeeper clients. For more information, see link High Availability Component.

Zookeeper Nodes

Zookeeper nodes, also called znodes, store information written and consumed by clients. Clients can perform read or write operations or watch for changes. Nodes can be either persistent or ephemeral. Persistent nodes exist until deleted by any client. Ephemeral nodes are deleted when the creator (client) disconnects from the cluster. If there is a network failure between a client and the cluster, the ephemeral node is removed when the session expires. See high-availability:mdm-zookeeper-intercation.adoc for information on how MDM interacts with Zookeeper and its filesystem.

Leadership Election

Nodes can be created with a sequential flag (the node name is appended with a sequence number). Combining ephemeral and sequential node types enables the implementation of leadership election. When the client connects to a cluster, it creates an ephemeral and sequential node in a designated election folder. The leader is always the node with the lowest sequence number. When the leader wants to cancel its leadership, it needs to disconnect, which deletes its ephemeral node with the lowest sequence id, and connect again. The implementation of leader election is done by the Curator Framework.

Was this page useful?