|
@@ -23,6 +23,7 @@ limitations under the License.
|
|
* [Active Messaging](#sc_activeMessaging)
|
|
* [Active Messaging](#sc_activeMessaging)
|
|
* [Summary](#sc_summary)
|
|
* [Summary](#sc_summary)
|
|
* [Comparisons](#sc_comparisons)
|
|
* [Comparisons](#sc_comparisons)
|
|
|
|
+* [Consistency Guarantees](#sc_consistency)
|
|
* [Quorums](#sc_quorum)
|
|
* [Quorums](#sc_quorum)
|
|
* [Logging](#sc_logging)
|
|
* [Logging](#sc_logging)
|
|
* [Developer Guidelines](#sc_developerGuidelines)
|
|
* [Developer Guidelines](#sc_developerGuidelines)
|
|
@@ -34,9 +35,11 @@ limitations under the License.
|
|
## Introduction
|
|
## Introduction
|
|
|
|
|
|
This document contains information on the inner workings of ZooKeeper.
|
|
This document contains information on the inner workings of ZooKeeper.
|
|
-So far, it discusses these topics:
|
|
|
|
|
|
+It discusses the following topics:
|
|
|
|
|
|
* [Atomic Broadcast](#sc_atomicBroadcast)
|
|
* [Atomic Broadcast](#sc_atomicBroadcast)
|
|
|
|
+* [Consistency Guarantees](#sc_consistency)
|
|
|
|
+* [Quorums](#sc_quorum)
|
|
* [Logging](#sc_logging)
|
|
* [Logging](#sc_logging)
|
|
|
|
|
|
<a name="sc_atomicBroadcast"></a>
|
|
<a name="sc_atomicBroadcast"></a>
|
|
@@ -52,18 +55,17 @@ At the heart of ZooKeeper is an atomic messaging system that keeps all of the se
|
|
The specific guarantees provided by the messaging system used by ZooKeeper are the following:
|
|
The specific guarantees provided by the messaging system used by ZooKeeper are the following:
|
|
|
|
|
|
* *_Reliable delivery_* :
|
|
* *_Reliable delivery_* :
|
|
- If a message, m, is delivered
|
|
|
|
- by one server, it will be eventually delivered by all servers.
|
|
|
|
|
|
+ If a message `m`, is delivered
|
|
|
|
+ by one server, message `m` will be eventually delivered by all servers.
|
|
|
|
|
|
* *_Total order_* :
|
|
* *_Total order_* :
|
|
- If a message is
|
|
|
|
- delivered before message b by one server, a will be delivered before b by all
|
|
|
|
- servers. If a and b are delivered messages, either a will be delivered before b
|
|
|
|
- or b will be delivered before a.
|
|
|
|
|
|
+ If a message `a` is
|
|
|
|
+ delivered before message `b` by one server, message `a` will be delivered before `b` by all
|
|
|
|
+ servers.
|
|
|
|
|
|
* *_Causal order_* :
|
|
* *_Causal order_* :
|
|
- If a message b is sent after a message a has been delivered by the sender of b,
|
|
|
|
- a must be ordered before b. If a sender sends c after sending b, c must be ordered after b.
|
|
|
|
|
|
+ If a message `b` is sent after a message `a` has been delivered by the sender of `b`,
|
|
|
|
+ message `a` must be ordered before `b`. If a sender sends `c` after sending `b`, `c` must be ordered after `b`.
|
|
|
|
|
|
The ZooKeeper messaging system also needs to be efficient, reliable, and easy to
|
|
The ZooKeeper messaging system also needs to be efficient, reliable, and easy to
|
|
implement and maintain. We make heavy use of messaging, so we need the system to
|
|
implement and maintain. We make heavy use of messaging, so we need the system to
|
|
@@ -80,29 +82,29 @@ lose or reorder messages, our assumption of FIFO channels is very practical
|
|
given that we use TCP for communication. Specifically we rely on the following property of TCP:
|
|
given that we use TCP for communication. Specifically we rely on the following property of TCP:
|
|
|
|
|
|
* *_Ordered delivery_* :
|
|
* *_Ordered delivery_* :
|
|
- Data is delivered in the same order it is sent and a message m is
|
|
|
|
- delivered only after all messages sent before m have been delivered.
|
|
|
|
- (The corollary to this is that if message m is lost all messages after m will be lost.)
|
|
|
|
|
|
+ Data is delivered in the same order it is sent and a message `m` is
|
|
|
|
+ delivered only after all messages sent before `m` have been delivered.
|
|
|
|
+ (The corollary to this is that if message `m` is lost all messages after `m` will be lost.)
|
|
|
|
|
|
* *_No message after close_* :
|
|
* *_No message after close_* :
|
|
Once a FIFO channel is closed, no messages will be received from it.
|
|
Once a FIFO channel is closed, no messages will be received from it.
|
|
|
|
|
|
FLP proved that consensus cannot be achieved in asynchronous distributed systems
|
|
FLP proved that consensus cannot be achieved in asynchronous distributed systems
|
|
-if failures are possible. To ensure we achieve consensus in the presence of failures
|
|
|
|
-we use timeouts. However, we rely on times for liveness not for correctness. So,
|
|
|
|
-if timeouts stop working (clocks malfunction for example) the messaging system may
|
|
|
|
|
|
+if failures are possible. To ensure that we achieve consensus in the presence of failures
|
|
|
|
+we use timeouts. However, we rely on time for liveness not for correctness. So,
|
|
|
|
+if timeouts stop working (e.g., skewed clocks) the messaging system may
|
|
hang, but it will not violate its guarantees.
|
|
hang, but it will not violate its guarantees.
|
|
|
|
|
|
When describing the ZooKeeper messaging protocol we will talk of packets,
|
|
When describing the ZooKeeper messaging protocol we will talk of packets,
|
|
proposals, and messages:
|
|
proposals, and messages:
|
|
|
|
|
|
* *_Packet_* :
|
|
* *_Packet_* :
|
|
- a sequence of bytes sent through a FIFO channel
|
|
|
|
|
|
+ a sequence of bytes sent through a FIFO channel.
|
|
|
|
|
|
* *_Proposal_* :
|
|
* *_Proposal_* :
|
|
a unit of agreement. Proposals are agreed upon by exchanging packets
|
|
a unit of agreement. Proposals are agreed upon by exchanging packets
|
|
with a quorum of ZooKeeper servers. Most proposals contain messages, however the
|
|
with a quorum of ZooKeeper servers. Most proposals contain messages, however the
|
|
- NEW_LEADER proposal is an example of a proposal that does not correspond to a message.
|
|
|
|
|
|
+ NEW_LEADER proposal is an example of a proposal that does not contain to a message.
|
|
|
|
|
|
* *_Message_* :
|
|
* *_Message_* :
|
|
a sequence of bytes to be atomically broadcast to all ZooKeeper
|
|
a sequence of bytes to be atomically broadcast to all ZooKeeper
|
|
@@ -121,7 +123,7 @@ n is the number of servers that make up a ZooKeeper service.
|
|
|
|
|
|
The zxid has two parts: the epoch and a counter. In our implementation the zxid
|
|
The zxid has two parts: the epoch and a counter. In our implementation the zxid
|
|
is a 64-bit number. We use the high order 32-bits for the epoch and the low order
|
|
is a 64-bit number. We use the high order 32-bits for the epoch and the low order
|
|
-32-bits for the counter. Because it has two parts represent the zxid both as a
|
|
|
|
|
|
+32-bits for the counter. Because zxid consists of two parts, zxid can be represented both as a
|
|
number and as a pair of integers, (_epoch, count_). The epoch number represents a
|
|
number and as a pair of integers, (_epoch, count_). The epoch number represents a
|
|
change in leadership. Each time a new leader comes into power it will have its
|
|
change in leadership. Each time a new leader comes into power it will have its
|
|
own epoch number. We have a simple algorithm to assign a unique zxid to a proposal:
|
|
own epoch number. We have a simple algorithm to assign a unique zxid to a proposal:
|
|
@@ -146,18 +148,15 @@ up with the leader, they have the same state. This state consists of all of the
|
|
proposals that the leader believes have been committed and the proposal to follow
|
|
proposals that the leader believes have been committed and the proposal to follow
|
|
the leader, the NEW_LEADER proposal. (Hopefully you are thinking to
|
|
the leader, the NEW_LEADER proposal. (Hopefully you are thinking to
|
|
yourself, _Does the set of proposals that the leader believes has been committed
|
|
yourself, _Does the set of proposals that the leader believes has been committed
|
|
-included all the proposals that really have been committed?_ The answer is _yes_.
|
|
|
|
|
|
+include all the proposals that really have been committed?_ The answer is _yes_.
|
|
Below, we make clear why.)
|
|
Below, we make clear why.)
|
|
|
|
|
|
<a name="sc_leaderElection"></a>
|
|
<a name="sc_leaderElection"></a>
|
|
|
|
|
|
### Leader Activation
|
|
### Leader Activation
|
|
|
|
|
|
-Leader activation includes leader election. We currently have two leader election
|
|
|
|
-algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection
|
|
|
|
-is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple
|
|
|
|
-form of authentication to avoid IP spoofing). ZooKeeper messaging doesn't care about the
|
|
|
|
-exact method of electing a leader as long as the following holds:
|
|
|
|
|
|
+Leader activation includes leader election (`FastLeaderElection`).
|
|
|
|
+ZooKeeper messaging doesn't care about the exact method of electing a leader as long as the following holds:
|
|
|
|
|
|
* The leader has seen the highest zxid of all the followers.
|
|
* The leader has seen the highest zxid of all the followers.
|
|
* A quorum of servers have committed to following the leader.
|
|
* A quorum of servers have committed to following the leader.
|
|
@@ -170,18 +169,18 @@ we will recover by abandoning leader activation and running another election.
|
|
|
|
|
|
After leader election a single server will be designated as a leader and start
|
|
After leader election a single server will be designated as a leader and start
|
|
waiting for followers to connect. The rest of the servers will try to connect to
|
|
waiting for followers to connect. The rest of the servers will try to connect to
|
|
-the leader. The leader will sync up with followers by sending any proposals they
|
|
|
|
|
|
+the leader. The leader will sync up with the followers by sending any proposals they
|
|
are missing, or if a follower is missing too many proposals, it will send a full
|
|
are missing, or if a follower is missing too many proposals, it will send a full
|
|
snapshot of the state to the follower.
|
|
snapshot of the state to the follower.
|
|
|
|
|
|
-There is a corner case in which a follower that has proposals, U, not seen
|
|
|
|
-by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids
|
|
|
|
|
|
+There is a corner case in which a follower that has proposals, `U`, not seen
|
|
|
|
+by a leader arrives. Proposals are seen in order, so the proposals of `U` will have a zxids
|
|
higher than zxids seen by the leader. The follower must have arrived after the
|
|
higher than zxids seen by the leader. The follower must have arrived after the
|
|
leader election, otherwise the follower would have been elected leader given that
|
|
leader election, otherwise the follower would have been elected leader given that
|
|
it has seen a higher zxid. Since committed proposals must be seen by a quorum of
|
|
it has seen a higher zxid. Since committed proposals must be seen by a quorum of
|
|
-servers, and a quorum of servers that elected the leader did not see U, the proposals
|
|
|
|
-of you have not been committed, so they can be discarded. When the follower connects
|
|
|
|
-to the leader, the leader will tell the follower to discard U.
|
|
|
|
|
|
+servers, and a quorum of servers that elected the leader did not see `U`, the proposals
|
|
|
|
+of `U` have not been committed, so they can be discarded. When the follower connects
|
|
|
|
+to the leader, the leader will tell the follower to discard `U`.
|
|
|
|
|
|
A new leader establishes a zxid to start using for new proposals by getting the
|
|
A new leader establishes a zxid to start using for new proposals by getting the
|
|
epoch, e, of the highest zxid it has seen and setting the next zxid to use to be
|
|
epoch, e, of the highest zxid it has seen and setting the next zxid to use to be
|
|
@@ -226,9 +225,9 @@ the following operating constraints are observed:
|
|
received. Because we use FIFO channels this means that followers also receive proposals in order.
|
|
received. Because we use FIFO channels this means that followers also receive proposals in order.
|
|
* Followers process messages in the order they are received. This
|
|
* Followers process messages in the order they are received. This
|
|
means that messages will be ACKed in order and the leader will receive ACKs from
|
|
means that messages will be ACKed in order and the leader will receive ACKs from
|
|
- followers in order, due to the FIFO channels. It also means that if message $m$
|
|
|
|
|
|
+ followers in order, due to the FIFO channels. It also means that if message `m`
|
|
has been written to non-volatile storage, all messages that were proposed before
|
|
has been written to non-volatile storage, all messages that were proposed before
|
|
- $m$ have been written to non-volatile storage.
|
|
|
|
|
|
+ `m` have been written to non-volatile storage.
|
|
* The leader will issue a COMMIT to all followers as soon as a
|
|
* The leader will issue a COMMIT to all followers as soon as a
|
|
quorum of followers have ACKed a message. Since messages are ACKed in order,
|
|
quorum of followers have ACKed a message. Since messages are ACKed in order,
|
|
COMMITs will be sent by the leader as received by the followers in order.
|
|
COMMITs will be sent by the leader as received by the followers in order.
|
|
@@ -267,6 +266,26 @@ all packets, it all falls apart. Also, our leader activation phase is different
|
|
both of them. In particular, our use of epochs allows us to skip blocks of uncommitted
|
|
both of them. In particular, our use of epochs allows us to skip blocks of uncommitted
|
|
proposals and to not worry about duplicate proposals for a given zxid.
|
|
proposals and to not worry about duplicate proposals for a given zxid.
|
|
|
|
|
|
|
|
+<a name="sc_consistency"></a>
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+## Consistency Guarantees
|
|
|
|
+
|
|
|
|
+ZooKeeper [consistency](https://jepsen.io/consistency) guarantees lie between sequential consistency and linearizabiliy. Here, we explain the exact consistency guarantees that ZooKepeer provides.
|
|
|
|
+
|
|
|
|
+Write operations in ZooKeeper are linearizabile. In other words, each write appears to take effect atomically at some point between its invocation and its response. This means that the writes performed by all the clients in ZooKeeper can be totally ordered in such a way that respects the real-time ordering of these writes. However, note that just stating that writes are linearizable is meaningless unless we also talk about read operations.
|
|
|
|
+
|
|
|
|
+Read operations in ZooKeeper are not linearizable since they can return potentially stale data. This occurs since a read in ZooKeeper is not a quorum operation and a server responds immediately to a client that is performing a read.
|
|
|
|
+Nevertheless, ZooKeeper makes this choice because it chooses performance in the trade-off between performance and consistency. ZooKeeper read operations are sequentially-consistent, since read operations appear to take effect in some sequential order that furthermore respects the order of each client's operations.
|
|
|
|
+If a client wants to read the freshest data, it is generally assumed that the client should first perform a sync operation, and then a read.
|
|
|
|
+However, even with a sync before a read operation, a client might retrieve stale data.
|
|
|
|
+This can occur because `sync` is [not a quorum operation](https://issues.apache.org/jira/browse/ZOOKEEPER-1675). Such a scenario might appear if two servers think that they are the leaders at the same time, which may occur if the time it takes for a TCP connection to drop is smaller than `syncLimit * tickTime`, something that is [unlikely](https://www.amazon.com/ZooKeeper-Distributed-Coordination-Flavio-Junqueira/dp/1449361307) to occur in practice.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+This raises the question on what are the exact consistency guarantees of ZooKeeper?
|
|
|
|
+Formally, the ZooKeeper consistency guarantees are captured by the notion of [ordered sequential consistency](http://webee.technion.ac.il/people/idish/ftp/OSC-IPL17.pdf) or `OSC(U)` to be exact, that lies between sequential consistency and linearizability.
|
|
|
|
+Finally, note that the current version of ZooKeeper can provide linearizability for both reads and writes, if every read is preceded by a write to some dummy znode.
|
|
|
|
+
|
|
<a name="sc_quorum"></a>
|
|
<a name="sc_quorum"></a>
|
|
|
|
|
|
## Quorums
|
|
## Quorums
|
|
@@ -313,8 +332,7 @@ of the [ZooKeeper Administrator's Guide.](zookeeperAdmin.html)
|
|
### Developer Guidelines
|
|
### Developer Guidelines
|
|
|
|
|
|
Please follow the [slf4j manual](http://www.slf4j.org/manual.html) when creating log statements within code.
|
|
Please follow the [slf4j manual](http://www.slf4j.org/manual.html) when creating log statements within code.
|
|
-Also read the[FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance)
|
|
|
|
-, when creating log statements. Patch reviewers will look for the following:
|
|
|
|
|
|
+Also read the [FAQ on performance](http://www.slf4j.org/faq.html#logging\_performance), when creating log statements. Patch reviewers will look for the following:
|
|
|
|
|
|
<a name="sc_rightLevel"></a>
|
|
<a name="sc_rightLevel"></a>
|
|
|
|
|