Bladeren bron

ZOOKEEPER-3140: Allow Followers to host Observers

Creates a new abstraction, LearnerMaster, to represent the portions of the Leader logic that are used in LearnerHandler. Leader implements LearnerMaster and a new class ObserverMaster implements LearnerMaster. Followers have the option of instantiating a ObserverMaster thread when they assume their role and so support Learner traffic.

A new parameter 'observerMasterPort' is used to control which Follower instances host Observers.

Author: Brian Nixon <nixon@fb.com>

Reviewers: fangmin@apache.org, hanm@apache.org, eolivelli@gmail.com

Closes #628 from enixon/learner-master
Brian Nixon 6 jaren geleden
bovenliggende
commit
b2513c1149
30 gewijzigde bestanden met toevoegingen van 2459 en 385 verwijderingen
  1. 7 0
      zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md
  2. 32 0
      zookeeper-docs/src/main/resources/markdown/zookeeperObservers.md
  3. 14 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/ObserverBean.java
  4. 15 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/admin/Commands.java
  5. 39 7
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Follower.java
  6. 10 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerBean.java
  7. 10 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerMXBean.java
  8. 24 17
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java
  9. 30 2
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerZooKeeperServer.java
  10. 155 11
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java
  11. 11 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java
  12. 19 9
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java
  13. 75 91
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java
  14. 66 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandlerBean.java
  15. 29 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandlerMXBean.java
  16. 196 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerMaster.java
  17. 94 4
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Observer.java
  18. 12 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ObserverMXBean.java
  19. 532 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ObserverMaster.java
  20. 87 3
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java
  21. 12 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java
  22. 1 0
      zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerMain.java
  23. 26 0
      zookeeper-server/src/test/java/org/apache/zookeeper/ZKTestCase.java
  24. 77 0
      zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/DelayRequestProcessor.java
  25. 28 27
      zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java
  26. 57 48
      zookeeper-server/src/test/java/org/apache/zookeeper/server/util/PortForwarder.java
  27. 780 0
      zookeeper-server/src/test/java/org/apache/zookeeper/test/ObserverMasterTest.java
  28. 4 159
      zookeeper-server/src/test/java/org/apache/zookeeper/test/ObserverTest.java
  29. 9 7
      zookeeper-server/src/test/java/org/apache/zookeeper/test/ReconfigTest.java
  30. 8 0
      zookeeper-server/src/test/resources/findbugsExcludeFile.xml

+ 7 - 0
zookeeper-docs/src/main/resources/markdown/zookeeperAdmin.md

@@ -550,6 +550,13 @@ in the configuration file:
     Note that SSL feature will be enabled when user plugs-in
     zookeeper.serverCnxnFactory, zookeeper.clientCnxnSocket as Netty.
 
+* *observerMasterPort* :
+    the port to listen for observer connections; that is, the
+    port that observers attempt to connect to.
+    if the property is set then the server will host observer connections
+    when in follower mode in addition to when in leader mode and correspondingly
+    attempt to connect to any voting peer when in observer mode.
+
 * *dataDir* :
     the location where ZooKeeper will store the in-memory
     database snapshots and, unless specified otherwise, the

+ 32 - 0
zookeeper-docs/src/main/resources/markdown/zookeeperObservers.md

@@ -81,6 +81,38 @@ specified in every config file. You should see a command line prompt
 through which you can issue commands like _ls_ to query
 the ZooKeeper service.
 
+<a name="ch_ObserverMasters"></a>
+
+## How to use Observer Masters
+
+Observers function simple as non-voting members of the ensemble, sharing
+the Learner interface with Followers and holding only a slightly difference
+internal pipeline. Both maintain connections along the quorum port with the
+Leader by which they learn of all new proposals on the ensemble.
+
+By default, Observers connect to the Leader of the quorum along its
+quorum port and this is how they learn of all new proposals on the
+ensemble. There are benefits to allowing Observers to connect to the
+Followers instead as a means of plugging in to the commit stream in place
+of connecting to the Leader. It shifts the burden of supporting Observers
+off the Leader and allow it to focus on coordinating the commit of writes.
+This means better performance when the Leader is under high load,
+particularly high network load such as can happen after a leader election
+when many Learners need to sync. It reduces the total network connections
+maintained on the Leader when there are a high number of observers.
+Activating Followers to support Observers allow the overall number of
+Observers to scale into the hundreds. One the other end, Observer
+availability is improved since it will take shorter time for a high
+number of Observers to finish syncing and start serving client traffic.
+
+This feature can be activated by letting all members of the ensemble know
+which port will be used by the Followers to listen for Observer
+connections. The following entry, when added to the server config file,
+will instruct Observers to connect to peers (Leaders and Followers) on
+port 2191 and instruct Followers to create an ObserverMaster thread to
+listen and serve on that port.
+
+    observerMasterPort=2191
 <a name="ch_UseCases"></a>
 
 ## Example use cases

+ 14 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/ObserverBean.java

@@ -20,6 +20,7 @@ package org.apache.zookeeper.server;
 
 import org.apache.zookeeper.server.quorum.Observer;
 import org.apache.zookeeper.server.quorum.ObserverMXBean;
+import org.apache.zookeeper.server.quorum.QuorumPeer;
 
 /**
  * ObserverBean
@@ -46,4 +47,17 @@ public class ObserverBean extends ZooKeeperServerBean implements ObserverMXBean{
         return observer.getSocket().toString();
     }
 
+    public String getLearnerMaster() {
+        QuorumPeer.QuorumServer learnerMaster = observer.getCurrentLearnerMaster();
+        if (learnerMaster == null || learnerMaster.addr == null) {
+            return "Unknown";
+        }
+        return learnerMaster.addr.getAddress().getHostAddress() + ":" + learnerMaster.addr.getPort();
+    }
+
+    public void setLearnerMaster(String learnerMaster) {
+        if (!observer.setLearnerMaster(learnerMaster)) {
+            throw new IllegalArgumentException("Not a valid learner master");
+        }
+    }
 }

+ 15 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/admin/Commands.java

@@ -37,10 +37,13 @@ import org.apache.zookeeper.server.ServerStats;
 import org.apache.zookeeper.server.ZKDatabase;
 import org.apache.zookeeper.server.ZooKeeperServer;
 import org.apache.zookeeper.server.ZooTrace;
+import org.apache.zookeeper.server.quorum.Follower;
+import org.apache.zookeeper.server.quorum.FollowerZooKeeperServer;
 import org.apache.zookeeper.server.quorum.Leader;
 import org.apache.zookeeper.server.quorum.LeaderZooKeeperServer;
 import org.apache.zookeeper.server.quorum.QuorumPeer;
 import org.apache.zookeeper.server.quorum.QuorumZooKeeperServer;
+import org.apache.zookeeper.server.quorum.ObserverZooKeeperServer;
 import org.apache.zookeeper.server.quorum.ReadOnlyZooKeeperServer;
 import org.apache.zookeeper.server.util.OSMXBean;
 import org.slf4j.Logger;
@@ -375,6 +378,18 @@ public class Commands {
                 response.put("min_proposal_size", leader.getProposalStats().getMinBufferSize());
             }
 
+            if (zkServer instanceof FollowerZooKeeperServer) {
+                Follower follower = ((FollowerZooKeeperServer) zkServer).getFollower();
+                Integer syncedObservers = follower.getSyncedObserverSize();
+                if (syncedObservers != null) {
+                    response.put("synced_observers", syncedObservers);
+                }
+            }
+
+            if (zkServer instanceof ObserverZooKeeperServer) {
+                response.put("observer_master_id", ((ObserverZooKeeperServer)zkServer).getObserver().getLearnerMasterId());
+            }
+
             response.putAll(ServerMetrics.getAllValues());
 
             return response;

+ 39 - 7
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Follower.java

@@ -42,7 +42,9 @@ public class Follower extends Learner{
     private long lastQueued;
     // This is the same object as this.zk, but we cache the downcast op
     final FollowerZooKeeperServer fzk;
-    
+
+    ObserverMaster om;
+
     Follower(QuorumPeer self,FollowerZooKeeperServer zk) {
         this.self = self;
         this.zk=zk;
@@ -96,6 +98,15 @@ public class Follower extends Learner{
                     long syncTime = Time.currentElapsedTime() - startTime;
                     ServerMetrics.FOLLOWER_SYNC_TIME.add(syncTime);
                 }
+                if (self.getObserverMasterPort() > 0) {
+                    LOG.info("Starting ObserverMaster");
+
+                    om = new ObserverMaster(self, fzk, self.getObserverMasterPort());
+                    om.start();
+                } else {
+                    om = null;
+                }
+                // create a reusable packet to reduce gc impact
                 QuorumPacket qp = new QuorumPacket();
                 while (this.isRunning()) {
                     readPacket(qp);
@@ -113,6 +124,9 @@ public class Follower extends Learner{
                 pendingRevalidations.clear();
             }
         } finally {
+            if (om != null) {
+                om.stop();
+            }
             zk.unregisterJMX((Learner)this);
         }
     }
@@ -145,9 +159,16 @@ public class Follower extends Learner{
             }
             
             fzk.logRequest(hdr, txn);
+
+            if (om != null) {
+                om.proposalReceived(qp);
+            }
             break;
         case Leader.COMMIT:
             fzk.commit(qp.getZxid());
+            if (om != null) {
+                om.proposalCommitted(qp.getZxid());
+            }
             break;
             
         case Leader.COMMITANDACTIVATE:
@@ -159,11 +180,16 @@ public class Follower extends Learner{
            // get new designated leader from (current) leader's message
            ByteBuffer buffer = ByteBuffer.wrap(qp.getData());    
            long suggestedLeaderId = buffer.getLong();
-            boolean majorChange = 
-                   self.processReconfig(qv, suggestedLeaderId, qp.getZxid(), true);
-           // commit (writes the new config to ZK tree (/zookeeper/config)                     
-           fzk.commit(qp.getZxid());
-            if (majorChange) {
+           final long zxid = qp.getZxid();
+           boolean majorChange =
+                   self.processReconfig(qv, suggestedLeaderId, zxid, true);
+           // commit (writes the new config to ZK tree (/zookeeper/config)
+           fzk.commit(zxid);
+
+           if (om != null) {
+               om.informAndActivate(zxid, suggestedLeaderId);
+           }
+           if (majorChange) {
                throw new Exception("changes proposed in reconfig");
            }
            break;
@@ -171,7 +197,9 @@ public class Follower extends Learner{
             LOG.error("Received an UPTODATE message after Follower started");
             break;
         case Leader.REVALIDATE:
-            revalidate(qp);
+            if (om == null || !om.revalidateLearnerSession(qp)) {
+                revalidate(qp);
+            }
             break;
         case Leader.SYNC:
             fzk.sync();
@@ -205,6 +233,10 @@ public class Follower extends Learner{
         return lastQueued;
     }
 
+    public Integer getSyncedObserverSize() {
+        return  om == null ? null : om.getNumActiveObservers();
+    }
+
     @Override
     public void shutdown() {    
         LOG.info("shutdown called", new Exception("shutdown Follower"));

+ 10 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerBean.java

@@ -52,4 +52,14 @@ public class FollowerBean extends ZooKeeperServerBean implements FollowerMXBean
     public long getElectionTimeTaken() {
         return follower.self.getElectionTimeTaken();
     }
+
+    @Override
+    public int getObserverMasterPacketSizeLimit() {
+        return follower.om == null ? -1 : follower.om.getPktsSizeLimit();
+    }
+
+    @Override
+    public void setObserverMasterPacketSizeLimit(int sizeLimit) {
+        ObserverMaster.setPktsSizeLimit(sizeLimit);
+    }
 }

+ 10 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerMXBean.java

@@ -43,4 +43,14 @@ public interface FollowerMXBean extends ZooKeeperServerMXBean {
      * @return time taken for leader election in milliseconds.
      */
     public long getElectionTimeTaken();
+
+    /**
+     * @return the size limit in bytes for the observer master commit packet queue
+     */
+    public int getObserverMasterPacketSizeLimit();
+
+    /**
+     * set the size limit in bytes for the observer master commit packet queue
+     */
+    public void setObserverMasterPacketSizeLimit(int sizeLimit);
 }

+ 24 - 17
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java

@@ -111,26 +111,33 @@ public class FollowerRequestProcessor extends ZooKeeperCriticalThread implements
     }
 
     public void processRequest(Request request) {
+        processRequest(request, true);
+    }
+
+    void processRequest(Request request, boolean checkForUpgrade) {
         if (!finished) {
-            // Before sending the request, check if the request requires a
-            // global session and what we have is a local session. If so do
-            // an upgrade.
-            Request upgradeRequest = null;
-            try {
-                upgradeRequest = zks.checkUpgradeSession(request);
-            } catch (KeeperException ke) {
-                if (request.getHdr() != null) {
-                    request.getHdr().setType(OpCode.error);
-                    request.setTxn(new ErrorTxn(ke.code().intValue()));
+            if (checkForUpgrade) {
+                // Before sending the request, check if the request requires a
+                // global session and what we have is a local session. If so do
+                // an upgrade.
+                Request upgradeRequest = null;
+                try {
+                    upgradeRequest = zks.checkUpgradeSession(request);
+                } catch (KeeperException ke) {
+                    if (request.getHdr() != null) {
+                        request.getHdr().setType(OpCode.error);
+                        request.setTxn(new ErrorTxn(ke.code().intValue()));
+                    }
+                    request.setException(ke);
+                    LOG.info("Error creating upgrade request", ke);
+                } catch (IOException ie) {
+                    LOG.error("Unexpected error in upgrade", ie);
+                }
+                if (upgradeRequest != null) {
+                    queuedRequests.add(upgradeRequest);
                 }
-                request.setException(ke);
-                LOG.info("Error creating upgrade request",  ke);
-            } catch (IOException ie) {
-                LOG.error("Unexpected error in upgrade", ie);
-            }
-            if (upgradeRequest != null) {
-                queuedRequests.add(upgradeRequest);
             }
+
             queuedRequests.add(request);
         }
     }

+ 30 - 2
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/FollowerZooKeeperServer.java

@@ -23,6 +23,7 @@ import java.util.concurrent.ConcurrentLinkedQueue;
 import java.util.concurrent.LinkedBlockingQueue;
 
 import org.apache.jute.Record;
+import org.apache.zookeeper.jmx.MBeanRegistry;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.apache.zookeeper.server.ExitCode;
@@ -34,6 +35,8 @@ import org.apache.zookeeper.server.ZKDatabase;
 import org.apache.zookeeper.server.persistence.FileTxnSnapLog;
 import org.apache.zookeeper.txn.TxnHeader;
 
+import javax.management.JMException;
+
 /**
  * Just like the standard ZooKeeperServer. We just replace the request
  * processors: FollowerRequestProcessor -> CommitProcessor ->
@@ -113,13 +116,17 @@ public class FollowerZooKeeperServer extends LearnerZooKeeperServer {
     }
 
     synchronized public void sync(){
-        if(pendingSyncs.size() ==0){
+        if(pendingSyncs.size() == 0) {
             LOG.warn("Not expecting a sync.");
             return;
         }
 
         Request r = pendingSyncs.remove();
-		commitProcessor.commit(r);
+        if (r instanceof LearnerSyncRequest) {
+            LearnerSyncRequest lsr = (LearnerSyncRequest)r;
+            lsr.fh.queuePacket(new QuorumPacket(Leader.SYNC, 0, null, null));
+        }
+        commitProcessor.commit(r);
     }
 
     @Override
@@ -139,4 +146,25 @@ public class FollowerZooKeeperServer extends LearnerZooKeeperServer {
     public Learner getLearner() {
         return getFollower();
     }
+
+    /**
+     * Process a request received from external Learner through the LearnerMaster
+     * These requests have already passed through validation and checks for
+     * session upgrade and can be injected into the middle of the pipeline.
+     *
+     * @param request received from external Learner
+     */
+    void processObserverRequest(Request request) {
+        ((FollowerRequestProcessor)firstProcessor).processRequest(request, false);
+    }
+
+    boolean registerJMX(LearnerHandlerBean handlerBean) {
+        try {
+            MBeanRegistry.getInstance().register(handlerBean, jmxServerBean);
+            return true;
+        } catch (JMException e) {
+            LOG.warn("Could not register connection", e);
+        }
+        return false;
+    }
 }

+ 155 - 11
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java

@@ -19,6 +19,10 @@
 package org.apache.zookeeper.server.quorum;
 
 import java.io.BufferedInputStream;
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
 import java.io.IOException;
 import java.net.BindException;
 import java.net.ServerSocket;
@@ -41,15 +45,20 @@ import java.util.concurrent.ConcurrentMap;
 
 import javax.security.sasl.SaslException;
 
+import org.apache.zookeeper.KeeperException;
 import org.apache.zookeeper.ZooDefs.OpCode;
 import org.apache.zookeeper.common.Time;
 import org.apache.zookeeper.common.X509Exception;
+import org.apache.zookeeper.jmx.MBeanRegistry;
 import org.apache.zookeeper.server.FinalRequestProcessor;
 import org.apache.zookeeper.server.Request;
 import org.apache.zookeeper.server.RequestProcessor;
 import org.apache.zookeeper.server.ServerMetrics;
 import org.apache.zookeeper.server.ZooKeeperCriticalThread;
 import org.apache.zookeeper.server.quorum.QuorumPeer.LearnerType;
+import org.apache.zookeeper.server.ZKDatabase;
+import org.apache.zookeeper.server.ZooTrace;
+import org.apache.zookeeper.server.quorum.auth.QuorumAuthServer;
 import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier;
 import org.apache.zookeeper.server.util.SerializeUtils;
 import org.apache.zookeeper.server.util.ZxidUtils;
@@ -60,7 +69,7 @@ import org.slf4j.LoggerFactory;
 /**
  * This class has the control logic for the Leader.
  */
-public class Leader {
+public class Leader implements LearnerMaster {
     private static final Logger LOG = LoggerFactory.getLogger(Leader.class);
 
     static final private boolean nodelay = System.getProperty("leader.nodelay", "true").equals("true");
@@ -118,6 +127,9 @@ public class Leader {
                 maxConcurrentSnapshots, maxConcurrentSnapshotTimeout);
     }
 
+    // beans for all learners
+    private final ConcurrentHashMap<LearnerHandler, LearnerHandlerBean> connectionBeans = new ConcurrentHashMap<>();
+
     /**
      * Returns a copy of the current learner snapshot
      */
@@ -181,7 +193,8 @@ public class Leader {
      * @param learner
      *                instance of learner handle
      */
-    void addLearnerHandler(LearnerHandler learner) {
+    @Override
+    public void addLearnerHandler(LearnerHandler learner) {
         synchronized (learners) {
             learners.add(learner);
         }
@@ -192,7 +205,8 @@ public class Leader {
      *
      * @param peer
      */
-    void removeLearnerHandler(LearnerHandler peer) {
+    @Override
+    public void removeLearnerHandler(LearnerHandler peer) {
         synchronized (forwardingFollowers) {
             forwardingFollowers.remove(peer);
         }
@@ -866,6 +880,7 @@ public class Leader {
      * @param sid, the id of the server that sent the ack
      * @param followerAddr
      */
+    @Override
     synchronized public void processAck(long sid, long zxid, SocketAddress followerAddr) {
         if (!allowedToCommit) return; // last op committed was a leader change - from now on
                                      // the new leader should commit
@@ -1064,23 +1079,30 @@ public class Leader {
         sendObserverPacket(qp);
     }
 
+    public static QuorumPacket buildInformAndActivePacket(long zxid,
+            long designatedLeader, byte[] proposalData) {
+        byte[] data = new byte[proposalData.length + 8];
+        ByteBuffer buffer = ByteBuffer.wrap(data);
+        buffer.putLong(designatedLeader);
+        buffer.put(proposalData);
+
+        return new QuorumPacket(Leader.INFORMANDACTIVATE, zxid, data, null);
+    }
 
     /**
      * Create an inform&activate packet and send it to all observers.
      */
     public void informAndActivate(Proposal proposal, long designatedLeader) {
-       byte[] proposalData = proposal.packet.getData();
-        byte[] data = new byte[proposalData.length + 8];
-        ByteBuffer buffer = ByteBuffer.wrap(data);
-       buffer.putLong(designatedLeader);
-       buffer.put(proposalData);
-
-        QuorumPacket qp = new QuorumPacket(Leader.INFORMANDACTIVATE, proposal.request.zxid, data, null);
-        sendObserverPacket(qp);
+        sendObserverPacket(buildInformAndActivePacket(proposal.request.zxid,
+                designatedLeader, proposal.packet.getData()));
     }
 
     long lastProposed;
 
+    @Override
+    synchronized public long getLastProposed() {
+        return lastProposed;
+    }
 
     /**
      * Returns the current epoch of the leader.
@@ -1146,6 +1168,7 @@ public class Leader {
         return p;
     }
 
+    @Override
     public LearnerSnapshotThrottler getLearnerSnapshotThrottler() {
         return learnerSnapshotThrottler;
     }
@@ -1185,6 +1208,7 @@ public class Leader {
      * @return last proposed zxid
      * @throws InterruptedException
      */
+    @Override
     synchronized public long startForwarding(LearnerHandler handler,
             long lastSeenZxid) {
         // Queue up any outstanding requests enabling the receipt of
@@ -1221,6 +1245,16 @@ public class Leader {
 
         return lastProposed;
     }
+
+    @Override
+    public void waitForStartup() throws InterruptedException {
+        synchronized(zk) {
+            while(!zk.isRunning() && !Thread.currentThread().isInterrupted()) {
+                zk.wait(20);
+            }
+        }
+    }
+
     // VisibleForTesting
     protected final Set<Long> connectingFollowers = new HashSet<Long>();
 
@@ -1277,6 +1311,7 @@ public class Leader {
         }
     }
 
+    @Override
     public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
         synchronized(connectingFollowers) {
             if (!waitingForNewEpoch) {
@@ -1313,10 +1348,17 @@ public class Leader {
         }
     }
 
+    @Override
+    public ZKDatabase getZKDatabase() {
+        return zk.getZKDatabase();
+    }
+
     // VisibleForTesting
     protected final Set<Long> electingFollowers = new HashSet<Long>();
     // VisibleForTesting
     protected boolean electionFinished = false;
+
+    @Override
     public void waitForEpochAck(long id, StateSummary ss) throws IOException, InterruptedException {
         synchronized(electingFollowers) {
             if (electionFinished) {
@@ -1417,6 +1459,7 @@ public class Leader {
      * @param sid
      * @throws InterruptedException
      */
+    @Override
     public void waitForNewLeaderAck(long sid, long zxid)
             throws InterruptedException {
 
@@ -1517,4 +1560,105 @@ public class Leader {
     private boolean isParticipant(long sid) {
         return self.getQuorumVerifier().getVotingMembers().containsKey(sid);
     }
+
+    @Override
+    public int getCurrentTick() {
+        return self.tick.get();
+    }
+
+    @Override
+    public int syncTimeout() {
+        return self.tickTime * self.syncLimit;
+    }
+
+    @Override
+    public int getTickOfNextAckDeadline() {
+        return self.tick.get() + self.syncLimit;
+    }
+
+    @Override
+    public int getTickOfInitialAckDeadline() {
+        return self.tick.get() + self.initLimit + self.syncLimit;
+    }
+
+    @Override
+    public long getAndDecrementFollowerCounter() {
+        return followerCounter.getAndDecrement();
+    }
+
+    @Override
+    public void touch(long sess, int to) {
+        zk.touch(sess, to);
+    }
+
+    @Override
+    public void submitLearnerRequest(Request si) {
+        zk.submitLearnerRequest(si);
+    }
+
+    @Override
+    public long getQuorumVerifierVersion() {
+        return self.getQuorumVerifier().getVersion();
+    }
+
+    @Override
+    public String getPeerInfo(long sid) {
+        QuorumPeer.QuorumServer server = self.getView().get(sid);
+        return server == null ? "" : server.toString();
+    }
+
+    @Override
+    public byte[] getQuorumVerifierBytes() {
+        return self.getLastSeenQuorumVerifier().toString().getBytes();
+    }
+
+    @Override
+    public QuorumAuthServer getQuorumAuthServer() {
+        return (self == null) ? null : self.authServer;
+    }
+
+    @Override
+    public void revalidateSession(QuorumPacket qp, LearnerHandler learnerHandler) throws IOException {
+        ByteArrayInputStream bis = new ByteArrayInputStream(qp.getData());
+        DataInputStream dis = new DataInputStream(bis);
+        long id = dis.readLong();
+        int to = dis.readInt();
+        ByteArrayOutputStream bos = new ByteArrayOutputStream();
+        DataOutputStream dos = new DataOutputStream(bos);
+        dos.writeLong(id);
+        boolean valid = zk.checkIfValidGlobalSession(id, to);
+        if (valid) {
+            try {
+                // set the session owner as the follower that owns the session
+                zk.setOwner(id, learnerHandler);
+            } catch (KeeperException.SessionExpiredException e) {
+                LOG.error("Somehow session " + Long.toHexString(id) + " expired right after being renewed! (impossible)", e);
+            }
+        }
+        if (LOG.isTraceEnabled()) {
+            ZooTrace.logTraceMessage(LOG,
+                    ZooTrace.SESSION_TRACE_MASK,
+                    "Session 0x" + Long.toHexString(id)
+                            + " is valid: "+ valid);
+        }
+        dos.writeBoolean(valid);
+        qp.setData(bos.toByteArray());
+        learnerHandler.queuePacket(qp);
+    }
+
+    @Override
+    public void registerLearnerHandlerBean(final LearnerHandler learnerHandler, Socket socket) {
+        LearnerHandlerBean bean = new LearnerHandlerBean(learnerHandler, socket);
+        if (zk.registerJMX(bean)) {
+            connectionBeans.put(learnerHandler, bean);
+        }
+    }
+
+    @Override
+    public void unregisterLearnerHandlerBean(final LearnerHandler learnerHandler) {
+        LearnerHandlerBean bean = connectionBeans.remove(learnerHandler);
+        if (bean != null){
+            MBeanRegistry.getInstance().unregister(bean);
+        }
+    }
 }

+ 11 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java

@@ -30,6 +30,7 @@ import org.apache.zookeeper.server.ServerCnxn;
 import org.apache.zookeeper.server.ZKDatabase;
 import org.apache.zookeeper.server.persistence.FileTxnSnapLog;
 
+import javax.management.JMException;
 import java.io.IOException;
 import java.util.concurrent.TimeUnit;
 
@@ -186,6 +187,16 @@ public class LeaderZooKeeperServer extends QuorumZooKeeperServer {
         }
     }
 
+    boolean registerJMX(LearnerHandlerBean handlerBean) {
+        try {
+            MBeanRegistry.getInstance().register(handlerBean, jmxServerBean);
+            return true;
+        } catch (JMException e) {
+            LOG.warn("Could not register connection", e);
+        }
+        return false;
+    }
+
     @Override
     protected void unregisterJMX() {
         // unregister from JMX

+ 19 - 9
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java

@@ -87,9 +87,17 @@ public class Learner {
     
     protected static final Logger LOG = LoggerFactory.getLogger(Learner.class);
 
+    /**
+     * Time to wait after connection attempt with the Leader or LearnerMaster before this
+     * Learner tries to connect again.
+     */
+    private static final int leaderConnectDelayDuringRetryMs =
+            Integer.getInteger("zookeeper.leaderConnectDelayDuringRetryMs", 100);
+
     static final private boolean nodelay = System.getProperty("follower.nodelay", "true").equals("true");
     static {
-        LOG.info("TCP NoDelay set to: " + nodelay);
+        LOG.info("leaderConnectDelayDuringRetryMs: {}", leaderConnectDelayDuringRetryMs);
+        LOG.info("TCP NoDelay set to: {}", nodelay);
     }   
     
     final ConcurrentHashMap<Long, ServerCnxn> pendingRevalidations =
@@ -235,9 +243,10 @@ public class Learner {
     }
 
     /**
-     * Establish a connection with the Leader found by findLeader. Retries
-     * until either initLimit time has elapsed or 5 tries have happened. 
-     * @param addr - the address of the Leader to connect to.
+     * Establish a connection with the LearnerMaster found by findLearnerMaster.
+     * Followers only connect to Leaders, Observers can connect to any active LearnerMaster.
+     * Retries until either initLimit time has elapsed or 5 tries have happened.
+     * @param addr - the address of the Peer to connect to.
      * @throws IOException - if the socket connection fails on the 5th attempt
      * <li>if there is an authentication failure while connecting to leader</li>
      * @throws ConnectException
@@ -248,7 +257,7 @@ public class Learner {
         this.sock = createSocket();
 
         int initLimitTime = self.tickTime * self.initLimit;
-        int remainingInitLimitTime = initLimitTime;
+        int remainingInitLimitTime;
         long startNanoTime = nanoTime();
 
         for (int tries = 0; tries < 5; tries++) {
@@ -286,7 +295,7 @@ public class Learner {
                     this.sock = createSocket();
                 }
             }
-            Thread.sleep(1000);
+            Thread.sleep(leaderConnectDelayDuringRetryMs);
         }
 
         self.authLearner.authenticate(sock, hostname);
@@ -309,8 +318,8 @@ public class Learner {
     }
 
     /**
-     * Once connected to the leader, perform the handshake protocol to
-     * establish a following / observing connection. 
+     * Once connected to the leader or learner master, perform the handshake
+     * protocol to establish a following / observing connection.
      * @param pktType
      * @return the zxid the Leader sends for synchronization purposes.
      * @throws IOException
@@ -369,7 +378,8 @@ public class Learner {
     } 
     
     /**
-     * Finally, synchronize our history with the Leader. 
+     * Finally, synchronize our history with the Leader (if Follower)
+     * or the LearnerMaster (if Observer).
      * @param newLeaderZxid
      * @throws IOException
      * @throws InterruptedException

+ 75 - 91
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandler.java

@@ -21,9 +21,7 @@ package org.apache.zookeeper.server.quorum;
 import java.io.BufferedInputStream;
 import java.io.BufferedOutputStream;
 import java.io.ByteArrayInputStream;
-import java.io.ByteArrayOutputStream;
 import java.io.DataInputStream;
-import java.io.DataOutputStream;
 import java.io.IOException;
 import java.net.Socket;
 import java.nio.ByteBuffer;
@@ -37,8 +35,6 @@ import javax.security.sasl.SaslException;
 
 import org.apache.jute.BinaryInputArchive;
 import org.apache.jute.BinaryOutputArchive;
-import org.apache.jute.Record;
-import org.apache.zookeeper.KeeperException.SessionExpiredException;
 import org.apache.zookeeper.ZooDefs.OpCode;
 import org.apache.zookeeper.server.Request;
 import org.apache.zookeeper.server.ServerMetrics;
@@ -48,6 +44,7 @@ import org.apache.zookeeper.server.ZooKeeperThread;
 import org.apache.zookeeper.server.ZooTrace;
 import org.apache.zookeeper.server.quorum.Leader.Proposal;
 import org.apache.zookeeper.server.quorum.QuorumPeer.LearnerType;
+import org.apache.zookeeper.server.quorum.auth.QuorumAuthServer;
 import org.apache.zookeeper.server.util.SerializeUtils;
 import org.apache.zookeeper.server.util.ZxidUtils;
 import org.apache.zookeeper.txn.TxnHeader;
@@ -68,7 +65,7 @@ public class LearnerHandler extends ZooKeeperThread {
         return sock;
     }
 
-    final Leader leader;
+    final LearnerMaster learnerMaster;
 
     /** Deadline for receiving the next ack. If we are bootstrapping then
      * it's based on the initLimit, if we are done bootstrapping it's based
@@ -85,6 +82,10 @@ public class LearnerHandler extends ZooKeeperThread {
         return sid;
     }
 
+    String getRemoteAddress() {
+        return sock == null ? "<null>" : sock.getRemoteSocketAddress().toString();
+    }
+
     protected int version = 0x1;
 
     int getVersion() {
@@ -147,7 +148,7 @@ public class LearnerHandler extends ZooKeeperThread {
                 return true;
             } else {
                 long msDelay = (time - currentTime) / 1000000;
-                return (msDelay < (leader.self.tickTime * leader.self.syncLimit));
+                return (msDelay < learnerMaster.syncTimeout());
             }
         }
     };
@@ -167,7 +168,7 @@ public class LearnerHandler extends ZooKeeperThread {
     private volatile boolean sendingThreadStarted = false;
 
     /**
-     * For testing purpose, force leader to use snapshot to sync with followers
+     * For testing purpose, force learnerMaster to use snapshot to sync with followers
      */
     public static final String FORCE_SNAP_SYNC = "zookeeper.forceSnapshotSync";
     private boolean forceSnapSync = false;
@@ -183,10 +184,10 @@ public class LearnerHandler extends ZooKeeperThread {
      */
     private long leaderLastZxid;
 
-    LearnerHandler(Socket sock, BufferedInputStream bufferedInput,Leader leader) throws IOException {
+    LearnerHandler(Socket sock, BufferedInputStream bufferedInput, LearnerMaster learnerMaster) throws IOException {
         super("LearnerHandler-" + sock.getRemoteSocketAddress());
         this.sock = sock;
-        this.leader = leader;
+        this.learnerMaster = learnerMaster;
         this.bufferedInput = bufferedInput;
 
         if (Boolean.getBoolean(FORCE_SNAP_SYNC)) {
@@ -195,9 +196,9 @@ public class LearnerHandler extends ZooKeeperThread {
         }
 
         try {
-            if (leader.self != null) {
-                leader.self.authServer.authenticate(sock,
-                        new DataInputStream(bufferedInput));
+            QuorumAuthServer authServer = learnerMaster.getQuorumAuthServer();
+            if (authServer != null) {
+                authServer.authenticate(sock, new DataInputStream(bufferedInput));
             }
         } catch (IOException e) {
             LOG.error("Server failed to authenticate quorum learner, addr: {}, closing connection",
@@ -368,9 +369,8 @@ public class LearnerHandler extends ZooKeeperThread {
     @Override
     public void run() {
         try {
-            leader.addLearnerHandler(this);
-            tickOfNextAckDeadline = leader.self.tick.get()
-                    + leader.self.initLimit + leader.self.syncLimit;
+            learnerMaster.addLearnerHandler(this);
+            tickOfNextAckDeadline = learnerMaster.getTickOfInitialAckDeadline();
 
             ia = BinaryInputArchive.getArchive(bufferedInput);
             bufferedOutput = new BufferedOutputStream(sock.getOutputStream());
@@ -384,6 +384,9 @@ public class LearnerHandler extends ZooKeeperThread {
                 return;
             }
 
+            if (learnerMaster instanceof ObserverMaster && qp.getType() != Leader.OBSERVERINFO) {
+                throw new IOException("Non observer attempting to connect to ObserverMaster. type = " + qp.getType());
+            }
             byte learnerInfoData[] = qp.getData();
             if (learnerInfoData != null) {
                 ByteBuffer bbsid = ByteBuffer.wrap(learnerInfoData);
@@ -395,31 +398,34 @@ public class LearnerHandler extends ZooKeeperThread {
                 }
                 if (learnerInfoData.length >= 20) {
                     long configVersion = bbsid.getLong();
-                    if (configVersion > leader.self.getQuorumVerifier().getVersion()) {
+                    if (configVersion > learnerMaster.getQuorumVerifierVersion()) {
                         throw new IOException("Follower is ahead of the leader (has a later activated configuration)");
                     }
                 }
             } else {
-                this.sid = leader.followerCounter.getAndDecrement();
+                this.sid = learnerMaster.getAndDecrementFollowerCounter();
             }
 
-            if (leader.self.getView().containsKey(this.sid)) {
-                LOG.info("Follower sid: " + this.sid + " : info : "
-                        + leader.self.getView().get(this.sid).toString());
+            String followerInfo = learnerMaster.getPeerInfo(this.sid);
+            if (followerInfo.isEmpty()) {
+                LOG.info("Follower sid: " + this.sid + " not in the current config "
+                        + Long.toHexString(learnerMaster.getQuorumVerifierVersion()));
             } else {
-                LOG.info("Follower sid: " + this.sid + " not in the current config " + Long.toHexString(leader.self.getQuorumVerifier().getVersion()));
+                LOG.info("Follower sid: " + this.sid + " : info : " + followerInfo);
             }
 
             if (qp.getType() == Leader.OBSERVERINFO) {
                   learnerType = LearnerType.OBSERVER;
             }
 
+            learnerMaster.registerLearnerHandlerBean(this, sock);
+
             long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid(qp.getZxid());
 
             long peerLastZxid;
             StateSummary ss = null;
             long zxid = qp.getZxid();
-            long newEpoch = leader.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
+            long newEpoch = learnerMaster.getEpochToPropose(this.getSid(), lastAcceptedEpoch);
             long newLeaderZxid = ZxidUtils.makeZxid(newEpoch, 0);
 
             if (this.getVersion() < 0x10000) {
@@ -427,7 +433,7 @@ public class LearnerHandler extends ZooKeeperThread {
                 long epoch = ZxidUtils.getEpochFromZxid(zxid);
                 ss = new StateSummary(epoch, zxid);
                 // fake the message
-                leader.waitForEpochAck(this.getSid(), ss);
+                learnerMaster.waitForEpochAck(this.getSid(), ss);
             } else {
                 byte ver[] = new byte[4];
                 ByteBuffer.wrap(ver).putInt(0x10000);
@@ -443,21 +449,21 @@ public class LearnerHandler extends ZooKeeperThread {
 				}
                 ByteBuffer bbepoch = ByteBuffer.wrap(ackEpochPacket.getData());
                 ss = new StateSummary(bbepoch.getInt(), ackEpochPacket.getZxid());
-                leader.waitForEpochAck(this.getSid(), ss);
+                learnerMaster.waitForEpochAck(this.getSid(), ss);
             }
             peerLastZxid = ss.getLastZxid();
 
             // Take any necessary action if we need to send TRUNC or DIFF
             // startForwarding() will be called in all cases
-            boolean needSnap = syncFollower(peerLastZxid, leader.zk.getZKDatabase(), leader);
+            boolean needSnap = syncFollower(peerLastZxid, learnerMaster);
 
             /* if we are not truncating or sending a diff just send a snapshot */
             if (needSnap) {
                 boolean exemptFromThrottle = getLearnerType() != LearnerType.OBSERVER;
                 LearnerSnapshot snapshot =
-                        leader.getLearnerSnapshotThrottler().beginSnapshot(exemptFromThrottle);
+                        learnerMaster.getLearnerSnapshotThrottler().beginSnapshot(exemptFromThrottle);
                 try {
-                    long zxidToSend = leader.zk.getZKDatabase().getDataTreeLastProcessedZxid();
+                    long zxidToSend = learnerMaster.getZKDatabase().getDataTreeLastProcessedZxid();
                     oa.writeRecord(new QuorumPacket(Leader.SNAP, zxidToSend, null, null), "packet");
                     bufferedOutput.flush();
 
@@ -470,7 +476,7 @@ public class LearnerHandler extends ZooKeeperThread {
                             snapshot.getConcurrentSnapshotNumber(),
                             snapshot.isEssential() ? "exempt" : "not exempt");
                     // Dump data to peer
-                    leader.zk.getZKDatabase().serializeSnapshot(oa);
+                    learnerMaster.getZKDatabase().serializeSnapshot(oa);
                     oa.writeString("BenWasHere", "signature");
                     bufferedOutput.flush();
                 } finally {
@@ -492,8 +498,7 @@ public class LearnerHandler extends ZooKeeperThread {
                 oa.writeRecord(newLeaderQP, "packet");
             } else {
                 QuorumPacket newLeaderQP = new QuorumPacket(Leader.NEWLEADER,
-                        newLeaderZxid, leader.self.getLastSeenQuorumVerifier()
-                                .toString().getBytes(), null);
+                        newLeaderZxid, learnerMaster.getQuorumVerifierBytes(), null);
                 queuedPackets.add(newLeaderQP);
             }
             bufferedOutput.flush();
@@ -503,7 +508,7 @@ public class LearnerHandler extends ZooKeeperThread {
 
             /*
              * Have to wait for the first ACK, wait until
-             * the leader is ready, and only then we can
+             * the learnerMaster is ready, and only then we can
              * start processing messages.
              */
             qp = new QuorumPacket();
@@ -517,21 +522,18 @@ public class LearnerHandler extends ZooKeeperThread {
             if(LOG.isDebugEnabled()){
             	LOG.debug("Received NEWLEADER-ACK message from " + sid);
             }
-            leader.waitForNewLeaderAck(getSid(), qp.getZxid());
+            learnerMaster.waitForNewLeaderAck(getSid(), qp.getZxid());
 
             syncLimitCheck.start();
 
             // now that the ack has been processed expect the syncLimit
-            sock.setSoTimeout(leader.self.tickTime * leader.self.syncLimit);
+            sock.setSoTimeout(learnerMaster.syncTimeout());
 
             /*
-             * Wait until leader starts up
+             * Wait until learnerMaster starts up
              */
-            synchronized(leader.zk){
-                while(!leader.zk.isRunning() && !this.isInterrupted()){
-                    leader.zk.wait(20);
-                }
-            }
+            learnerMaster.waitForStartup();
+
             // Mutation packets will be queued during the serialize,
             // so we need to mark when the peer can actually start
             // using the data
@@ -550,7 +552,7 @@ public class LearnerHandler extends ZooKeeperThread {
                 if (LOG.isTraceEnabled()) {
                     ZooTrace.logQuorumPacket(LOG, traceMask, 'i', qp);
                 }
-                tickOfNextAckDeadline = leader.self.tick.get() + leader.self.syncLimit;
+                tickOfNextAckDeadline = learnerMaster.getTickOfNextAckDeadline();
 
 
                 ByteBuffer bb;
@@ -566,7 +568,7 @@ public class LearnerHandler extends ZooKeeperThread {
                         }
                     }
                     syncLimitCheck.updateAck(qp.getZxid());
-                    leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
+                    learnerMaster.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress());
                     break;
                 case Leader.PING:
                     // Process the touches
@@ -576,38 +578,11 @@ public class LearnerHandler extends ZooKeeperThread {
                     while (dis.available() > 0) {
                         long sess = dis.readLong();
                         int to = dis.readInt();
-                        leader.zk.touch(sess, to);
+                        learnerMaster.touch(sess, to);
                     }
                     break;
                 case Leader.REVALIDATE:
-                    bis = new ByteArrayInputStream(qp.getData());
-                    dis = new DataInputStream(bis);
-                    long id = dis.readLong();
-                    int to = dis.readInt();
-                    ByteArrayOutputStream bos = new ByteArrayOutputStream();
-                    DataOutputStream dos = new DataOutputStream(bos);
-                    dos.writeLong(id);
-                    boolean valid = leader.zk.checkIfValidGlobalSession(id, to);
-                    if (valid) {
-                        try {
-                            //set the session owner
-                            // as the follower that
-                            // owns the session
-                            leader.zk.setOwner(id, this);
-                        } catch (SessionExpiredException e) {
-                            LOG.error("Somehow session " + Long.toHexString(id) +
-                                    " expired right after being renewed! (impossible)", e);
-                        }
-                    }
-                    if (LOG.isTraceEnabled()) {
-                        ZooTrace.logTraceMessage(LOG,
-                                                 ZooTrace.SESSION_TRACE_MASK,
-                                                 "Session 0x" + Long.toHexString(id)
-                                                 + " is valid: "+ valid);
-                    }
-                    dos.writeBoolean(valid);
-                    qp.setData(bos.toByteArray());
-                    queuedPackets.add(qp);
+                    learnerMaster.revalidateSession(qp, this);
                     break;
                 case Leader.REQUEST:
                     bb = ByteBuffer.wrap(qp.getData());
@@ -622,7 +597,7 @@ public class LearnerHandler extends ZooKeeperThread {
                         si = new Request(null, sessionId, cxid, type, bb, qp.getAuthinfo());
                     }
                     si.setOwner(this);
-                    leader.zk.submitLearnerRequest(si);
+                    learnerMaster.submitLearnerRequest(si);
                     break;
                 default:
                     LOG.warn("unexpected quorum packet, type: {}", packetToString(qp));
@@ -646,9 +621,7 @@ public class LearnerHandler extends ZooKeeperThread {
         } catch (SnapshotThrottleException e) {
             LOG.error("too many concurrent snapshots: " + e);
         } finally {
-            LOG.warn("******* GOODBYE "
-                    + (sock != null ? sock.getRemoteSocketAddress() : "<null>")
-                    + " ********");
+            LOG.warn("******* GOODBYE {} ********", getRemoteAddress());
             shutdown();
         }
     }
@@ -681,19 +654,18 @@ public class LearnerHandler extends ZooKeeperThread {
      * and setup follower to receive packets from commit processor
      *
      * @param peerLastZxid
-     * @param db
-     * @param leader
+     * @param learnerMaster
      * @return true if snapshot transfer is needed.
      */
-    public boolean syncFollower(long peerLastZxid, ZKDatabase db, Leader leader) {
+    boolean syncFollower(long peerLastZxid, LearnerMaster learnerMaster) {
         /*
          * When leader election is completed, the leader will set its
          * lastProcessedZxid to be (epoch < 32). There will be no txn associated
          * with this zxid.
          *
          * The learner will set its lastProcessedZxid to the same value if
-         * it get DIFF or SNAP from the leader. If the same learner come
-         * back to sync with leader using this zxid, we will never find this
+         * it get DIFF or SNAP from the learnerMaster. If the same learner come
+         * back to sync with learnerMaster using this zxid, we will never find this
          * zxid in our history. In this case, we will ignore TRUNC logic and
          * always send DIFF if we have old enough history
          */
@@ -701,6 +673,7 @@ public class LearnerHandler extends ZooKeeperThread {
         // Keep track of the latest zxid which already queued
         long currentZxid = peerLastZxid;
         boolean needSnap = true;
+        ZKDatabase db = learnerMaster.getZKDatabase();
         boolean txnLogSyncEnabled = db.isTxnLogSyncEnabled();
         ReentrantReadWriteLock lock = db.getLogLock();
         ReadLock rl = lock.readLock();
@@ -721,7 +694,7 @@ public class LearnerHandler extends ZooKeeperThread {
             if (db.getCommittedLog().isEmpty()) {
                 /*
                  * It is possible that committedLog is empty. In that case
-                 * setting these value to the latest txn in leader db
+                 * setting these value to the latest txn in learnerMaster db
                  * will reduce the case that we need to handle
                  *
                  * Here is how each case handle by the if block below
@@ -737,7 +710,7 @@ public class LearnerHandler extends ZooKeeperThread {
              * Here are the cases that we want to handle
              *
              * 1. Force sending snapshot (for testing purpose)
-             * 2. Peer and leader is already sync, send empty diff
+             * 2. Peer and learnerMaster is already sync, send empty diff
              * 3. Follower has txn that we haven't seen. This may be old leader
              *    so we need to send TRUNC. However, if peer has newEpochZxid,
              *    we cannot send TRUNC since the follower has no txnlog
@@ -750,7 +723,7 @@ public class LearnerHandler extends ZooKeeperThread {
              */
 
             if (forceSnapSync) {
-                // Force leader to use snapshot to sync with follower
+                // Force learnerMaster to use snapshot to sync with follower
                 LOG.warn("Forcing snapshot sync - should not see this in production");
             } else if (lastProcessedZxid == peerLastZxid) {
                 // Follower is already sync with us, send empty diff
@@ -811,9 +784,12 @@ public class LearnerHandler extends ZooKeeperThread {
                         Long.toHexString(peerLastZxid),
                         txnLogSyncEnabled);
             }
+            if (needSnap) {
+                currentZxid = db.getDataTreeLastProcessedZxid();
+            }
             LOG.debug("Start forwarding 0x" + Long.toHexString(currentZxid) +
                       " for peer sid: " +  getSid());
-            leaderLastZxid = leader.startForwarding(this, currentZxid);
+            leaderLastZxid = learnerMaster.startForwarding(this, currentZxid);
         } finally {
             rl.unlock();
         }
@@ -884,7 +860,7 @@ public class LearnerHandler extends ZooKeeperThread {
                    queueOpPacket(Leader.DIFF, lastCommittedZxid);
                    needOpPacket = false;
                 } else if (packetZxid > peerLastZxid  ) {
-                    // Peer have some proposals that the leader hasn't seen yet
+                    // Peer have some proposals that the learnerMaster hasn't seen yet
                     // it may used to be a leader
                     if (ZxidUtils.getEpochFromZxid(packetZxid) !=
                             ZxidUtils.getEpochFromZxid(peerLastZxid)) {
@@ -947,7 +923,8 @@ public class LearnerHandler extends ZooKeeperThread {
             LOG.warn("Ignoring unexpected exception during socket close", e);
         }
         this.interrupt();
-        leader.removeLearnerHandler(this);
+        learnerMaster.removeLearnerHandler(this);
+        learnerMaster.unregisterLearnerHandlerBean(this);
     }
 
     public long tickOfNextAckDeadline() {
@@ -955,7 +932,7 @@ public class LearnerHandler extends ZooKeeperThread {
     }
 
     /**
-     * ping calls from the leader to the peers
+     * ping calls from the learnerMaster to the peers
      */
     public void ping() {
         // If learner hasn't sync properly yet, don't send ping packet
@@ -965,9 +942,7 @@ public class LearnerHandler extends ZooKeeperThread {
         }
         long id;
         if (syncLimitCheck.check(System.nanoTime())) {
-            synchronized(leader) {
-                id = leader.lastProposed;
-            }
+            id = learnerMaster.getLastProposed();
             QuorumPacket ping = new QuorumPacket(Leader.PING, id, null, null);
             queuePacket(ping);
         } else {
@@ -990,9 +965,18 @@ public class LearnerHandler extends ZooKeeperThread {
         queuedPackets.add(p);
     }
 
+    static long packetSize(QuorumPacket p) {
+        /* Approximate base size of QuorumPacket: int + long + byte[] + List */
+        long size = 4 + 8 + 8 + 8;
+        byte[] data = p.getData();
+        if (data != null) {
+            size += data.length;
+        }
+        return size;
+    }
+
     public boolean synced() {
-        return isAlive()
-        && leader.self.tick.get() <= tickOfNextAckDeadline;
+        return isAlive() && learnerMaster.getCurrentTick() <= tickOfNextAckDeadline;
     }
 
     /**

+ 66 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandlerBean.java

@@ -0,0 +1,66 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.zookeeper.server.quorum;
+
+import org.apache.zookeeper.jmx.MBeanRegistry;
+import org.apache.zookeeper.jmx.ZKMBeanInfo;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.management.ObjectName;
+import java.net.InetSocketAddress;
+import java.net.Socket;
+
+public class LearnerHandlerBean implements LearnerHandlerMXBean, ZKMBeanInfo{
+    private static final Logger LOG = LoggerFactory.getLogger(LearnerHandlerBean.class);
+
+    private final LearnerHandler learnerHandler;
+    private final String remoteAddr;
+
+    public LearnerHandlerBean(final LearnerHandler learnerHandler, final Socket socket) {
+        this.learnerHandler = learnerHandler;
+        InetSocketAddress sockAddr = (InetSocketAddress) socket.getRemoteSocketAddress();
+        if (sockAddr == null) {
+            this.remoteAddr = "Unknown";
+        } else {
+            this.remoteAddr = sockAddr.getAddress().getHostAddress() + ":" + sockAddr.getPort();
+        }
+    }
+
+    @Override
+    public String getName() {
+        return MBeanRegistry.getInstance().makeFullPath("Learner_Connections", ObjectName.quote(remoteAddr),
+                String.format("\"id:%d\"", learnerHandler.getSid()));
+    }
+
+    @Override
+    public boolean isHidden() {
+        return false;
+    }
+
+    @Override
+    public void terminateConnection() {
+        LOG.info("terminating learner handler connection on demand " + toString());
+        learnerHandler.shutdown();
+    }
+
+    @Override
+    public String toString() {
+        return "LearnerHandlerBean{remoteIP=" + remoteAddr + ",ServerId=" + learnerHandler.getSid() + "}";
+    }
+}

+ 29 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerHandlerMXBean.java

@@ -0,0 +1,29 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.zookeeper.server.quorum;
+
+/**
+ * This MBean represents a server connection for a learner.
+ */
+public interface LearnerHandlerMXBean {
+    /**
+     * Terminate the connection. The learner will attempt to reconnect to
+     * the leader or to the next ObserverMaster if that feature is enabled
+     */
+    public void terminateConnection();
+}

+ 196 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LearnerMaster.java

@@ -0,0 +1,196 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zookeeper.server.quorum;
+
+import org.apache.zookeeper.server.Request;
+import org.apache.zookeeper.server.ZKDatabase;
+import org.apache.zookeeper.server.quorum.auth.QuorumAuthServer;
+
+import java.io.IOException;
+import java.net.Socket;
+import java.net.SocketAddress;
+
+/**
+ * interface for keeping Observers in sync
+ */
+public interface LearnerMaster {
+    /**
+     * start tracking a learner handler
+     * @param learnerHandler to track
+     */
+    void addLearnerHandler(LearnerHandler learnerHandler);
+
+    /**
+     * stop tracking a learner handler
+     * @param learnerHandler to drop
+     */
+    void removeLearnerHandler(LearnerHandler learnerHandler);
+
+    /**
+     * wait for the leader of the new epoch to be confirmed by followers
+     * @param sid learner id
+     * @param ss
+     * @throws IOException
+     * @throws InterruptedException
+     */
+    void waitForEpochAck(long sid, StateSummary ss) throws IOException, InterruptedException;
+
+    /**
+     * snapshot throttler
+     * @return snapshot throttler
+     */
+    LearnerSnapshotThrottler getLearnerSnapshotThrottler();
+
+    /**
+     * wait for server to start
+     * @throws InterruptedException
+     */
+    void waitForStartup() throws InterruptedException;
+
+    /**
+     * get the first zxid of the next epoch
+     * @param sid learner id
+     * @param lastAcceptedEpoch
+     * @return
+     * @throws InterruptedException
+     * @throws IOException
+     */
+    long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException;
+
+    /**
+     * ZKDatabase
+     * @return ZKDatabase
+     */
+    ZKDatabase getZKDatabase();
+
+    /**
+     * wait for new leader to settle
+     * @param sid id of learner
+     * @param zxid zxid at learner
+     * @throws InterruptedException
+     */
+    void waitForNewLeaderAck(long sid, long zxid) throws InterruptedException;
+
+    /**
+     * last proposed zxid
+     * @return last proposed zxid
+     */
+    long getLastProposed();
+
+    /**
+     * the current tick
+     * @return the current tick
+     */
+    int getCurrentTick();
+
+    /**
+     * time allowed for sync response
+     * @return time allowed for sync response
+     */
+    int syncTimeout();
+
+    /**
+     * deadline tick marking observer sync (initial)
+     * @return deadline tick marking observer sync (initial)
+     */
+    int getTickOfNextAckDeadline();
+
+    /**
+     * next deadline tick marking observer sync (steady state)
+     * @return next deadline tick marking observer sync (steady state)
+     */
+    int getTickOfInitialAckDeadline();
+
+    /**
+     * decrement follower count
+     * @return previous follower count
+     */
+    long getAndDecrementFollowerCounter();
+
+    /**
+     * handle ack packet
+     * @param sid leader id
+     * @param zxid packet zxid
+     * @param localSocketAddress forwarder's address
+     */
+    void processAck(long sid, long zxid, SocketAddress localSocketAddress);
+
+    /**
+     * mark session as alive
+     * @param sess session id
+     * @param to timeout
+     */
+    void touch(long sess, int to);
+
+    /**
+     * handle revalidate packet
+     * @param qp session packet
+     * @param learnerHandler learner
+     * @throws IOException
+     */
+    void revalidateSession(QuorumPacket qp, LearnerHandler learnerHandler) throws IOException;
+
+    /**
+     * proxy request from learner to server
+     * @param si request
+     */
+    void submitLearnerRequest(Request si);
+
+    /**
+     * begin forwarding packets to learner handler
+     * @param learnerHandler learner
+     * @param lastSeenZxid zxid of learner
+     * @return last zxid forwarded
+     */
+    long startForwarding(LearnerHandler learnerHandler, long lastSeenZxid);
+
+    /**
+     * version of current quorum verifier
+     * @return version of current quorum verifier
+     */
+    long getQuorumVerifierVersion();
+
+    /**
+     *
+     * @param sid server id
+     * @return server information in the view
+     */
+    String getPeerInfo(long sid);
+
+    /**
+     * identifier of current quorum verifier for new leader
+     * @return identifier of current quorum verifier for new leader
+     */
+    byte[] getQuorumVerifierBytes();
+
+    QuorumAuthServer getQuorumAuthServer();
+
+    /**
+     * registers the handler's bean
+     * @param learnerHandler handler
+     * @param socket connection to learner
+     */
+    void registerLearnerHandlerBean(final LearnerHandler learnerHandler, Socket socket);
+
+    /**
+     * unregisters the handler's bean
+     * @param learnerHandler handler
+     */
+    void unregisterLearnerHandlerBean(final LearnerHandler learnerHandler);
+}

+ 94 - 4
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Observer.java

@@ -20,6 +20,7 @@ package org.apache.zookeeper.server.quorum;
 
 import java.io.IOException;
 import java.nio.ByteBuffer;
+import java.util.concurrent.atomic.AtomicReference;
 
 import org.apache.jute.Record;
 import org.apache.zookeeper.server.ObserverBean;
@@ -29,6 +30,9 @@ import org.apache.zookeeper.server.quorum.flexible.QuorumVerifier;
 import org.apache.zookeeper.server.util.SerializeUtils;
 import org.apache.zookeeper.txn.SetDataTxn;
 import org.apache.zookeeper.txn.TxnHeader;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 
 /**
  * Observers are peers that do not take part in the atomic broadcast protocol.
@@ -41,6 +45,31 @@ import org.apache.zookeeper.txn.TxnHeader;
  */
 public class Observer extends Learner{
 
+    private static final Logger LOG = LoggerFactory.getLogger(Observer.class);
+
+    /**
+     * When observer lost its connection with the leader, it waits for 0 to the
+     * specified value before trying to reconnect with the leader. So that
+     * the entire observer fleet won't try to run leader election and reconnect
+     * to the leader at once. Default value is zero.
+     */
+    public static final String OBSERVER_RECONNECT_DELAY_MS =
+            "zookeeper.observer.reconnectDelayMs";
+
+    private static final long reconnectDelayMs;
+
+    static {
+        reconnectDelayMs = Long.getLong(OBSERVER_RECONNECT_DELAY_MS, 0);
+        LOG.info(OBSERVER_RECONNECT_DELAY_MS + " = " + reconnectDelayMs);
+    }
+
+    /**
+     * next learner master to try, when specified
+     */
+    private final static AtomicReference<QuorumPeer.QuorumServer> nextLearnerMaster = new AtomicReference<>();
+
+    private QuorumPeer.QuorumServer currentLearnerMaster = null;
+
     Observer(QuorumPeer self,ObserverZooKeeperServer observerZooKeeperServer) {
         this.self = self;
         this.zk=observerZooKeeperServer;
@@ -63,17 +92,16 @@ public class Observer extends Learner{
         zk.registerJMX(new ObserverBean(this, zk), self.jmxLocalPeerBean);
 
         try {
-            QuorumServer leaderServer = findLeader();
-            LOG.info("Observing " + leaderServer.addr);
+            QuorumServer master = findLearnerMaster();
             try {
-                connectToLeader(leaderServer.addr, leaderServer.hostname);
+                connectToLeader(master.addr, master.hostname);
                 long newLeaderZxid = registerWithLeader(Leader.OBSERVERINFO);
                 if (self.isReconfigStateChange())
                    throw new Exception("learned about role change");
  
                 syncWithLeader(newLeaderZxid);
                 QuorumPacket qp = new QuorumPacket();
-                while (this.isRunning()) {
+                while (this.isRunning() && nextLearnerMaster.get() == null) {
                     readPacket(qp);
                     processPacket(qp);
                 }
@@ -89,10 +117,29 @@ public class Observer extends Learner{
                 pendingRevalidations.clear();
             }
         } finally {
+            currentLearnerMaster = null;
             zk.unregisterJMX(this);
         }
     }
 
+    private QuorumServer findLearnerMaster() {
+        QuorumPeer.QuorumServer prescribedLearnerMaster =  nextLearnerMaster.getAndSet(null);
+        if (prescribedLearnerMaster != null && self.validateLearnerMaster(Long.toString(prescribedLearnerMaster.id)) == null) {
+            LOG.warn("requested next learner master {} is no longer valid", prescribedLearnerMaster);
+            prescribedLearnerMaster = null;
+        }
+        final QuorumPeer.QuorumServer master = (prescribedLearnerMaster == null) ?
+                self.findLearnerMaster(findLeader()) :
+                prescribedLearnerMaster;
+        currentLearnerMaster = master;
+        if (master == null) {
+            LOG.warn("No learner master found");
+        } else {
+            LOG.info("Observing new leader sid={} addr={}", master.id, master.addr);
+        }
+        return master;
+    }
+
     /**
      * Controls the response of an observer to the receipt of a quorumpacket
      * @param qp
@@ -162,5 +209,48 @@ public class Observer extends Learner{
         LOG.info("shutdown called", new Exception("shutdown Observer"));
         super.shutdown();
     }
+
+    static void waitForReconnectDelay() {
+        if (reconnectDelayMs > 0) {
+            long randomDelay = (long) (reconnectDelayMs * Math.random());
+            LOG.info("Waiting for " + randomDelay
+                    + " ms before reconnecting with the leader");
+            try {
+                Thread.sleep(randomDelay);
+            } catch (InterruptedException e) {
+                LOG.warn("Interrupted while waiting" + e.getMessage());
+            }
+        }
+    }
+
+    public long getLearnerMasterId() {
+        QuorumPeer.QuorumServer current = currentLearnerMaster;
+        return current == null ? -1 : current.id;
+    }
+
+    /**
+     * Prompts the Observer to disconnect from its current learner master and reconnect
+     * to the specified server. If that connection attempt fails, the Observer will
+     * fail over to the next available learner master.
+     */
+    public boolean setLearnerMaster(String learnerMaster) {
+        final QuorumPeer.QuorumServer server = self.validateLearnerMaster(learnerMaster);
+        if (server == null) {
+            return false;
+        } else if (server.equals(currentLearnerMaster)) {
+            LOG.info("Already connected to requested learner master sid={} addr={}",
+                    server.id, server.addr);
+            return true;
+        } else {
+            LOG.info("Requesting disconnect and reconnect to new learner master sid={} addr={}",
+                    server.id, server.addr);
+            nextLearnerMaster.set(server);
+            return true;
+        }
+    }
+
+    public QuorumPeer.QuorumServer getCurrentLearnerMaster() {
+        return currentLearnerMaster;
+    }
 }
 

+ 12 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ObserverMXBean.java

@@ -34,4 +34,16 @@ public interface ObserverMXBean extends ZooKeeperServerMXBean {
      * @return socket address
      */
     public String getQuorumAddress();
+
+    /**
+     * @return address of the current learner master
+     */
+    public String getLearnerMaster();
+
+    /**
+     * requests the Observer switch to a new learner master
+     *
+     * @param learnerMaster address of the desired learner master
+     */
+    public void setLearnerMaster(String learnerMaster);
 }

+ 532 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ObserverMaster.java

@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zookeeper.server.quorum;
+
+import org.apache.zookeeper.jmx.MBeanRegistry;
+import org.apache.zookeeper.server.Request;
+import org.apache.zookeeper.server.ZKDatabase;
+
+import java.io.BufferedInputStream;
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.net.ServerSocket;
+import java.net.Socket;
+import java.net.SocketAddress;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.Set;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentLinkedQueue;
+import java.util.concurrent.Executors;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicLong;
+
+import org.apache.zookeeper.server.quorum.auth.QuorumAuthServer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Used by Followers to host Observers. This reduces the network load on the Leader process by pushing
+ * the responsibility for keeping Observers in sync off the leading peer.
+ *
+ * It is expected that Observers will continue to perform the initial vetting of clients and requests.
+ * Observers send the request to the follower where it is received by an ObserverMaster.
+ *
+ * The ObserverMaster forwards a copy of the request to the ensemble Leader and inserts it into its own
+ * request processor pipeline where it can be matched with the response comes back. All commits received
+ * from the Leader will be forwarded along to every Learner connected to the ObserverMaster.
+ *
+ * New Learners connecting to a Follower will receive a LearnerHandler object and be party to its syncing logic
+ * to be brought up to date.
+ *
+ * The logic is quite a bit simpler than the corresponding logic in Leader because it only hosts observers.
+ */
+public class ObserverMaster implements LearnerMaster, Runnable {
+    private static final Logger LOG = LoggerFactory.getLogger(ObserverMaster.class);
+
+    //Follower counter
+    private final AtomicLong followerCounter = new AtomicLong(-1);
+
+    private QuorumPeer self;
+    private FollowerZooKeeperServer zks;
+    private int port;
+
+    private Set<LearnerHandler> activeObservers = Collections.newSetFromMap(new ConcurrentHashMap<LearnerHandler,Boolean>());
+
+    private final ConcurrentHashMap<LearnerHandler, LearnerHandlerBean> connectionBeans = new ConcurrentHashMap<>();
+
+    /**
+     * we want to keep a log of past txns so that observers can sync up with us when we connect,
+     * but we can't keep everything in memory, so this limits how much memory will be dedicated
+     * to keeping recent txns.
+     */
+    private final static int PKTS_SIZE_LIMIT = 32 * 1024 * 1024;
+    private static volatile int pktsSizeLimit = Integer.getInteger("zookeeper.observerMaster.sizeLimit", PKTS_SIZE_LIMIT);
+    private ConcurrentLinkedQueue<QuorumPacket> proposedPkts = new ConcurrentLinkedQueue<>();
+    private ConcurrentLinkedQueue<QuorumPacket> committedPkts = new ConcurrentLinkedQueue<>();
+    private int pktsSize = 0;
+
+    private long lastProposedZxid;
+
+    // ensure ordering of revalidations returned to this learner
+    private final Object revalidateSessionLock = new Object();
+
+    // Throttle when there are too many concurrent snapshots being sent to observers
+    private static final String MAX_CONCURRENT_SNAPSHOTS = "zookeeper.leader.maxConcurrentSnapshots";
+    private static final int maxConcurrentSnapshots;
+
+    private static final String MAX_CONCURRENT_DIFFS = "zookeeper.leader.maxConcurrentDiffs";
+    private static final int maxConcurrentDiffs;
+    static {
+        maxConcurrentSnapshots = Integer.getInteger(MAX_CONCURRENT_SNAPSHOTS, 10);
+        LOG.info(MAX_CONCURRENT_SNAPSHOTS + " = " + maxConcurrentSnapshots);
+
+        maxConcurrentDiffs = Integer.getInteger(MAX_CONCURRENT_DIFFS, 100);
+        LOG.info(MAX_CONCURRENT_DIFFS + " = " + maxConcurrentDiffs);
+    }
+
+    private final ConcurrentLinkedQueue<Revalidation> pendingRevalidations = new ConcurrentLinkedQueue<>();
+    static class Revalidation {
+        public final long sessionId;
+        public final int timeout;
+        public final LearnerHandler handler;
+
+        Revalidation(final Long sessionId, final int timeout, final LearnerHandler handler) {
+            this.sessionId = sessionId;
+            this.timeout = timeout;
+            this.handler = handler;
+        }
+
+        @Override
+        public boolean equals(Object o) {
+            if (this == o) return true;
+            if (o == null || getClass() != o.getClass()) return false;
+
+            final Revalidation that = (Revalidation) o;
+            return sessionId == that.sessionId && timeout == that.timeout && handler.equals(that.handler);
+        }
+
+        @Override
+        public int hashCode() {
+            int result = (int) (sessionId ^ (sessionId >>> 32));
+            result = 31 * result + timeout;
+            result = 31 * result + handler.hashCode();
+            return result;
+        }
+    }
+
+    private final LearnerSnapshotThrottler learnerSnapshotThrottler =
+            new LearnerSnapshotThrottler(maxConcurrentSnapshots);
+
+    private Thread thread;
+    private ServerSocket ss;
+    private boolean listenerRunning;
+    private ScheduledExecutorService pinger;
+
+    Runnable ping = new Runnable() {
+        @Override
+        public void run() {
+            for (LearnerHandler lh: activeObservers) {
+                lh.ping();
+            }
+        }
+    };
+
+    ObserverMaster(QuorumPeer self, FollowerZooKeeperServer zks, int port) {
+        this.self = self;
+        this.zks = zks;
+        this.port = port;
+    }
+
+    @Override
+    public void addLearnerHandler(LearnerHandler learnerHandler) {
+        if (!listenerRunning) {
+            throw new RuntimeException(("ObserverMaster is not running"));
+        }
+    }
+
+    @Override
+    public void removeLearnerHandler(LearnerHandler learnerHandler) {
+        activeObservers.remove(learnerHandler);
+    }
+
+    @Override
+    public int syncTimeout() {
+        return self.getSyncLimit() * self.getTickTime();
+    }
+
+    @Override
+    public int getTickOfNextAckDeadline() {
+        return self.tick.get() + self.syncLimit;
+    }
+
+    @Override
+    public int getTickOfInitialAckDeadline() {
+        return self.tick.get() + self.initLimit + self.syncLimit;
+    }
+
+    @Override
+    public long getAndDecrementFollowerCounter() {
+        return followerCounter.getAndDecrement();
+    }
+
+    @Override
+    public void waitForEpochAck(long sid, StateSummary ss) throws IOException, InterruptedException {
+        // since this is done by an active follower, we don't need to wait for anything
+    }
+
+    @Override
+    public LearnerSnapshotThrottler getLearnerSnapshotThrottler() {
+        return learnerSnapshotThrottler;
+    }
+
+    @Override
+    public void waitForStartup() throws InterruptedException {
+        // since this is done by an active follower, we don't need to wait for anything
+    }
+
+    @Override
+    synchronized public long getLastProposed() {
+        return lastProposedZxid;
+    }
+
+    @Override
+    public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
+        return self.getCurrentEpoch();
+    }
+
+    @Override
+    public ZKDatabase getZKDatabase() {
+        return zks.getZKDatabase();
+    }
+
+    @Override
+    public void waitForNewLeaderAck(long sid, long zxid) throws InterruptedException {
+        // no need to wait since we are a follower
+    }
+
+    @Override
+    public int getCurrentTick() {
+        return self.tick.get();
+    }
+
+    @Override
+    public void processAck(long sid, long zxid, SocketAddress localSocketAddress) {
+        if ((zxid & 0xffffffffL) == 0) {
+            /*
+             * We no longer process NEWLEADER ack by this method. However,
+             * the learner sends ack back to the leader after it gets UPTODATE
+             * so we just ignore the message.
+             */
+            return;
+        }
+
+        throw new RuntimeException("Observers shouldn't send ACKS ack = " + Long.toHexString(zxid));
+    }
+
+    @Override
+    public void touch(long sess, int to) {
+        zks.getSessionTracker().touchSession(sess, to);
+    }
+
+    boolean revalidateLearnerSession(QuorumPacket qp) throws IOException {
+        ByteArrayInputStream bis = new ByteArrayInputStream(qp.getData());
+        DataInputStream dis = new DataInputStream(bis);
+        long id = dis.readLong();
+        boolean valid = dis.readBoolean();
+        Iterator<Revalidation> itr = pendingRevalidations.iterator();
+        if (!itr.hasNext()) {
+            // not a learner session, handle locally
+            return false;
+        }
+        Revalidation revalidation = itr.next();
+        if (revalidation.sessionId != id) {
+            // not a learner session, handle locally
+            return false;
+        }
+        itr.remove();
+        LearnerHandler learnerHandler = revalidation.handler;
+        // create a copy here as the qp object is reused by the Follower and may be mutated
+        QuorumPacket deepCopy = new QuorumPacket(qp.getType(), qp.getZxid(),
+                Arrays.copyOf(qp.getData(), qp.getData().length),
+                qp.getAuthinfo() == null ? null : new ArrayList<>(qp.getAuthinfo()));
+        learnerHandler.queuePacket(deepCopy);
+        // To keep consistent as leader, touch the session when it's
+        // revalidating the session, only update if it's a valid session.
+        if (valid) {
+            touch(revalidation.sessionId, revalidation.timeout);
+        }
+        return true;
+    }
+
+    @Override
+    public void revalidateSession(QuorumPacket qp, LearnerHandler learnerHandler) throws IOException {
+        ByteArrayInputStream bis = new ByteArrayInputStream(qp.getData());
+        DataInputStream dis = new DataInputStream(bis);
+        long id = dis.readLong();
+        int to = dis.readInt();
+        synchronized (revalidateSessionLock) {
+            pendingRevalidations.add(new Revalidation(id, to, learnerHandler));
+            Learner learner = zks.getLearner();
+            if (learner != null) {
+                learner.writePacket(qp, true);
+            }
+        }
+    }
+
+    @Override
+    public void submitLearnerRequest(Request si) {
+        zks.processObserverRequest(si);
+    }
+
+    @Override
+    synchronized public long startForwarding(LearnerHandler learnerHandler, long lastSeenZxid) {
+        Iterator<QuorumPacket> itr = committedPkts.iterator();
+        if (itr.hasNext()) {
+            QuorumPacket packet = itr.next();
+            if (packet.getZxid() > lastSeenZxid + 1) {
+                LOG.error("LearnerHandler is too far behind ({} < {}), disconnecting {} at {}", Long.toHexString(lastSeenZxid + 1),
+                        Long.toHexString(packet.getZxid()), learnerHandler.getSid(), learnerHandler.getRemoteAddress());
+                learnerHandler.shutdown();
+                return -1;
+            } else if (packet.getZxid() == lastSeenZxid + 1) {
+                learnerHandler.queuePacket(packet);
+            }
+            long queueHeadZxid = packet.getZxid();
+            long queueBytesUsed = LearnerHandler.packetSize(packet);
+            while (itr.hasNext()) {
+                packet = itr.next();
+                if (packet.getZxid() <= lastSeenZxid) {
+                    continue;
+                }
+                learnerHandler.queuePacket(packet);
+                queueBytesUsed += LearnerHandler.packetSize(packet);
+            }
+            LOG.info("finished syncing observer from retained commit queue: sid {}, " +
+                            "queue head 0x{}, queue tail 0x{}, sync position 0x{}, num packets used {}, " +
+                            "num bytes used {}",
+                    learnerHandler.getSid(),
+                    Long.toHexString(queueHeadZxid),
+                    Long.toHexString(packet.getZxid()),
+                    Long.toHexString(lastSeenZxid),
+                    packet.getZxid() - lastSeenZxid,
+                    queueBytesUsed);
+        }
+        activeObservers.add(learnerHandler);
+        return lastProposedZxid;
+    }
+
+    @Override
+    public long getQuorumVerifierVersion() {
+        return self.getQuorumVerifier().getVersion();
+    }
+
+    @Override
+    public String getPeerInfo(long sid) {
+        QuorumPeer.QuorumServer server = self.getView().get(sid);
+        return server == null ? "" : server.toString();
+    }
+
+    @Override
+    public byte[] getQuorumVerifierBytes() {
+        return self.getLastSeenQuorumVerifier().toString().getBytes();
+    }
+
+    @Override
+    public QuorumAuthServer getQuorumAuthServer() {
+        return (self == null) ? null : self.authServer;
+    }
+
+    void proposalReceived(QuorumPacket qp) {
+        proposedPkts.add(new QuorumPacket(Leader.INFORM, qp.getZxid(), qp.getData(), null));
+    }
+
+    private synchronized QuorumPacket removeProposedPacket(long zxid) {
+        QuorumPacket pkt = proposedPkts.peek();
+        if (pkt == null || pkt.getZxid() > zxid) {
+            LOG.debug("ignore missing proposal packet for {}", Long.toHexString(zxid));
+            return null;
+        }
+        if (pkt.getZxid() != zxid) {
+            final String m = String.format("Unexpected proposal packet on commit ack, expected zxid 0x%d got zxid 0x%d",
+                    zxid, pkt.getZxid());
+            LOG.error(m);
+            throw new RuntimeException(m);
+        }
+        proposedPkts.remove();
+        return pkt;
+    }
+
+    private synchronized void cacheCommittedPacket(final QuorumPacket pkt) {
+        committedPkts.add(pkt);
+        pktsSize += LearnerHandler.packetSize(pkt);
+        // remove 5 packets for every one added as we near the size limit
+        for (int i = 0; pktsSize > pktsSizeLimit * 0.8  && i < 5; i++) {
+            QuorumPacket oldPkt = committedPkts.poll();
+            if (oldPkt == null) {
+                pktsSize = 0;
+                break;
+            }
+            pktsSize -= LearnerHandler.packetSize(oldPkt);
+        }
+        // enforce the size limit as a hard cap
+        while (pktsSize > pktsSizeLimit) {
+            QuorumPacket oldPkt = committedPkts.poll();
+            if (oldPkt == null) {
+                pktsSize = 0;
+                break;
+            }
+            pktsSize -= LearnerHandler.packetSize(oldPkt);
+        }
+    }
+
+    private synchronized void sendPacket(final QuorumPacket pkt) {
+        for (LearnerHandler lh: activeObservers) {
+            lh.queuePacket(pkt);
+        }
+        lastProposedZxid = pkt.getZxid();
+    }
+
+    synchronized void proposalCommitted(long zxid) {
+        QuorumPacket pkt = removeProposedPacket(zxid);
+        if (pkt == null) {
+            return;
+        }
+        cacheCommittedPacket(pkt);
+        sendPacket(pkt);
+    }
+
+    synchronized void informAndActivate(long zxid, long suggestedLeaderId) {
+        QuorumPacket pkt = removeProposedPacket(zxid);
+        if (pkt == null) {
+            return;
+        }
+
+        // Build the INFORMANDACTIVATE packet
+        QuorumPacket informAndActivateQP = Leader.buildInformAndActivePacket(
+                zxid, suggestedLeaderId, pkt.getData());
+        cacheCommittedPacket(informAndActivateQP);
+        sendPacket(informAndActivateQP);
+    }
+
+    synchronized public void start() throws IOException {
+        if (thread != null && thread.isAlive()) {
+            return;
+        }
+        listenerRunning = true;
+        int backlog = 10; // dog science
+        if (self.shouldUsePortUnification() || self.isSslQuorum()) {
+            boolean allowInsecureConnection = self.shouldUsePortUnification();
+            if (self.getQuorumListenOnAllIPs()) {
+                ss = new UnifiedServerSocket(self.getX509Util(), allowInsecureConnection, port, backlog);
+            } else {
+                ss = new UnifiedServerSocket(
+                        self.getX509Util(),
+                        allowInsecureConnection,
+                        port,
+                        backlog,
+                        self.getQuorumAddress().getAddress());
+            }
+        } else {
+            if (self.getQuorumListenOnAllIPs()) {
+                ss = new ServerSocket(port, backlog);
+            } else {
+                ss = new ServerSocket(port, backlog, self.getQuorumAddress().getAddress());
+            }
+        }
+        thread = new Thread(this, "ObserverMaster");
+        thread.start();
+        pinger = Executors.newSingleThreadScheduledExecutor();
+        pinger.scheduleAtFixedRate(ping, self.tickTime /2, self.tickTime/2, TimeUnit.MILLISECONDS);
+    }
+
+    public void run() {
+        while (listenerRunning) {
+            try {
+                Socket s = ss.accept();
+                // start with the initLimit, once the ack is processed
+                // in LearnerHandler switch to the syncLimit
+                s.setSoTimeout(self.tickTime * self.initLimit);
+                BufferedInputStream is = new BufferedInputStream(s.getInputStream());
+                LearnerHandler lh = new LearnerHandler(s, is, this);
+                lh.start();
+            } catch (Exception e) {
+                if (listenerRunning) {
+                    LOG.debug("Ignoring accept exception (maybe shutting down)", e);
+                } else {
+                    LOG.debug("Ignoring accept exception (maybe client closed)", e);
+                }
+            }
+        }
+        /*
+         * we don't need to close ss because we only got here because listenerRunning is
+         * false and that is set and then ss is closed() in stop()
+         */
+    }
+
+    synchronized public void stop() {
+        listenerRunning = false;
+        if (pinger != null) {
+            pinger.shutdownNow();
+        }
+        if (ss != null) {
+            try {
+                ss.close();
+            } catch (IOException e) {
+                e.printStackTrace();
+            }
+        }
+        for (LearnerHandler lh: activeObservers) {
+            lh.shutdown();
+        }
+    }
+
+    int getNumActiveObservers() {
+        return activeObservers.size();
+    }
+
+    int getPktsSizeLimit() {
+        return pktsSizeLimit;
+    }
+
+    static void setPktsSizeLimit(final int sizeLimit) {
+        pktsSizeLimit = sizeLimit;
+    }
+
+
+    @Override
+    public void registerLearnerHandlerBean(final LearnerHandler learnerHandler, Socket socket) {
+        LearnerHandlerBean bean = new LearnerHandlerBean(learnerHandler, socket);
+        if (zks.registerJMX(bean)) {
+            connectionBeans.put(learnerHandler, bean);
+        }
+    }
+
+    @Override
+    public void unregisterLearnerHandlerBean(final LearnerHandler learnerHandler) {
+        LearnerHandlerBean bean = connectionBeans.remove(learnerHandler);
+        if (bean != null){
+            MBeanRegistry.getInstance().unregister(bean);
+        }
+    }
+}

+ 87 - 3
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeer.java

@@ -144,6 +144,16 @@ public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider
         }
     }
 
+    private int observerMasterPort;
+
+    public int getObserverMasterPort() {
+        return observerMasterPort;
+    }
+
+    public void setObserverMasterPort(int observerMasterPort) {
+        this.observerMasterPort = observerMasterPort;
+    }
+
     public static class QuorumServer {
         public InetSocketAddress addr = null;
 
@@ -1231,8 +1241,14 @@ public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider
                         LOG.warn("Unexpected exception",e );
                     } finally {
                         observer.shutdown();
-                        setObserver(null);  
-                       updateServerState();
+                        setObserver(null);
+                        updateServerState();
+
+                        // Add delay jitter before we switch to LOOKING
+                        // state to reduce the load of ObserverMaster
+                        if (isRunning()) {
+                            Observer.waitForReconnectDelay();
+                        }
                     }
                     break;
                 case FOLLOWING:
@@ -1651,6 +1667,7 @@ public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider
             if (qs != null) {
                 setAddrs(qs.addr, qs.electionAddr, qs.clientAddr);
             }
+            updateObserverMasterList();
             return prevQV;
         }
     }
@@ -1989,7 +2006,74 @@ public class QuorumPeer extends ZooKeeperThread implements QuorumStats.Provider
         }
     }
 
-   private boolean updateLearnerType(QuorumVerifier newQV) {        
+    private ArrayList<QuorumServer> observerMasters = new ArrayList<>();
+    private void updateObserverMasterList() {
+        if (observerMasterPort <= 0) {
+            return; // observer masters not enabled
+        }
+        observerMasters.clear();
+        StringBuilder sb = new StringBuilder();
+        for (QuorumServer server : quorumVerifier.getVotingMembers().values()) {
+            InetSocketAddress addr = new InetSocketAddress(server.addr.getAddress(), observerMasterPort);
+            observerMasters.add(new QuorumServer(server.id, addr));
+            sb.append(addr).append(",");
+        }
+        LOG.info("Updated learner master list to be {}", sb.toString());
+        Collections.shuffle(observerMasters);
+    }
+
+    private boolean useObserverMasters() {
+        return getLearnerType() == LearnerType.OBSERVER && observerMasters.size() > 0;
+    }
+
+    private int nextObserverMaster = 0;
+    private QuorumServer nextObserverMaster() {
+        if (nextObserverMaster >= observerMasters.size()) {
+            nextObserverMaster = 0;
+        }
+        return observerMasters.get(nextObserverMaster++);
+    }
+
+    QuorumServer findLearnerMaster(QuorumServer leader) {
+        return useObserverMasters() ? nextObserverMaster() : leader;
+    }
+
+    /**
+     * Vet a given learner master's information.
+     * Allows specification by server id, ip  only, or ip and port
+     */
+    QuorumServer validateLearnerMaster(String desiredMaster) {
+        if (useObserverMasters()) {
+            Long sid;
+            try {
+                sid = Long.parseLong(desiredMaster);
+            } catch (NumberFormatException e) {
+                sid = null;
+            }
+            for (QuorumServer server : observerMasters) {
+                if (sid == null) {
+                    String serverAddr = server.addr.getAddress().getHostAddress() + ':' + server.addr.getPort();
+                    if (serverAddr.startsWith(desiredMaster)) {
+                        return server;
+                    }
+                } else {
+                    if (sid.equals(server.id)) {
+                        return server;
+                    }
+                }
+            }
+            if (sid == null) {
+                LOG.info("could not find learner master address={}", desiredMaster);
+            } else {
+                LOG.warn("could not find learner master sid={}", sid);
+            }
+        } else {
+            LOG.info("cannot validate request, observer masters not enabled");
+        }
+        return null;
+    }
+
+   private boolean updateLearnerType(QuorumVerifier newQV) {
        //check if I'm an observer in new config
        if (newQV.getObservingMembers().containsKey(getId())) {
            if (getLearnerType()!=LearnerType.OBSERVER){

+ 12 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerConfig.java

@@ -71,6 +71,7 @@ public class QuorumPeerConfig {
     protected InetSocketAddress secureClientPortAddress;
     protected boolean sslQuorum = false;
     protected boolean shouldUsePortUnification = false;
+    protected int observerMasterPort;
     protected File dataDir;
     protected File dataLogDir;
     protected String dynamicConfigFileStr = null;
@@ -239,6 +240,7 @@ public class QuorumPeerConfig {
     throws IOException, ConfigException {
         int clientPort = 0;
         int secureClientPort = 0;
+        int observerMasterPort = 0;
         String clientPortAddress = null;
         String secureClientPortAddress = null;
         VerifyingFileFactory vff = new VerifyingFileFactory.Builder(LOG).warnForRelativePath().build();
@@ -261,6 +263,8 @@ public class QuorumPeerConfig {
                 secureClientPort = Integer.parseInt(value);
             } else if (key.equals("secureClientPortAddress")){
                 secureClientPortAddress = value.trim();
+            } else if (key.equals("observerMasterPort")) {
+                observerMasterPort = Integer.parseInt(value);
             } else if (key.equals("tickTime")) {
                 tickTime = Integer.parseInt(value);
             } else if (key.equals("maxClientCnxns")) {
@@ -412,6 +416,13 @@ public class QuorumPeerConfig {
             configureSSLAuth();
         }
 
+        if (observerMasterPort <= 0) {
+            LOG.info("observerMasterPort is not set");
+        } else {
+            this.observerMasterPort = observerMasterPort;
+            LOG.info("observerMasterPort is {}", observerMasterPort);
+        }
+
         if (tickTime == 0) {
             throw new IllegalArgumentException("tickTime is not set");
         }
@@ -754,6 +765,7 @@ public class QuorumPeerConfig {
 
     public InetSocketAddress getClientPortAddress() { return clientPortAddress; }
     public InetSocketAddress getSecureClientPortAddress() { return secureClientPortAddress; }
+    public int getObserverMasterPort() { return observerMasterPort; }
     public File getDataDir() { return dataDir; }
     public File getDataLogDir() { return dataLogDir; }
     public int getTickTime() { return tickTime; }

+ 1 - 0
zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumPeerMain.java

@@ -188,6 +188,7 @@ public class QuorumPeerMain {
           quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
           quorumPeer.setInitLimit(config.getInitLimit());
           quorumPeer.setSyncLimit(config.getSyncLimit());
+          quorumPeer.setObserverMasterPort(config.getObserverMasterPort());
           quorumPeer.setConfigFileName(config.getConfigFilename());
           quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
           quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);

+ 26 - 0
zookeeper-server/src/test/java/org/apache/zookeeper/ZKTestCase.java

@@ -18,6 +18,7 @@
 
 package org.apache.zookeeper;
 
+import org.junit.Assert;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.junit.Rule;
@@ -75,4 +76,29 @@ public class ZKTestCase {
 
     };
 
+    public interface WaitForCondition {
+        /**
+         * @return true when success
+         */
+        boolean evaluate();
+    }
+
+    /**
+     * Wait for condition to be true; otherwise fail the test if it exceed
+     * timeout
+     * @param msg       error message to print when fail
+     * @param condition condition to evaluate
+     * @param timeout   timeout in seconds
+     * @throws InterruptedException
+     */
+    public void waitFor(String msg, WaitForCondition condition, int timeout)
+            throws InterruptedException {
+        for (int i = 0; i < timeout; ++i) {
+            if (condition.evaluate()) {
+                return;
+            }
+            Thread.sleep(100);
+        }
+        Assert.fail(msg);
+    }
 }

+ 77 - 0
zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/DelayRequestProcessor.java

@@ -0,0 +1,77 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zookeeper.server.quorum;
+
+import org.apache.zookeeper.server.Request;
+import org.apache.zookeeper.server.RequestProcessor;
+
+import java.util.concurrent.LinkedBlockingQueue;
+
+/**
+ * Allows the blocking of the request processor queue on a ZooKeeperServer.
+ *
+ * This is used to simulate arbitrary length delays or to produce delays
+ * in request processing that are maximally inconvenient for a given feature
+ * for the purposes of testing it.
+ */
+public class DelayRequestProcessor implements RequestProcessor {
+
+    private boolean blocking;
+    RequestProcessor next;
+
+    private LinkedBlockingQueue<Request> incomingRequests = new LinkedBlockingQueue<>();
+
+    private DelayRequestProcessor(RequestProcessor next) {
+        this.blocking = true;
+        this.next = next;
+    }
+
+    @Override
+    public void processRequest(Request request) throws RequestProcessorException {
+        if (blocking) {
+            incomingRequests.add(request);
+        } else {
+            next.processRequest(request);
+        }
+    }
+
+    public void submitRequest(Request request) throws RequestProcessorException {
+        next.processRequest(request);
+    }
+
+    @Override
+    public void shutdown() {
+    }
+
+    public void unblockQueue() throws RequestProcessorException {
+        if (blocking) {
+            for (Request request : incomingRequests) {
+                next.processRequest(request);
+            }
+            blocking = false;
+        }
+    }
+
+    public static DelayRequestProcessor injectDelayRequestProcessor(FollowerZooKeeperServer zooKeeperServer) {
+        RequestProcessor finalRequestProcessor = zooKeeperServer.commitProcessor.nextProcessor;
+        DelayRequestProcessor delayRequestProcessor = new DelayRequestProcessor(finalRequestProcessor);
+        zooKeeperServer.commitProcessor.nextProcessor = delayRequestProcessor;
+        return delayRequestProcessor;
+    }
+}

+ 28 - 27
zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java

@@ -39,7 +39,6 @@ import org.apache.zookeeper.server.ZKDatabase;
 import org.apache.zookeeper.server.persistence.FileTxnSnapLog;
 import org.apache.zookeeper.server.quorum.Leader.Proposal;
 import org.apache.zookeeper.server.util.ZxidUtils;
-import org.junit.After;
 import org.junit.Before;
 import org.junit.Test;
 import org.mockito.Matchers;
@@ -127,6 +126,9 @@ public class LearnerHandlerTest extends ZKTestCase {
 
     @Before
     public void setUp() throws Exception {
+        db = new MockZKDatabase(null);
+        sock = mock(Socket.class);
+
         // Intercept when startForwarding is called
         leader = mock(Leader.class);
         when(
@@ -137,10 +139,8 @@ public class LearnerHandlerTest extends ZKTestCase {
                 return 0;
             }
         });
+        when(leader.getZKDatabase()).thenReturn(db);
 
-        sock = mock(Socket.class);
-
-        db = new MockZKDatabase(null);
         learnerHandler = new MockLearnerHandler(sock, leader);
     }
 
@@ -204,7 +204,7 @@ public class LearnerHandlerTest extends ZKTestCase {
         peerZxid = 3;
         db.lastProcessedZxid = 1;
         db.committedLog.clear();
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send TRUNC and forward any packet starting lastProcessedZxid
         assertOpType(Leader.TRUNC, db.lastProcessedZxid, db.lastProcessedZxid);
         reset();
@@ -213,7 +213,7 @@ public class LearnerHandlerTest extends ZKTestCase {
         peerZxid = 1;
         db.lastProcessedZxid = 1;
         db.committedLog.clear();
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF and forward any packet starting lastProcessedZxid
         assertOpType(Leader.DIFF, db.lastProcessedZxid, db.lastProcessedZxid);
         assertEquals(1, learnerHandler.getQueuedPackets().size());
@@ -226,7 +226,7 @@ public class LearnerHandlerTest extends ZKTestCase {
         db.lastProcessedZxid = 1;
         db.committedLog.clear();
         // We send SNAP
-        assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertTrue(learnerHandler.syncFollower(peerZxid, leader));
         assertEquals(0, learnerHandler.getQueuedPackets().size());
         reset();
 
@@ -248,7 +248,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid that we have never seen
         peerZxid = 4;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send TRUNC to 3 and forward any packet starting 5
         assertOpType(Leader.TRUNC, 3, 5);
         // DIFF + 1 proposals + 1 commit
@@ -258,7 +258,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer is within committedLog range
         peerZxid = 2;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF and forward any packet starting lastProcessedZxid
         assertOpType(Leader.DIFF, db.getmaxCommittedLog(),
                 db.getmaxCommittedLog());
@@ -271,7 +271,7 @@ public class LearnerHandlerTest extends ZKTestCase {
         peerZxid = 1;
         db.setSnapshotSizeFactor(-1);
         // We send SNAP
-        assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertTrue(learnerHandler.syncFollower(peerZxid, leader));
         assertEquals(0, learnerHandler.getQueuedPackets().size());
         reset();
     }
@@ -297,7 +297,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid that we have never seen
         peerZxid = 4;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send TRUNC to 3 and forward any packet starting at maxCommittedLog
         assertOpType(Leader.TRUNC, 3, db.getmaxCommittedLog());
         // DIFF + 4 proposals + 4 commit
@@ -307,7 +307,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer zxid is in txnlog range
         peerZxid = 3;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF and forward any packet starting at maxCommittedLog
         assertOpType(Leader.DIFF, db.getmaxCommittedLog(),
                 db.getmaxCommittedLog());
@@ -336,11 +336,12 @@ public class LearnerHandlerTest extends ZKTestCase {
         db.lastProcessedZxid = 7;
         db.txnLog.add(createProposal(2));
         db.txnLog.add(createProposal(3));
+        when(leader.getZKDatabase()).thenReturn(db);
 
         // Peer zxid
         peerZxid = 4;
         assertTrue("Couldn't identify snapshot transfer!",
-                learnerHandler.syncFollower(peerZxid, db, leader));
+                learnerHandler.syncFollower(peerZxid, leader));
         reset();
     }
 
@@ -362,7 +363,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid that we have never seen
         peerZxid = 4;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send TRUNC to 3 and forward any packet starting at
         // lastProcessedZxid
         assertOpType(Leader.TRUNC, 3, db.lastProcessedZxid);
@@ -373,7 +374,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid in txnlog range
         peerZxid = 2;
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF and forward any packet starting at lastProcessedZxid
         assertOpType(Leader.DIFF, db.lastProcessedZxid, db.lastProcessedZxid);
         // DIFF + 4 proposals + 4 commit
@@ -383,7 +384,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer miss the txnlog
         peerZxid = 1;
-        assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertTrue(learnerHandler.syncFollower(peerZxid, leader));
         // We send snap
         assertEquals(0, learnerHandler.getQueuedPackets().size());
         reset();
@@ -414,7 +415,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid that we have never seen
         peerZxid = getZxid(0xf, 4);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send TRUNC to 3 and forward any packet starting at maxCommittedLog
         assertOpType(Leader.TRUNC, getZxid(0xf, 3), db.getmaxCommittedLog());
         // DIFF + 4 proposals + 4 commit
@@ -425,7 +426,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer zxid is in txnlog range
         peerZxid = getZxid(0xf, 3);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF and forward any packet starting at maxCommittedLog
         assertOpType(Leader.DIFF, db.getmaxCommittedLog(),
                 db.getmaxCommittedLog());
@@ -456,13 +457,13 @@ public class LearnerHandlerTest extends ZKTestCase {
         // We should get snap, we can do better here, but the main logic is
         // that we should never send diff if we have never seen any txn older
         // than peer zxid
-        assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertTrue(learnerHandler.syncFollower(peerZxid, leader));
         assertEquals(0, learnerHandler.getQueuedPackets().size());
         reset();
 
         // Peer has zxid of epoch 1
         peerZxid = getZxid(1, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (1, 2) and forward any packet starting at (1, 2)
         assertOpType(Leader.DIFF, getZxid(1, 2), getZxid(1, 2));
         // DIFF + 2 proposals + 2 commit
@@ -472,7 +473,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 2, so it is already sync
         peerZxid = getZxid(2, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (2, 0) and forward any packet starting at (2, 0)
         assertOpType(Leader.DIFF, getZxid(2, 0), getZxid(2, 0));
         // DIFF only
@@ -498,7 +499,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 3
         peerZxid = getZxid(3, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (6,0) and forward any packet starting at (4,1)
         assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(4, 1));
         // DIFF + 1 proposals + 1 commit
@@ -508,7 +509,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 4
         peerZxid = getZxid(4, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (6,0) and forward any packet starting at (4,1)
         assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(4, 1));
         // DIFF + 1 proposals + 1 commit
@@ -518,7 +519,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 5
         peerZxid = getZxid(5, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (6,0) and forward any packet starting at (5,0)
         assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(5, 0));
         // DIFF only
@@ -527,7 +528,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 6
         peerZxid = getZxid(6, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (6,0) and forward any packet starting at (6, 0)
         assertOpType(Leader.DIFF, getZxid(6, 0), getZxid(6, 0));
         // DIFF only
@@ -558,7 +559,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid of epoch 1
         peerZxid = getZxid(1, 0);
-        assertFalse(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertFalse(learnerHandler.syncFollower(peerZxid, leader));
         // We send DIFF to (1, 2) and forward any packet starting at (1, 2)
         assertOpType(Leader.DIFF, getZxid(1, 2), getZxid(1, 2));
         // DIFF + 2 proposals + 2 commit
@@ -585,7 +586,7 @@ public class LearnerHandlerTest extends ZKTestCase {
 
         // Peer has zxid (3, 1)
         peerZxid = getZxid(3, 1);
-        assertTrue(learnerHandler.syncFollower(peerZxid, db, leader));
+        assertTrue(learnerHandler.syncFollower(peerZxid, leader));
         assertEquals(0, learnerHandler.getQueuedPackets().size());
         reset();
     }

+ 57 - 48
zookeeper-server/src/test/java/org/apache/zookeeper/server/util/PortForwarder.java

@@ -16,9 +16,6 @@
  * limitations under the License.
  */
 
-/**
- * 
- */
 package org.apache.zookeeper.server.util;
 
 import java.io.IOException;
@@ -29,6 +26,8 @@ import java.net.ServerSocket;
 import java.net.Socket;
 import java.net.SocketException;
 import java.net.SocketTimeoutException;
+import java.util.ArrayList;
+import java.util.List;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
 import java.util.concurrent.TimeUnit;
@@ -95,9 +94,10 @@ public class PortForwarder extends Thread {
         private final OutputStream out;
         private final Socket toClose;
         private final Socket toClose2;
+        private boolean isFinished = false;
 
         PortForwardWorker(Socket toClose, Socket toClose2, InputStream in,
-                OutputStream out) throws IOException {
+                OutputStream out) {
             this.toClose = toClose;
             this.toClose2 = toClose2;
             this.in = in;
@@ -118,50 +118,57 @@ public class PortForwarder extends Thread {
                                 this.out.write(buf, 0, read);
                             } catch (IOException e) {
                                 LOG.warn("exception during write", e);
-                                try {
-                                    toClose.close();
-                                } catch (IOException ex) {
-                                    // ignore
-                                }
-                                try {
-                                    toClose2.close();
-                                } catch (IOException ex) {
-                                    // ignore
-                                }
                                 break;
                             }
+                        } else if (read < 0) {
+                            throw new IOException("read " + read);
                         }
                     } catch (SocketTimeoutException e) {
                         LOG.error("socket timeout", e);
                     }
-                    Thread.sleep(1);
                 }
+                Thread.sleep(1);
             } catch (InterruptedException e) {
                 LOG.warn("Interrupted", e);
-                try {
-                    toClose.close();
-                } catch (IOException ex) {
-                    // ignore
-                }
-                try {
-                    toClose2.close();
-                } catch (IOException ex) {
-                    // ignore silently
-                }
             } catch (SocketException e) {
                 if (!"Socket closed".equals(e.getMessage())) {
                     LOG.error("Unexpected exception", e);
                 }
             } catch (IOException e) {
                 LOG.error("Unexpected exception", e);
+            } finally {
+                shutdown();
             }
             LOG.info("Shutting down forward for " + toClose);
+            isFinished = true;
         }
 
+        boolean waitForShutdown(long timeoutMs) throws InterruptedException {
+            synchronized (this) {
+                if (!isFinished) {
+                   this.wait(timeoutMs);
+                }
+            }
+            return isFinished;
+        }
+
+        public void shutdown() {
+            try {
+                toClose.close();
+            } catch (IOException ex) {
+                // ignore
+            }
+            try {
+                toClose2.close();
+            } catch (IOException ex) {
+                // ignore silently
+            }
+        }
     }
 
     private volatile boolean stopped = false;
-    private ExecutorService workers = Executors.newCachedThreadPool();
+    private ExecutorService workerExecutor = Executors.newCachedThreadPool();
+    private List<PortForwardWorker> workers = new ArrayList<>();
     private ServerSocket serverSocket;
     private final int to;
 
@@ -207,30 +214,31 @@ public class PortForwarder extends Thread {
                             + " to:" + to);
                     sock.setSoTimeout(30000);
                     target.setSoTimeout(30000);
-                    this.workers.execute(new PortForwardWorker(sock, target,
+
+
+                    workers.add(new PortForwardWorker(sock, target,
                             sock.getInputStream(), target.getOutputStream()));
-                    this.workers.execute(new PortForwardWorker(target, sock,
+                    workers.add(new PortForwardWorker(target, sock,
                             target.getInputStream(), sock.getOutputStream()));
-                } catch (SocketTimeoutException e) {               	
-                    LOG.warn("socket timed out local:" 
-                            + (sock != null ? sock.getLocalPort(): "")
-                            + " from:" + (sock != null ? sock.getPort(): "")
-                            + " to:" + to, e);
+                    for (PortForwardWorker worker: workers) {
+                        workerExecutor.submit(worker);
+                    }
+                } catch (SocketTimeoutException e) {
+                    LOG.warn("socket timed out", e);
                 } catch (ConnectException e) {
-                    LOG.warn("connection exception local:"
-                            + (sock != null ? sock.getLocalPort(): "")
-                            + " from:" + (sock != null ? sock.getPort(): "")
+                    LOG.warn("connection exception local:" + sock.getLocalPort()
+                            + " from:" + sock.getPort()
                             + " to:" + to, e);
                     sock.close();
                 } catch (IOException e) {
                     if (!"Socket closed".equals(e.getMessage())) {
-                        LOG.warn("unexpected exception local:" 
-                        		+ (sock != null ? sock.getLocalPort(): "")
-                                + " from:" + (sock != null ? sock.getPort(): "")
-                                + " to:" + to, e);
+                        LOG.warn("unexpected exception local:" + sock.getLocalPort()
+                            + " from:" + sock.getPort()
+                            + " to:" + to, e);
                         throw e;
                     }
                 }
+
             }
         } catch (IOException e) {
             LOG.error("Unexpected exception to:" + to, e);
@@ -242,15 +250,16 @@ public class PortForwarder extends Thread {
     public void shutdown() throws Exception {
         this.stopped = true;
         this.serverSocket.close();
-        this.workers.shutdownNow();
-        try {
-            if (!this.workers.awaitTermination(5, TimeUnit.SECONDS)) {
-                throw new Exception(
-                        "Failed to stop forwarding within 5 seconds");
+        this.join();
+        this.workerExecutor.shutdownNow();
+        for (PortForwardWorker worker: workers) {
+            worker.shutdown();
+        }
+
+        for (PortForwardWorker worker: workers) {
+            if (!worker.waitForShutdown(5000)) {
+                throw new Exception("Failed to stop forwarding within 5 seconds");
             }
-        } catch (InterruptedException e) {
-            throw new Exception("Failed to stop forwarding");
         }
-        this.join();
     }
 }

+ 780 - 0
zookeeper-server/src/test/java/org/apache/zookeeper/test/ObserverMasterTest.java

@@ -0,0 +1,780 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.zookeeper.test;
+
+import static org.apache.zookeeper.test.ClientBase.CONNECTION_TIMEOUT;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.zookeeper.AsyncCallback;
+import org.apache.zookeeper.KeeperException;
+import org.apache.zookeeper.jmx.CommonNames;
+import org.apache.zookeeper.jmx.MBeanRegistry;
+import org.apache.zookeeper.jmx.ZKMBeanInfo;
+import org.apache.zookeeper.server.admin.Commands;
+import org.apache.zookeeper.server.quorum.DelayRequestProcessor;
+import org.apache.zookeeper.server.quorum.FollowerZooKeeperServer;
+import org.apache.zookeeper.server.quorum.QuorumPeerConfig;
+import org.apache.zookeeper.server.quorum.QuorumPeerTestBase;
+import org.apache.zookeeper.server.util.PortForwarder;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.apache.zookeeper.CreateMode;
+import org.apache.zookeeper.KeeperException.ConnectionLossException;
+import org.apache.zookeeper.PortAssignment;
+import org.apache.zookeeper.WatchedEvent;
+import org.apache.zookeeper.Watcher;
+import org.apache.zookeeper.Watcher.Event.KeeperState;
+import org.apache.zookeeper.ZooDefs.Ids;
+import org.apache.zookeeper.ZooKeeper;
+import org.apache.zookeeper.ZooKeeper.States;
+import org.apache.zookeeper.admin.ZooKeeperAdmin;
+import org.junit.Assert;
+import org.junit.Test;
+
+import javax.management.Attribute;
+import javax.management.AttributeNotFoundException;
+import javax.management.InstanceNotFoundException;
+import javax.management.InvalidAttributeValueException;
+import javax.management.MBeanException;
+import javax.management.MalformedObjectNameException;
+import javax.management.ObjectName;
+import javax.management.ReflectionException;
+import javax.management.RuntimeMBeanException;
+
+@RunWith(Parameterized.class)
+public class ObserverMasterTest extends QuorumPeerTestBase implements Watcher{
+    protected static final Logger LOG = LoggerFactory.getLogger(ObserverMasterTest.class);
+
+    public ObserverMasterTest(Boolean testObserverMaster) {
+        this.testObserverMaster = testObserverMaster;
+    }
+
+    @Parameterized.Parameters
+    public static List<Object []> data() { return Arrays.asList(new Object [][] {
+            {Boolean.TRUE},
+            {Boolean.FALSE}});
+    }
+
+    private Boolean testObserverMaster;
+
+    private CountDownLatch latch;
+    ZooKeeper zk;
+    private WatchedEvent lastEvent = null;
+
+    private int CLIENT_PORT_QP1;
+    private int CLIENT_PORT_QP2;
+    private int CLIENT_PORT_OBS;
+    private int OM_PORT;
+    private MainThread q1;
+    private MainThread q2;
+    private MainThread q3;
+
+    private PortForwarder setUp(final int omProxyPort) throws IOException {
+        ClientBase.setupTestEnv();
+
+        final int PORT_QP1 = PortAssignment.unique();
+        final int PORT_QP2 = PortAssignment.unique();
+        final int PORT_OBS = PortAssignment.unique();
+        final int PORT_QP_LE1 = PortAssignment.unique();
+        final int PORT_QP_LE2 = PortAssignment.unique();
+        final int PORT_OBS_LE = PortAssignment.unique();
+
+        CLIENT_PORT_QP1 = PortAssignment.unique();
+        CLIENT_PORT_QP2 = PortAssignment.unique();
+        CLIENT_PORT_OBS = PortAssignment.unique();
+
+        OM_PORT = PortAssignment.unique();
+
+        String quorumCfgSection =
+                "server.1=127.0.0.1:" + (PORT_QP1)
+                        + ":" + (PORT_QP_LE1) + ";" +  CLIENT_PORT_QP1
+                        + "\nserver.2=127.0.0.1:" + (PORT_QP2)
+                        + ":" + (PORT_QP_LE2) + ";" + CLIENT_PORT_QP2
+                        + "\nserver.3=127.0.0.1:"
+                        + (PORT_OBS)+ ":" + (PORT_OBS_LE) + ":observer" + ";" + CLIENT_PORT_OBS;
+        String extraCfgs = testObserverMaster ? String.format("observerMasterPort=%d%n", OM_PORT) : "";
+        String extraCfgsObs = testObserverMaster ? String.format("observerMasterPort=%d%n", omProxyPort <= 0 ? OM_PORT : omProxyPort) : "";
+
+        PortForwarder forwarder = null;
+        if (testObserverMaster && omProxyPort >= 0) {
+            forwarder = new PortForwarder(omProxyPort, OM_PORT);
+        }
+
+        q1 = new MainThread(1, CLIENT_PORT_QP1, quorumCfgSection, extraCfgs);
+        q2 = new MainThread(2, CLIENT_PORT_QP2, quorumCfgSection, extraCfgs);
+        q3 = new MainThread(3, CLIENT_PORT_OBS, quorumCfgSection, extraCfgsObs);
+        q1.start();
+        q2.start();
+        Assert.assertTrue("waiting for server 1 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP1,
+                        CONNECTION_TIMEOUT));
+        Assert.assertTrue("waiting for server 2 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP2,
+                        CONNECTION_TIMEOUT));
+        return forwarder;
+    }
+
+    private void shutdown() throws InterruptedException {
+        LOG.info("Shutting down all servers");
+        zk.close();
+
+        q1.shutdown();
+        q2.shutdown();
+        q3.shutdown();
+
+        Assert.assertTrue("Waiting for server 1 to shut down",
+                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP1,
+                        ClientBase.CONNECTION_TIMEOUT));
+        Assert.assertTrue("Waiting for server 2 to shut down",
+                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP2,
+                        ClientBase.CONNECTION_TIMEOUT));
+        Assert.assertTrue("Waiting for server 3 to shut down",
+                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_OBS,
+                        ClientBase.CONNECTION_TIMEOUT));
+    }
+
+    @Test
+    public void testLaggingObserverMaster() throws Exception {
+        final int OM_PROXY_PORT = PortAssignment.unique();
+        PortForwarder forwarder = setUp(OM_PROXY_PORT);
+
+        // find the leader and observer master
+        int leaderPort;
+        MainThread leader;
+        MainThread follower;
+        if (q1.getQuorumPeer().leader != null) {
+            leaderPort = CLIENT_PORT_QP1;
+            leader = q1;
+            follower = q2;
+        } else if (q2.getQuorumPeer().leader != null) {
+            leaderPort = CLIENT_PORT_QP2;
+            leader = q2;
+            follower = q1;
+        } else {
+            throw new RuntimeException("No leader");
+        }
+
+        // ensure the observer master has commits in the queue before observer sync
+        zk = new ZooKeeper("127.0.0.1:" + leaderPort,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        for (int i = 0; i < 10; i++) {
+            zk.create("/bulk" + i, ("initial data of some size").getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
+        }
+        zk.close();
+
+        q3.start();
+        Assert.assertTrue("waiting for server 3 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+
+        latch = new CountDownLatch(1);
+        zk = new ZooKeeper("127.0.0.1:" + leaderPort,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        latch.await();
+        Assert.assertEquals(zk.getState(), States.CONNECTED);
+
+        zk.create("/init", "first".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
+        final long lastLoggedZxid = leader.getQuorumPeer().getLastLoggedZxid();
+
+        // wait for change to propagate
+        waitFor("Timeout waiting for observer sync", new WaitForCondition() {
+            public boolean evaluate() {
+                return lastLoggedZxid == q3.getQuorumPeer().getLastLoggedZxid();
+            }
+        }, 30);
+
+        // simulate network fault
+        if (forwarder != null) {
+            forwarder.shutdown();
+        }
+
+        for (int i = 0; i < 10; i++) {
+            zk.create("/basic" + i, "second".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                    CreateMode.PERSISTENT);
+        }
+
+        DelayRequestProcessor delayRequestProcessor = null;
+        if (testObserverMaster) {
+            FollowerZooKeeperServer followerZooKeeperServer = (FollowerZooKeeperServer) follower.getQuorumPeer().getActiveServer();
+            delayRequestProcessor = DelayRequestProcessor.injectDelayRequestProcessor(followerZooKeeperServer);
+        }
+
+        zk.create("/target1", "third".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.PERSISTENT);
+        zk.create("/target2", "third".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.PERSISTENT);
+
+        LOG.info("observer zxid " + Long.toHexString(q3.getQuorumPeer().getLastLoggedZxid()) +
+                (testObserverMaster ? "" : " observer master zxid " +
+                        Long.toHexString(follower.getQuorumPeer().getLastLoggedZxid())) +
+                " leader zxid " + Long.toHexString(leader.getQuorumPeer().getLastLoggedZxid()));
+
+        // restore network
+        forwarder = testObserverMaster ? new PortForwarder(OM_PROXY_PORT, OM_PORT) : null;
+
+        Assert.assertTrue("waiting for server 3 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+        Assert.assertNotNull("Leader switched", leader.getQuorumPeer().leader);
+
+        if (delayRequestProcessor != null) {
+            delayRequestProcessor.unblockQueue();
+        }
+
+        latch = new CountDownLatch(1);
+        ZooKeeper obsZk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        latch.await();
+        zk.create("/finalop", "fourth".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.PERSISTENT);
+
+        Assert.assertEquals("first", new String(obsZk.getData("/init", null, null)));
+        Assert.assertEquals("third", new String(obsZk.getData("/target1", null, null)));
+
+        obsZk.close();
+        shutdown();
+
+        try {
+            if (forwarder != null) {
+                forwarder.shutdown();
+            }
+        } catch (Exception e) {
+            // ignore
+        }
+    }
+
+    /**
+     * This test ensures two things:
+     * 1. That Observers can successfully proxy requests to the ensemble.
+     * 2. That Observers don't participate in leader elections.
+     * The second is tested by constructing an ensemble where a leader would
+     * be elected if and only if an Observer voted.
+     */
+    @Test
+    public void testObserver() throws Exception {
+        // We expect two notifications before we want to continue
+        latch = new CountDownLatch(2);
+        setUp(-1);
+        q3.start();
+        Assert.assertTrue("waiting for server 3 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+
+        if (testObserverMaster) {
+            int masterPort = q3.getQuorumPeer().observer.getSocket().getPort();
+            LOG.info("port " + masterPort + " " + OM_PORT);
+            Assert.assertEquals("observer failed to connect to observer master", masterPort, OM_PORT);
+        }
+
+        zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        zk.create("/obstest", "test".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.PERSISTENT);
+
+        // Assert that commands are getting forwarded correctly
+        Assert.assertEquals(new String(zk.getData("/obstest", null, null)), "test");
+
+        // Now check that other commands don't blow everything up
+        zk.sync("/", null, null);
+        zk.setData("/obstest", "test2".getBytes(), -1);
+        zk.getChildren("/", false);
+
+        Assert.assertEquals(zk.getState(), States.CONNECTED);
+
+        LOG.info("Shutting down server 2");
+        // Now kill one of the other real servers
+        q2.shutdown();
+
+        Assert.assertTrue("Waiting for server 2 to shut down",
+                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP2,
+                        ClientBase.CONNECTION_TIMEOUT));
+
+        LOG.info("Server 2 down");
+
+        // Now the resulting ensemble shouldn't be quorate
+        latch.await();
+        Assert.assertNotSame("Client is still connected to non-quorate cluster",
+                KeeperState.SyncConnected,lastEvent.getState());
+
+        LOG.info("Latch returned");
+
+        try {
+            Assert.assertNotEquals("Shouldn't get a response when cluster not quorate!",
+                    "test", new String(zk.getData("/obstest", null, null)));
+        }
+        catch (ConnectionLossException c) {
+            LOG.info("Connection loss exception caught - ensemble not quorate (this is expected)");
+        }
+
+        latch = new CountDownLatch(1);
+
+        LOG.info("Restarting server 2");
+
+        // Bring it back
+        //q2 = new MainThread(2, CLIENT_PORT_QP2, quorumCfgSection, extraCfgs);
+        q2.start();
+
+        LOG.info("Waiting for server 2 to come up");
+        Assert.assertTrue("waiting for server 2 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP2,
+                        CONNECTION_TIMEOUT));
+
+        LOG.info("Server 2 started, waiting for latch");
+
+        latch.await();
+        // It's possible our session expired - but this is ok, shows we
+        // were able to talk to the ensemble
+        Assert.assertTrue("Client didn't reconnect to quorate ensemble (state was" +
+                        lastEvent.getState() + ")",
+                (KeeperState.SyncConnected==lastEvent.getState() ||
+                        KeeperState.Expired==lastEvent.getState()));
+
+        LOG.info("perform a revalidation test");
+        int leaderProxyPort = PortAssignment.unique();
+        int obsProxyPort = PortAssignment.unique();
+        int leaderPort = q1.getQuorumPeer().leader == null ? CLIENT_PORT_QP2 : CLIENT_PORT_QP1;
+        PortForwarder leaderPF = new PortForwarder(leaderProxyPort, leaderPort);
+
+        latch = new CountDownLatch(1);
+        ZooKeeper client = new ZooKeeper(String.format("127.0.0.1:%d,127.0.0.1:%d", leaderProxyPort, obsProxyPort),
+                ClientBase.CONNECTION_TIMEOUT, this);
+        latch.await();
+        client.create("/revalidtest", "test".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.EPHEMERAL);
+        Assert.assertNotNull("Read-after write failed", client.exists("/revalidtest", null));
+
+        latch = new CountDownLatch(2);
+        PortForwarder obsPF = new PortForwarder(obsProxyPort, CLIENT_PORT_OBS);
+        try {
+            leaderPF.shutdown();
+        } catch (Exception e) {
+            // ignore?
+        }
+        latch.await();
+        Assert.assertEquals(new String(client.getData("/revalidtest", null, null)), "test");
+        client.close();
+        obsPF.shutdown();
+
+        shutdown();
+    }
+
+    @Test
+    public void testRevalidation() throws Exception {
+        setUp(-1);
+        q3.start();
+        Assert.assertTrue("waiting for server 3 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+        final int leaderProxyPort = PortAssignment.unique();
+        final int obsProxyPort = PortAssignment.unique();
+
+        int leaderPort = q1.getQuorumPeer().leader == null ? CLIENT_PORT_QP2 : CLIENT_PORT_QP1;
+        PortForwarder leaderPF = new PortForwarder(leaderProxyPort, leaderPort);
+
+        latch = new CountDownLatch(1);
+        zk = new ZooKeeper(String.format("127.0.0.1:%d,127.0.0.1:%d", leaderProxyPort, obsProxyPort),
+                ClientBase.CONNECTION_TIMEOUT, this);
+        latch.await();
+        zk.create("/revalidtest", "test".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.EPHEMERAL);
+        Assert.assertNotNull("Read-after write failed", zk.exists("/revalidtest", null));
+
+        latch = new CountDownLatch(2);
+        PortForwarder obsPF = new PortForwarder(obsProxyPort, CLIENT_PORT_OBS);
+        try {
+            leaderPF.shutdown();
+        } catch (Exception e) {
+            // ignore?
+        }
+        latch.await();
+        Assert.assertEquals(new String(zk.getData("/revalidtest", null, null)), "test");
+        obsPF.shutdown();
+
+        shutdown();
+    }
+
+    @Test
+    public void testInOrderCommits() throws Exception {
+        setUp(-1);
+
+        zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_QP1,
+                ClientBase.CONNECTION_TIMEOUT, null);
+        for (int i = 0; i < 10; i++) {
+            zk.create("/bulk" + i, ("Initial data of some size").getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
+        }
+        zk.close();
+
+        q3.start();
+        Assert.assertTrue("waiting for observer to be up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+
+        latch = new CountDownLatch(1);
+        zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_QP1,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        latch.await();
+        Assert.assertEquals(zk.getState(), States.CONNECTED);
+
+        zk.create("/init", "first".getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
+        final long zxid = q1.getQuorumPeer().getLastLoggedZxid();
+
+        // wait for change to propagate
+        waitFor("Timeout waiting for observer sync", new WaitForCondition() {
+            public boolean evaluate() {
+                return zxid == q3.getQuorumPeer().getLastLoggedZxid();
+            }
+        }, 30);
+
+        ZooKeeper obsZk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        int followerPort = q1.getQuorumPeer().leader == null ? CLIENT_PORT_QP1 : CLIENT_PORT_QP2;
+        ZooKeeper fZk = new ZooKeeper("127.0.0.1:" + followerPort,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        final int numTransactions = 10001;
+        CountDownLatch gate = new CountDownLatch(1);
+        CountDownLatch oAsyncLatch = new CountDownLatch(numTransactions);
+        Thread oAsyncWriteThread = new Thread(new AsyncWriter(obsZk, numTransactions, true, oAsyncLatch, "/obs", gate));
+        CountDownLatch fAsyncLatch = new CountDownLatch(numTransactions);
+        Thread fAsyncWriteThread = new Thread(new AsyncWriter(fZk, numTransactions, true, fAsyncLatch, "/follower", gate));
+
+        LOG.info("ASYNC WRITES");
+        oAsyncWriteThread.start();
+        fAsyncWriteThread.start();
+        gate.countDown();
+
+        oAsyncLatch.await();
+        fAsyncLatch.await();
+
+        oAsyncWriteThread.join(ClientBase.CONNECTION_TIMEOUT);
+        if (oAsyncWriteThread.isAlive()) {
+            LOG.error("asyncWriteThread is still alive");
+        }
+        fAsyncWriteThread.join(ClientBase.CONNECTION_TIMEOUT);
+        if (fAsyncWriteThread.isAlive()) {
+            LOG.error("asyncWriteThread is still alive");
+        }
+
+        obsZk.close();
+        fZk.close();
+
+        shutdown();
+    }
+
+    @Test
+    public void testAdminCommands() throws IOException, MBeanException,
+            InstanceNotFoundException, ReflectionException, InterruptedException, MalformedObjectNameException,
+            AttributeNotFoundException, InvalidAttributeValueException, KeeperException {
+        // flush all beans, then start
+        for (ZKMBeanInfo beanInfo : MBeanRegistry.getInstance().getRegisteredBeans()) {
+            MBeanRegistry.getInstance().unregister(beanInfo);
+        }
+
+        JMXEnv.setUp();
+        setUp(-1);
+        q3.start();
+        Assert.assertTrue("waiting for observer to be up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+
+        // Assert that commands are getting forwarded correctly
+        zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
+                ClientBase.CONNECTION_TIMEOUT, this);
+        zk.create("/obstest", "test".getBytes(),Ids.OPEN_ACL_UNSAFE,
+                CreateMode.PERSISTENT);
+        Assert.assertEquals(new String(zk.getData("/obstest", null, null)), "test");
+
+        // test stats collection
+        final Map<String, String> emptyMap = Collections.emptyMap();
+        Map<String, Object> stats = Commands.runCommand("mntr", q3.getQuorumPeer().getActiveServer(), emptyMap).toMap();
+        Assert.assertTrue("observer not emitting observer_master_id", stats.containsKey("observer_master_id"));
+
+        // check the stats for the first peer
+        stats = Commands.runCommand("mntr", q1.getQuorumPeer().getActiveServer(), emptyMap).toMap();
+        if (testObserverMaster) {
+            if (q1.getQuorumPeer().leader == null) {
+                Assert.assertEquals(1, stats.get("synced_observers"));
+            } else {
+                Assert.assertEquals(0, stats.get("synced_observers"));
+            }
+        } else {
+            if (q1.getQuorumPeer().leader == null) {
+                Assert.assertNull(stats.get("synced_observers"));
+            } else {
+                Assert.assertEquals(1, stats.get("synced_observers"));
+            }
+        }
+
+        // check the stats for the second peer
+        stats = Commands.runCommand("mntr", q2.getQuorumPeer().getActiveServer(), emptyMap).toMap();
+        if (testObserverMaster) {
+            if (q2.getQuorumPeer().leader == null) {
+                Assert.assertEquals(1, stats.get("synced_observers"));
+            } else {
+                Assert.assertEquals(0, stats.get("synced_observers"));
+            }
+        } else {
+            if (q2.getQuorumPeer().leader == null) {
+                Assert.assertNull(stats.get("synced_observers"));
+            } else {
+                Assert.assertEquals(1, stats.get("synced_observers"));
+            }
+        }
+
+        // test admin commands for disconnection
+        ObjectName connBean = null;
+        for (ObjectName bean : JMXEnv.conn().queryNames(new ObjectName(CommonNames.DOMAIN + ":*"), null)) {
+            if (bean.getCanonicalName().contains("Learner_Connections") &&
+                    bean.getCanonicalName().contains("id:" + q3.getQuorumPeer().getId())) {
+                connBean = bean;
+                break;
+            }
+        }
+        Assert.assertNotNull("could not find connection bean", connBean);
+
+        latch = new CountDownLatch(1);
+        JMXEnv.conn().invoke(connBean, "terminateConnection", new Object[0], null);
+        Assert.assertTrue("server failed to disconnect on terminate",
+                latch.await(CONNECTION_TIMEOUT/2, TimeUnit.MILLISECONDS));
+        Assert.assertTrue("waiting for server 3 being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                        CONNECTION_TIMEOUT));
+
+        final String obsBeanName =
+                String.format("org.apache.ZooKeeperService:name0=ReplicatedServer_id%d,name1=replica.%d,name2=Observer",
+                        q3.getQuorumPeer().getId(), q3.getQuorumPeer().getId());
+        Set<ObjectName> names = JMXEnv.conn().queryNames(new ObjectName(obsBeanName), null);
+        Assert.assertEquals("expecting singular observer bean", 1, names.size());
+        ObjectName obsBean = names.iterator().next();
+
+        if (testObserverMaster) {
+            // show we can move the observer using the id
+            long observerMasterId = q3.getQuorumPeer().observer.getLearnerMasterId();
+            latch = new CountDownLatch(1);
+            JMXEnv.conn().setAttribute(obsBean, new Attribute("LearnerMaster", Long.toString(3 - observerMasterId)));
+            Assert.assertTrue("server failed to disconnect on terminate",
+                    latch.await(CONNECTION_TIMEOUT, TimeUnit.MILLISECONDS));
+            Assert.assertTrue("waiting for server 3 being up",
+                    ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
+                            CONNECTION_TIMEOUT));
+        } else {
+            // show we get an error
+            final long leaderId = q1.getQuorumPeer().leader == null ? 2 : 1;
+            try {
+                JMXEnv.conn().setAttribute(obsBean, new Attribute("LearnerMaster", Long.toString(3 - leaderId)));
+                Assert.fail("should have seen an exception on previous command");
+            } catch (RuntimeMBeanException e) {
+                Assert.assertEquals("mbean failed for the wrong reason",
+                        IllegalArgumentException.class, e.getCause().getClass());
+            }
+        }
+
+        shutdown();
+        JMXEnv.tearDown();
+    }
+
+    private String createServerString(String type, long serverId, int clientPort) {
+        return "server." + serverId + "=127.0.0.1:" +
+                PortAssignment.unique() + ":" +
+                PortAssignment.unique() + ":" +
+                type + ";" + clientPort;
+    }
+
+    private void waitServerUp(int clientPort) {
+        Assert.assertTrue("waiting for server being up",
+                ClientBase.waitForServerUp("127.0.0.1:" + clientPort,
+                        CONNECTION_TIMEOUT));
+    }
+
+    private ZooKeeperAdmin createAdmin(int clientPort) throws IOException {
+        System.setProperty("zookeeper.DigestAuthenticationProvider.superDigest",
+                "super:D/InIHSb7yEEbrWz8b9l71RjZJU="/* password is 'test'*/);
+        QuorumPeerConfig.setReconfigEnabled(true);
+        ZooKeeperAdmin admin = new ZooKeeperAdmin("127.0.0.1:" + clientPort,
+                ClientBase.CONNECTION_TIMEOUT, new Watcher() {
+                    public void process(WatchedEvent event) {}
+                });
+        admin.addAuthInfo("digest", "super:test".getBytes());
+        return admin;
+    }
+
+    // This test is known to be flaky and fail due to "reconfig already in progress".
+    // TODO: Investigate intermittent testDynamicReconfig failures.
+    // @Test
+    public void testDynamicReconfig() throws InterruptedException, IOException,
+              KeeperException {
+        if (!testObserverMaster) {
+            return;
+        }
+
+        ClientBase.setupTestEnv();
+
+        // create a quorum running with different observer master port
+        // to make it easier to choose which server the observer is
+        // following with
+        //
+        // we have setObserverMaster function but it's broken, use this
+        // solution before we fixed that
+        int clientPort1 = PortAssignment.unique();
+        int clientPort2 = PortAssignment.unique();
+        int omPort1 = PortAssignment.unique();
+        int omPort2 = PortAssignment.unique();
+        String quorumCfgSection =
+                createServerString("participant", 1, clientPort1) + "\n" +
+                createServerString("participant", 2, clientPort2);
+
+        MainThread s1 = new MainThread(1, clientPort1, quorumCfgSection,
+                String.format("observerMasterPort=%d%n",omPort1));
+        MainThread s2 = new MainThread(2, clientPort2, quorumCfgSection,
+                String.format("observerMasterPort=%d%n", omPort2));
+        s1.start();
+        s2.start();
+        waitServerUp(clientPort1);
+        waitServerUp(clientPort2);
+
+        // create observer to follow non-leader observer master
+        long nonLeaderOMPort = s1.getQuorumPeer().leader == null ? omPort1
+                                                                 : omPort2;
+        int observerClientPort = PortAssignment.unique();
+        int observerId = 10;
+        MainThread observer = new MainThread(
+                observerId,
+                observerClientPort, quorumCfgSection + "\n" +
+                createServerString("observer", observerId,
+                        observerClientPort),
+                String.format("observerMasterPort=%d%n", nonLeaderOMPort));
+        LOG.info("starting observer");
+        observer.start();
+        waitServerUp(observerClientPort);
+
+        // create a client to the observer
+        final LinkedBlockingQueue<KeeperState> states =
+            new LinkedBlockingQueue<KeeperState>();
+        ZooKeeper observerClient = new ZooKeeper(
+                "127.0.0.1:" + observerClientPort,
+                ClientBase.CONNECTION_TIMEOUT, new Watcher() {
+                    @Override
+                    public void process(WatchedEvent event) {
+                        try {
+                            states.put(event.getState());
+                        } catch (InterruptedException e) {}
+                    }
+                });
+
+        // wait for connected
+        KeeperState state = states.poll(1000, TimeUnit.MILLISECONDS);
+        Assert.assertEquals(KeeperState.SyncConnected, state);
+
+        // issue reconfig command
+        ArrayList<String> newServers = new ArrayList<String>();
+        String server = "server.3=127.0.0.1:" + PortAssignment.unique()
+                + ":" + PortAssignment.unique() + ":participant;localhost:"
+                + PortAssignment.unique();
+        newServers.add(server);
+        ZooKeeperAdmin admin = createAdmin(clientPort1);
+        ReconfigTest.reconfig(admin, newServers, null, null, -1);
+
+        // make sure the observer has the new config
+        ReconfigTest.testServerHasConfig(observerClient, newServers, null);
+
+        // shouldn't be disconnected during reconfig, so expect to not
+        // receive any new event
+        state = states.poll(1000, TimeUnit.MILLISECONDS);
+        Assert.assertNull(state);
+
+        admin.close();
+        observerClient.close();
+        observer.shutdown();
+        s2.shutdown();
+        s1.shutdown();
+    }
+
+    /**
+     * Implementation of watcher interface.
+     */
+    public void process(WatchedEvent event) {
+        lastEvent = event;
+        if (latch != null) {
+            latch.countDown();
+        }
+        LOG.info("Latch got event :: " + event);
+    }
+
+    class AsyncWriter implements Runnable {
+        private final ZooKeeper client;
+        private final int numTransactions;
+        private final boolean issueSync;
+        private final CountDownLatch writerLatch;
+        private final String root;
+        private final CountDownLatch gate;
+
+        AsyncWriter(ZooKeeper client, int numTransactions, boolean issueSync, CountDownLatch writerLatch,
+                    String root, CountDownLatch gate) {
+            this.client = client;
+            this.numTransactions = numTransactions;
+            this.issueSync = issueSync;
+            this.writerLatch = writerLatch;
+            this.root = root;
+            this.gate = gate;
+        }
+
+        @Override
+        public void run() {
+            if (gate != null) {
+                try {
+                    gate.await();
+                } catch (InterruptedException e) {
+                    LOG.error("Gate interrupted");
+                    return;
+                }
+            }
+            for (int i = 0; i < numTransactions; i++) {
+                final boolean pleaseLog = i % 100 == 0;
+                client.create(root + i, "inner thread".getBytes(), Ids.OPEN_ACL_UNSAFE,
+                        CreateMode.PERSISTENT, new AsyncCallback.StringCallback() {
+                            @Override
+                            public void processResult(int rc, String path,
+                                                      Object ctx, String name) {
+                                writerLatch.countDown();
+                                if (pleaseLog) {
+                                    LOG.info("wrote {}", path);
+                                }
+                            }
+                        }, null);
+                if (pleaseLog) {
+                    LOG.info("async wrote {}{}", root, i);
+                    if (issueSync) {
+                        client.sync(root + "0", null, null);
+                    }
+                }
+            }
+        }
+    }
+}

+ 4 - 159
zookeeper-server/src/test/java/org/apache/zookeeper/test/ObserverTest.java

@@ -18,176 +18,21 @@
 
 package org.apache.zookeeper.test;
 
-import static org.apache.zookeeper.test.ClientBase.CONNECTION_TIMEOUT;
-
-import java.util.concurrent.CountDownLatch;
-
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-import org.apache.zookeeper.CreateMode;
-import org.apache.zookeeper.KeeperException.ConnectionLossException;
 import org.apache.zookeeper.PortAssignment;
-import org.apache.zookeeper.WatchedEvent;
 import org.apache.zookeeper.Watcher;
-import org.apache.zookeeper.Watcher.Event.KeeperState;
-import org.apache.zookeeper.ZooDefs.Ids;
 import org.apache.zookeeper.ZooKeeper;
-import org.apache.zookeeper.ZooKeeper.States;
 import org.apache.zookeeper.server.quorum.QuorumPeerTestBase;
 import org.junit.Assert;
 import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class ObserverTest extends QuorumPeerTestBase implements Watcher{
     protected static final Logger LOG =
-        LoggerFactory.getLogger(ObserverTest.class);    
-      
-    CountDownLatch latch;
-    ZooKeeper zk;
-    WatchedEvent lastEvent = null;
-          
-    /**
-     * This test ensures two things:
-     * 1. That Observers can successfully proxy requests to the ensemble.
-     * 2. That Observers don't participate in leader elections.
-     * The second is tested by constructing an ensemble where a leader would
-     * be elected if and only if an Observer voted. 
-     * @throws Exception
-     */
-    @Test
-    public void testObserver() throws Exception {
-        ClientBase.setupTestEnv();
-        // We expect two notifications before we want to continue        
-        latch = new CountDownLatch(2);
-        
-        final int PORT_QP1 = PortAssignment.unique();
-        final int PORT_QP2 = PortAssignment.unique();
-        final int PORT_OBS = PortAssignment.unique();
-        final int PORT_QP_LE1 = PortAssignment.unique();
-        final int PORT_QP_LE2 = PortAssignment.unique();
-        final int PORT_OBS_LE = PortAssignment.unique();
-
-        final int CLIENT_PORT_QP1 = PortAssignment.unique();
-        final int CLIENT_PORT_QP2 = PortAssignment.unique();
-        final int CLIENT_PORT_OBS = PortAssignment.unique();
+        LoggerFactory.getLogger(ObserverTest.class);
 
-        
-        String quorumCfgSection = 
-            "server.1=127.0.0.1:" + (PORT_QP1)
-            + ":" + (PORT_QP_LE1) + ";" +  CLIENT_PORT_QP1 
-            + "\nserver.2=127.0.0.1:" + (PORT_QP2)
-            + ":" + (PORT_QP_LE2) + ";" + CLIENT_PORT_QP2  
-            + "\nserver.3=127.0.0.1:" 
-            + (PORT_OBS)+ ":" + (PORT_OBS_LE) + ":observer" + ";" + CLIENT_PORT_OBS;
-        
-        MainThread q1 = new MainThread(1, CLIENT_PORT_QP1, quorumCfgSection);
-        MainThread q2 = new MainThread(2, CLIENT_PORT_QP2, quorumCfgSection);
-        MainThread q3 = new MainThread(3, CLIENT_PORT_OBS, quorumCfgSection);
-        q1.start();
-        q2.start();
-        q3.start();
-        Assert.assertTrue("waiting for server 1 being up",
-                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP1,
-                        CONNECTION_TIMEOUT));
-        Assert.assertTrue("waiting for server 2 being up",
-                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP2,
-                        CONNECTION_TIMEOUT));
-        Assert.assertTrue("waiting for server 3 being up",
-                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_OBS,
-                        CONNECTION_TIMEOUT));        
-        
-        zk = new ZooKeeper("127.0.0.1:" + CLIENT_PORT_OBS,
-                ClientBase.CONNECTION_TIMEOUT, this);
-        zk.create("/obstest", "test".getBytes(),Ids.OPEN_ACL_UNSAFE,
-                CreateMode.PERSISTENT);
-        
-        // Assert that commands are getting forwarded correctly
-        Assert.assertEquals(new String(zk.getData("/obstest", null, null)), "test");
-        
-        // Now check that other commands don't blow everything up
-        zk.sync("/", null, null);
-        zk.setData("/obstest", "test2".getBytes(), -1);
-        zk.getChildren("/", false);
-        
-        Assert.assertEquals(zk.getState(), States.CONNECTED);
-        
-        LOG.info("Shutting down server 2");
-        // Now kill one of the other real servers        
-        q2.shutdown();
-                
-        Assert.assertTrue("Waiting for server 2 to shut down",
-                    ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP2, 
-                                    ClientBase.CONNECTION_TIMEOUT));
-
-        LOG.info("Server 2 down");
-
-        // Now the resulting ensemble shouldn't be quorate         
-        latch.await();        
-        Assert.assertNotSame("Client is still connected to non-quorate cluster", 
-                KeeperState.SyncConnected,lastEvent.getState());
-
-        LOG.info("Latch returned");
-
-        try {
-            Assert.assertFalse("Shouldn't get a response when cluster not quorate!",
-                    new String(zk.getData("/obstest", null, null)).equals("test"));
-        }
-        catch (ConnectionLossException c) {
-            LOG.info("Connection loss exception caught - ensemble not quorate (this is expected)");
-        }
-        
-        latch = new CountDownLatch(1);
-
-        LOG.info("Restarting server 2");
-
-        // Bring it back
-        q2 = new MainThread(2, CLIENT_PORT_QP2, quorumCfgSection);
-        q2.start();
-        
-        LOG.info("Waiting for server 2 to come up");
-        Assert.assertTrue("waiting for server 2 being up",
-                ClientBase.waitForServerUp("127.0.0.1:" + CLIENT_PORT_QP2,
-                        CONNECTION_TIMEOUT));
-        
-        LOG.info("Server 2 started, waiting for latch");
-
-        latch.await();
-        // It's possible our session expired - but this is ok, shows we 
-        // were able to talk to the ensemble
-        Assert.assertTrue("Client didn't reconnect to quorate ensemble (state was" +
-                lastEvent.getState() + ")",
-                (KeeperState.SyncConnected==lastEvent.getState() ||
-                KeeperState.Expired==lastEvent.getState())); 
-
-        LOG.info("Shutting down all servers");
-
-        q1.shutdown();
-        q2.shutdown();
-        q3.shutdown();
-        
-        LOG.info("Closing zk client");
+    ZooKeeper zk;
 
-        zk.close();        
-        Assert.assertTrue("Waiting for server 1 to shut down",
-                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP1, 
-                                ClientBase.CONNECTION_TIMEOUT));
-        Assert.assertTrue("Waiting for server 2 to shut down",
-                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_QP2, 
-                                ClientBase.CONNECTION_TIMEOUT));
-        Assert.assertTrue("Waiting for server 3 to shut down",
-                ClientBase.waitForServerDown("127.0.0.1:"+CLIENT_PORT_OBS, 
-                                ClientBase.CONNECTION_TIMEOUT));
-    
-    }
-    
-    /**
-     * Implementation of watcher interface.
-     */
-    public void process(WatchedEvent event) {
-        lastEvent = event;
-        latch.countDown();
-        LOG.info("Latch got event :: " + event);        
-    }    
-    
     /**
      * This test ensures that an Observer does not elect itself as a leader, or
      * indeed come up properly, if it is the lone member of an ensemble.

+ 9 - 7
zookeeper-server/src/test/java/org/apache/zookeeper/test/ReconfigTest.java

@@ -82,20 +82,22 @@ public class ReconfigTest extends ZKTestCase implements DataCallback{
                                   List<String> leavingServers, List<String> newMembers, long fromConfig)
             throws KeeperException, InterruptedException {
         byte[] config = null;
+        String failure = null;
         for (int j = 0; j < 30; j++) {
             try {
                 config = zkAdmin.reconfigure(joiningServers, leavingServers,
                         newMembers, fromConfig, new Stat());
+                failure = null;
                 break;
             } catch (KeeperException.ConnectionLossException e) {
-                if (j < 29) {
-                    Thread.sleep(1000);
-                } else {
-                    // test fails if we still can't connect to the quorum after
-                    // 30 seconds.
-                    Assert.fail("client could not connect to reestablished quorum: giving up after 30+ seconds.");
-                }
+                failure = "client could not connect to reestablished quorum: giving up after 30+ seconds.";
+            } catch (KeeperException.ReconfigInProgress e) {
+                failure = "reconfig still in progress: giving up after 30+ seconds.";
             }
+            Thread.sleep(1000);
+        }
+        if (failure != null) {
+            Assert.fail(failure);
         }
 
         String configStr = new String(config);

+ 8 - 0
zookeeper-server/src/test/resources/findbugsExcludeFile.xml

@@ -199,4 +199,12 @@
     <Class name="org.apache.zookeeper.server.EphemeralType"/>
       <Bug pattern="RV_RETURN_VALUE_IGNORED_NO_SIDE_EFFECT" />
   </Match>
+
+  <!-- Disable 'Inconsistent synchronization' to allow the ServerSocket to listen without
+       locking the class -->
+  <Match>
+    <Class name="org.apache.zookeeper.server.quorum.ObserverMaster"/>
+    <Field name="ss"/>
+    <Bug code="IS"/>
+  </Match>
 </FindBugsFilter>