|
@@ -0,0 +1,371 @@
|
|
|
+<?xml version="1.0"?>
|
|
|
+ <!--
|
|
|
+ Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
+ contributor license agreements. See the NOTICE file distributed with
|
|
|
+ this work for additional information regarding copyright ownership.
|
|
|
+ The ASF licenses this file to You under the Apache License, Version
|
|
|
+ 2.0 (the "License"); you may not use this file except in compliance
|
|
|
+ with the License. You may obtain a copy of the License at
|
|
|
+
|
|
|
+ http://www.apache.org/licenses/LICENSE-2.0 Unless required by
|
|
|
+ applicable law or agreed to in writing, software distributed under
|
|
|
+ the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
|
|
|
+ OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
|
+ License for the specific language governing permissions and
|
|
|
+ limitations under the License.
|
|
|
+ -->
|
|
|
+
|
|
|
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
|
|
|
+<document>
|
|
|
+ <header>
|
|
|
+ <title>Fair Scheduler</title>
|
|
|
+ </header>
|
|
|
+ <body>
|
|
|
+
|
|
|
+ <section>
|
|
|
+ <title>Purpose</title>
|
|
|
+
|
|
|
+ <p>This document describes the Fair Scheduler, a pluggable
|
|
|
+ Map/Reduce scheduler for Hadoop which provides a way to share
|
|
|
+ large clusters.</p>
|
|
|
+ </section>
|
|
|
+
|
|
|
+ <section>
|
|
|
+ <title>Introduction</title>
|
|
|
+ <p>Fair scheduling is a method of assigning resources to jobs
|
|
|
+ such that all jobs get, on average, an equal share of resources
|
|
|
+ over time. When there is a single job running, that job uses the
|
|
|
+ entire cluster. When other jobs are submitted, tasks slots that
|
|
|
+ free up are assigned to the new jobs, so that each job gets
|
|
|
+ roughly the same amount of CPU time. Unlike the default Hadoop
|
|
|
+ scheduler, which forms a queue of jobs, this lets short jobs finish
|
|
|
+ in reasonable time while not starving long jobs. It is also a
|
|
|
+ reasonable way to share a cluster between a number of users. Finally,
|
|
|
+ fair sharing can also work with job priorities - the priorities are
|
|
|
+ used as weights to determine the fraction of total compute time that
|
|
|
+ each job should get.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ The scheduler actually organizes jobs further into "pools", and
|
|
|
+ shares resources fairly between these pools. By default, there is a
|
|
|
+ separate pool for each user, so that each user gets the same share
|
|
|
+ of the cluster no matter how many jobs they submit. However, it is
|
|
|
+ also possible to set a job's pool based on the user's Unix group or
|
|
|
+ any other jobconf property, such as the queue name property used by
|
|
|
+ <a href="capacity_scheduler.html">Capacity Scheduler</a>.
|
|
|
+ Within each pool, fair sharing is used to share capacity between
|
|
|
+ the running jobs. Pools can also be given weights to share the
|
|
|
+ cluster non-proportionally in the config file.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ In addition to providing fair sharing, the Fair Scheduler allows
|
|
|
+ assigning guaranteed minimum shares to pools, which is useful for
|
|
|
+ ensuring that certain users, groups or production applications
|
|
|
+ always get sufficient resources. When a pool contains jobs, it gets
|
|
|
+ at least its minimum share, but when the pool does not need its full
|
|
|
+ guaranteed share, the excess is split between other running jobs.
|
|
|
+ This lets the scheduler guarantee capacity for pools while utilizing
|
|
|
+ resources efficiently when these pools don't contain jobs.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ The Fair Scheduler lets all jobs run by default, but it is also
|
|
|
+ possible to limit the number of running jobs per user and per pool
|
|
|
+ through the config file. This can be useful when a user must submit
|
|
|
+ hundreds of jobs at once, or in general to improve performance if
|
|
|
+ running too many jobs at once would cause too much intermediate data
|
|
|
+ to be created or too much context-switching. Limiting the jobs does
|
|
|
+ not cause any subsequently submitted jobs to fail, only to wait in the
|
|
|
+ sheduler's queue until some of the user's earlier jobs finish. Jobs to
|
|
|
+ run from each user/pool are chosen in order of priority and then
|
|
|
+ submit time, as in the default FIFO scheduler in Hadoop.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Finally, the fair scheduler provides several extension points where
|
|
|
+ the basic functionality can be extended. For example, the weight
|
|
|
+ calculation can be modified to give a priority boost to new jobs,
|
|
|
+ implementing a "shortest job first" policy which reduces response
|
|
|
+ times for interactive jobs even further.
|
|
|
+ </p>
|
|
|
+ </section>
|
|
|
+
|
|
|
+ <section>
|
|
|
+ <title>Installation</title>
|
|
|
+ <p>
|
|
|
+ To run the fair scheduler in your Hadoop installation, you need to put
|
|
|
+ it on the CLASSPATH. The easiest way is to copy the
|
|
|
+ <em>hadoop-*-fairscheduler.jar</em> from
|
|
|
+ <em>HADOOP_HOME/contrib/fairscheduler</em> to <em>HADOOP_HOME/lib</em>.
|
|
|
+ Alternatively you can modify <em>HADOOP_CLASSPATH</em> to include this jar, in
|
|
|
+ <em>HADOOP_CONF_DIR/hadoop-env.sh</em>
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ In order to compile fair scheduler, from sources execute <em> ant
|
|
|
+ package</em> in source folder and copy the
|
|
|
+ <em>build/contrib/fair-scheduler/hadoop-*-fairscheduler.jar</em>
|
|
|
+ to <em>HADOOP_HOME/lib</em>
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ You will also need to set the following property in the Hadoop config
|
|
|
+ file <em>HADOOP_CONF_DIR/mapred-site.xml</em> to have Hadoop use
|
|
|
+ the fair scheduler: <br/>
|
|
|
+ <code><property></code><br/>
|
|
|
+ <code> <name>mapred.jobtracker.taskScheduler</name></code><br/>
|
|
|
+ <code> <value>org.apache.hadoop.mapred.FairScheduler</value></code><br/>
|
|
|
+ <code></property></code>
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Once you restart the cluster, you can check that the fair scheduler
|
|
|
+ is running by going to http://<jobtracker URL>/scheduler
|
|
|
+ on the JobTracker's web UI. A "job scheduler administration" page should
|
|
|
+ be visible there. This page is described in the Administration section.
|
|
|
+ </p>
|
|
|
+ </section>
|
|
|
+
|
|
|
+ <section>
|
|
|
+ <title>Configuring the Fair scheduler</title>
|
|
|
+ <p>
|
|
|
+ The following properties can be set in mapred-site.xml to configure
|
|
|
+ the fair scheduler:
|
|
|
+ </p>
|
|
|
+ <table>
|
|
|
+ <tr>
|
|
|
+ <th>Name</th><th>Description</th>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.allocation.file
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Specifies an absolute path to an XML file which contains the
|
|
|
+ allocations for each pool, as well as the per-pool and per-user
|
|
|
+ limits on number of running jobs. If this property is not
|
|
|
+ provided, allocations are not used.<br/>
|
|
|
+ This file must be in XML format, and can contain three types of
|
|
|
+ elements:
|
|
|
+ <ul>
|
|
|
+ <li>pool elements, which may contain elements for minMaps,
|
|
|
+ minReduces, maxRunningJobs (limit the number of jobs from the
|
|
|
+ pool to run at once),and weight (to share the cluster
|
|
|
+ non-proportionally with other pools).
|
|
|
+ </li>
|
|
|
+ <li>user elements, which may contain a maxRunningJobs to limit
|
|
|
+ jobs. Note that by default, there is a separate pool for each
|
|
|
+ user, so these may not be necessary; they are useful, however,
|
|
|
+ if you create a pool per user group or manually assign jobs
|
|
|
+ to pools.</li>
|
|
|
+ <li>A userMaxJobsDefault element, which sets the default running
|
|
|
+ job limit for any users whose limit is not specified.</li>
|
|
|
+ </ul>
|
|
|
+ <br/>
|
|
|
+ Example Allocation file is listed below :<br/>
|
|
|
+ <code><?xml version="1.0"?> </code> <br/>
|
|
|
+ <code><allocations></code> <br/>
|
|
|
+ <code> <pool name="sample_pool"></code><br/>
|
|
|
+ <code> <minMaps>5</minMaps></code><br/>
|
|
|
+ <code> <minReduces>5</minReduces></code><br/>
|
|
|
+ <code> <weight>2.0</weight></code><br/>
|
|
|
+ <code> </pool></code><br/>
|
|
|
+ <code> <user name="sample_user"></code><br/>
|
|
|
+ <code> <maxRunningJobs>6</maxRunningJobs></code><br/>
|
|
|
+ <code> </user></code><br/>
|
|
|
+ <code> <userMaxJobsDefault>3</userMaxJobsDefault></code><br/>
|
|
|
+ <code></allocations></code>
|
|
|
+ <br/>
|
|
|
+ This example creates a pool sample_pool with a guarantee of 5 map
|
|
|
+ slots and 5 reduce slots. The pool also has a weight of 2.0, meaning
|
|
|
+ it has a 2x higher share of the cluster than other pools (the default
|
|
|
+ weight is 1). Finally, the example limits the number of running jobs
|
|
|
+ per user to 3, except for sample_user, who can run 6 jobs concurrently.
|
|
|
+ Any pool not defined in the allocations file will have no guaranteed
|
|
|
+ capacity and a weight of 1.0. Also, any pool or user with no max
|
|
|
+ running jobs set in the file will be allowed to run an unlimited
|
|
|
+ number of jobs.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.assignmultiple
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Allows the scheduler to assign both a map task and a reduce task
|
|
|
+ on each heartbeat, which improves cluster throughput when there
|
|
|
+ are many small tasks to run. Boolean value, default: false.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.sizebasedweight
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Take into account job sizes in calculating their weights for fair
|
|
|
+ sharing.By default, weights are only based on job priorities.
|
|
|
+ Setting this flag to true will make them based on the size of the
|
|
|
+ job (number of tasks needed) as well,though not linearly
|
|
|
+ (the weight will be proportional to the log of the number of tasks
|
|
|
+ needed). This lets larger jobs get larger fair shares while still
|
|
|
+ providing enough of a share to small jobs to let them finish fast.
|
|
|
+ Boolean value, default: false.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.poolnameproperty
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Specify which jobconf property is used to determine the pool that a
|
|
|
+ job belongs in. String, default: user.name (i.e. one pool for each
|
|
|
+ user). Some other useful values to set this to are: <br/>
|
|
|
+ <ul>
|
|
|
+ <li> group.name (to create a pool per Unix group).</li>
|
|
|
+ <li>mapred.job.queue.name (the same property as the queue name in
|
|
|
+ <a href="capacity_scheduler.html">Capacity Scheduler</a>).</li>
|
|
|
+ </ul>
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.weightadjuster
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extensibility point that lets you specify a class to adjust the
|
|
|
+ weights of running jobs. This class should implement the
|
|
|
+ <em>WeightAdjuster</em> interface. There is currently one example
|
|
|
+ implementation - <em>NewJobWeightBooster</em>, which increases the
|
|
|
+ weight of jobs for the first 5 minutes of their lifetime to let
|
|
|
+ short jobs finish faster. To use it, set the weightadjuster
|
|
|
+ property to the full class name,
|
|
|
+ <code>org.apache.hadoop.mapred.NewJobWeightBooster</code>
|
|
|
+ NewJobWeightBooster itself provides two parameters for setting the
|
|
|
+ duration and boost factor. <br/>
|
|
|
+ <ol>
|
|
|
+ <li> <em>mapred.newjobweightbooster.factor</em>
|
|
|
+ Factor by which new jobs weight should be boosted. Default is 3</li>
|
|
|
+ <li><em>mapred.newjobweightbooster.duration</em>
|
|
|
+ Duration in milliseconds, default 300000 for 5 minutes</li>
|
|
|
+ </ol>
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.loadmanager
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extensibility point that lets you specify a class that determines
|
|
|
+ how many maps and reduces can run on a given TaskTracker. This class
|
|
|
+ should implement the LoadManager interface. By default the task caps
|
|
|
+ in the Hadoop config file are used, but this option could be used to
|
|
|
+ make the load based on available memory and CPU utilization for example.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.taskselector:
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extensibility point that lets you specify a class that determines
|
|
|
+ which task from within a job to launch on a given tracker. This can be
|
|
|
+ used to change either the locality policy (e.g. keep some jobs within
|
|
|
+ a particular rack) or the speculative execution algorithm (select
|
|
|
+ when to launch speculative tasks). The default implementation uses
|
|
|
+ Hadoop's default algorithms from JobInProgress.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ </table>
|
|
|
+ </section>
|
|
|
+ <section>
|
|
|
+ <title> Administration</title>
|
|
|
+ <p>
|
|
|
+ The fair scheduler provides support for administration at runtime
|
|
|
+ through two mechanisms:
|
|
|
+ </p>
|
|
|
+ <ol>
|
|
|
+ <li>
|
|
|
+ It is possible to modify pools' allocations
|
|
|
+ and user and pool running job limits at runtime by editing the allocation
|
|
|
+ config file. The scheduler will reload this file 10-15 seconds after it
|
|
|
+ sees that it was modified.
|
|
|
+ </li>
|
|
|
+ <li>
|
|
|
+ Current jobs, pools, and fair shares can be examined through the
|
|
|
+ JobTracker's web interface, at http://<jobtracker URL>/scheduler.
|
|
|
+ On this interface, it is also possible to modify jobs' priorities or
|
|
|
+ move jobs from one pool to another and see the effects on the fair
|
|
|
+ shares (this requires JavaScript).
|
|
|
+ </li>
|
|
|
+ </ol>
|
|
|
+ <p>
|
|
|
+ The following fields can be seen for each job on the web interface:
|
|
|
+ </p>
|
|
|
+ <ul>
|
|
|
+ <li><em>Submitted</em> - Date and time job was submitted.</li>
|
|
|
+ <li><em>JobID, User, Name</em> - Job identifiers as on the standard
|
|
|
+ web UI.</li>
|
|
|
+ <li><em>Pool</em> - Current pool of job. Select another value to move job to
|
|
|
+ another pool.</li>
|
|
|
+ <li><em>Priority</em> - Current priority. Select another value to change the
|
|
|
+ job's priority</li>
|
|
|
+ <li><em>Maps/Reduces Finished</em>: Number of tasks finished / total tasks.</li>
|
|
|
+ <li><em>Maps/Reduces Running</em>: Tasks currently running.</li>
|
|
|
+ <li><em>Map/Reduce Fair Share</em>: The average number of task slots that this
|
|
|
+ job should have at any given time according to fair sharing. The actual
|
|
|
+ number of tasks will go up and down depending on how much compute time
|
|
|
+ the job has had, but on average it will get its fair share amount.</li>
|
|
|
+ </ul>
|
|
|
+ <p>
|
|
|
+ In addition, it is possible to turn on an "advanced" view for the web UI,
|
|
|
+ by going to http://<jobtracker URL>/scheduler?advanced. This view shows
|
|
|
+ four more columns used for calculations internally:
|
|
|
+ </p>
|
|
|
+ <ul>
|
|
|
+ <li><em>Maps/Reduce Weight</em>: Weight of the job in the fair sharing
|
|
|
+ calculations. This depends on priority and potentially also on
|
|
|
+ job size and job age if the <em>sizebasedweight</em> and
|
|
|
+ <em>NewJobWeightBooster</em> are enabled.</li>
|
|
|
+ <li><em>Map/Reduce Deficit</em>: The job's scheduling deficit in machine-
|
|
|
+ seconds - the amount of resources it should have gotten according to
|
|
|
+ its fair share, minus how many it actually got. Positive deficit means
|
|
|
+ the job will be scheduled again in the near future because it needs to
|
|
|
+ catch up to its fair share. The scheduler schedules jobs with higher
|
|
|
+ deficit ahead of others. Please see the Implementation section of
|
|
|
+ this document for details.</li>
|
|
|
+ </ul>
|
|
|
+ </section>
|
|
|
+ <section>
|
|
|
+ <title>Implementation</title>
|
|
|
+ <p>There are two aspects to implementing fair scheduling: Calculating
|
|
|
+ each job's fair share, and choosing which job to run when a task slot
|
|
|
+ becomes available.</p>
|
|
|
+ <p>To select jobs to run, the scheduler then keeps track of a
|
|
|
+ "deficit" for each job - the difference between the amount of
|
|
|
+ compute time it should have gotten on an ideal scheduler, and the amount
|
|
|
+ of compute time it actually got. This is a measure of how
|
|
|
+ "unfair" we've been to the job. Every few hundred
|
|
|
+ milliseconds, the scheduler updates the deficit of each job by looking
|
|
|
+ at how many tasks each job had running during this interval vs. its
|
|
|
+ fair share. Whenever a task slot becomes available, it is assigned to
|
|
|
+ the job with the highest deficit. There is one exception - if there
|
|
|
+ were one or more jobs who were not meeting their pool capacity
|
|
|
+ guarantees, we only choose among these "needy" jobs (based
|
|
|
+ again on their deficit), to ensure that the scheduler meets pool
|
|
|
+ guarantees as soon as possible.</p>
|
|
|
+ <p>
|
|
|
+ The fair shares are calculated by dividing the capacity of the cluster
|
|
|
+ among runnable jobs according to a "weight" for each job. By
|
|
|
+ default the weight is based on priority, with each level of priority
|
|
|
+ having 2x higher weight than the next (for example, VERY_HIGH has 4x the
|
|
|
+ weight of NORMAL). However, weights can also be based on job sizes and ages,
|
|
|
+ as described in the Configuring section. For jobs that are in a pool,
|
|
|
+ fair shares also take into account the minimum guarantee for that pool.
|
|
|
+ This capacity is divided among the jobs in that pool according again to
|
|
|
+ their weights.
|
|
|
+ </p>
|
|
|
+ <p>Finally, when limits on a user's running jobs or a pool's running jobs
|
|
|
+ are in place, we choose which jobs get to run by sorting all jobs in order
|
|
|
+ of priority and then submit time, as in the standard Hadoop scheduler. Any
|
|
|
+ jobs that fall after the user/pool's limit in this ordering are queued up
|
|
|
+ and wait idle until they can be run. During this time, they are ignored
|
|
|
+ from the fair sharing calculations and do not gain or lose deficit (their
|
|
|
+ fair share is set to zero).</p>
|
|
|
+ </section>
|
|
|
+ </body>
|
|
|
+</document>
|