|
@@ -18,16 +18,15 @@
|
|
|
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
|
|
|
<document>
|
|
|
<header>
|
|
|
- <title>Fair Scheduler Guide</title>
|
|
|
+ <title>Fair Scheduler</title>
|
|
|
</header>
|
|
|
<body>
|
|
|
|
|
|
<section>
|
|
|
<title>Purpose</title>
|
|
|
|
|
|
- <p>This document describes the Fair Scheduler, a pluggable
|
|
|
- MapReduce scheduler for Hadoop which provides a way to share
|
|
|
- large clusters.</p>
|
|
|
+ <p>This document describes the Fair Scheduler, a pluggable MapReduce
|
|
|
+ scheduler that provides a way to share large clusters.</p>
|
|
|
</section>
|
|
|
|
|
|
<section>
|
|
@@ -39,52 +38,62 @@
|
|
|
free up are assigned to the new jobs, so that each job gets
|
|
|
roughly the same amount of CPU time. Unlike the default Hadoop
|
|
|
scheduler, which forms a queue of jobs, this lets short jobs finish
|
|
|
- in reasonable time while not starving long jobs. It is also a
|
|
|
- reasonable way to share a cluster between a number of users. Finally,
|
|
|
- fair sharing can also work with job priorities - the priorities are
|
|
|
+ in reasonable time while not starving long jobs. It is also an easy
|
|
|
+ way to share a cluster between multiple of users.
|
|
|
+ Fair sharing can also work with job priorities - the priorities are
|
|
|
used as weights to determine the fraction of total compute time that
|
|
|
- each job should get.
|
|
|
+ each job gets.
|
|
|
</p>
|
|
|
<p>
|
|
|
- The scheduler actually organizes jobs further into "pools", and
|
|
|
- shares resources fairly between these pools. By default, there is a
|
|
|
- separate pool for each user, so that each user gets the same share
|
|
|
- of the cluster no matter how many jobs they submit. However, it is
|
|
|
- also possible to set a job's pool based on the user's Unix group or
|
|
|
- any other jobconf property, such as the queue name property used by
|
|
|
- <a href="capacity_scheduler.html">Capacity Scheduler</a>.
|
|
|
- Within each pool, fair sharing is used to share capacity between
|
|
|
- the running jobs. Pools can also be given weights to share the
|
|
|
- cluster non-proportionally in the config file.
|
|
|
+ The fair scheduler organizes jobs into <em>pools</em>, and
|
|
|
+ divides resources fairly between these pools. By default, there is a
|
|
|
+ separate pool for each user, so that each user gets an equal share
|
|
|
+ of the cluster. It is also possible to set a job's pool based on the
|
|
|
+ user's Unix group or any jobconf property.
|
|
|
+ Within each pool, jobs can be scheduled using either fair sharing or
|
|
|
+ first-in-first-out (FIFO) scheduling.
|
|
|
</p>
|
|
|
<p>
|
|
|
In addition to providing fair sharing, the Fair Scheduler allows
|
|
|
- assigning guaranteed minimum shares to pools, which is useful for
|
|
|
- ensuring that certain users, groups or production applications
|
|
|
+ assigning guaranteed <em>minimum shares</em> to pools, which is useful
|
|
|
+ for ensuring that certain users, groups or production applications
|
|
|
always get sufficient resources. When a pool contains jobs, it gets
|
|
|
at least its minimum share, but when the pool does not need its full
|
|
|
- guaranteed share, the excess is split between other running jobs.
|
|
|
- This lets the scheduler guarantee capacity for pools while utilizing
|
|
|
- resources efficiently when these pools don't contain jobs.
|
|
|
+ guaranteed share, the excess is split between other pools.
|
|
|
</p>
|
|
|
<p>
|
|
|
- The Fair Scheduler lets all jobs run by default, but it is also
|
|
|
- possible to limit the number of running jobs per user and per pool
|
|
|
- through the config file. This can be useful when a user must submit
|
|
|
- hundreds of jobs at once, or in general to improve performance if
|
|
|
- running too many jobs at once would cause too much intermediate data
|
|
|
- to be created or too much context-switching. Limiting the jobs does
|
|
|
- not cause any subsequently submitted jobs to fail, only to wait in the
|
|
|
- sheduler's queue until some of the user's earlier jobs finish. Jobs to
|
|
|
- run from each user/pool are chosen in order of priority and then
|
|
|
- submit time, as in the default FIFO scheduler in Hadoop.
|
|
|
+ If a pool's minimum share is not met for some period of time, the
|
|
|
+ scheduler optionally supports <em>preemption</em> of jobs in other
|
|
|
+ pools. The pool will be allowed to kill tasks from other pools to make
|
|
|
+ room to run. Preemption can be used to guarantee
|
|
|
+ that "production" jobs are not starved while also allowing
|
|
|
+ the Hadoop cluster to also be used for experimental and research jobs.
|
|
|
+ In addition, a pool can also be allowed to preempt tasks if it is
|
|
|
+ below half of its fair share for a configurable timeout (generally
|
|
|
+ set larger than the minimum share preemption timeout).
|
|
|
+ When choosing tasks to kill, the fair scheduler picks the
|
|
|
+ most-recently-launched tasks from over-allocated jobs,
|
|
|
+ to minimize wasted computation.
|
|
|
+ Preemption does not cause the preempted jobs to fail, because Hadoop
|
|
|
+ jobs tolerate losing tasks; it only makes them take longer to finish.
|
|
|
</p>
|
|
|
<p>
|
|
|
- Finally, the fair scheduler provides several extension points where
|
|
|
- the basic functionality can be extended. For example, the weight
|
|
|
- calculation can be modified to give a priority boost to new jobs,
|
|
|
- implementing a "shortest job first" policy which reduces response
|
|
|
- times for interactive jobs even further.
|
|
|
+ The Fair Scheduler can limit the number of concurrent
|
|
|
+ running jobs per user and per pool. This can be useful when a
|
|
|
+ user must submit hundreds of jobs at once, or for ensuring that
|
|
|
+ intermediate data does not fill up disk space on a cluster when too many
|
|
|
+ concurrent jobs are running.
|
|
|
+ Setting job limits causes jobs submitted beyond the limit to wait
|
|
|
+ until some of the user/pool's earlier jobs finish.
|
|
|
+ Jobs to run from each user/pool are chosen in order of priority and then
|
|
|
+ submit time.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Finally, the Fair Scheduler can limit the number of concurrent
|
|
|
+ running tasks per pool. This can be useful when jobs have a
|
|
|
+ dependency on an external service like a database or web
|
|
|
+ service that could be overloaded if too many map or reduce
|
|
|
+ tasks are run at once.
|
|
|
</p>
|
|
|
</section>
|
|
|
|
|
@@ -93,184 +102,367 @@
|
|
|
<p>
|
|
|
To run the fair scheduler in your Hadoop installation, you need to put
|
|
|
it on the CLASSPATH. The easiest way is to copy the
|
|
|
- <em>hadoop-fairscheduler-*.jar</em> from
|
|
|
- <em>HADOOP_HOME/contrib/fairscheduler</em> to <em>HADOOP_HOME/lib</em>.
|
|
|
+ <em>hadoop-*-fairscheduler.jar</em> from
|
|
|
+ <em>HADOOP_HOME/build/contrib/fairscheduler</em> to <em>HADOOP_HOME/lib</em>.
|
|
|
Alternatively you can modify <em>HADOOP_CLASSPATH</em> to include this jar, in
|
|
|
<em>HADOOP_CONF_DIR/hadoop-env.sh</em>
|
|
|
</p>
|
|
|
- <p>
|
|
|
- In order to compile fair scheduler, from sources execute <em> ant
|
|
|
- package</em> in source folder and copy the
|
|
|
- <em>build/contrib/fair-scheduler/hadoop-fairscheduler-*.jar</em>
|
|
|
- to <em>HADOOP_HOME/lib</em>
|
|
|
- </p>
|
|
|
<p>
|
|
|
You will also need to set the following property in the Hadoop config
|
|
|
file <em>HADOOP_CONF_DIR/mapred-site.xml</em> to have Hadoop use
|
|
|
- the fair scheduler: <br/>
|
|
|
- <code><property></code><br/>
|
|
|
- <code> <name>mapred.jobtracker.taskScheduler</name></code><br/>
|
|
|
- <code> <value>org.apache.hadoop.mapred.FairScheduler</value></code><br/>
|
|
|
- <code></property></code>
|
|
|
+ the fair scheduler:
|
|
|
</p>
|
|
|
+<source>
|
|
|
+<property>
|
|
|
+ <name>mapred.jobtracker.taskScheduler</name>
|
|
|
+ <value>org.apache.hadoop.mapred.FairScheduler</value>
|
|
|
+</property>
|
|
|
+</source>
|
|
|
<p>
|
|
|
Once you restart the cluster, you can check that the fair scheduler
|
|
|
- is running by going to http://<jobtracker URL>/scheduler
|
|
|
+ is running by going to <em>http://<jobtracker URL>/scheduler</em>
|
|
|
on the JobTracker's web UI. A "job scheduler administration" page should
|
|
|
be visible there. This page is described in the Administration section.
|
|
|
</p>
|
|
|
+ <p>
|
|
|
+ If you wish to compile the fair scheduler from source, run <em> ant
|
|
|
+ package</em> in your HADOOP_HOME directory. This will build
|
|
|
+ <em>build/contrib/fair-scheduler/hadoop-*-fairscheduler.jar</em>.
|
|
|
+ </p>
|
|
|
</section>
|
|
|
|
|
|
<section>
|
|
|
- <title>Configuring the Fair scheduler</title>
|
|
|
+ <title>Configuration</title>
|
|
|
<p>
|
|
|
- The following properties can be set in mapred-site.xml to configure
|
|
|
- the fair scheduler:
|
|
|
+ The Fair Scheduler contains configuration in two places -- algorithm
|
|
|
+ parameters are set in <em>HADOOP_CONF_DIR/mapred-site.xml</em>, while
|
|
|
+ a separate XML file called the <em>allocation file</em>,
|
|
|
+ located by default in
|
|
|
+ <em>HADOOP_CONF_DIR/fair-scheduler.xml</em>, is used to configure
|
|
|
+ pools, minimum shares, running job limits and preemption timeouts.
|
|
|
+ The allocation file is reloaded periodically at runtime,
|
|
|
+ allowing you to change pool settings without restarting
|
|
|
+ your Hadoop cluster.
|
|
|
</p>
|
|
|
- <table>
|
|
|
- <tr>
|
|
|
- <th>Name</th><th>Description</th>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.allocation.file
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- Specifies an absolute path to an XML file which contains the
|
|
|
- allocations for each pool, as well as the per-pool and per-user
|
|
|
- limits on number of running jobs. If this property is not
|
|
|
- provided, allocations are not used.<br/>
|
|
|
- This file must be in XML format, and can contain three types of
|
|
|
- elements:
|
|
|
+ <p>
|
|
|
+ For a minimal installation, to just get equal sharing between users,
|
|
|
+ you will not need to edit the allocation file.
|
|
|
+ </p>
|
|
|
+ <section>
|
|
|
+ <title>Scheduler Parameters in mapred-site.xml</title>
|
|
|
+ <p>
|
|
|
+ The following parameters can be set in <em>mapred-site.xml</em>
|
|
|
+ to affect the behavior of the fair scheduler:
|
|
|
+ </p>
|
|
|
+ <p><strong>Basic Parameters</strong></p>
|
|
|
+ <table>
|
|
|
+ <tr>
|
|
|
+ <th>Name</th><th>Description</th>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.preemption
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Boolean property for enabling preemption. Default: false.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.pool
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Specify the pool that a job belongs in.
|
|
|
+ If this is specified then mapred.fairscheduler.poolnameproperty is ignored.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.poolnameproperty
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Specify which jobconf property is used to determine the pool that a
|
|
|
+ job belongs in. String, default: <em>user.name</em>
|
|
|
+ (i.e. one pool for each user).
|
|
|
+ Another useful value is <em>mapred.job.queue.name</em> to use MapReduce's "queue"
|
|
|
+ system for access control lists (see below).
|
|
|
+ mapred.fairscheduler.poolnameproperty is used only for jobs in which
|
|
|
+ mapred.fairscheduler.pool is not explicitly set.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.allocation.file
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Can be used to have the scheduler use a different allocation file
|
|
|
+ than the default one (<em>HADOOP_CONF_DIR/fair-scheduler.xml</em>).
|
|
|
+ Must be an absolute path to the allocation file.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ </table>
|
|
|
+ <p> <br></br></p>
|
|
|
+ <p><strong>Advanced Parameters</strong> </p>
|
|
|
+ <table>
|
|
|
+ <tr>
|
|
|
+ <th>Name</th><th>Description</th>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.sizebasedweight
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Take into account job sizes in calculating their weights for fair
|
|
|
+ sharing. By default, weights are only based on job priorities.
|
|
|
+ Setting this flag to true will make them based on the size of the
|
|
|
+ job (number of tasks needed) as well,though not linearly
|
|
|
+ (the weight will be proportional to the log of the number of tasks
|
|
|
+ needed). This lets larger jobs get larger fair shares while still
|
|
|
+ providing enough of a share to small jobs to let them finish fast.
|
|
|
+ Boolean value, default: false.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.preemption.only.log
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ This flag will cause the scheduler to run through the preemption
|
|
|
+ calculations but simply log when it wishes to preempt a task,
|
|
|
+ without actually preempting the task.
|
|
|
+ Boolean property, default: false.
|
|
|
+ This property can be useful for
|
|
|
+ doing a "dry run" of preemption before enabling it to make sure
|
|
|
+ that you have not set timeouts too aggressively.
|
|
|
+ You will see preemption log messages in your JobTracker's output
|
|
|
+ log (<em>HADOOP_LOG_DIR/hadoop-jobtracker-*.log</em>).
|
|
|
+ The messages look as follows:<br/>
|
|
|
+ <code>Should preempt 2 tasks for job_20090101337_0001: tasksDueToMinShare = 2, tasksDueToFairShare = 0</code>
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.update.interval
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Interval at which to update fair share calculations. The default
|
|
|
+ of 500ms works well for clusters with fewer than 500 nodes,
|
|
|
+ but larger values reduce load on the JobTracker for larger clusters.
|
|
|
+ Integer value in milliseconds, default: 500.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.preemption.interval
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Interval at which to check for tasks to preempt. The default
|
|
|
+ of 15s works well for timeouts on the order of minutes.
|
|
|
+ It is not recommended to set timeouts much smaller than this
|
|
|
+ amount, but you can use this value to make preemption computations
|
|
|
+ run more often if you do set such timeouts. A value of less than
|
|
|
+ 5s will probably be too small, however, as it becomes less than
|
|
|
+ the inter-heartbeat interval.
|
|
|
+ Integer value in milliseconds, default: 15000.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.weightadjuster
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extension point that lets you specify a class to adjust the
|
|
|
+ weights of running jobs. This class should implement the
|
|
|
+ <em>WeightAdjuster</em> interface. There is currently one example
|
|
|
+ implementation - <em>NewJobWeightBooster</em>, which increases the
|
|
|
+ weight of jobs for the first 5 minutes of their lifetime to let
|
|
|
+ short jobs finish faster. To use it, set the weightadjuster
|
|
|
+ property to the full class name,
|
|
|
+ <code>org.apache.hadoop.mapred.NewJobWeightBooster</code>.
|
|
|
+ NewJobWeightBooster itself provides two parameters for setting the
|
|
|
+ duration and boost factor.
|
|
|
<ul>
|
|
|
- <li>pool elements, which may contain elements for minMaps,
|
|
|
- minReduces, maxRunningJobs (limit the number of jobs from the
|
|
|
- pool to run at once),and weight (to share the cluster
|
|
|
- non-proportionally with other pools).
|
|
|
- </li>
|
|
|
- <li>user elements, which may contain a maxRunningJobs to limit
|
|
|
- jobs. Note that by default, there is a separate pool for each
|
|
|
- user, so these may not be necessary; they are useful, however,
|
|
|
- if you create a pool per user group or manually assign jobs
|
|
|
- to pools.</li>
|
|
|
- <li>A userMaxJobsDefault element, which sets the default running
|
|
|
- job limit for any users whose limit is not specified.</li>
|
|
|
+ <li><em>mapred.newjobweightbooster.factor</em>
|
|
|
+ Factor by which new jobs weight should be boosted.
|
|
|
+ Default is 3.</li>
|
|
|
+ <li><em>mapred.newjobweightbooster.duration</em>
|
|
|
+ Boost duration in milliseconds. Default is 300000 for 5 minutes.</li>
|
|
|
</ul>
|
|
|
- <br/>
|
|
|
- Example Allocation file is listed below :<br/>
|
|
|
- <code><?xml version="1.0"?> </code> <br/>
|
|
|
- <code><allocations></code> <br/>
|
|
|
- <code> <pool name="sample_pool"></code><br/>
|
|
|
- <code> <minMaps>5</minMaps></code><br/>
|
|
|
- <code> <minReduces>5</minReduces></code><br/>
|
|
|
- <code> <weight>2.0</weight></code><br/>
|
|
|
- <code> </pool></code><br/>
|
|
|
- <code> <user name="sample_user"></code><br/>
|
|
|
- <code> <maxRunningJobs>6</maxRunningJobs></code><br/>
|
|
|
- <code> </user></code><br/>
|
|
|
- <code> <userMaxJobsDefault>3</userMaxJobsDefault></code><br/>
|
|
|
- <code></allocations></code>
|
|
|
- <br/>
|
|
|
- This example creates a pool sample_pool with a guarantee of 5 map
|
|
|
- slots and 5 reduce slots. The pool also has a weight of 2.0, meaning
|
|
|
- it has a 2x higher share of the cluster than other pools (the default
|
|
|
- weight is 1). Finally, the example limits the number of running jobs
|
|
|
- per user to 3, except for sample_user, who can run 6 jobs concurrently.
|
|
|
- Any pool not defined in the allocations file will have no guaranteed
|
|
|
- capacity and a weight of 1.0. Also, any pool or user with no max
|
|
|
- running jobs set in the file will be allowed to run an unlimited
|
|
|
- number of jobs.
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.assignmultiple
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- Allows the scheduler to assign both a map task and a reduce task
|
|
|
- on each heartbeat, which improves cluster throughput when there
|
|
|
- are many small tasks to run. Boolean value, default: false.
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.sizebasedweight
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- Take into account job sizes in calculating their weights for fair
|
|
|
- sharing.By default, weights are only based on job priorities.
|
|
|
- Setting this flag to true will make them based on the size of the
|
|
|
- job (number of tasks needed) as well,though not linearly
|
|
|
- (the weight will be proportional to the log of the number of tasks
|
|
|
- needed). This lets larger jobs get larger fair shares while still
|
|
|
- providing enough of a share to small jobs to let them finish fast.
|
|
|
- Boolean value, default: false.
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.poolnameproperty
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- Specify which jobconf property is used to determine the pool that a
|
|
|
- job belongs in. String, default: user.name (i.e. one pool for each
|
|
|
- user). Some other useful values to set this to are: <br/>
|
|
|
- <ul>
|
|
|
- <li> group.name (to create a pool per Unix group).</li>
|
|
|
- <li>mapred.job.queue.name (the same property as the queue name in
|
|
|
- <a href="capacity_scheduler.html">Capacity Scheduler</a>).</li>
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.loadmanager
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extension point that lets you specify a class that determines
|
|
|
+ how many maps and reduces can run on a given TaskTracker. This class
|
|
|
+ should implement the LoadManager interface. By default the task caps
|
|
|
+ in the Hadoop config file are used, but this option could be used to
|
|
|
+ make the load based on available memory and CPU utilization for example.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.taskselector
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ An extension point that lets you specify a class that determines
|
|
|
+ which task from within a job to launch on a given tracker. This can be
|
|
|
+ used to change either the locality policy (e.g. keep some jobs within
|
|
|
+ a particular rack) or the speculative execution algorithm (select
|
|
|
+ when to launch speculative tasks). The default implementation uses
|
|
|
+ Hadoop's default algorithms from JobInProgress.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <!--
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.eventlog.enabled
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ Enable a detailed log of fair scheduler events, useful for
|
|
|
+ debugging.
|
|
|
+ This log is stored in <em>HADOOP_LOG_DIR/fairscheduler</em>.
|
|
|
+ Boolean value, default: false.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ <tr>
|
|
|
+ <td>
|
|
|
+ mapred.fairscheduler.dump.interval
|
|
|
+ </td>
|
|
|
+ <td>
|
|
|
+ If using the event log, this is the interval at which to dump
|
|
|
+ complete scheduler state (list of pools and jobs) to the log.
|
|
|
+ Integer value in milliseconds, default: 10000.
|
|
|
+ </td>
|
|
|
+ </tr>
|
|
|
+ -->
|
|
|
+ </table>
|
|
|
+ </section>
|
|
|
+ <section>
|
|
|
+ <title>Allocation File (fair-scheduler.xml)</title>
|
|
|
+ <p>
|
|
|
+ The allocation file configures minimum shares, running job
|
|
|
+ limits, weights and preemption timeouts for each pool.
|
|
|
+ Only users/pools whose values differ from the defaults need to be
|
|
|
+ explicitly configured in this file.
|
|
|
+ The allocation file is located in
|
|
|
+ <em>HADOOP_HOME/conf/fair-scheduler.xml</em>.
|
|
|
+ It can contain the following types of elements:
|
|
|
+ </p>
|
|
|
+ <ul>
|
|
|
+ <li><em>pool</em> elements, which configure each pool.
|
|
|
+ These may contain the following sub-elements:
|
|
|
+ <ul>
|
|
|
+ <li><em>minMaps</em> and <em>minReduces</em>,
|
|
|
+ to set the pool's minimum share of task slots.</li>
|
|
|
+ <li><em>maxMaps</em> and <em>maxReduces</em>, to set the
|
|
|
+ pool's maximum concurrent task slots.</li>
|
|
|
+ <li><em>schedulingMode</em>, the pool's internal scheduling mode,
|
|
|
+ which can be <em>fair</em> for fair sharing or <em>fifo</em> for
|
|
|
+ first-in-first-out.</li>
|
|
|
+ <li><em>maxRunningJobs</em>,
|
|
|
+ to limit the number of jobs from the
|
|
|
+ pool to run at once (defaults to infinite).</li>
|
|
|
+ <li><em>weight</em>, to share the cluster
|
|
|
+ non-proportionally with other pools. For example, a pool with weight 2.0 will get a 2x higher share than other pools. The default weight is 1.0.</li>
|
|
|
+ <li><em>minSharePreemptionTimeout</em>, the
|
|
|
+ number of seconds the pool will wait before
|
|
|
+ killing other pools' tasks if it is below its minimum share
|
|
|
+ (defaults to infinite).</li>
|
|
|
</ul>
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.weightadjuster
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- An extensibility point that lets you specify a class to adjust the
|
|
|
- weights of running jobs. This class should implement the
|
|
|
- <em>WeightAdjuster</em> interface. There is currently one example
|
|
|
- implementation - <em>NewJobWeightBooster</em>, which increases the
|
|
|
- weight of jobs for the first 5 minutes of their lifetime to let
|
|
|
- short jobs finish faster. To use it, set the weightadjuster
|
|
|
- property to the full class name,
|
|
|
- <code>org.apache.hadoop.mapred.NewJobWeightBooster</code>
|
|
|
- NewJobWeightBooster itself provides two parameters for setting the
|
|
|
- duration and boost factor. <br/>
|
|
|
- <ol>
|
|
|
- <li> <em>mapred.newjobweightbooster.factor</em>
|
|
|
- Factor by which new jobs weight should be boosted. Default is 3</li>
|
|
|
- <li><em>mapred.newjobweightbooster.duration</em>
|
|
|
- Duration in milliseconds, default 300000 for 5 minutes</li>
|
|
|
- </ol>
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.loadmanager
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- An extensibility point that lets you specify a class that determines
|
|
|
- how many maps and reduces can run on a given TaskTracker. This class
|
|
|
- should implement the LoadManager interface. By default the task caps
|
|
|
- in the Hadoop config file are used, but this option could be used to
|
|
|
- make the load based on available memory and CPU utilization for example.
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- <tr>
|
|
|
- <td>
|
|
|
- mapred.fairscheduler.taskselector:
|
|
|
- </td>
|
|
|
- <td>
|
|
|
- An extensibility point that lets you specify a class that determines
|
|
|
- which task from within a job to launch on a given tracker. This can be
|
|
|
- used to change either the locality policy (e.g. keep some jobs within
|
|
|
- a particular rack) or the speculative execution algorithm (select
|
|
|
- when to launch speculative tasks). The default implementation uses
|
|
|
- Hadoop's default algorithms from JobInProgress.
|
|
|
- </td>
|
|
|
- </tr>
|
|
|
- </table>
|
|
|
+ </li>
|
|
|
+ <li><em>user</em> elements, which may contain a
|
|
|
+ <em>maxRunningJobs</em> element to limit
|
|
|
+ jobs. Note that by default, there is a pool for each
|
|
|
+ user, so per-user limits are not necessary.</li>
|
|
|
+ <li><em>poolMaxJobsDefault</em>, which sets the default running
|
|
|
+ job limit for any pools whose limit is not specified.</li>
|
|
|
+ <li><em>userMaxJobsDefault</em>, which sets the default running
|
|
|
+ job limit for any users whose limit is not specified.</li>
|
|
|
+ <li><em>defaultMinSharePreemptionTimeout</em>,
|
|
|
+ which sets the default minimum share preemption timeout
|
|
|
+ for any pools where it is not specified.</li>
|
|
|
+ <li><em>fairSharePreemptionTimeout</em>,
|
|
|
+ which sets the preemption timeout used when jobs are below half
|
|
|
+ their fair share.</li>
|
|
|
+ <li><em>defaultPoolSchedulingMode</em>, which sets the default scheduling
|
|
|
+ mode (<em>fair</em> or <em>fifo</em>) for pools whose mode is
|
|
|
+ not specified.</li>
|
|
|
+ </ul>
|
|
|
+ <p>
|
|
|
+ Pool and user elements only required if you are setting
|
|
|
+ non-default values for the pool/user. That is, you do not need to
|
|
|
+ declare all users and all pools in your config file before running
|
|
|
+ the fair scheduler. If a user or pool is not listed in the config file,
|
|
|
+ the default values for limits, preemption timeouts, etc will be used.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ An example allocation file is given below : </p>
|
|
|
+<source>
|
|
|
+<?xml version="1.0"?>
|
|
|
+<allocations>
|
|
|
+ <pool name="sample_pool">
|
|
|
+ <minMaps>5</minMaps>
|
|
|
+ <minReduces>5</minReduces>
|
|
|
+ <maxMaps>25</maxMaps>
|
|
|
+ <maxReduces>25</maxReduces>
|
|
|
+ <minSharePreemptionTimeout>300</minSharePreemptionTimeout>
|
|
|
+ </pool>
|
|
|
+ <user name="sample_user">
|
|
|
+ <maxRunningJobs>6</maxRunningJobs>
|
|
|
+ </user>
|
|
|
+ <userMaxJobsDefault>3</userMaxJobsDefault>
|
|
|
+ <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
|
|
|
+</allocations>
|
|
|
+</source>
|
|
|
+ <p>
|
|
|
+ This example creates a pool sample_pool with a guarantee of 5 map
|
|
|
+ slots and 5 reduce slots. The pool also has a minimum share preemption
|
|
|
+ timeout of 300 seconds (5 minutes), meaning that if it does not get its
|
|
|
+ guaranteed share within this time, it is allowed to kill tasks from
|
|
|
+ other pools to achieve its share. The pool has a cap of 25 map and 25
|
|
|
+ reduce slots, which means that once 25 tasks are running, no more will
|
|
|
+ be scheduled even if the pool's fair share is higher.
|
|
|
+ The example also limits the number of running jobs
|
|
|
+ per user to 3, except for sample_user, who can run 6 jobs concurrently.
|
|
|
+ Finally, the example sets a fair share preemption timeout of 600 seconds
|
|
|
+ (10 minutes). If a job is below half its fair share for 10 minutes, it
|
|
|
+ will be allowed to kill tasks from other jobs to achieve its share.
|
|
|
+ Note that the preemption settings require preemption to be
|
|
|
+ enabled in <em>mapred-site.xml</em> as described earlier.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Any pool not defined in the allocation file will have no guaranteed
|
|
|
+ capacity and no preemption timeout. Also, any pool or user with no max
|
|
|
+ running jobs set in the file will be allowed to run an unlimited
|
|
|
+ number of jobs.
|
|
|
+ </p>
|
|
|
+ </section>
|
|
|
+ <section>
|
|
|
+ <title>Access Control Lists (ACLs)</title>
|
|
|
+ <p>
|
|
|
+ The fair scheduler can be used in tandem with the "queue" based access
|
|
|
+ control system in MapReduce to restrict which pools each user can access.
|
|
|
+ To do this, first enable ACLs and set up some queues as described in the
|
|
|
+ <a href="mapred_tutorial.html#Job+Authorization">MapReduce usage guide</a>,
|
|
|
+ then set the fair scheduler to use one pool per queue by adding
|
|
|
+ the following property in <em>HADOOP_CONF_DIR/mapred-site.xml</em>:
|
|
|
+ </p>
|
|
|
+<source>
|
|
|
+<property>
|
|
|
+ <name>mapred.fairscheduler.poolnameproperty</name>
|
|
|
+ <value>mapred.job.queue.name</value>
|
|
|
+</property>
|
|
|
+</source>
|
|
|
+ <p>
|
|
|
+ You can then set the minimum share, weight, and internal scheduling mode
|
|
|
+ for each pool as described earlier.
|
|
|
+ In addition, make sure that users submit jobs to the right queue by setting
|
|
|
+ the <em>mapred.job.queue.name</em> property in their jobs.
|
|
|
+ </p>
|
|
|
+ </section>
|
|
|
</section>
|
|
|
<section>
|
|
|
<title> Administration</title>
|
|
@@ -280,14 +472,15 @@
|
|
|
</p>
|
|
|
<ol>
|
|
|
<li>
|
|
|
- It is possible to modify pools' allocations
|
|
|
- and user and pool running job limits at runtime by editing the allocation
|
|
|
- config file. The scheduler will reload this file 10-15 seconds after it
|
|
|
+ It is possible to modify minimum shares, limits, weights, preemption
|
|
|
+ timeouts and pool scheduling modes at runtime by editing the allocation
|
|
|
+ file. The scheduler will reload this file 10-15 seconds after it
|
|
|
sees that it was modified.
|
|
|
</li>
|
|
|
<li>
|
|
|
Current jobs, pools, and fair shares can be examined through the
|
|
|
- JobTracker's web interface, at http://<jobtracker URL>/scheduler.
|
|
|
+ JobTracker's web interface, at
|
|
|
+ <em>http://<JobTracker URL>/scheduler</em>.
|
|
|
On this interface, it is also possible to modify jobs' priorities or
|
|
|
move jobs from one pool to another and see the effects on the fair
|
|
|
shares (this requires JavaScript).
|
|
@@ -312,24 +505,36 @@
|
|
|
the job has had, but on average it will get its fair share amount.</li>
|
|
|
</ul>
|
|
|
<p>
|
|
|
- In addition, it is possible to turn on an "advanced" view for the web UI,
|
|
|
- by going to http://<jobtracker URL>/scheduler?advanced. This view shows
|
|
|
- four more columns used for calculations internally:
|
|
|
+ In addition, it is possible to view an "advanced" version of the web
|
|
|
+ UI by going to <em>http://<JobTracker URL>/scheduler?advanced</em>.
|
|
|
+ This view shows two more columns:
|
|
|
</p>
|
|
|
<ul>
|
|
|
<li><em>Maps/Reduce Weight</em>: Weight of the job in the fair sharing
|
|
|
calculations. This depends on priority and potentially also on
|
|
|
job size and job age if the <em>sizebasedweight</em> and
|
|
|
<em>NewJobWeightBooster</em> are enabled.</li>
|
|
|
- <li><em>Map/Reduce Deficit</em>: The job's scheduling deficit in machine-
|
|
|
- seconds - the amount of resources it should have gotten according to
|
|
|
- its fair share, minus how many it actually got. Positive deficit means
|
|
|
- the job will be scheduled again in the near future because it needs to
|
|
|
- catch up to its fair share. The scheduler schedules jobs with higher
|
|
|
- deficit ahead of others. Please see the Implementation section of
|
|
|
- this document for details.</li>
|
|
|
</ul>
|
|
|
</section>
|
|
|
+ <section>
|
|
|
+ <title>Metrics</title>
|
|
|
+ <p>
|
|
|
+ The fair scheduler can export metrics using the Hadoop metrics interface.
|
|
|
+ This can be enabled by adding an entry to <code>hadoop-metrics.properties</code>
|
|
|
+ to enable the <code>fairscheduler</code> metrics context. For example, to
|
|
|
+ simply retain the metrics in memory so they may be viewed in the <code>/metrics</code>
|
|
|
+ servlet:
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ <code>fairscheduler.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext</code>
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Metrics are generated for each pool and job, and contain the same information that
|
|
|
+ is visible on the <code>/scheduler</code> web page.
|
|
|
+ </p>
|
|
|
+ </section>
|
|
|
+
|
|
|
+ <!--
|
|
|
<section>
|
|
|
<title>Implementation</title>
|
|
|
<p>There are two aspects to implementing fair scheduling: Calculating
|
|
@@ -359,13 +564,31 @@
|
|
|
This capacity is divided among the jobs in that pool according again to
|
|
|
their weights.
|
|
|
</p>
|
|
|
- <p>Finally, when limits on a user's running jobs or a pool's running jobs
|
|
|
+ <p>When limits on a user's running jobs or a pool's running jobs
|
|
|
are in place, we choose which jobs get to run by sorting all jobs in order
|
|
|
of priority and then submit time, as in the standard Hadoop scheduler. Any
|
|
|
jobs that fall after the user/pool's limit in this ordering are queued up
|
|
|
and wait idle until they can be run. During this time, they are ignored
|
|
|
from the fair sharing calculations and do not gain or lose deficit (their
|
|
|
fair share is set to zero).</p>
|
|
|
+ <p>
|
|
|
+ Preemption is implemented by periodically checking whether jobs are
|
|
|
+ below their minimum share or below half their fair share. If a job has
|
|
|
+ been below its share for sufficiently long, it is allowed to kill
|
|
|
+ other jobs' tasks. The tasks chosen are the most-recently-launched
|
|
|
+ tasks from over-allocated jobs, to minimize the amount of wasted
|
|
|
+ computation.
|
|
|
+ </p>
|
|
|
+ <p>
|
|
|
+ Finally, the fair scheduler provides several extension points where
|
|
|
+ the basic functionality can be extended. For example, the weight
|
|
|
+ calculation can be modified to give a priority boost to new jobs,
|
|
|
+ implementing a "shortest job first" policy which reduces response
|
|
|
+ times for interactive jobs even further.
|
|
|
+ These extension points are listed in
|
|
|
+ <a href="#Scheduler+Parameters+in+mapred-site.xml">Advanced Parameters</a>.
|
|
|
+ </p>
|
|
|
</section>
|
|
|
+ -->
|
|
|
</body>
|
|
|
</document>
|