Fixed a race condition in writing the log index file that caused tasks to 'fail'.
Removed duplicate chmods of job log dir that were vulnerable to race conditions between tasks. Also improved the messages when the symlinks failed to be created.
There is a rare race condition in linux task controller when concurrent task processes tries to create job log directory at the same time.
{quote}
Error Message
Queue "q1" does not exist
Stacktrace
java.io.IOException: Queue "q1" does not exist
at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:354)
at org.apache.hadoop.mapred.TestCapacityScheduler$FakeJobInProgress.<init>(TestCapacityScheduler.java:172)
at org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:794)
at org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:818)
at org.apache.hadoop.mapred.TestCapacityScheduler.submitJobAndInit(TestCapacityScheduler.java:825)
at org.apache.hadoop.mapred.TestCapacityScheduler.testMultiTaskAssignmentInMultipleQueues(TestCapacityScheduler.java:1109)
{quote}
When queue name is invalid, an exception is thrown now.
We would like to record and present the jobtracker metrics on a per-queue basis.
Observing a lot of jvmValidate exceptions in TT logs for grid mix run
************************
2011-04-28 02:00:37,578 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 46121, call
statusUpdate(attempt_201104270735_5993_m_003305_0, org.apache.hadoop.mapred.MapTaskStatus@1840a9c,
org.apache.hadoop.mapred.JvmContext@1d4ab6b) from 127.0.0.1:50864: error: java.io.IOException: JvmValidate Failed.
Ignoring request from task: attempt_201104270735_5993_m_003305_0, with JvmId:
jvm_201104270735_5993_m_103399012gsbl20430: java.io.IOException: JvmValidate Failed. Ignoring request from task:
attempt_201104270735_5993_m_003305_0, with JvmId: jvm_201104270735_5993_m_103399012gsbl20430: --
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1386)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1384)
*********************
Added 2 new config parameters:
mapreduce.reduce.shuffle.catch.exception.stack.regex
mapreduce.reduce.shuffle.catch.exception.message.regex
Added a new configuration option: mapreduce.reduce.shuffle.maxfetchfailures, and removed a no longer used option: mapred.reduce.copy.backoff.
The cleanup thread in the distributed cache handles IOExceptions and the like correctly, but just to be a bit more defensive it would be good to monitor the thread, and check that it is still alive regularly, so that the distributed cache does not fill up the entire disk on the node.
Gain some insight into blacklist increments/decrements by enhancing the debug logging
Added mapreduce.tasktracker.distributedcache.checkperiod to the task tracker that defined the period to wait while cleaning up the distributed cache. The default is 1 min.
This jira is to provide more useful log information for debugging the "Too many fetch-failures" error.
Looking at the JobTracker node, we see messages like this:
"2010-12-14 00:00:06,911 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #8 for task
attempt_201011300729_189729_m_007458_0".
I would be useful to see which reducer is reporting the error here.
So, I propose we add the following to these log messages:
1. reduce task ID
2. TaskTracker nodenames for both the mapper and the reducer
The information on why a specific TaskTracker got blacklisted is not stored anywhere. The jobtracker web ui will show the detailed reason string until the TT gets unblacklisted. After that it is lost.
TaskTracker.validateJVM() is throwing NPE when setupWorkDir() throws IOException. This is because
taskFinal.setJvmContext() is not executed yet
src/test/system/aop/org/apache/hadoop/mapred/TaskAspect.aj:72 [warning] advice defined in org.apache.hadoop.mapred.TaskAspect has not been applied [Xlint:adviceDidNotMatch]
After the fix in MR-2429, the call to ping in TaskAspect needs to be fixed.
Currently TT doens't check to ensure jvmid is relevant during communication with the Child via TaskUmbilicalProtocol.
Job error details are not displayed in the JobHistory servlet. e.g. Errors like 'Counter limit exceeded for a job'.
jobdetails.jsp has 'Failure Info', but this is missing in jobdetailshistory.jsp
Currently, userlogs directory in TaskTracker is placed under hadoop.log.dir like <hadoop.log.dir>/userlogs. I am proposing to spread these userlogs onto multiple configured mapred.local.dirs to strengthen TaskTracker reliability w.r.t disk failures.
At present, TaskTracker doesn't handle disk failures properly both at startup and runtime.
(1) Currently TaskTracker doesn't come up if any of the mapred-local-dirs is on a bad disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only the remaining good mapred-local-dirs.
(2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn't do anything special. This results in either
(a) TaskTracker continues to "try to use that bad disk" and this results in lots of task failures and possibly job failures(because of multiple TTs having bad disks) and eventually these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. OR
(b) Health check script identifying the disk as bad and the TT gets blacklisted. And this also needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk.
This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and (2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs.
With this patch we'll check for that, and print a message in the logs. Then at submission time you find out about it.
If a 'global' file is specified as a 'file' by one job - subsequent jobs cannot override this source file to be an 'archive' (until the TT cleans up it's cache or a TT restart).
The other way around as well -> 'archive' to 'file'
In case of an accidental submission using the wrong type - some of the tasks for the second job will end up seeing the source file as an archive, others as a file.
Problem where the task browser UI can't retrieve the stdxxx printouts of streaming jobs that abend in the unix code, in the common case where the containing job doesn't reuse JVM's.
There is a deadlock while localizing resources on the TaskTracker.
Fix unit-test failures: TestBadRecords (NPE due to rearranged MapTask code) and TestTaskTrackerMemoryManager (need hostname in output-string pattern).
The job client strips the file system from the user's job jar, which causes breakage when it isn't the default file system.
We are passing fs.deafult.name as viewfs:/// in core site.xml on oozie server.
We have default name node in configuration also viewfs:///
We are using hdfs://path in our path for application.
Its giving following error:
IllegalArgumentException: Wrong FS:
hdfs://nn/user/strat_ci/oozie-oozi/0000002-110217014830452-oozie-oozi-W/hadoop1--map-reduce/map-reduce-launcher.jar,
expected: viewfs:/
Mapred assumes hdfs as the default fs even when defined otherwise.
if you're using a custom RecordReader/InputFormat setup and using an
InputSplit that does NOT extend FileSplit, then any exceptions you throw in your RecordReader.nextKeyValue() function
are silently ignored.
From Luke Lu:
Here is a summary of why the failed map task was considered "successful" (Thanks to Mahadev, Arun and Devaraj
for insightful discussions).
1. The map task was hanging BEFORE being initialized (probably in localization, but it doesn't matter in this case).
Its state is UNASSIGNED.
2. The jt decided to kill it due to timeout and scheduled a cleanup task on the same node.
3. The cleanup task has the same attempt id (by design.) but runs in a different JVM. Its initial state is
FAILED_UNCLEAN.
4. The JVM of the original attempt is getting killed, while proceeding to setupWorkDir and throwed an
IllegalStateException while FileSystem.getLocal, which causes taskFinal.taskCleanup being called in Child, and
triggered the NPE due to the task is not yet initialized (committer is null). Before the NPE, however it sent a
statusUpdate to TT, and in tip.reportProgress, changed the task state (currently FAILED_UNCLEAN) to UNASSIGNED.
5. The cleanup attempt succeeded and report done to TT. In tip.reportDone, the isCleanup() check returned false due to
the UNASSIGNED state and set the task state as SUCCEEDED.
HADOOP-3136 changed the default o.a.h.mapred.JobQueueTaskScheduler to assign multiple tasks per TaskTracker heartbeat, the capacity-scheduler should do the same.
JobContext is used for a read-only view of job's info. Hence all the readonly fields in JobContext are set in the constructor. Job extends JobContext. When a Job is created, jobid is not known and hence there is no way to set JobID once Job is created. JobID is obtained only when the JobClient queries the jobTracker for a job-id., which happens later i.e upon job submission.
Test case TestHdfsProxy.testHdfsProxyInterface has been temporarily disabled for this release, due to failure in the Hudson automated test environment.
As a part of datanode process hang, this part of code was introduced in 0.20.204 to clean up all the waiting threads.
- try {
- readPool.awaitTermination(10, TimeUnit.SECONDS);
- } catch (InterruptedException e) {
- LOG.info("Exception occured in doStop:" + e.getMessage());
- }
- readPool.shutdownNow();
This was clearly meant for production, but all the unit tests uses minidfscluster and minimrcluster for shutdown which waits on this part of the code. Due to this, we saw increase in unit test run times. So removing this code.
The test makes assumptions about timing issues that hold true in workstation environments but not in Hudson auto-test.
Since we have multiple Jira's in trunk for common and hdfs, I am creating another jira for this issue.
This patch addresses the following:
1. Provides FileUtil API for list and listFiles which throws IOException for null cases.
2. Replaces most of the code where JDK file API with FileUtil API.
In 20.204, TestHDFSServerPorts was observed to intermittently throw a NullPointerException. This only happens when FSNamesystem.close() is called, which means system termination for the Namenode, so this is not a serious bug for .204. TestHDFSServerPorts is more likely than normal execution to stimulate the race, because it runs two Namenodes in the same JVM, causing more interleaving and more potential to see a race condition.
The race is in FSNamesystem.close(), line 566, we have:
if (replthread != null) replthread.interrupt();
if (replmon != null) replmon = null;
Since the interrupted replthread is not waited on, there is a potential race condition with replmon being nulled before replthread is dead, but replthread references replmon in computeDatanodeWork() where the NullPointerException occurs.
The solution is either to wait on replthread or just don't null replmon. The latter is preferred, since none of the sibling Namenode processing threads are waited on in close().
I'll attach a patch for .205.
Same opcode are used for different operations between 0.20.security, 0.22 and 0.23. This results in failure to load editlogs on later release, especially during upgrades.
Our service engineering team who operates the clusters on a daily basis founds it is confusing that after a data node is decommissioned, there is no way to make the cluster forget about this data node and it always remains in the dead node list.
Consider a large cluster that takes 40 minutes to start up. The datanodes compete to register and send their Initial Block Reports (IBRs) as fast as they can after startup (subject to a small sub-two-minute random delay, which isn't relevant to this discussion).
As each datanode succeeds in sending its IBR, it schedules the starting time for its regular cycle of reports, every hour (or other configured value of dfs.blockreport.intervalMsec). In order to spread the reports evenly across the block report interval, each datanode picks a random fraction of that interval, for the starting point of its regular report cycle. For example, if a particular datanode ends up randomly selecting 18 minutes after the hour, then that datanode will send a Block Report at 18 minutes after the hour every hour as long as it remains up. Other datanodes will start their cycles at other randomly selected times. This code is in DataNode.blockReport() and DataNode.scheduleBlockReport().
The "second Block Report" (2BR), is the start of these hourly reports. The problem is that some of these 2BRs get scheduled sooner rather than later, and actually occur within the startup period. For example, if the cluster takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them will send their 2BR before the 40-minute startup time has completed!
2BRs sent within the startup time actually compete with the remaining IBRs, and thereby slow down the overall startup process. This can be seen in the following data, which shows the startup process for a 3700-node cluster that took about 17 minutes to finish startup:
{noformat}
time starts sum regs sum IBR sum 2nd_BR sum total_BRs/min
0 1299799498 3042 3042 1969 1969 151 151 0 151
1 1299799558 665 3707 1470 3439 248 399 0 248
2 1299799618 3707 224 3663 270 669 0 270
3 1299799678 3707 14 3677 261 930 3 3 264
4 1299799738 3707 23 3700 288 1218 1 4 289
5 1299799798 3707 7 3707 258 1476 3 7 261
6 1299799858 3707 3707 317 1793 4 11 321
7 1299799918 3707 3707 292 2085 6 17 298
8 1299799978 3707 3707 292 2377 8 25 300
9 1299800038 3707 3707 272 2649 25 272
10 1299800098 3707 3707 280 2929 15 40 295
11 1299800158 3707 3707 223 3152 14 54 237
12 1299800218 3707 3707 143 3295 54 143
13 1299800278 3707 3707 141 3436 20 74 161
14 1299800338 3707 3707 195 3631 78 152 273
15 1299800398 3707 3707 51 3682 209 361 260
16 1299800458 3707 3707 25 3707 369 730 394
17 1299800518 3707 3707 3707 166 896 166
18 1299800578 3707 3707 3707 72 968 72
19 1299800638 3707 3707 3707 67 1035 67
20 1299800698 3707 3707 3707 75 1110 75
21 1299800758 3707 3707 3707 71 1181 71
22 1299800818 3707 3707 3707 67 1248 67
23 1299800878 3707 3707 3707 62 1310 62
24 1299800938 3707 3707 3707 56 1366 56
25 1299800998 3707 3707 3707 60 1426 60
{noformat}
This data was harvested from the startup logs of all the datanodes, and correlated into one-minute buckets. Each row of the table represents the progress during one elapsed minute of clock time. It seems that every cluster startup is different, but this one showed the effect fairly well.
The "starts" column shows that all the nodes started up within the first 2 minutes, and the "regs" column shows that all succeeded in registering by minute 6. The IBR column shows a sustained rate of Initial Block Report processing of 250-300/minute for the first 10 minutes.
The question is why, during minutes 11 through 16, the rate of IBR processing slowed down. Why didn't the startup just finish? In the "2nd_BR" column, we see the rate of 2BRs ramping up as more datanodes complete their IBRs. As the rate increases, they become more effective at competing with the IBRs, and slow down the IBR processing even more. After the IBRs finally finish in minute 16, the rate of 2BRs settles down to a steady ~60-70/minute.
In order to decrease competition for locks and other resources, to speed up IBR processing during startup, we propose to delay 2BRs until later into the cycle.
The set of JSP pages that web UI uses are not thread safe. We have observed some problems when requesting Live/Dead/Decommissioning pages from the web UI, incorrect page is displayed. To be more specific, requesting Dead node list page, sometimes, Live node page is returned. Requesting decommissioning page, sometimes, dead page is returned.
The root cause of this problem is that JSP page is not thread safe by default. When multiple requests come in, each request is assigned to a different thread, multiple threads access the same instance of the servlet class resulted from a JSP page. A class variable is shared by multiple threads. The JSP code in 20 branche, for example, dfsnodelist.jsp has
{code}
<!%
int rowNum = 0;
int colNum = 0;
String sorterField = null;
String sorterOrder = null;
String whatNodes = "LIVE";
...
%>
{code}
declared as class variables. ( These set of variables are declared within <%! code %> directives which made them class members. ) Multiple threads share the same set of class member variables, one request would step on anther's toe.
However, due to the JSP code refactor, HADOOP-5857, all of these class member variables are moved to become function local variables. So this bug does not appear in Apache trunk. Hence, we have proposed to take a simple fix for this bug on 20 branch alone, to be more specific, branch-0.20-security.
The simple fix is to add jsp ThreadSafe="false" directive into the related JSP pages, dfshealth.jsp and dfsnodelist.jsp to make them thread safe, i.e. only on request is processed at each time.
We did evaluate the thread safety issue for other JSP pages on trunk, we noticed a potential problem is that when we retrieving some statistics from namenode, for example, we make the call to
{code}
NamenodeJspHelper.getInodeLimitText(fsn);
{code}
in dfshealth.jsp, which eventuality is
{code}
static String getInodeLimitText(FSNamesystem fsn) {
long inodes = fsn.dir.totalInodes();
long blocks = fsn.getBlocksTotal();
long maxobjects = fsn.getMaxObjects();
....
{code}
some of the function calls are already guarded by readwritelock, e.g. dir.totalInodes, but others are not. As a result of this, the web ui results are not 100% thread safe. But after evaluating the prons and cons of adding a giant lock into the JSP pages, we decided not to issue FSNamesystem ReadWrite locks into JSPs.
{noformat}
hadoop dfs -touchz /tmp/file1 # create file. OK
hadoop dfs -ls /tmp/file1 # OK
hadoop dfs -ls hftp://namenode:50070/tmp/file1 # FAILED: not seeing the file
{noformat}
In secure mode, when disks fail more than volumes tolerated, datanode process doesn't exit properly and it just hangs even though shutdown method is called.
Datanode startup doesn't honor volumes.tolerated for hadoop 20 version.
In a big cluster, when namenode starts up, it takes a long time for namenode to process block reports from all datanodes. Because heartbeats processing get delayed, some datanodes are erroneously marked as dead, then later on they have to register again, thus wasting time.
It would speed up starting time if the checking of dead nodes is disabled when namenode in safemode.
Batch hardlinking during "upgrade" snapshots, cutting time from aprx 8 minutes per volume to aprx 8 seconds. Validated in both Linux and Windows. Depends on prior integration with patch for HADOOP-7133.
There's a bug in the quota code that causes them not to be respected when a file is not an exact multiple of the block size. Here's an example:
{code}
$ hadoop fs -mkdir /test
$ hadoop dfsadmin -setSpaceQuota 384M /test
$ ls dir/ | wc -l # dir contains 101 files
101
$ du -ms dir # each is 3mb
304 dir
$ hadoop fs -put dir /test
$ hadoop fs -count -q /test
none inf 402653184 -550502400 2 101 317718528 hdfs://haus01.sf.cloudera.com:10020/test
$ hadoop fs -stat "%o %r" /test/dir/f30
134217728 3 # three 128mb blocks
{code}
INodeDirectoryWithQuota caches the number of bytes consumed by it's children in {{diskspace}}. The quota adjustment code has a bug that causes {{diskspace}} to get updated incorrectly when a file is not an exact multiple of the block size (the value ends up being negative).
This causes the quota checking code to think that the files in the directory consumes less space than they actually do, so the verifyQuota does not throw a QuotaExceededException even when the directory is over quota. However the bug isn't visible to users because {{fs count -q}} reports the numbers generated by INode#getContentSummary which adds up the sizes of the blocks rather than use the cached INodeDirectoryWithQuota#diskspace value.
In FSDirectory#addBlock the disk space consumed is set conservatively to the full block size * the number of replicas:
{code}
updateCount(inodes, inodes.length-1, 0,
fileNode.getPreferredBlockSize()*fileNode.getReplication(), true);
{code}
In FSNameSystem#addStoredBlock we adjust for this conservative estimate by subtracting out the difference between the conservative estimate and what the number of bytes actually stored was:
{code}
//Updated space consumed if required.
INodeFile file = (storedBlock != null) ? storedBlock.getINode() : null;
long diff = (file == null) ? 0 :
(file.getPreferredBlockSize() - storedBlock.getNumBytes());
if (diff > 0 && file.isUnderConstruction() &&
cursize < storedBlock.getNumBytes()) {
...
dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
{code}
We do the same in FSDirectory#replaceNode when completing the file, but at a file granularity (I believe the intent here is to correct for the cases when there's a failure replicating blocks and recovery). Since oldnode is under construction INodeFile#diskspaceConsumed will use the preferred block size (vs of Block#getNumBytes used by newnode) so we will again subtract out the difference between the full block size and what the number of bytes actually stored was:
{code}
long dsOld = oldnode.diskspaceConsumed();
...
//check if disk space needs to be updated.
long dsNew = 0;
if (updateDiskspace && (dsNew = newnode.diskspaceConsumed()) != dsOld) {
try {
updateSpaceConsumed(path, 0, dsNew-dsOld);
...
{code}
So in the above example we started with diskspace at 384mb (3 * 128mb) and then we subtract 375mb (to reflect only 9mb raw was actually used) twice so for each file the diskspace for the directory is - 366mb (384mb minus 2 * 375mb). Which is why the quota gets negative and yet we can still write more files.
So a directory with lots of single block files (if you have multiple blocks on the final partial block ends up subtracting from the diskspace used) ends up having a quota that's way off.
I think the fix is to in FSDirectory#replaceNode not have the diskspaceConsumed calculations differ when the old and new INode have the same blocks. I'll work on a patch which also adds a quota test for blocks that are not multiples of the block size and warns in INodeDirectory#computeContentSummary if the computed size does not reflect the cached value.
The HDFS root directory starts out with a default namespace quota of Integer.MAX_VALUE. If you clear this quota (using "hadoop dfsadmin -clrQuota /"), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, and the NN will not come back up from a restart.
HDFS Quota counts will be missed between a clear quota operation and a set quota.
When setting quota for a dir, the INodeDirectory will be replaced by INodeDirectoryWithQuota and dir.isQuotaSet() becomes true. When INodeDirectoryWithQuota is newly created, quota counting will be performed. However, when clearing quota, the quota conf is set to -1 and dir.isQuotaSet() becomes false while INodeDirectoryWithQuota will NOT be replaced back to INodeDirectory.
FSDirectory.updateCount just update the quota count for inodes that isQuotaSet() is true. So after clear quota for a dir, its quota counts will not be updated and it's reasonable. But when re seting quota for this dir, quota counting will not be performed and some counts will be missed.
When running hadoop-setup-single-node.sh, the system can not find the templates configuration directory:
{noformat}
cat: /usr/libexec/../templates/conf/core-site.xml: No such file or directory
cat: /usr/libexec/../templates/conf/hdfs-site.xml: No such file or directory
cat: /usr/libexec/../templates/conf/mapred-site.xml: No such file or directory
cat: /usr/libexec/../templates/conf/hadoop-env.sh: No such file or directory
chown: cannot access `hadoop-env.sh': No such file or directory
chmod: cannot access `hadoop-env.sh': No such file or directory
cp: cannot stat `*.xml': No such file or directory
cp: cannot stat `hadoop-env.sh': No such file or directory
{noformat}
Create a new mechanism to suppress the warning about HADOOP_HOME deprecation.
I'll create a HADOOP_HOME_WARN_SUPPRESS environment variable that suppresses the warning.
The hadoop-config.sh overrides the variable "bin", which makes the scripts use libexec for hadoop-daemon(s).
TestMiniMRDFSCaching fails if test.build.dir is set to something other than build/test.
hadoop-config.sh has been moved to libexec for binary package, but developers prefers to have hadoop-config.sh in bin. Hadoo shell scripts should be modified to support both scenarios.
The MetricsSourceAdapter#getAttribute in 0.20.203 is returning the attribute object instead of the value.
Although, all metrics in metrics v2 are exposed via the standard JMX mechanisms, most users are using Ganglia to collect metrics.
This is to add the eclipse launchers from HADOOP-5911 to the 0.20 security branch.
Eclipse has a notion of "run configuration", which encapsulates what's needed to run or debug an application. I use this quite a bit to start various Hadoop daemons in debug mode, with breakpoints set, to inspect state and what not.
This is simply configuration, so no tests are provided. After running "ant eclipse" and refreshing your project, you should see entries in the Run Configurations and Debug Configurations for launching the various hadoop daemons from within eclipse. There's a template for testing a specific test, and also templates to run all the tests, the job tracker, and a task tracker. It's likely that some parameters need to be further tweaked to have the same behavior as "ant test", but for most tests, this works.
This also requires a small change to build.xml for the eclipse classpath.
Same fix as for HADOOP-7057 for the Hadoop security branch
{noformat}
throw new IOException( "Premeture EOF from inputStream");
{noformat}
Backport HADOOP-6407 into 0.20 based source trees
The javadoc is currently generating 31 warnings.
Much of the Hadoop metrics and status info is available via JMX, especially since 0.20.100, and 0.22+ (HDFS-1318, HADOOP-6728 etc.) For operations staff not familiar JMX setup, especially JMX with SSL and firewall tunnelling, the usage can be daunting. Using a JMXProxyServlet (a la Tomcat) to translate JMX attributes into JSON output would make a lot of non-Java admins happy.
We could probably use Tomcat's JMXProxyServlet code directly, if it's already output some standard format (JSON or XML etc.) The code is simple enough to port over and can probably integrate with the common HttpServer as one of the default servelet (maybe /jmx) for the pluggable security.
Added RPM/DEB packages to build system.
Issue | Component | Notes |
---|---|---|
HADOOP-3344 | build | Changed build procedure for libhdfs to build correctly for different platforms. Build instructions are in the Jira item. |
HADOOP-4253 | conf | Removed from class org.apache.hadoop.fs.RawLocalFileSystem deprecated methods public String getName(), public void lock(Path p, boolean shared) and public void release(Path p). |
HADOOP-4454 | conf | Changed processing of conf/slaves file to allow # to begin a comment. |
HADOOP-4631 | conf | Split hadoop-default.xml into core-default.xml, hdfs-default.xml and mapreduce-default.xml. |
HADOOP-4035 | contrib/capacity-sched | Changed capacity scheduler policy to take note of task memory requirements and task tracker memory availability. |
HADOOP-4445 | contrib/capacity-sched | Changed JobTracker UI to better present the number of active tasks. |
HADOOP-4576 | contrib/capacity-sched | Changed capacity scheduler UI to better present number of running and pending tasks. |
HADOOP-4179 | contrib/chukwa | Introduced Vaidya rule based performance diagnostic tool for Map/Reduce jobs. |
HADOOP-4827 | contrib/chukwa | Improved framework for data aggregation in Chuckwa. |
HADOOP-4843 | contrib/chukwa | Introduced Chuckwa collection of job history. |
HADOOP-5030 | contrib/chukwa | Changed RPM install location to the value specified by build.properties file. |
HADOOP-5531 | contrib/chukwa | Disabled Chukwa unit tests for 0.20 branch only. |
HADOOP-4789 | contrib/fair-share | Changed fair scheduler to divide resources equally between pools, not jobs. |
HADOOP-4873 | contrib/fair-share | Changed fair scheduler UI to display minMaps and minReduces variables. |
HADOOP-3750 | dfs | Removed deprecated method parseArgs from org.apache.hadoop.fs.FileSystem. |
HADOOP-4029 | dfs | Added name node storage information to the dfshealth page, and moved data node information to a separated page. |
HADOOP-4103 | dfs | Modified dfsadmin -report to report under replicated blocks. blocks with corrupt replicas, and missing blocks". |
HADOOP-4567 | dfs | Changed GetFileBlockLocations to return topology information for nodes that host the block replicas. |
HADOOP-4572 | dfs | Moved org.apache.hadoop.hdfs.{CreateEditsLog, NNThroughputBenchmark} to org.apache.hadoop.hdfs.server.namenode. |
HADOOP-4618 | dfs | Moved HTTP server from FSNameSystem to NameNode. Removed FSNamesystem.getNameNodeInfoPort(). Replaced FSNamesystem.getDFSNameNodeMachine() and FSNamesystem.getDFSNameNodePort() with new method FSNamesystem.getDFSNameNodeAddress(). Removed constructor NameNode(bindAddress, conf). |
HADOOP-4826 | dfs | Introduced new dfsadmin command saveNamespace to command the name service to do an immediate save of the file system image. |
HADOOP-4970 | dfs | Changed trash facility to use absolute path of the deleted file. |
HADOOP-5468 | documentation | Reformatted HTML documentation for Hadoop to use submenus at the left column. |
HADOOP-3497 | fs | Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match. |
HADOOP-4234 | fs | Changed KFS glue layer to allow applications to interface with multiple KFS metaservers. |
HADOOP-4422 | fs/s3 | Modified Hadoop file system to no longer create S3 buckets. Applications can create buckets for their S3 file systems by other means, for example, using the JetS3t API. |
HADOOP-3063 | io | Introduced BloomMapFile subclass of MapFile that creates a Bloom filter from all keys. |
HADOOP-1230 | mapred | Replaced parameters with context obejcts in Mapper, Reducer, Partitioner, InputFormat, and OutputFormat classes. |
HADOOP-1650 | mapred | Upgraded all core servers to use Jetty 6 |
HADOOP-3923 | mapred | Moved class org.apache.hadoop.mapred.StatusHttpServer to org.apache.hadoop.http.HttpServer. |
HADOOP-3986 | mapred | Removed classes org.apache.hadoop.mapred.JobShell and org.apache.hadoop.mapred.TestJobShell. Removed from JobClient methods static void setCommandLineConfig(Configuration conf) and public static Configuration getCommandLineConfig(). |
HADOOP-4188 | mapred | Removed Task's dependency on concrete file systems by taking list from FileSystem class. Added statistics table to FileSystem class. Deprecated FileSystem method getStatistics(Class<? extends FileSystem> cls). |
HADOOP-4210 | mapred | Changed public class org.apache.hadoop.mapreduce.ID to be an abstract class. Removed from class org.apache.hadoop.mapreduce.ID the methods public static ID read(DataInput in) and public static ID forName(String str). |
HADOOP-4305 | mapred | Improved TaskTracker blacklisting strategy to better exclude faulty tracker from executing tasks. |
HADOOP-4435 | mapred | Changed JobTracker web status page to display the amount of heap memory in use. This changes the JobSubmissionProtocol. |
HADOOP-4565 | mapred | Improved MultiFileInputFormat so that multiple blocks from the same node or same rack can be combined into a single split. |
HADOOP-4749 | mapred | Added a new counter REDUCE_INPUT_BYTES. |
HADOOP-4783 | mapred | Changed history directory permissions to 750 and history file permissions to 740. |
HADOOP-3422 | metrics | Changed names of ganglia metrics to avoid conflicts and to better identify source function. |
HADOOP-4284 | security | Introduced HttpServer method to support global filters. |
HADOOP-4575 | security | Introduced independent HSFTP proxy server for authenticated access to clusters. |
HADOOP-4661 | tools/distcp | Introduced distch tool for parallel ch{mod, own, grp}. |