11 년 전 · 73e0b8e55a
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@@ -36,6 +36,9 @@ Release 2.5.0 - UNRELEASED
 
				     MAPREDUCE-5812. Make job context available to
			
 
				     OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
			
 
				 
			
 
				+    MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
			
 
				+    jeagles)
			
 
				+
			
 
				   OPTIMIZATIONS
			
 
				 
			
 
				   BUG FIXES 
			
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
+++ b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
@@ -0,0 +1,138 @@
 
				+<!---
			
 
				+  Licensed under the Apache License, Version 2.0 (the "License");
			
 
				+  you may not use this file except in compliance with the License.
			
 
				+  You may obtain a copy of the License at
			
 
				+
			
 
				+    http://www.apache.org/licenses/LICENSE-2.0
			
 
				+
			
 
				+  Unless required by applicable law or agreed to in writing, software
			
 
				+  distributed under the License is distributed on an "AS IS" BASIS,
			
 
				+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
			
 
				+  See the License for the specific language governing permissions and
			
 
				+  limitations under the License. See accompanying LICENSE file.
			
 
				+-->
			
 
				+
			
 
				+#set ( $H3 = '###' )
			
 
				+
			
 
				+Hadoop Archives Guide
			
 
				+=====================
			
 
				+
			
 
				+ - [Overview](#Overview)
			
 
				+ - [How to Create an Archive](#How_to_Create_an_Archive)
			
 
				+ - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
			
 
				+ - [Archives Examples](#Archives_Examples)
			
 
				+     - [Creating an Archive](#Creating_an_Archive)
			
 
				+     - [Looking Up Files](#Looking_Up_Files)
			
 
				+ - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
			
 
				+
			
 
				+Overview
			
 
				+--------
			
 
				+
			
 
				+  Hadoop archives are special format archives. A Hadoop archive maps to a file
			
 
				+  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
			
 
				+  archive directory contains metadata (in the form of _index and _masterindex)
			
 
				+  and data (part-\*) files. The _index file contains the name of the files that
			
 
				+  are part of the archive and the location within the part files.
			
 
				+
			
 
				+How to Create an Archive
			
 
				+------------------------
			
 
				+
			
 
				+  `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
			
 
				+
			
 
				+  -archiveName is the name of the archive you would like to create. An example
			
 
				+  would be foo.har. The name should have a \*.har extension. The parent argument
			
 
				+  is to specify the relative path to which the files should be archived to.
			
 
				+  Example would be :
			
 
				+
			
 
				+  `-p /foo/bar a/b/c e/f/g`
			
 
				+
			
 
				+  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
			
 
				+  parent. Note that this is a Map/Reduce job that creates the archives. You
			
 
				+  would need a map reduce cluster to run this. For a detailed example the later
			
 
				+  sections.
			
 
				+
			
 
				+  If you just want to archive a single directory /foo/bar then you can just use
			
 
				+
			
 
				+  `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
			
 
				+
			
 
				+How to Look Up Files in Archives
			
 
				+--------------------------------
			
 
				+
			
 
				+  The archive exposes itself as a file system layer. So all the fs shell
			
 
				+  commands in the archives work but with a different URI. Also, note that
			
 
				+  archives are immutable. So, rename's, deletes and creates return an error.
			
 
				+  URI for Hadoop Archives is
			
 
				+
			
 
				+  `har://scheme-hostname:port/archivepath/fileinarchive`
			
 
				+
			
 
				+  If no scheme is provided it assumes the underlying filesystem. In that case
			
 
				+  the URI would look like
			
 
				+
			
 
				+  `har:///archivepath/fileinarchive`
			
 
				+
			
 
				+Archives Examples
			
 
				+-----------------
			
 
				+
			
 
				+$H3 Creating an Archive
			
 
				+
			
 
				+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
			
 
				+
			
 
				+  The above example is creating an archive using /user/hadoop as the relative
			
 
				+  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
			
 
				+  will be archived in the following file system directory -- /user/zoo/foo.har.
			
 
				+  Archiving does not delete the input files. If you want to delete the input
			
 
				+  files after creating the archives (to reduce namespace), you will have to do
			
 
				+  it on your own. 
			
 
				+
			
 
				+$H3 Looking Up Files
			
 
				+
			
 
				+  Looking up files in hadoop archives is as easy as doing an ls on the
			
 
				+  filesystem. After you have archived the directories /user/hadoop/dir1 and
			
 
				+  /user/hadoop/dir2 as in the example above, to see all the files in the
			
 
				+  archives you can just run:
			
 
				+
			
 
				+  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
			
 
				+
			
 
				+  To understand the significance of the -p argument, lets go through the above
			
 
				+  example again. If you just do an ls (not lsr) on the hadoop archive using
			
 
				+
			
 
				+  `hdfs dfs -ls har:///user/zoo/foo.har`
			
 
				+
			
 
				+  The output should be:
			
 
				+
			
 
				+```
			
 
				+har:///user/zoo/foo.har/dir1
			
 
				+har:///user/zoo/foo.har/dir2
			
 
				+```
			
 
				+
			
 
				+  As you can recall the archives were created with the following command
			
 
				+
			
 
				+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
			
 
				+
			
 
				+  If we were to change the command to:
			
 
				+
			
 
				+  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
			
 
				+
			
 
				+  then a ls on the hadoop archive using
			
 
				+
			
 
				+  `hdfs dfs -ls har:///user/zoo/foo.har`
			
 
				+
			
 
				+  would give you
			
 
				+
			
 
				+```
			
 
				+har:///user/zoo/foo.har/hadoop/dir1
			
 
				+har:///user/zoo/foo.har/hadoop/dir2
			
 
				+```
			
 
				+
			
 
				+  Notice that the archived files have been archived relative to /user/ rather
			
 
				+  than /user/hadoop.
			
 
				+
			
 
				+Hadoop Archives and MapReduce
			
 
				+-----------------------------
			
 
				+
			
 
				+  Using Hadoop Archives in MapReduce is as easy as specifying a different input
			
 
				+  filesystem than the default file system. If you have a hadoop archive stored
			
 
				+  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
			
 
				+  all you need to specify the input directory as har:///user/zoo/foo.har. Since
			
 
				+  Hadoop Archives is exposed as a file system MapReduce will be able to use all
			
 
				+  the logical input files in Hadoop Archives as input.
			
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@@ -92,6 +92,7 @@
 
				       <item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
			
 
				       <item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
			
 
				       <item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
			
 
				+      <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
			
 
				       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
			
 
				     </menu>