123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153 |
- <!---
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License. See accompanying LICENSE file.
- -->
- #set ( $H3 = '###' )
- Hadoop Archives Guide
- =====================
- - [Overview](#Overview)
- - [How to Create an Archive](#How_to_Create_an_Archive)
- - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
- - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
- - [Archives Examples](#Archives_Examples)
- - [Creating an Archive](#Creating_an_Archive)
- - [Looking Up Files](#Looking_Up_Files)
- - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
- Overview
- --------
- Hadoop archives are special format archives. A Hadoop archive maps to a file
- system directory. A Hadoop archive always has a \*.har extension. A Hadoop
- archive directory contains metadata (in the form of _index and _masterindex)
- and data (part-\*) files. The _index file contains the name of the files that
- are part of the archive and the location within the part files.
- How to Create an Archive
- ------------------------
- `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
- -archiveName is the name of the archive you would like to create. An example
- would be foo.har. The name should have a \*.har extension. The parent argument
- is to specify the relative path to which the files should be archived to.
- Example would be :
- `-p /foo/bar a/b/c e/f/g`
- Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
- parent. Note that this is a Map/Reduce job that creates the archives. You
- would need a map reduce cluster to run this. For a detailed example the later
- sections.
- If you just want to archive a single directory /foo/bar then you can just use
- `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
- How to Look Up Files in Archives
- --------------------------------
- The archive exposes itself as a file system layer. So all the fs shell
- commands in the archives work but with a different URI. Also, note that
- archives are immutable. So, rename's, deletes and creates return an error.
- URI for Hadoop Archives is
- `har://scheme-hostname:port/archivepath/fileinarchive`
- If no scheme is provided it assumes the underlying filesystem. In that case
- the URI would look like
- `har:///archivepath/fileinarchive`
- How to Unarchive an Archive
- ---------------------------
- Since all the fs shell commands in the archives work transparently,
- unarchiving is just a matter of copying.
- To unarchive sequentially:
- `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
- To unarchive in parallel, use DistCp:
- `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
- Archives Examples
- -----------------
- $H3 Creating an Archive
- `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
- The above example is creating an archive using /user/hadoop as the relative
- archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
- will be archived in the following file system directory -- /user/zoo/foo.har.
- Archiving does not delete the input files. If you want to delete the input
- files after creating the archives (to reduce namespace), you will have to do
- it on your own.
- $H3 Looking Up Files
- Looking up files in hadoop archives is as easy as doing an ls on the
- filesystem. After you have archived the directories /user/hadoop/dir1 and
- /user/hadoop/dir2 as in the example above, to see all the files in the
- archives you can just run:
- `hdfs dfs -ls -R har:///user/zoo/foo.har/`
- To understand the significance of the -p argument, lets go through the above
- example again. If you just do an ls (not lsr) on the hadoop archive using
- `hdfs dfs -ls har:///user/zoo/foo.har`
- The output should be:
- ```
- har:///user/zoo/foo.har/dir1
- har:///user/zoo/foo.har/dir2
- ```
- As you can recall the archives were created with the following command
- `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
- If we were to change the command to:
- `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
- then a ls on the hadoop archive using
- `hdfs dfs -ls har:///user/zoo/foo.har`
- would give you
- ```
- har:///user/zoo/foo.har/hadoop/dir1
- har:///user/zoo/foo.har/hadoop/dir2
- ```
- Notice that the archived files have been archived relative to /user/ rather
- than /user/hadoop.
- Hadoop Archives and MapReduce
- -----------------------------
- Using Hadoop Archives in MapReduce is as easy as specifying a different input
- filesystem than the default file system. If you have a hadoop archive stored
- in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
- all you need to specify the input directory as har:///user/zoo/foo.har. Since
- Hadoop Archives is exposed as a file system MapReduce will be able to use all
- the logical input files in Hadoop Archives as input.
|