HadoopArchives.md.vm 5.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
  1. <!---
  2. Licensed under the Apache License, Version 2.0 (the "License");
  3. you may not use this file except in compliance with the License.
  4. You may obtain a copy of the License at
  5. http://www.apache.org/licenses/LICENSE-2.0
  6. Unless required by applicable law or agreed to in writing, software
  7. distributed under the License is distributed on an "AS IS" BASIS,
  8. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  9. See the License for the specific language governing permissions and
  10. limitations under the License. See accompanying LICENSE file.
  11. -->
  12. #set ( $H3 = '###' )
  13. Hadoop Archives Guide
  14. =====================
  15. - [Overview](#Overview)
  16. - [How to Create an Archive](#How_to_Create_an_Archive)
  17. - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
  18. - [How to Unarchive an Archive](#How_to_Unarchive_an_Archive)
  19. - [Archives Examples](#Archives_Examples)
  20. - [Creating an Archive](#Creating_an_Archive)
  21. - [Looking Up Files](#Looking_Up_Files)
  22. - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
  23. Overview
  24. --------
  25. Hadoop archives are special format archives. A Hadoop archive maps to a file
  26. system directory. A Hadoop archive always has a \*.har extension. A Hadoop
  27. archive directory contains metadata (in the form of _index and _masterindex)
  28. and data (part-\*) files. The _index file contains the name of the files that
  29. are part of the archive and the location within the part files.
  30. How to Create an Archive
  31. ------------------------
  32. `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
  33. -archiveName is the name of the archive you would like to create. An example
  34. would be foo.har. The name should have a \*.har extension. The parent argument
  35. is to specify the relative path to which the files should be archived to.
  36. Example would be :
  37. `-p /foo/bar a/b/c e/f/g`
  38. Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
  39. parent. Note that this is a Map/Reduce job that creates the archives. You
  40. would need a map reduce cluster to run this. For a detailed example the later
  41. sections.
  42. If you just want to archive a single directory /foo/bar then you can just use
  43. `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
  44. How to Look Up Files in Archives
  45. --------------------------------
  46. The archive exposes itself as a file system layer. So all the fs shell
  47. commands in the archives work but with a different URI. Also, note that
  48. archives are immutable. So, rename's, deletes and creates return an error.
  49. URI for Hadoop Archives is
  50. `har://scheme-hostname:port/archivepath/fileinarchive`
  51. If no scheme is provided it assumes the underlying filesystem. In that case
  52. the URI would look like
  53. `har:///archivepath/fileinarchive`
  54. How to Unarchive an Archive
  55. ---------------------------
  56. Since all the fs shell commands in the archives work transparently,
  57. unarchiving is just a matter of copying.
  58. To unarchive sequentially:
  59. `hdfs dfs -cp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
  60. To unarchive in parallel, use DistCp:
  61. `hadoop distcp har:///user/zoo/foo.har/dir1 hdfs:/user/zoo/newdir`
  62. Archives Examples
  63. -----------------
  64. $H3 Creating an Archive
  65. `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
  66. The above example is creating an archive using /user/hadoop as the relative
  67. archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
  68. will be archived in the following file system directory -- /user/zoo/foo.har.
  69. Archiving does not delete the input files. If you want to delete the input
  70. files after creating the archives (to reduce namespace), you will have to do
  71. it on your own.
  72. $H3 Looking Up Files
  73. Looking up files in hadoop archives is as easy as doing an ls on the
  74. filesystem. After you have archived the directories /user/hadoop/dir1 and
  75. /user/hadoop/dir2 as in the example above, to see all the files in the
  76. archives you can just run:
  77. `hdfs dfs -ls -R har:///user/zoo/foo.har/`
  78. To understand the significance of the -p argument, lets go through the above
  79. example again. If you just do an ls (not lsr) on the hadoop archive using
  80. `hdfs dfs -ls har:///user/zoo/foo.har`
  81. The output should be:
  82. ```
  83. har:///user/zoo/foo.har/dir1
  84. har:///user/zoo/foo.har/dir2
  85. ```
  86. As you can recall the archives were created with the following command
  87. `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
  88. If we were to change the command to:
  89. `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
  90. then a ls on the hadoop archive using
  91. `hdfs dfs -ls har:///user/zoo/foo.har`
  92. would give you
  93. ```
  94. har:///user/zoo/foo.har/hadoop/dir1
  95. har:///user/zoo/foo.har/hadoop/dir2
  96. ```
  97. Notice that the archived files have been archived relative to /user/ rather
  98. than /user/hadoop.
  99. Hadoop Archives and MapReduce
  100. -----------------------------
  101. Using Hadoop Archives in MapReduce is as easy as specifying a different input
  102. filesystem than the default file system. If you have a hadoop archive stored
  103. in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
  104. all you need to specify the input directory as har:///user/zoo/foo.har. Since
  105. Hadoop Archives is exposed as a file system MapReduce will be able to use all
  106. the logical input files in Hadoop Archives as input.