HdfsImageViewer.apt.vm 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418
  1. ~~ Licensed under the Apache License, Version 2.0 (the "License");
  2. ~~ you may not use this file except in compliance with the License.
  3. ~~ You may obtain a copy of the License at
  4. ~~
  5. ~~ http://www.apache.org/licenses/LICENSE-2.0
  6. ~~
  7. ~~ Unless required by applicable law or agreed to in writing, software
  8. ~~ distributed under the License is distributed on an "AS IS" BASIS,
  9. ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  10. ~~ See the License for the specific language governing permissions and
  11. ~~ limitations under the License. See accompanying LICENSE file.
  12. ---
  13. Offline Image Viewer Guide
  14. ---
  15. ---
  16. ${maven.build.timestamp}
  17. Offline Image Viewer Guide
  18. \[ {{{./index.html}Go Back}} \]
  19. %{toc|section=1|fromDepth=0}
  20. * Overview
  21. The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
  22. files to human-readable formats in order to allow offline analysis and
  23. examination of an Hadoop cluster's namespace. The tool is able to
  24. process very large image files relatively quickly, converting them to
  25. one of several output formats. The tool handles the layout formats that
  26. were included with Hadoop versions 16 and up. If the tool is not able
  27. to process an image file, it will exit cleanly. The Offline Image
  28. Viewer does not require an Hadoop cluster to be running; it is entirely
  29. offline in its operation.
  30. The Offline Image Viewer provides several output processors:
  31. [[1]] Ls is the default output processor. It closely mimics the format of
  32. the lsr command. It includes the same fields, in the same order, as
  33. lsr : directory or file flag, permissions, replication, owner,
  34. group, file size, modification date, and full path. Unlike the lsr
  35. command, the root path is included. One important difference
  36. between the output of the lsr command this processor, is that this
  37. output is not sorted by directory name and contents. Rather, the
  38. files are listed in the order in which they are stored in the
  39. fsimage file. Therefore, it is not possible to directly compare the
  40. output of the lsr command this this tool. The Ls processor uses
  41. information contained within the Inode blocks to calculate file
  42. sizes and ignores the -skipBlocks option.
  43. [[2]] Indented provides a more complete view of the fsimage's contents,
  44. including all of the information included in the image, such as
  45. image version, generation stamp and inode- and block-specific
  46. listings. This processor uses indentation to organize the output
  47. into a hierarchal manner. The lsr format is suitable for easy human
  48. comprehension.
  49. [[3]] Delimited provides one file per line consisting of the path,
  50. replication, modification time, access time, block size, number of
  51. blocks, file size, namespace quota, diskspace quota, permissions,
  52. username and group name. If run against an fsimage that does not
  53. contain any of these fields, the field's column will be included,
  54. but no data recorded. The default record delimiter is a tab, but
  55. this may be changed via the -delimiter command line argument. This
  56. processor is designed to create output that is easily analyzed by
  57. other tools, such as [36]Apache Pig. See the [37]Analyzing Results
  58. section for further information on using this processor to analyze
  59. the contents of fsimage files.
  60. [[4]] XML creates an XML document of the fsimage and includes all of the
  61. information within the fsimage, similar to the lsr processor. The
  62. output of this processor is amenable to automated processing and
  63. analysis with XML tools. Due to the verbosity of the XML syntax,
  64. this processor will also generate the largest amount of output.
  65. [[5]] FileDistribution is the tool for analyzing file sizes in the
  66. namespace image. In order to run the tool one should define a range
  67. of integers [0, maxSize] by specifying maxSize and a step. The
  68. range of integers is divided into segments of size step: [0, s[1],
  69. ..., s[n-1], maxSize], and the processor calculates how many files
  70. in the system fall into each segment [s[i-1], s[i]). Note that
  71. files larger than maxSize always fall into the very last segment.
  72. The output file is formatted as a tab separated two column table:
  73. Size and NumFiles. Where Size represents the start of the segment,
  74. and numFiles is the number of files form the image which size falls
  75. in this segment.
  76. * Usage
  77. ** Basic
  78. The simplest usage of the Offline Image Viewer is to provide just an
  79. input and output file, via the -i and -o command-line switches:
  80. ----
  81. bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
  82. ----
  83. This will create a file named fsimage.txt in the current directory
  84. using the Ls output processor. For very large image files, this process
  85. may take several minutes.
  86. One can specify which output processor via the command-line switch -p.
  87. For instance:
  88. ----
  89. bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
  90. ----
  91. or
  92. ----
  93. bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
  94. ----
  95. This will run the tool using either the XML or Indented output
  96. processor, respectively.
  97. One command-line option worth considering is -skipBlocks, which
  98. prevents the tool from explicitly enumerating all of the blocks that
  99. make up a file in the namespace. This is useful for file systems that
  100. have very large files. Enabling this option can significantly decrease
  101. the size of the resulting output, as individual blocks are not
  102. included. Note, however, that the Ls processor needs to enumerate the
  103. blocks and so overrides this option.
  104. Example
  105. Consider the following contrived namespace:
  106. ----
  107. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherDir
  108. -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherDir/biggerfile
  109. -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherDir/smallFile
  110. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem
  111. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser
  112. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
  113. drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
  114. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one
  115. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two
  116. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user
  117. drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser
  118. ----
  119. Applying the Offline Image Processor against this file with default
  120. options would result in the following output:
  121. ----
  122. machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
  123. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /
  124. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherDir
  125. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem
  126. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one
  127. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user
  128. -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherDir/biggerfile
  129. -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherDir/smallFile
  130. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser
  131. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
  132. drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
  133. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two
  134. drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser
  135. ----
  136. Similarly, applying the Indented processor would generate output that
  137. begins with:
  138. ----
  139. machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
  140. FSImage
  141. ImageVersion = -19
  142. NamespaceID = 2109123098
  143. GenerationStamp = 1003
  144. INodes [NumInodes = 12]
  145. Inode
  146. INodePath =
  147. Replication = 0
  148. ModificationTime = 2009-03-16 14:16
  149. AccessTime = 1969-12-31 16:00
  150. BlockSize = 0
  151. Blocks [NumBlocks = -1]
  152. NSQuota = 2147483647
  153. DSQuota = -1
  154. Permissions
  155. Username = theuser
  156. GroupName = supergroup
  157. PermString = rwxr-xr-x
  158. ...remaining output omitted...
  159. ----
  160. * Options
  161. *-----------------------:-----------------------------------+
  162. | <<Flag>> | <<Description>> |
  163. *-----------------------:-----------------------------------+
  164. | <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
  165. | | process. Required.
  166. *-----------------------:-----------------------------------+
  167. | <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if the
  168. | | specified output processor generates one. If the specified file already
  169. | | exists, it is silently overwritten. Required.
  170. *-----------------------:-----------------------------------+
  171. | <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to apply
  172. | | against the image file. Currently valid options are Ls (default), XML
  173. | | and Indented..
  174. *-----------------------:-----------------------------------+
  175. | <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This may
  176. | | save processing time and outfile file space on namespaces with very
  177. | | large files. The Ls processor reads the blocks to correctly determine
  178. | | file sizes and ignores this option.
  179. *-----------------------:-----------------------------------+
  180. | <<<-printToScreen>>> | Pipe output of processor to console as well as specified
  181. | | file. On extremely large namespaces, this may increase processing time
  182. | | by an order of magnitude.
  183. *-----------------------:-----------------------------------+
  184. | <<<-delimiter>>> <arg>| When used in conjunction with the Delimited processor,
  185. | | replaces the default tab delimiter with the string specified by arg.
  186. *-----------------------:-----------------------------------+
  187. | <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
  188. *-----------------------:-----------------------------------+
  189. * Analyzing Results
  190. The Offline Image Viewer makes it easy to gather large amounts of data
  191. about the hdfs namespace. This information can then be used to explore
  192. file system usage patterns or find specific files that match arbitrary
  193. criteria, along with other types of namespace analysis. The Delimited
  194. image processor in particular creates output that is amenable to
  195. further processing by tools such as [38]Apache Pig. Pig provides a
  196. particularly good choice for analyzing these data as it is able to deal
  197. with the output generated from a small fsimage but also scales up to
  198. consume data from extremely large file systems.
  199. The Delimited image processor generates lines of text separated, by
  200. default, by tabs and includes all of the fields that are common between
  201. constructed files and files that were still under constructed when the
  202. fsimage was generated. Examples scripts are provided demonstrating how
  203. to use this output to accomplish three tasks: determine the number of
  204. files each user has created on the file system, find files were created
  205. but have not accessed, and find probable duplicates of large files by
  206. comparing the size of each file.
  207. Each of the following scripts assumes you have generated an output file
  208. using the Delimited processor named foo and will be storing the results
  209. of the Pig analysis in a file named results.
  210. ** Total Number of Files for Each User
  211. This script processes each path within the namespace, groups them by
  212. the file owner and determines the total number of files each user owns.
  213. ----
  214. numFilesOfEachUser.pig:
  215. -- This script determines the total number of files each user has in
  216. -- the namespace. Its output is of the form:
  217. -- username, totalNumFiles
  218. -- Load all of the fields from the file
  219. A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
  220. replication:int,
  221. modTime:chararray,
  222. accessTime:chararray,
  223. blockSize:long,
  224. numBlocks:int,
  225. fileSize:long,
  226. NamespaceQuota:int,
  227. DiskspaceQuota:int,
  228. perms:chararray,
  229. username:chararray,
  230. groupname:chararray);
  231. -- Grab just the path and username
  232. B = FOREACH A GENERATE path, username;
  233. -- Generate the sum of the number of paths for each user
  234. C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
  235. -- Save results
  236. STORE C INTO '$outputFile';
  237. ----
  238. This script can be run against pig with the following command:
  239. ----
  240. bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
  241. ----
  242. The output file's content will be similar to that below:
  243. ----
  244. bart 1
  245. lisa 16
  246. homer 28
  247. marge 2456
  248. ----
  249. ** Files That Have Never Been Accessed
  250. This script finds files that were created but whose access times were
  251. never changed, meaning they were never opened or viewed.
  252. ----
  253. neverAccessed.pig:
  254. -- This script generates a list of files that were created but never
  255. -- accessed, based on their AccessTime
  256. -- Load all of the fields from the file
  257. A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
  258. replication:int,
  259. modTime:chararray,
  260. accessTime:chararray,
  261. blockSize:long,
  262. numBlocks:int,
  263. fileSize:long,
  264. NamespaceQuota:int,
  265. DiskspaceQuota:int,
  266. perms:chararray,
  267. username:chararray,
  268. groupname:chararray);
  269. -- Grab just the path and last time the file was accessed
  270. B = FOREACH A GENERATE path, accessTime;
  271. -- Drop all the paths that don't have the default assigned last-access time
  272. C = FILTER B BY accessTime == '1969-12-31 16:00';
  273. -- Drop the accessTimes, since they're all the same
  274. D = FOREACH C GENERATE path;
  275. -- Save results
  276. STORE D INTO '$outputFile';
  277. ----
  278. This script can be run against pig with the following command and its
  279. output file's content will be a list of files that were created but
  280. never viewed afterwards.
  281. ----
  282. bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
  283. ----
  284. ** Probable Duplicated Files Based on File Size
  285. This script groups files together based on their size, drops any that
  286. are of less than 100mb and returns a list of the file size, number of
  287. files found and a tuple of the file paths. This can be used to find
  288. likely duplicates within the filesystem namespace.
  289. ----
  290. probableDuplicates.pig:
  291. -- This script finds probable duplicate files greater than 100 MB by
  292. -- grouping together files based on their byte size. Files of this size
  293. -- with exactly the same number of bytes can be considered probable
  294. -- duplicates, but should be checked further, either by comparing the
  295. -- contents directly or by another proxy, such as a hash of the contents.
  296. -- The scripts output is of the type:
  297. -- fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
  298. -- Load all of the fields from the file
  299. A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
  300. replication:int,
  301. modTime:chararray,
  302. accessTime:chararray,
  303. blockSize:long,
  304. numBlocks:int,
  305. fileSize:long,
  306. NamespaceQuota:int,
  307. DiskspaceQuota:int,
  308. perms:chararray,
  309. username:chararray,
  310. groupname:chararray);
  311. -- Grab the pathname and filesize
  312. B = FOREACH A generate path, fileSize;
  313. -- Drop files smaller than 100 MB
  314. C = FILTER B by fileSize > 100L * 1024L * 1024L;
  315. -- Gather all the files of the same byte size
  316. D = GROUP C by fileSize;
  317. -- Generate path, num of duplicates, list of duplicates
  318. E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
  319. -- Drop all the files where there are only one of them
  320. F = FILTER E by numDupes > 1L;
  321. -- Sort by the size of the files
  322. G = ORDER F by fileSize;
  323. -- Save results
  324. STORE G INTO '$outputFile';
  325. ----
  326. This script can be run against pig with the following command:
  327. ----
  328. bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
  329. ----
  330. The output file's content will be similar to that below:
  331. ----
  332. 1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
  333. 1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
  334. 1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
  335. 1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
  336. 1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
  337. ----
  338. Each line includes the file size in bytes that was found to be
  339. duplicated, the number of duplicates found, and a list of the
  340. duplicated paths. Files less than 100MB are ignored, providing a
  341. reasonable likelihood that files of these exact sizes may be
  342. duplicates.