123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247 |
- ~~ Licensed under the Apache License, Version 2.0 (the "License");
- ~~ you may not use this file except in compliance with the License.
- ~~ You may obtain a copy of the License at
- ~~
- ~~ http://www.apache.org/licenses/LICENSE-2.0
- ~~
- ~~ Unless required by applicable law or agreed to in writing, software
- ~~ distributed under the License is distributed on an "AS IS" BASIS,
- ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~~ See the License for the specific language governing permissions and
- ~~ limitations under the License. See accompanying LICENSE file.
- ---
- Offline Image Viewer Guide
- ---
- ---
- ${maven.build.timestamp}
- Offline Image Viewer Guide
- %{toc|section=1|fromDepth=0}
- * Overview
- The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
- files to a human-readable format and provide read-only WebHDFS API
- in order to allow offline analysis and examination of an Hadoop cluster's
- namespace. The tool is able to process very large image files relatively
- quickly. The tool handles the layout formats that were included with Hadoop
- versions 2.4 and up. If you want to handle older layout formats, you can
- use the Offline Image Viewer of Hadoop 2.3 or {{oiv_legacy Command}}.
- If the tool is not able to process an image file, it will exit cleanly.
- The Offline Image Viewer does not require a Hadoop cluster to be running;
- it is entirely offline in its operation.
- The Offline Image Viewer provides several output processors:
- [[1]] Web is the default output processor. It launches a HTTP server
- that exposes read-only WebHDFS API. Users can investigate the namespace
- interactively by using HTTP REST API.
- [[2]] XML creates an XML document of the fsimage and includes all of the
- information within the fsimage, similar to the lsr processor. The
- output of this processor is amenable to automated processing and
- analysis with XML tools. Due to the verbosity of the XML syntax,
- this processor will also generate the largest amount of output.
- [[3]] FileDistribution is the tool for analyzing file sizes in the
- namespace image. In order to run the tool one should define a range
- of integers [0, maxSize] by specifying maxSize and a step. The
- range of integers is divided into segments of size step: [0, s[1],
- ..., s[n-1], maxSize], and the processor calculates how many files
- in the system fall into each segment [s[i-1], s[i]). Note that
- files larger than maxSize always fall into the very last segment.
- The output file is formatted as a tab separated two column table:
- Size and NumFiles. Where Size represents the start of the segment,
- and numFiles is the number of files form the image which size falls
- in this segment.
- * Usage
- ** Web Processor
- Web processor launches a HTTP server which exposes read-only WebHDFS API.
- Users can specify the address to listen by -addr option (default by
- localhost:5978).
- ----
- bash$ bin/hdfs oiv -i fsimage
- 14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer
- started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
- ----
- Users can access the viewer and get the information of the fsimage by
- the following shell command:
- ----
- bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/
- Found 2 items
- drwxrwx--- - root supergroup 0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp
- drwxr-xr-x - root supergroup 0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user
- ----
- To get the information of all the files and directories, you can simply use
- the following command:
- ----
- bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/
- ----
- Users can also get JSON formatted FileStatuses via HTTP REST API.
- ----
- bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus
- HTTP/1.1 200 OK
- Content-Type: application/json
- Content-Length: 252
- {"FileStatuses":{"FileStatus":[
- {"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"}
- ]}}
- ----
- The Web processor now supports the following operations:
- * {{{./WebHDFS.html#List_a_Directory}LISTSTATUS}}
- * {{{./WebHDFS.html#Status_of_a_FileDirectory}GETFILESTATUS}}
- * {{{./WebHDFS.html#Get_ACL_Status}GETACLSTATUS}}
- ** XML Processor
- XML Processor is used to dump all the contents in the fsimage. Users can
- specify input and output file via -i and -o command-line.
- ----
- bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml
- ----
- This will create a file named fsimage.xml contains all the information in
- the fsimage. For very large image files, this process may take several
- minutes.
- Applying the Offline Image Viewer with XML processor would result in the
- following output:
- ----
- <?xml version="1.0"?>
- <fsimage>
- <NameSection>
- <genstampV1>1000</genstampV1>
- <genstampV2>1002</genstampV2>
- <genstampV1Limit>0</genstampV1Limit>
- <lastAllocatedBlockId>1073741826</lastAllocatedBlockId>
- <txid>37</txid>
- </NameSection>
- <INodeSection>
- <lastInodeId>16400</lastInodeId>
- <inode>
- <id>16385</id>
- <type>DIRECTORY</type>
- <name></name>
- <mtime>1392772497282</mtime>
- <permission>theuser:supergroup:rwxr-xr-x</permission>
- <nsquota>9223372036854775807</nsquota>
- <dsquota>-1</dsquota>
- </inode>
- ...remaining output omitted...
- ----
- * Options
- *-----------------------:-----------------------------------+
- | <<Flag>> | <<Description>> |
- *-----------------------:-----------------------------------+
- | <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file
- | | to process. Required.
- *-----------------------:-----------------------------------+
- | <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename,
- | | if the specified output processor generates one. If
- | | the specified file already exists, it is silently
- | | overwritten. (output to stdout by default)
- *-----------------------:-----------------------------------+
- | <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to
- | | apply against the image file. Currently valid options
- | | are Web (default), XML and FileDistribution.
- *-----------------------:-----------------------------------+
- | <<<-addr>>> <address> | Specify the address(host:port) to listen.
- | | (localhost:5978 by default). This option is used with
- | | Web processor.
- *-----------------------:-----------------------------------+
- | <<<-maxSize>>> <size> | Specify the range [0, maxSize] of file sizes to be
- | | analyzed in bytes (128GB by default). This option is
- | | used with FileDistribution processor.
- *-----------------------:-----------------------------------+
- | <<<-step>>> <size> | Specify the granularity of the distribution in bytes
- | | (2MB by default). This option is used with
- | | FileDistribution processor.
- *-----------------------:-----------------------------------+
- | <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and
- | | exit.
- *-----------------------:-----------------------------------+
- * Analyzing Results
- The Offline Image Viewer makes it easy to gather large amounts of data
- about the hdfs namespace. This information can then be used to explore
- file system usage patterns or find specific files that match arbitrary
- criteria, along with other types of namespace analysis.
- * oiv_legacy Command
- Due to the internal layout changes introduced by the ProtocolBuffer-based
- fsimage ({{{https://issues.apache.org/jira/browse/HDFS-5698}HDFS-5698}}),
- OfflineImageViewer consumes excessive amount of memory and loses some
- functions such as Indented and Delimited processor. If you want to process
- without large amount of memory or use these processors, you can use
- <<<oiv_legacy>>> command (same as <<<oiv>>> in Hadoop 2.3).
- ** Usage
- 1. Set <<<dfs.namenode.legacy-oiv-image.dir>>> to an appropriate directory
- to make standby NameNode or SecondaryNameNode save its namespace in the
- old fsimage format during checkpointing.
- 2. Use <<<oiv_legacy>>> command to the old format fsimage.
- ----
- bash$ bin/hdfs oiv_legacy -i fsimage_old -o output
- ----
- ** Options
- *-----------------------:-----------------------------------+
- | <<Flag>> | <<Description>> |
- *-----------------------:-----------------------------------+
- | <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
- | | process. Required.
- *-----------------------:-----------------------------------+
- | <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if
- | | the specified output processor generates one. If the
- | | specified file already exists, it is silently
- | | overwritten. Required.
- *-----------------------:-----------------------------------+
- | <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to
- | | apply against the image file. Valid options are
- | | Ls (default), XML, Delimited, Indented, and
- | | FileDistribution.
- *-----------------------:-----------------------------------+
- | <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This
- | | may save processing time and outfile file space on
- | | namespaces with very large files. The Ls processor
- | | reads the blocks to correctly determine file sizes
- | | and ignores this option.
- *-----------------------:-----------------------------------+
- | <<<-printToScreen>>> | Pipe output of processor to console as well as
- | | specified file. On extremely large namespaces, this
- | | may increase processing time by an order of
- | | magnitude.
- *-----------------------:-----------------------------------+
- | <<<-delimiter>>> <arg>| When used in conjunction with the Delimited
- | | processor, replaces the default tab delimiter with
- | | the string specified by <arg>.
- *-----------------------:-----------------------------------+
- | <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
- *-----------------------:-----------------------------------+
|