123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416 |
- ~~ Licensed under the Apache License, Version 2.0 (the "License");
- ~~ you may not use this file except in compliance with the License.
- ~~ You may obtain a copy of the License at
- ~~
- ~~ http://www.apache.org/licenses/LICENSE-2.0
- ~~
- ~~ Unless required by applicable law or agreed to in writing, software
- ~~ distributed under the License is distributed on an "AS IS" BASIS,
- ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~~ See the License for the specific language governing permissions and
- ~~ limitations under the License. See accompanying LICENSE file.
- ---
- Offline Image Viewer Guide
- ---
- ---
- ${maven.build.timestamp}
- Offline Image Viewer Guide
- %{toc|section=1|fromDepth=0}
- * Overview
- The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
- files to human-readable formats in order to allow offline analysis and
- examination of an Hadoop cluster's namespace. The tool is able to
- process very large image files relatively quickly, converting them to
- one of several output formats. The tool handles the layout formats that
- were included with Hadoop versions 16 and up. If the tool is not able
- to process an image file, it will exit cleanly. The Offline Image
- Viewer does not require an Hadoop cluster to be running; it is entirely
- offline in its operation.
- The Offline Image Viewer provides several output processors:
- [[1]] Ls is the default output processor. It closely mimics the format of
- the lsr command. It includes the same fields, in the same order, as
- lsr : directory or file flag, permissions, replication, owner,
- group, file size, modification date, and full path. Unlike the lsr
- command, the root path is included. One important difference
- between the output of the lsr command this processor, is that this
- output is not sorted by directory name and contents. Rather, the
- files are listed in the order in which they are stored in the
- fsimage file. Therefore, it is not possible to directly compare the
- output of the lsr command this this tool. The Ls processor uses
- information contained within the Inode blocks to calculate file
- sizes and ignores the -skipBlocks option.
- [[2]] Indented provides a more complete view of the fsimage's contents,
- including all of the information included in the image, such as
- image version, generation stamp and inode- and block-specific
- listings. This processor uses indentation to organize the output
- into a hierarchal manner. The lsr format is suitable for easy human
- comprehension.
- [[3]] Delimited provides one file per line consisting of the path,
- replication, modification time, access time, block size, number of
- blocks, file size, namespace quota, diskspace quota, permissions,
- username and group name. If run against an fsimage that does not
- contain any of these fields, the field's column will be included,
- but no data recorded. The default record delimiter is a tab, but
- this may be changed via the -delimiter command line argument. This
- processor is designed to create output that is easily analyzed by
- other tools, such as {{{http://pig.apache.org}Apache Pig}}. See
- the {{Analyzing Results}} section for further information on using
- this processor to analyze the contents of fsimage files.
- [[4]] XML creates an XML document of the fsimage and includes all of the
- information within the fsimage, similar to the lsr processor. The
- output of this processor is amenable to automated processing and
- analysis with XML tools. Due to the verbosity of the XML syntax,
- this processor will also generate the largest amount of output.
- [[5]] FileDistribution is the tool for analyzing file sizes in the
- namespace image. In order to run the tool one should define a range
- of integers [0, maxSize] by specifying maxSize and a step. The
- range of integers is divided into segments of size step: [0, s[1],
- ..., s[n-1], maxSize], and the processor calculates how many files
- in the system fall into each segment [s[i-1], s[i]). Note that
- files larger than maxSize always fall into the very last segment.
- The output file is formatted as a tab separated two column table:
- Size and NumFiles. Where Size represents the start of the segment,
- and numFiles is the number of files form the image which size falls
- in this segment.
- * Usage
- ** Basic
- The simplest usage of the Offline Image Viewer is to provide just an
- input and output file, via the -i and -o command-line switches:
- ----
- bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
- ----
- This will create a file named fsimage.txt in the current directory
- using the Ls output processor. For very large image files, this process
- may take several minutes.
- One can specify which output processor via the command-line switch -p.
- For instance:
- ----
- bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
- ----
- or
- ----
- bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
- ----
- This will run the tool using either the XML or Indented output
- processor, respectively.
- One command-line option worth considering is -skipBlocks, which
- prevents the tool from explicitly enumerating all of the blocks that
- make up a file in the namespace. This is useful for file systems that
- have very large files. Enabling this option can significantly decrease
- the size of the resulting output, as individual blocks are not
- included. Note, however, that the Ls processor needs to enumerate the
- blocks and so overrides this option.
- Example
- Consider the following contrived namespace:
- ----
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherDir
- -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherDir/biggerfile
- -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherDir/smallFile
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
- drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user
- drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser
- ----
- Applying the Offline Image Processor against this file with default
- options would result in the following output:
- ----
- machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherDir
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user
- -rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherDir/biggerfile
- -rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherDir/smallFile
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
- drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two
- drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser
- ----
- Similarly, applying the Indented processor would generate output that
- begins with:
- ----
- machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
- FSImage
- ImageVersion = -19
- NamespaceID = 2109123098
- GenerationStamp = 1003
- INodes [NumInodes = 12]
- Inode
- INodePath =
- Replication = 0
- ModificationTime = 2009-03-16 14:16
- AccessTime = 1969-12-31 16:00
- BlockSize = 0
- Blocks [NumBlocks = -1]
- NSQuota = 2147483647
- DSQuota = -1
- Permissions
- Username = theuser
- GroupName = supergroup
- PermString = rwxr-xr-x
- ...remaining output omitted...
- ----
- * Options
- *-----------------------:-----------------------------------+
- | <<Flag>> | <<Description>> |
- *-----------------------:-----------------------------------+
- | <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
- | | process. Required.
- *-----------------------:-----------------------------------+
- | <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if the
- | | specified output processor generates one. If the specified file already
- | | exists, it is silently overwritten. Required.
- *-----------------------:-----------------------------------+
- | <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to apply
- | | against the image file. Currently valid options are Ls (default), XML
- | | and Indented..
- *-----------------------:-----------------------------------+
- | <<<-skipBlocks>>> | Do not enumerate individual blocks within files. This may
- | | save processing time and outfile file space on namespaces with very
- | | large files. The Ls processor reads the blocks to correctly determine
- | | file sizes and ignores this option.
- *-----------------------:-----------------------------------+
- | <<<-printToScreen>>> | Pipe output of processor to console as well as specified
- | | file. On extremely large namespaces, this may increase processing time
- | | by an order of magnitude.
- *-----------------------:-----------------------------------+
- | <<<-delimiter>>> <arg>| When used in conjunction with the Delimited processor,
- | | replaces the default tab delimiter with the string specified by arg.
- *-----------------------:-----------------------------------+
- | <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
- *-----------------------:-----------------------------------+
- * Analyzing Results
- The Offline Image Viewer makes it easy to gather large amounts of data
- about the hdfs namespace. This information can then be used to explore
- file system usage patterns or find specific files that match arbitrary
- criteria, along with other types of namespace analysis. The Delimited
- image processor in particular creates output that is amenable to
- further processing by tools such as [38]Apache Pig. Pig provides a
- particularly good choice for analyzing these data as it is able to deal
- with the output generated from a small fsimage but also scales up to
- consume data from extremely large file systems.
- The Delimited image processor generates lines of text separated, by
- default, by tabs and includes all of the fields that are common between
- constructed files and files that were still under constructed when the
- fsimage was generated. Examples scripts are provided demonstrating how
- to use this output to accomplish three tasks: determine the number of
- files each user has created on the file system, find files were created
- but have not accessed, and find probable duplicates of large files by
- comparing the size of each file.
- Each of the following scripts assumes you have generated an output file
- using the Delimited processor named foo and will be storing the results
- of the Pig analysis in a file named results.
- ** Total Number of Files for Each User
- This script processes each path within the namespace, groups them by
- the file owner and determines the total number of files each user owns.
- ----
- numFilesOfEachUser.pig:
- -- This script determines the total number of files each user has in
- -- the namespace. Its output is of the form:
- -- username, totalNumFiles
- -- Load all of the fields from the file
- A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
- replication:int,
- modTime:chararray,
- accessTime:chararray,
- blockSize:long,
- numBlocks:int,
- fileSize:long,
- NamespaceQuota:int,
- DiskspaceQuota:int,
- perms:chararray,
- username:chararray,
- groupname:chararray);
- -- Grab just the path and username
- B = FOREACH A GENERATE path, username;
- -- Generate the sum of the number of paths for each user
- C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
- -- Save results
- STORE C INTO '$outputFile';
- ----
- This script can be run against pig with the following command:
- ----
- bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
- ----
- The output file's content will be similar to that below:
- ----
- bart 1
- lisa 16
- homer 28
- marge 2456
- ----
- ** Files That Have Never Been Accessed
- This script finds files that were created but whose access times were
- never changed, meaning they were never opened or viewed.
- ----
- neverAccessed.pig:
- -- This script generates a list of files that were created but never
- -- accessed, based on their AccessTime
- -- Load all of the fields from the file
- A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
- replication:int,
- modTime:chararray,
- accessTime:chararray,
- blockSize:long,
- numBlocks:int,
- fileSize:long,
- NamespaceQuota:int,
- DiskspaceQuota:int,
- perms:chararray,
- username:chararray,
- groupname:chararray);
- -- Grab just the path and last time the file was accessed
- B = FOREACH A GENERATE path, accessTime;
- -- Drop all the paths that don't have the default assigned last-access time
- C = FILTER B BY accessTime == '1969-12-31 16:00';
- -- Drop the accessTimes, since they're all the same
- D = FOREACH C GENERATE path;
- -- Save results
- STORE D INTO '$outputFile';
- ----
- This script can be run against pig with the following command and its
- output file's content will be a list of files that were created but
- never viewed afterwards.
- ----
- bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
- ----
- ** Probable Duplicated Files Based on File Size
- This script groups files together based on their size, drops any that
- are of less than 100mb and returns a list of the file size, number of
- files found and a tuple of the file paths. This can be used to find
- likely duplicates within the filesystem namespace.
- ----
- probableDuplicates.pig:
- -- This script finds probable duplicate files greater than 100 MB by
- -- grouping together files based on their byte size. Files of this size
- -- with exactly the same number of bytes can be considered probable
- -- duplicates, but should be checked further, either by comparing the
- -- contents directly or by another proxy, such as a hash of the contents.
- -- The scripts output is of the type:
- -- fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
- -- Load all of the fields from the file
- A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
- replication:int,
- modTime:chararray,
- accessTime:chararray,
- blockSize:long,
- numBlocks:int,
- fileSize:long,
- NamespaceQuota:int,
- DiskspaceQuota:int,
- perms:chararray,
- username:chararray,
- groupname:chararray);
- -- Grab the pathname and filesize
- B = FOREACH A generate path, fileSize;
- -- Drop files smaller than 100 MB
- C = FILTER B by fileSize > 100L * 1024L * 1024L;
- -- Gather all the files of the same byte size
- D = GROUP C by fileSize;
- -- Generate path, num of duplicates, list of duplicates
- E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
- -- Drop all the files where there are only one of them
- F = FILTER E by numDupes > 1L;
- -- Sort by the size of the files
- G = ORDER F by fileSize;
- -- Save results
- STORE G INTO '$outputFile';
- ----
- This script can be run against pig with the following command:
- ----
- bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
- ----
- The output file's content will be similar to that below:
- ----
- 1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
- 1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
- 1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
- 1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
- 1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
- ----
- Each line includes the file size in bytes that was found to be
- duplicated, the number of duplicates found, and a list of the
- duplicated paths. Files less than 100MB are ignored, providing a
- reasonable likelihood that files of these exact sizes may be
- duplicates.
|