Metrics.apt.vm 44 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732
  1. ~~ Licensed to the Apache Software Foundation (ASF) under one or more
  2. ~~ contributor license agreements. See the NOTICE file distributed with
  3. ~~ this work for additional information regarding copyright ownership.
  4. ~~ The ASF licenses this file to You under the Apache License, Version 2.0
  5. ~~ (the "License"); you may not use this file except in compliance with
  6. ~~ the License. You may obtain a copy of the License at
  7. ~~
  8. ~~ http://www.apache.org/licenses/LICENSE-2.0
  9. ~~
  10. ~~ Unless required by applicable law or agreed to in writing, software
  11. ~~ distributed under the License is distributed on an "AS IS" BASIS,
  12. ~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. ~~ See the License for the specific language governing permissions and
  14. ~~ limitations under the License.
  15. ---
  16. Metrics Guide
  17. ---
  18. ---
  19. ${maven.build.timestamp}
  20. %{toc}
  21. Overview
  22. Metrics are statistical information exposed by Hadoop daemons,
  23. used for monitoring, performance tuning and debug.
  24. There are many metrics available by default
  25. and they are very useful for troubleshooting.
  26. This page shows the details of the available metrics.
  27. Each section describes each context into which metrics are grouped.
  28. The documentation of Metrics 2.0 framework is
  29. {{{../../api/org/apache/hadoop/metrics2/package-summary.html}here}}.
  30. jvm context
  31. * JvmMetrics
  32. Each metrics record contains tags such as ProcessName, SessionID
  33. and Hostname as additional information along with metrics.
  34. *-------------------------------------+--------------------------------------+
  35. || Name || Description
  36. *-------------------------------------+--------------------------------------+
  37. |<<<MemNonHeapUsedM>>> | Current non-heap memory used in MB
  38. *-------------------------------------+--------------------------------------+
  39. |<<<MemNonHeapCommittedM>>> | Current non-heap memory committed in MB
  40. *-------------------------------------+--------------------------------------+
  41. |<<<MemNonHeapMaxM>>> | Max non-heap memory size in MB
  42. *-------------------------------------+--------------------------------------+
  43. |<<<MemHeapUsedM>>> | Current heap memory used in MB
  44. *-------------------------------------+--------------------------------------+
  45. |<<<MemHeapCommittedM>>> | Current heap memory committed in MB
  46. *-------------------------------------+--------------------------------------+
  47. |<<<MemHeapMaxM>>> | Max heap memory size in MB
  48. *-------------------------------------+--------------------------------------+
  49. |<<<MemMaxM>>> | Max memory size in MB
  50. *-------------------------------------+--------------------------------------+
  51. |<<<ThreadsNew>>> | Current number of NEW threads
  52. *-------------------------------------+--------------------------------------+
  53. |<<<ThreadsRunnable>>> | Current number of RUNNABLE threads
  54. *-------------------------------------+--------------------------------------+
  55. |<<<ThreadsBlocked>>> | Current number of BLOCKED threads
  56. *-------------------------------------+--------------------------------------+
  57. |<<<ThreadsWaiting>>> | Current number of WAITING threads
  58. *-------------------------------------+--------------------------------------+
  59. |<<<ThreadsTimedWaiting>>> | Current number of TIMED_WAITING threads
  60. *-------------------------------------+--------------------------------------+
  61. |<<<ThreadsTerminated>>> | Current number of TERMINATED threads
  62. *-------------------------------------+--------------------------------------+
  63. |<<<GcInfo>>> | Total GC count and GC time in msec, grouped by the kind of GC. \
  64. | ex.) GcCountPS Scavenge=6, GCTimeMillisPS Scavenge=40,
  65. | GCCountPS MarkSweep=0, GCTimeMillisPS MarkSweep=0
  66. *-------------------------------------+--------------------------------------+
  67. |<<<GcCount>>> | Total GC count
  68. *-------------------------------------+--------------------------------------+
  69. |<<<GcTimeMillis>>> | Total GC time in msec
  70. *-------------------------------------+--------------------------------------+
  71. |<<<LogFatal>>> | Total number of FATAL logs
  72. *-------------------------------------+--------------------------------------+
  73. |<<<LogError>>> | Total number of ERROR logs
  74. *-------------------------------------+--------------------------------------+
  75. |<<<LogWarn>>> | Total number of WARN logs
  76. *-------------------------------------+--------------------------------------+
  77. |<<<LogInfo>>> | Total number of INFO logs
  78. *-------------------------------------+--------------------------------------+
  79. rpc context
  80. * rpc
  81. Each metrics record contains tags such as Hostname
  82. and port (number to which server is bound)
  83. as additional information along with metrics.
  84. *-------------------------------------+--------------------------------------+
  85. || Name || Description
  86. *-------------------------------------+--------------------------------------+
  87. |<<<ReceivedBytes>>> | Total number of received bytes
  88. *-------------------------------------+--------------------------------------+
  89. |<<<SentBytes>>> | Total number of sent bytes
  90. *-------------------------------------+--------------------------------------+
  91. |<<<RpcQueueTimeNumOps>>> | Total number of RPC calls
  92. *-------------------------------------+--------------------------------------+
  93. |<<<RpcQueueTimeAvgTime>>> | Average queue time in milliseconds
  94. *-------------------------------------+--------------------------------------+
  95. |<<<RpcProcessingTimeNumOps>>> | Total number of RPC calls (same to
  96. | RpcQueueTimeNumOps)
  97. *-------------------------------------+--------------------------------------+
  98. |<<<RpcProcessingAvgTime>>> | Average Processing time in milliseconds
  99. *-------------------------------------+--------------------------------------+
  100. |<<<RpcAuthenticationFailures>>> | Total number of authentication failures
  101. *-------------------------------------+--------------------------------------+
  102. |<<<RpcAuthenticationSuccesses>>> | Total number of authentication successes
  103. *-------------------------------------+--------------------------------------+
  104. |<<<RpcAuthorizationFailures>>> | Total number of authorization failures
  105. *-------------------------------------+--------------------------------------+
  106. |<<<RpcAuthorizationSuccesses>>> | Total number of authorization successes
  107. *-------------------------------------+--------------------------------------+
  108. |<<<NumOpenConnections>>> | Current number of open connections
  109. *-------------------------------------+--------------------------------------+
  110. |<<<CallQueueLength>>> | Current length of the call queue
  111. *-------------------------------------+--------------------------------------+
  112. |<<<rpcQueueTime>>><num><<<sNumOps>>> | Shows total number of RPC calls
  113. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  114. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  115. *-------------------------------------+--------------------------------------+
  116. |<<<rpcQueueTime>>><num><<<s50thPercentileLatency>>> |
  117. | | Shows the 50th percentile of RPC queue time in milliseconds
  118. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  119. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  120. *-------------------------------------+--------------------------------------+
  121. |<<<rpcQueueTime>>><num><<<s75thPercentileLatency>>> |
  122. | | Shows the 75th percentile of RPC queue time in milliseconds
  123. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  124. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  125. *-------------------------------------+--------------------------------------+
  126. |<<<rpcQueueTime>>><num><<<s90thPercentileLatency>>> |
  127. | | Shows the 90th percentile of RPC queue time in milliseconds
  128. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  129. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  130. *-------------------------------------+--------------------------------------+
  131. |<<<rpcQueueTime>>><num><<<s95thPercentileLatency>>> |
  132. | | Shows the 95th percentile of RPC queue time in milliseconds
  133. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  134. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  135. *-------------------------------------+--------------------------------------+
  136. |<<<rpcQueueTime>>><num><<<s99thPercentileLatency>>> |
  137. | | Shows the 99th percentile of RPC queue time in milliseconds
  138. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  139. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  140. *-------------------------------------+--------------------------------------+
  141. |<<<rpcProcessingTime>>><num><<<sNumOps>>> | Shows total number of RPC calls
  142. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  143. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  144. *-------------------------------------+--------------------------------------+
  145. |<<<rpcProcessingTime>>><num><<<s50thPercentileLatency>>> |
  146. | | Shows the 50th percentile of RPC processing time in milliseconds
  147. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  148. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  149. *-------------------------------------+--------------------------------------+
  150. |<<<rpcProcessingTime>>><num><<<s75thPercentileLatency>>> |
  151. | | Shows the 75th percentile of RPC processing time in milliseconds
  152. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  153. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  154. *-------------------------------------+--------------------------------------+
  155. |<<<rpcProcessingTime>>><num><<<s90thPercentileLatency>>> |
  156. | | Shows the 90th percentile of RPC processing time in milliseconds
  157. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  158. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  159. *-------------------------------------+--------------------------------------+
  160. |<<<rpcProcessingTime>>><num><<<s95thPercentileLatency>>> |
  161. | | Shows the 95th percentile of RPC processing time in milliseconds
  162. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  163. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  164. *-------------------------------------+--------------------------------------+
  165. |<<<rpcProcessingTime>>><num><<<s99thPercentileLatency>>> |
  166. | | Shows the 99th percentile of RPC processing time in milliseconds
  167. | | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
  168. | | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
  169. *-------------------------------------+--------------------------------------+
  170. * RetryCache/NameNodeRetryCache
  171. RetryCache metrics is useful to monitor NameNode fail-over.
  172. Each metrics record contains Hostname tag.
  173. *-------------------------------------+--------------------------------------+
  174. || Name || Description
  175. *-------------------------------------+--------------------------------------+
  176. |<<<CacheHit>>> | Total number of RetryCache hit
  177. *-------------------------------------+--------------------------------------+
  178. |<<<CacheCleared>>> | Total number of RetryCache cleared
  179. *-------------------------------------+--------------------------------------+
  180. |<<<CacheUpdated>>> | Total number of RetryCache updated
  181. *-------------------------------------+--------------------------------------+
  182. rpcdetailed context
  183. Metrics of rpcdetailed context are exposed in unified manner by RPC
  184. layer. Two metrics are exposed for each RPC based on its name.
  185. Metrics named "(RPC method name)NumOps" indicates total number of
  186. method calls, and metrics named "(RPC method name)AvgTime" shows
  187. average turn around time for method calls in milliseconds.
  188. * rpcdetailed
  189. Each metrics record contains tags such as Hostname
  190. and port (number to which server is bound)
  191. as additional information along with metrics.
  192. The Metrics about RPCs which is not called are not included
  193. in metrics record.
  194. *-------------------------------------+--------------------------------------+
  195. || Name || Description
  196. *-------------------------------------+--------------------------------------+
  197. |<methodname><<<NumOps>>> | Total number of the times the method is called
  198. *-------------------------------------+--------------------------------------+
  199. |<methodname><<<AvgTime>>> | Average turn around time of the method in
  200. | milliseconds
  201. *-------------------------------------+--------------------------------------+
  202. dfs context
  203. * namenode
  204. Each metrics record contains tags such as ProcessName, SessionId,
  205. and Hostname as additional information along with metrics.
  206. *-------------------------------------+--------------------------------------+
  207. || Name || Description
  208. *-------------------------------------+--------------------------------------+
  209. |<<<CreateFileOps>>> | Total number of files created
  210. *-------------------------------------+--------------------------------------+
  211. |<<<FilesCreated>>> | Total number of files and directories created by create
  212. | or mkdir operations
  213. *-------------------------------------+--------------------------------------+
  214. |<<<FilesAppended>>> | Total number of files appended
  215. *-------------------------------------+--------------------------------------+
  216. |<<<GetBlockLocations>>> | Total number of getBlockLocations operations
  217. *-------------------------------------+--------------------------------------+
  218. |<<<FilesRenamed>>> | Total number of rename <<operations>> (NOT number of
  219. | files/dirs renamed)
  220. *-------------------------------------+--------------------------------------+
  221. |<<<GetListingOps>>> | Total number of directory listing operations
  222. *-------------------------------------+--------------------------------------+
  223. |<<<DeleteFileOps>>> | Total number of delete operations
  224. *-------------------------------------+--------------------------------------+
  225. |<<<FilesDeleted>>> | Total number of files and directories deleted by delete
  226. | or rename operations
  227. *-------------------------------------+--------------------------------------+
  228. |<<<FileInfoOps>>> | Total number of getFileInfo and getLinkFileInfo
  229. | operations
  230. *-------------------------------------+--------------------------------------+
  231. |<<<AddBlockOps>>> | Total number of addBlock operations succeeded
  232. *-------------------------------------+--------------------------------------+
  233. |<<<GetAdditionalDatanodeOps>>> | Total number of getAdditionalDatanode
  234. | operations
  235. *-------------------------------------+--------------------------------------+
  236. |<<<CreateSymlinkOps>>> | Total number of createSymlink operations
  237. *-------------------------------------+--------------------------------------+
  238. |<<<GetLinkTargetOps>>> | Total number of getLinkTarget operations
  239. *-------------------------------------+--------------------------------------+
  240. |<<<FilesInGetListingOps>>> | Total number of files and directories listed by
  241. | directory listing operations
  242. *-------------------------------------+--------------------------------------+
  243. |<<<AllowSnapshotOps>>> | Total number of allowSnapshot operations
  244. *-------------------------------------+--------------------------------------+
  245. |<<<DisallowSnapshotOps>>> | Total number of disallowSnapshot operations
  246. *-------------------------------------+--------------------------------------+
  247. |<<<CreateSnapshotOps>>> | Total number of createSnapshot operations
  248. *-------------------------------------+--------------------------------------+
  249. |<<<DeleteSnapshotOps>>> | Total number of deleteSnapshot operations
  250. *-------------------------------------+--------------------------------------+
  251. |<<<RenameSnapshotOps>>> | Total number of renameSnapshot operations
  252. *-------------------------------------+--------------------------------------+
  253. |<<<ListSnapshottableDirOps>>> | Total number of snapshottableDirectoryStatus
  254. | operations
  255. *-------------------------------------+--------------------------------------+
  256. |<<<SnapshotDiffReportOps>>> | Total number of getSnapshotDiffReport
  257. | operations
  258. *-------------------------------------+--------------------------------------+
  259. |<<<TransactionsNumOps>>> | Total number of Journal transactions
  260. *-------------------------------------+--------------------------------------+
  261. |<<<TransactionsAvgTime>>> | Average time of Journal transactions in
  262. | milliseconds
  263. *-------------------------------------+--------------------------------------+
  264. |<<<SyncsNumOps>>> | Total number of Journal syncs
  265. *-------------------------------------+--------------------------------------+
  266. |<<<SyncsAvgTime>>> | Average time of Journal syncs in milliseconds
  267. *-------------------------------------+--------------------------------------+
  268. |<<<TransactionsBatchedInSync>>> | Total number of Journal transactions batched
  269. | in sync
  270. *-------------------------------------+--------------------------------------+
  271. |<<<BlockReportNumOps>>> | Total number of processing block reports from
  272. | DataNode
  273. *-------------------------------------+--------------------------------------+
  274. |<<<BlockReportAvgTime>>> | Average time of processing block reports in
  275. | milliseconds
  276. *-------------------------------------+--------------------------------------+
  277. |<<<CacheReportNumOps>>> | Total number of processing cache reports from
  278. | DataNode
  279. *-------------------------------------+--------------------------------------+
  280. |<<<CacheReportAvgTime>>> | Average time of processing cache reports in
  281. | milliseconds
  282. *-------------------------------------+--------------------------------------+
  283. |<<<SafeModeTime>>> | The interval between FSNameSystem starts and the last
  284. | time safemode leaves in milliseconds. \
  285. | (sometimes not equal to the time in SafeMode,
  286. | see {{{https://issues.apache.org/jira/browse/HDFS-5156}HDFS-5156}})
  287. *-------------------------------------+--------------------------------------+
  288. |<<<FsImageLoadTime>>> | Time loading FS Image at startup in milliseconds
  289. *-------------------------------------+--------------------------------------+
  290. |<<<FsImageLoadTime>>> | Time loading FS Image at startup in milliseconds
  291. *-------------------------------------+--------------------------------------+
  292. |<<<GetEditNumOps>>> | Total number of edits downloads from SecondaryNameNode
  293. *-------------------------------------+--------------------------------------+
  294. |<<<GetEditAvgTime>>> | Average edits download time in milliseconds
  295. *-------------------------------------+--------------------------------------+
  296. |<<<GetImageNumOps>>> |Total number of fsimage downloads from SecondaryNameNode
  297. *-------------------------------------+--------------------------------------+
  298. |<<<GetImageAvgTime>>> | Average fsimage download time in milliseconds
  299. *-------------------------------------+--------------------------------------+
  300. |<<<PutImageNumOps>>> | Total number of fsimage uploads to SecondaryNameNode
  301. *-------------------------------------+--------------------------------------+
  302. |<<<PutImageAvgTime>>> | Average fsimage upload time in milliseconds
  303. *-------------------------------------+--------------------------------------+
  304. * FSNamesystem
  305. Each metrics record contains tags such as HAState and Hostname
  306. as additional information along with metrics.
  307. *-------------------------------------+--------------------------------------+
  308. || Name || Description
  309. *-------------------------------------+--------------------------------------+
  310. |<<<MissingBlocks>>> | Current number of missing blocks
  311. *-------------------------------------+--------------------------------------+
  312. |<<<ExpiredHeartbeats>>> | Total number of expired heartbeats
  313. *-------------------------------------+--------------------------------------+
  314. |<<<TransactionsSinceLastCheckpoint>>> | Total number of transactions since
  315. | last checkpoint
  316. *-------------------------------------+--------------------------------------+
  317. |<<<TransactionsSinceLastLogRoll>>> | Total number of transactions since last
  318. | edit log roll
  319. *-------------------------------------+--------------------------------------+
  320. |<<<LastWrittenTransactionId>>> | Last transaction ID written to the edit log
  321. *-------------------------------------+--------------------------------------+
  322. |<<<LastCheckpointTime>>> | Time in milliseconds since epoch of last checkpoint
  323. *-------------------------------------+--------------------------------------+
  324. |<<<CapacityTotal>>> | Current raw capacity of DataNodes in bytes
  325. *-------------------------------------+--------------------------------------+
  326. |<<<CapacityTotalGB>>> | Current raw capacity of DataNodes in GB
  327. *-------------------------------------+--------------------------------------+
  328. |<<<CapacityUsed>>> | Current used capacity across all DataNodes in bytes
  329. *-------------------------------------+--------------------------------------+
  330. |<<<CapacityUsedGB>>> | Current used capacity across all DataNodes in GB
  331. *-------------------------------------+--------------------------------------+
  332. |<<<CapacityRemaining>>> | Current remaining capacity in bytes
  333. *-------------------------------------+--------------------------------------+
  334. |<<<CapacityRemainingGB>>> | Current remaining capacity in GB
  335. *-------------------------------------+--------------------------------------+
  336. |<<<CapacityUsedNonDFS>>> | Current space used by DataNodes for non DFS
  337. | purposes in bytes
  338. *-------------------------------------+--------------------------------------+
  339. |<<<TotalLoad>>> | Current number of connections
  340. *-------------------------------------+--------------------------------------+
  341. |<<<SnapshottableDirectories>>> | Current number of snapshottable directories
  342. *-------------------------------------+--------------------------------------+
  343. |<<<Snapshots>>> | Current number of snapshots
  344. *-------------------------------------+--------------------------------------+
  345. |<<<BlocksTotal>>> | Current number of allocated blocks in the system
  346. *-------------------------------------+--------------------------------------+
  347. |<<<FilesTotal>>> | Current number of files and directories
  348. *-------------------------------------+--------------------------------------+
  349. |<<<PendingReplicationBlocks>>> | Current number of blocks pending to be
  350. | replicated
  351. *-------------------------------------+--------------------------------------+
  352. |<<<UnderReplicatedBlocks>>> | Current number of blocks under replicated
  353. *-------------------------------------+--------------------------------------+
  354. |<<<CorruptBlocks>>> | Current number of blocks with corrupt replicas.
  355. *-------------------------------------+--------------------------------------+
  356. |<<<ScheduledReplicationBlocks>>> | Current number of blocks scheduled for
  357. | replications
  358. *-------------------------------------+--------------------------------------+
  359. |<<<PendingDeletionBlocks>>> | Current number of blocks pending deletion
  360. *-------------------------------------+--------------------------------------+
  361. |<<<ExcessBlocks>>> | Current number of excess blocks
  362. *-------------------------------------+--------------------------------------+
  363. |<<<PostponedMisreplicatedBlocks>>> | (HA-only) Current number of blocks
  364. | postponed to replicate
  365. *-------------------------------------+--------------------------------------+
  366. |<<<PendingDataNodeMessageCourt>>> | (HA-only) Current number of pending
  367. | block-related messages for later
  368. | processing in the standby NameNode
  369. *-------------------------------------+--------------------------------------+
  370. |<<<MillisSinceLastLoadedEdits>>> | (HA-only) Time in milliseconds since the
  371. | last time standby NameNode load edit log.
  372. | In active NameNode, set to 0
  373. *-------------------------------------+--------------------------------------+
  374. |<<<BlockCapacity>>> | Current number of block capacity
  375. *-------------------------------------+--------------------------------------+
  376. |<<<StaleDataNodes>>> | Current number of DataNodes marked stale due to delayed
  377. | heartbeat
  378. *-------------------------------------+--------------------------------------+
  379. |<<<TotalFiles>>> |Current number of files and directories (same as FilesTotal)
  380. *-------------------------------------+--------------------------------------+
  381. * JournalNode
  382. The server-side metrics for a journal from the JournalNode's perspective.
  383. Each metrics record contains Hostname tag as additional information
  384. along with metrics.
  385. *-------------------------------------+--------------------------------------+
  386. || Name || Description
  387. *-------------------------------------+--------------------------------------+
  388. |<<<Syncs60sNumOps>>> | Number of sync operations (1 minute granularity)
  389. *-------------------------------------+--------------------------------------+
  390. |<<<Syncs60s50thPercentileLatencyMicros>>> | The 50th percentile of sync
  391. | | latency in microseconds (1 minute granularity)
  392. *-------------------------------------+--------------------------------------+
  393. |<<<Syncs60s75thPercentileLatencyMicros>>> | The 75th percentile of sync
  394. | | latency in microseconds (1 minute granularity)
  395. *-------------------------------------+--------------------------------------+
  396. |<<<Syncs60s90thPercentileLatencyMicros>>> | The 90th percentile of sync
  397. | | latency in microseconds (1 minute granularity)
  398. *-------------------------------------+--------------------------------------+
  399. |<<<Syncs60s95thPercentileLatencyMicros>>> | The 95th percentile of sync
  400. | | latency in microseconds (1 minute granularity)
  401. *-------------------------------------+--------------------------------------+
  402. |<<<Syncs60s99thPercentileLatencyMicros>>> | The 99th percentile of sync
  403. | | latency in microseconds (1 minute granularity)
  404. *-------------------------------------+--------------------------------------+
  405. |<<<Syncs300sNumOps>>> | Number of sync operations (5 minutes granularity)
  406. *-------------------------------------+--------------------------------------+
  407. |<<<Syncs300s50thPercentileLatencyMicros>>> | The 50th percentile of sync
  408. | | latency in microseconds (5 minutes granularity)
  409. *-------------------------------------+--------------------------------------+
  410. |<<<Syncs300s75thPercentileLatencyMicros>>> | The 75th percentile of sync
  411. | | latency in microseconds (5 minutes granularity)
  412. *-------------------------------------+--------------------------------------+
  413. |<<<Syncs300s90thPercentileLatencyMicros>>> | The 90th percentile of sync
  414. | | latency in microseconds (5 minutes granularity)
  415. *-------------------------------------+--------------------------------------+
  416. |<<<Syncs300s95thPercentileLatencyMicros>>> | The 95th percentile of sync
  417. | | latency in microseconds (5 minutes granularity)
  418. *-------------------------------------+--------------------------------------+
  419. |<<<Syncs300s99thPercentileLatencyMicros>>> | The 99th percentile of sync
  420. | | latency in microseconds (5 minutes granularity)
  421. *-------------------------------------+--------------------------------------+
  422. |<<<Syncs3600sNumOps>>> | Number of sync operations (1 hour granularity)
  423. *-------------------------------------+--------------------------------------+
  424. |<<<Syncs3600s50thPercentileLatencyMicros>>> | The 50th percentile of sync
  425. | | latency in microseconds (1 hour granularity)
  426. *-------------------------------------+--------------------------------------+
  427. |<<<Syncs3600s75thPercentileLatencyMicros>>> | The 75th percentile of sync
  428. | | latency in microseconds (1 hour granularity)
  429. *-------------------------------------+--------------------------------------+
  430. |<<<Syncs3600s90thPercentileLatencyMicros>>> | The 90th percentile of sync
  431. | | latency in microseconds (1 hour granularity)
  432. *-------------------------------------+--------------------------------------+
  433. |<<<Syncs3600s95thPercentileLatencyMicros>>> | The 95th percentile of sync
  434. | | latency in microseconds (1 hour granularity)
  435. *-------------------------------------+--------------------------------------+
  436. |<<<Syncs3600s99thPercentileLatencyMicros>>> | The 99th percentile of sync
  437. | | latency in microseconds (1 hour granularity)
  438. *-------------------------------------+--------------------------------------+
  439. |<<<BatchesWritten>>> | Total number of batches written since startup
  440. *-------------------------------------+--------------------------------------+
  441. |<<<TxnsWritten>>> | Total number of transactions written since startup
  442. *-------------------------------------+--------------------------------------+
  443. |<<<BytesWritten>>> | Total number of bytes written since startup
  444. *-------------------------------------+--------------------------------------+
  445. |<<<BatchesWrittenWhileLagging>>> | Total number of batches written where this
  446. | | node was lagging
  447. *-------------------------------------+--------------------------------------+
  448. |<<<LastWriterEpoch>>> | Current writer's epoch number
  449. *-------------------------------------+--------------------------------------+
  450. |<<<CurrentLagTxns>>> | The number of transactions that this JournalNode is
  451. | | lagging
  452. *-------------------------------------+--------------------------------------+
  453. |<<<LastWrittenTxId>>> | The highest transaction id stored on this JournalNode
  454. *-------------------------------------+--------------------------------------+
  455. |<<<LastPromisedEpoch>>> | The last epoch number which this node has promised
  456. | | not to accept any lower epoch, or 0 if no promises have been made
  457. *-------------------------------------+--------------------------------------+
  458. * datanode
  459. Each metrics record contains tags such as SessionId and Hostname
  460. as additional information along with metrics.
  461. *-------------------------------------+--------------------------------------+
  462. || Name || Description
  463. *-------------------------------------+--------------------------------------+
  464. |<<<BytesWritten>>> | Total number of bytes written to DataNode
  465. *-------------------------------------+--------------------------------------+
  466. |<<<BytesRead>>> | Total number of bytes read from DataNode
  467. *-------------------------------------+--------------------------------------+
  468. |<<<BlocksWritten>>> | Total number of blocks written to DataNode
  469. *-------------------------------------+--------------------------------------+
  470. |<<<BlocksRead>>> | Total number of blocks read from DataNode
  471. *-------------------------------------+--------------------------------------+
  472. |<<<BlocksReplicated>>> | Total number of blocks replicated
  473. *-------------------------------------+--------------------------------------+
  474. |<<<BlocksRemoved>>> | Total number of blocks removed
  475. *-------------------------------------+--------------------------------------+
  476. |<<<BlocksVerified>>> | Total number of blocks verified
  477. *-------------------------------------+--------------------------------------+
  478. |<<<BlockVerificationFailures>>> | Total number of verifications failures
  479. *-------------------------------------+--------------------------------------+
  480. |<<<BlocksCached>>> | Total number of blocks cached
  481. *-------------------------------------+--------------------------------------+
  482. |<<<BlocksUncached>>> | Total number of blocks uncached
  483. *-------------------------------------+--------------------------------------+
  484. |<<<ReadsFromLocalClient>>> | Total number of read operations from local client
  485. *-------------------------------------+--------------------------------------+
  486. |<<<ReadsFromRemoteClient>>> | Total number of read operations from remote
  487. | client
  488. *-------------------------------------+--------------------------------------+
  489. |<<<WritesFromLocalClient>>> | Total number of write operations from local
  490. | client
  491. *-------------------------------------+--------------------------------------+
  492. |<<<WritesFromRemoteClient>>> | Total number of write operations from remote
  493. | client
  494. *-------------------------------------+--------------------------------------+
  495. |<<<BlocksGetLocalPathInfo>>> | Total number of operations to get local path
  496. | names of blocks
  497. *-------------------------------------+--------------------------------------+
  498. |<<<FsyncCount>>> | Total number of fsync
  499. *-------------------------------------+--------------------------------------+
  500. |<<<VolumeFailures>>> | Total number of volume failures occurred
  501. *-------------------------------------+--------------------------------------+
  502. |<<<ReadBlockOpNumOps>>> | Total number of read operations
  503. *-------------------------------------+--------------------------------------+
  504. |<<<ReadBlockOpAvgTime>>> | Average time of read operations in milliseconds
  505. *-------------------------------------+--------------------------------------+
  506. |<<<WriteBlockOpNumOps>>> | Total number of write operations
  507. *-------------------------------------+--------------------------------------+
  508. |<<<WriteBlockOpAvgTime>>> | Average time of write operations in milliseconds
  509. *-------------------------------------+--------------------------------------+
  510. |<<<BlockChecksumOpNumOps>>> | Total number of blockChecksum operations
  511. *-------------------------------------+--------------------------------------+
  512. |<<<BlockChecksumOpAvgTime>>> | Average time of blockChecksum operations in
  513. | milliseconds
  514. *-------------------------------------+--------------------------------------+
  515. |<<<CopyBlockOpNumOps>>> | Total number of block copy operations
  516. *-------------------------------------+--------------------------------------+
  517. |<<<CopyBlockOpAvgTime>>> | Average time of block copy operations in
  518. | milliseconds
  519. *-------------------------------------+--------------------------------------+
  520. |<<<ReplaceBlockOpNumOps>>> | Total number of block replace operations
  521. *-------------------------------------+--------------------------------------+
  522. |<<<ReplaceBlockOpAvgTime>>> | Average time of block replace operations in
  523. | milliseconds
  524. *-------------------------------------+--------------------------------------+
  525. |<<<HeartbeatsNumOps>>> | Total number of heartbeats
  526. *-------------------------------------+--------------------------------------+
  527. |<<<HeartbeatsAvgTime>>> | Average heartbeat time in milliseconds
  528. *-------------------------------------+--------------------------------------+
  529. |<<<BlockReportsNumOps>>> | Total number of block report operations
  530. *-------------------------------------+--------------------------------------+
  531. |<<<BlockReportsAvgTime>>> | Average time of block report operations in
  532. | milliseconds
  533. *-------------------------------------+--------------------------------------+
  534. |<<<CacheReportsNumOps>>> | Total number of cache report operations
  535. *-------------------------------------+--------------------------------------+
  536. |<<<CacheReportsAvgTime>>> | Average time of cache report operations in
  537. | milliseconds
  538. *-------------------------------------+--------------------------------------+
  539. |<<<PacketAckRoundTripTimeNanosNumOps>>> | Total number of ack round trip
  540. *-------------------------------------+--------------------------------------+
  541. |<<<PacketAckRoundTripTimeNanosAvgTime>>> | Average time from ack send to
  542. | | receive minus the downstream ack time in nanoseconds
  543. *-------------------------------------+--------------------------------------+
  544. |<<<FlushNanosNumOps>>> | Total number of flushes
  545. *-------------------------------------+--------------------------------------+
  546. |<<<FlushNanosAvgTime>>> | Average flush time in nanoseconds
  547. *-------------------------------------+--------------------------------------+
  548. |<<<FsyncNanosNumOps>>> | Total number of fsync
  549. *-------------------------------------+--------------------------------------+
  550. |<<<FsyncNanosAvgTime>>> | Average fsync time in nanoseconds
  551. *-------------------------------------+--------------------------------------+
  552. |<<<SendDataPacketBlockedOnNetworkNanosNumOps>>> | Total number of sending
  553. | packets
  554. *-------------------------------------+--------------------------------------+
  555. |<<<SendDataPacketBlockedOnNetworkNanosAvgTime>>> | Average waiting time of
  556. | | sending packets in nanoseconds
  557. *-------------------------------------+--------------------------------------+
  558. |<<<SendDataPacketTransferNanosNumOps>>> | Total number of sending packets
  559. *-------------------------------------+--------------------------------------+
  560. |<<<SendDataPacketTransferNanosAvgTime>>> | Average transfer time of sending
  561. | packets in nanoseconds
  562. *-------------------------------------+--------------------------------------+
  563. ugi context
  564. * UgiMetrics
  565. UgiMetrics is related to user and group information.
  566. Each metrics record contains Hostname tag as additional information
  567. along with metrics.
  568. *-------------------------------------+--------------------------------------+
  569. || Name || Description
  570. *-------------------------------------+--------------------------------------+
  571. |<<<LoginSuccessNumOps>>> | Total number of successful kerberos logins
  572. *-------------------------------------+--------------------------------------+
  573. |<<<LoginSuccessAvgTime>>> | Average time for successful kerberos logins in
  574. | milliseconds
  575. *-------------------------------------+--------------------------------------+
  576. |<<<LoginFailureNumOps>>> | Total number of failed kerberos logins
  577. *-------------------------------------+--------------------------------------+
  578. |<<<LoginFailureAvgTime>>> | Average time for failed kerberos logins in
  579. | milliseconds
  580. *-------------------------------------+--------------------------------------+
  581. |<<<getGroupsNumOps>>> | Total number of group resolutions
  582. *-------------------------------------+--------------------------------------+
  583. |<<<getGroupsAvgTime>>> | Average time for group resolution in milliseconds
  584. *-------------------------------------+--------------------------------------+
  585. |<<<getGroups>>><num><<<sNumOps>>> |
  586. | | Total number of group resolutions (<num> seconds granularity). <num> is
  587. | | specified by <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  588. *-------------------------------------+--------------------------------------+
  589. |<<<getGroups>>><num><<<s50thPercentileLatency>>> |
  590. | | Shows the 50th percentile of group resolution time in milliseconds
  591. | | (<num> seconds granularity). <num> is specified by
  592. | | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  593. *-------------------------------------+--------------------------------------+
  594. |<<<getGroups>>><num><<<s75thPercentileLatency>>> |
  595. | | Shows the 75th percentile of group resolution time in milliseconds
  596. | | (<num> seconds granularity). <num> is specified by
  597. | | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  598. *-------------------------------------+--------------------------------------+
  599. |<<<getGroups>>><num><<<s90thPercentileLatency>>> |
  600. | | Shows the 90th percentile of group resolution time in milliseconds
  601. | | (<num> seconds granularity). <num> is specified by
  602. | | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  603. *-------------------------------------+--------------------------------------+
  604. |<<<getGroups>>><num><<<s95thPercentileLatency>>> |
  605. | | Shows the 95th percentile of group resolution time in milliseconds
  606. | | (<num> seconds granularity). <num> is specified by
  607. | | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  608. *-------------------------------------+--------------------------------------+
  609. |<<<getGroups>>><num><<<s99thPercentileLatency>>> |
  610. | | Shows the 99th percentile of group resolution time in milliseconds
  611. | | (<num> seconds granularity). <num> is specified by
  612. | | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
  613. *-------------------------------------+--------------------------------------+
  614. metricssystem context
  615. * MetricsSystem
  616. MetricsSystem shows the statistics for metrics snapshots and publishes.
  617. Each metrics record contains Hostname tag as additional information
  618. along with metrics.
  619. *-------------------------------------+--------------------------------------+
  620. || Name || Description
  621. *-------------------------------------+--------------------------------------+
  622. |<<<NumActiveSources>>> | Current number of active metrics sources
  623. *-------------------------------------+--------------------------------------+
  624. |<<<NumAllSources>>> | Total number of metrics sources
  625. *-------------------------------------+--------------------------------------+
  626. |<<<NumActiveSinks>>> | Current number of active sinks
  627. *-------------------------------------+--------------------------------------+
  628. |<<<NumAllSinks>>> | Total number of sinks \
  629. | (BUT usually less than <<<NumActiveSinks>>>,
  630. | see {{{https://issues.apache.org/jira/browse/HADOOP-9946}HADOOP-9946}})
  631. *-------------------------------------+--------------------------------------+
  632. |<<<SnapshotNumOps>>> | Total number of operations to snapshot statistics from
  633. | a metrics source
  634. *-------------------------------------+--------------------------------------+
  635. |<<<SnapshotAvgTime>>> | Average time in milliseconds to snapshot statistics
  636. | from a metrics source
  637. *-------------------------------------+--------------------------------------+
  638. |<<<PublishNumOps>>> | Total number of operations to publish statistics to a
  639. | sink
  640. *-------------------------------------+--------------------------------------+
  641. |<<<PublishAvgTime>>> | Average time in milliseconds to publish statistics to
  642. | a sink
  643. *-------------------------------------+--------------------------------------+
  644. |<<<DroppedPubAll>>> | Total number of dropped publishes
  645. *-------------------------------------+--------------------------------------+
  646. |<<<Sink_>>><instance><<<NumOps>>> | Total number of sink operations for the
  647. | <instance>
  648. *-------------------------------------+--------------------------------------+
  649. |<<<Sink_>>><instance><<<AvgTime>>> | Average time in milliseconds of sink
  650. | operations for the <instance>
  651. *-------------------------------------+--------------------------------------+
  652. |<<<Sink_>>><instance><<<Dropped>>> | Total number of dropped sink operations
  653. | for the <instance>
  654. *-------------------------------------+--------------------------------------+
  655. |<<<Sink_>>><instance><<<Qsize>>> | Current queue length of sink operations \
  656. | (BUT always set to 0 because nothing to
  657. | increment this metrics, see
  658. | {{{https://issues.apache.org/jira/browse/HADOOP-9941}HADOOP-9941}})
  659. *-------------------------------------+--------------------------------------+
  660. default context
  661. * StartupProgress
  662. StartupProgress metrics shows the statistics of NameNode startup.
  663. Four metrics are exposed for each startup phase based on its name.
  664. The startup <phase>s are <<<LoadingFsImage>>>, <<<LoadingEdits>>>,
  665. <<<SavingCheckpoint>>>, and <<<SafeMode>>>.
  666. Each metrics record contains Hostname tag as additional information
  667. along with metrics.
  668. *-------------------------------------+--------------------------------------+
  669. || Name || Description
  670. *-------------------------------------+--------------------------------------+
  671. |<<<ElapsedTime>>> | Total elapsed time in milliseconds
  672. *-------------------------------------+--------------------------------------+
  673. |<<<PercentComplete>>> | Current rate completed in NameNode startup progress \
  674. | (The max value is not 100 but 1.0)
  675. *-------------------------------------+--------------------------------------+
  676. |<phase><<<Count>>> | Total number of steps completed in the phase
  677. *-------------------------------------+--------------------------------------+
  678. |<phase><<<ElapsedTime>>> | Total elapsed time in the phase in milliseconds
  679. *-------------------------------------+--------------------------------------+
  680. |<phase><<<Total>>> | Total number of steps in the phase
  681. *-------------------------------------+--------------------------------------+
  682. |<phase><<<PercentComplete>>> | Current rate completed in the phase \
  683. | (The max value is not 100 but 1.0)
  684. *-------------------------------------+--------------------------------------+