dushenda

李德胜大大粉丝

dushenda

快速分析

使用命令组合

1
2
3
4
5
6
7
8
9
10
11
12
# 安装工具
yum install blktrace -y
# 采集数据:(监控30秒)
blktrace -d /dev/nvme0n1 -w 30
# 合并数据:
blkparse -i nvme0n1 -d nvme_trace.bin。
# 解析分析:,重点查看 "ALL" 统计区和 "Device Overhead" 分布
btt -i nvme_trace.bin

# 组合命令
blktrace -d /dev/sdf -w 5 -o - | blkparse -i - -d btt.bin > blkparse.txt
btt -i btt.bin
btrace组合
1
2
# 追踪设备/dev/sda并输出blkparse数据
btrace /dev/sda

实际上btrace是调用了blktarce和blkparse组合出来的脚本文件,位于/usr/bin/btrace

理论分析

一个I/O请求,从应用层到底层块设备,路径如下图所示: 我们将IO路径简化一下: 一个I/O请求进入block layer之后,可能会经历下面的过程:

  • Remap: 可能被DM(Device Mapper)或MD(Multiple Device, Software RAID) remap到其它设备
  • Split: 可能会因为I/O请求与扇区边界未对齐、或者size太大而被分拆(split)成多个物理I/O
  • Merge: 可能会因为与其它I/O请求的物理位置相邻而合并(merge)成一个I/O
  • 被IO Scheduler依照调度策略发送给driver
  • 被driver提交给硬件,经过HBA、电缆(光纤、网线等)、交换机(SAN或网络)、最后到达存储设备,设备完成IO请求之后再把结果发回。

blktrace 能够记录下IO所经历的各个步骤: 我们一起看下blktrace的输出长什么样子:

  • 第一个字段:8,0 这个字段是设备号 major device ID和minor device ID。
  • 第二个字段:3 表示CPU
  • 第三个字段:11 序列号
  • 第四个字段:0.009507758 Time Stamp是时间偏移
  • 第五个字段:PID 本次IO对应的进程ID
  • 第六个字段:Event,这个字段非常重要,反映了IO进行到了那一步
  • 第七个字段:R表示 Read, W是Write,D表示block,B表示Barrier Operation
  • 第八个字段:223490+56,表示的是起始block number 和 number of blocks,即我们常说的Offset 和 Size
  • 第九个字段: 进程名

blkparse的输出包含了每个 I/O 请求事件的详细信息,理解这些字段是分析的关键:

字段 含义说明
设备号 (Major, Minor) 8,0通常指 /dev/sda
CPU ID 处理此事件的 CPU 核心编号
序列号 事件的序列号
时间戳 事件发生的时间(通常为相对时间)
进程ID (PID) 发起 I/O 操作的进程 ID
事件类型 (Action) 核心字段,表示 I/O 请求所处的阶段
RWBS 描述符 描述 I/O 类型:R(读)/W(写)/B(屏障)/S(同步)
扇区信息 2048 + 8,表示起始扇区号及连续扇区数
进程名 发起 I/O 的进程名称

其中,事件类型 (Action)​ 是理解 I/O 路径的关键,它记录了请求从产生到完成的各个阶段:

  • Q​ (Queued): I/O 请求进入块层。
  • G​ (Get request): 分配请求结构。
  • I​ (Inserted): 请求插入 I/O 调度器队列。
  • D​ (Issued): 请求提交给设备驱动。
  • C​ (Completed): 请求完成。

根据这些事件的时间戳,可以计算出 I/O 路径各阶段的耗时,例如:

  • D2C: 请求在驱动和硬件上消耗的时间,是评估硬件性能的关键指标。
  • I2D: 请求在 I/O 调度队列中等待的时间,反映调度器性能
  • Q2C: 整个 I/O 请求的总时间,近似于 iostat中的 await

其中第六个字段非常有用:每一个字母都代表了IO请求所经历的某个阶段。

1
2
3
4
5
6
7
8
9
Q – 即将生成IO请求
|
G – IO请求生成
|
I – IO请求进入IO Scheduler队列
|
D – IO请求进入driver
|
C – IO请求执行完毕
注意,整个IO路径,分成很多段,每一段开始的时候,都会有一个时间戳,根据上一段开始的时间和下一段开始的时间,就可以得到IO 路径各段花费的时间。

注意,我们心心念念的service time,也就是反应块设备处理能力的指标,就是从D到C所花费的时间,简称D2C。

而iostat输出中的await,即整个IO从生成请求到IO请求执行完毕,即从Q到C所花费的时间,我们简称Q2C。

我们知道Linux 有I/O scheduler,调度器的效率如何,I2D是重要的指标。

注意,这只是blktrace输出的一个部分,很明显,我们还能拿到offset和size,根据offset,我们能拿到某一段时间里,应用程序都访问了整个块设备的那些block,从而绘制出块设备访问轨迹图。

另外还有size和第七个字段(Read or Write),我们可以知道IO size的分布直方图。对于本文来讲,我们就是要根据blktrace来获取这些信息。 # 工具使用

我们接下来简单介绍这些工具的使用,其中这三个命令都是属于blktrace这个包的,他们是一家人。

首先通过如下命令,可以查看磁盘上的实时信息:

1
blktrace -d /dev/sdb -o – | blkparse -i –

这个命令会连绵不绝地出现很多输出,当你输入ctrl+C的时候,会停止。

当然了,你也可以先用如下命令采集信息,待所有信息采集完毕后,统一分析所有采集到的数据。搜集信息的命令如下:

1
blktrace -d /dev/sdb

注意,这个命令并不是只输出一个文件,他会根据CPU的个数上,每一个CPU都会输出一个文件,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.0
-rw-r--r-- 1 manu manu 823K Jul 6 19:58 sdb.blktrace.1
-rw-r--r-- 1 manu manu 2.8M Jul 6 19:58 sdb.blktrace.10
-rw-r--r-- 1 manu manu 1.9M Jul 6 19:58 sdb.blktrace.11
-rw-r--r-- 1 manu manu 474K Jul 6 19:58 sdb.blktrace.12
-rw-r--r-- 1 manu manu 271K Jul 6 19:58 sdb.blktrace.13
-rw-r--r-- 1 manu manu 578K Jul 6 19:58 sdb.blktrace.14
-rw-r--r-- 1 manu manu 375K Jul 6 19:58 sdb.blktrace.15
-rw-r--r-- 1 manu manu 382K Jul 6 19:58 sdb.blktrace.16
-rw-r--r-- 1 manu manu 478K Jul 6 19:58 sdb.blktrace.17
-rw-r--r-- 1 manu manu 839K Jul 6 19:58 sdb.blktrace.18
-rw-r--r-- 1 manu manu 848K Jul 6 19:58 sdb.blktrace.19
-rw-r--r-- 1 manu manu 1.6M Jul 6 19:58 sdb.blktrace.2
-rw-r--r-- 1 manu manu 652K Jul 6 19:58 sdb.blktrace.20
-rw-r--r-- 1 manu manu 738K Jul 6 19:58 sdb.blktrace.21
-rw-r--r-- 1 manu manu 594K Jul 6 19:58 sdb.blktrace.22
-rw-r--r-- 1 manu manu 527K Jul 6 19:58 sdb.blktrace.23
-rw-r--r-- 1 manu manu 1005K Jul 6 19:58 sdb.blktrace.3
-rw-r--r-- 1 manu manu 1.2M Jul 6 19:58 sdb.blktrace.4
-rw-r--r-- 1 manu manu 511K Jul 6 19:58 sdb.blktrace.5
-rw-r--r-- 1 manu manu 2.3M Jul 6 19:58 sdb.blktrace.6
-rw-r--r-- 1 manu manu 1.3M Jul 6 19:58 sdb.blktrace.7
-rw-r--r-- 1 manu manu 2.1M Jul 6 19:58 sdb.blktrace.8
-rw-r--r-- 1 manu manu 1.1M Jul 6 19:58 sdb.blktrace.9

有了输出,我们可以通过blkparse -i sdb来分析采集的数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
8,16   7     2147     0.999400390 630169  I   W 447379872 + 8 [kworker/u482:0]
8,16 7 2148 0.999400653 630169 I W 447380040 + 8 [kworker/u482:0]
8,16 7 2149 0.999401057 630169 I W 447380088 + 16 [kworker/u482:0]
8,16 7 2150 0.999401364 630169 I W 447380176 + 8 [kworker/u482:0]
8,16 7 2151 0.999401521 630169 I W 453543312 + 8 [kworker/u482:0]
8,16 7 2152 0.999401843 630169 I W 453543328 + 8 [kworker/u482:0]
8,16 7 2153 0.999402195 630169 U N [kworker/u482:0] 14
8,16 6 5648 0.999403047 16921 C W 347875880 + 8 [0]
8,16 6 5649 0.999406293 16921 D W 301856632 + 8 [ceph-osd]
8,16 6 5650 0.999421040 16921 C W 354834456 + 8 [0]
8,16 6 5651 0.999423900 16921 D W 301857280 + 8 [ceph-osd]
8,16 7 2154 0.999442195 630169 A W 425409840 + 8 <- (8,22) 131806512
8,16 7 2155 0.999442601 630169 Q W 425409840 + 8 [kworker/u482:0]
8,16 7 2156 0.999444277 630169 G W 425409840 + 8 [kworker/u482:0]
8,16 7 2157 0.999445177 630169 P N [kworker/u482:0]
8,16 7 2158 0.999446341 630169 I W 425409840 + 8 [kworker/u482:0]
8,16 7 2159 0.999446773 630169 UT N [kworker/u482:0] 1
8,16 6 5652 0.999452685 16921 C W 354834520 + 8 [0]
8,16 6 5653 0.999455613 16921 D W 301857336 + 8 [ceph-osd]
8,16 6 5654 0.999470425 16921 C W 393228176 + 8 [0]
8,16 6 5655 0.999474127 16921 D W 411554968 + 8 [ceph-osd]
8,16 6 5656 0.999488551 16921 C W 393228560 + 8 [0]
8,16 6 5657 0.999491549 16921 D W 411556112 + 8 [ceph-osd]
8,16 6 5658 0.999594849 16923 C W 393230152 + 16 [0]
8,16 6 5659 0.999604038 16923 D W 432877368 + 8 [ceph-osd]
8,16 6 5660 0.999610322 16923 C W 487390128 + 8 [0]
8,16 6 5661 0.999614654 16923 D W 432879632 + 8 [ceph-osd]
8,16 6 5662 0.999628284 16923 C W 487391344 + 8 [0]
8,16 6 5663 0.999632014 16923 D W 432879680 + 8 [ceph-osd]
8,16 6 5664 0.999646122 16923 C W 293759504 + 8 [0]

注意,blkparse仅仅是将blktrace输出的信息转化成人可以阅读和理解的输出,但是,信息太多,太杂,人完全没法得到关键信息。 这时候btt就横空出世了,这个工具可以将blktrace采集回来的数据,进行分析,得到对人更有用的信息。事实上,btt也是我们的终点。

获取个阶段的延迟信息

注意,btt已经可以很自如地生成这部分统计信息,我们可以很容易得到如下的表格:

阶段缩写 全称与含义 性能指标解读
Q2Q Queue to Queue
衡量上一个I/O完成到当前I/O抵达块层的时间间隔,反映了I/O请求的到达速率
间隔时间短表示I/O负载很重,请求密集;间隔时间长则表示I/O负载较轻

Q2G Queue to Get
I/O请求进入块层后,等待系统为其分配一个request结构体的时间。
此阶段通常极短。如果时间较长,可能表示系统内存紧张或内核在分配数据结构时遇到瓶颈。
G2I Get to Insert
获取request结构体后,准备并将其插入到I/O调度器队列所花费的时间。
此阶段也非常短暂。它反映了I/O调度器处理请求的初始开销。
I2D Insert to Issue
请求在I/O调度器队列中等待以及被调度(合并、排序)后,派发(Issue)到设备驱动程序的时间。这是分析操作系统层面瓶颈的关键指标

这是判断I/O调度器和系统负载的关键指标。如果I2D时间很长,说明:
1. 存储设备速度跟不上请求速率,队列中有大量请求在排队。
2. I/O调度策略可能不适合当前负载。
D2C Issue to Complete
请求被提交给设备驱动后,在物理硬件上真正执行所花费的时间(包括在设备自身的缓存和介质上的读写时间)。这是评估硬件性能最直接的指标

这是判断硬件瓶颈的核心指标。如果D2C时间很长,通常意味着:
1. 存储设备本身性能不足(如机械硬盘随机读写慢)。
2. 设备可能处于高负载或存在故障。
Q2C Queue to Complete
一个I/O请求在块层处理的总时间。它近似等于 Q2I + I2D + D2C(Q2I可再细分为Q2G+G2I)

这大致相当于 iostat命令输出的 await值。反映了一个I/O请求从进入系统到完成的总延迟

方法如下:

首先blkparse可以将对应不同cpu的多个文件聚合成一个文件:

1
blkparse -i sdb -d sdb.blktrace.bin

然后btt就可以分析这个sdb.blktrace.bin了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
==================== All Devices ====================

ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------

Q2Q 0.000000001 0.000159747 0.025292639 62150
Q2G 0.000000233 0.000001380 0.000056343 52423
G2I 0.000000146 0.000027084 0.005031317 48516
Q2M 0.000000142 0.000000751 0.000021613 9728
I2D 0.000000096 0.001534463 0.022469688 52423
M2D 0.000000647 0.002617691 0.022445412 5821
D2C 0.000046189 0.000779355 0.007860766 62151
Q2C 0.000051089 0.002522832 0.026096657 62151

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 16) | 0.0461% 0.8380% 0.0047% 51.3029% 30.8921%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0461% 0.8380% 0.0047% 51.3029% 30.8921%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
( 8, 16) | 62151 52246 1.2 | 1 20 664 1051700

==================== Device Q2Q Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 16) | 62151 42079658.0 0 | 0(17159)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 62151 42079658.0 0 | 0(17159)

==================== Device D2D Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 16) | 52246 39892356.2 0 | 0(9249)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 52246 39892356.2 0 | 0(9249)

注意: D2C和Q2C,一个是表征块设备性能的关键指标,另一个是客户发起请求到收到响应的时间,我们可以看出,

D2C 平均在0.000779355 秒,即0.7毫秒 Q2C 平均在0.002522832 秒,即2.5毫秒,

无论是service time 还是客户感知到的await time,都是非常短的,表现非常不俗。但是D2C花费的时间只占整个Q2C的30%, 51%以上的时间花费在I2D。

参考/搬运

https://bean-li.github.io/blktrace-to-report/

Map 的核心概念与用途

在 bpftrace 中,符号 @用于定义和操作 Map 变量(或称映射变量)。这是 bpftrace 实现高效数据聚合的核心机制,你可以把它理解为一个在内核中运行的、功能强大的迷你数据库

Map 的主要作用是在不同的事件探针(probe)之间存储、共享和聚合数据。当某个事件触发时,你可以将数据记录到 Map 中;当另一个相关事件触发时,再从中读取或更新数据。这使得实现复杂的追踪逻辑成为可能。

特性 说明
数据持久化 Map 中的数据在多个探针事件之间保持存在,不像普通变量那样每次事件触发后就被重置。(全局变量)
键值对结构 Map 使用键(key)来索引值(value),格式为 @map_name[key] = value。键可以是单一值,也可以是多个值的组合(如 @a[pid, comm])。kv结构
自动输出 默认情况下,当 bpftrace 程序退出时(例如你按下 Ctrl-C),所有非空的 Map 内容会自动打印到屏幕上。

 Map 的常见操作与示例

操作/函数 说明 示例
赋值 直接给 Map 赋值。 @start_time[tid] = nsecs(记录线程的开始时间)
计数 (count()) 统计事件发生的次数。 @syscall_count[comm] = count()(统计每个进程名的系统调用次数)
求和 (sum()) 对数值进行累加。 @total_bytes[pid] = sum(args->ret)(累计每个进程读取的总字节数)
统计 (avg(), min(), max()) 计算平均值、最小值、最大值。 @response_time = avg($latency)
直方图 (hist()) 非常实用,生成2的幂次方的直方图,直观展示数据分布。 @latency_ns = hist(nsecs - @start[tid])(可视化读操作的延迟分布)
线性直方图 (lhist()) 生成自定义区间的线性直方图。 @read_sizes = lhist(args->ret, 0, 10000, 1000)(统计读取大小的分布)
数据清理 (delete()) 从 Map 中删除特定的键值对,防止内存无限增长。 delete(@start_time[tid])(在处理完一个事件后清理对应的开始时间)

 与其他变量的区别

变量类型 前缀 作用域与用途
Map 变量 @ 全局。用于在探针之间持久化存储和聚合数据。
内置变量 只读。提供事件上下文信息,如 pid(进程ID)、comm(命令名)、retval(函数返回值)等。
暂存变量 $ 局部临时。用于单次探针触发过程中的中间计算,例如 $duration = nsecs - @start[tid]

map函数参考表

函数原型 核心作用与参数说明 典型应用场景
count() 计数。统计事件被触发的次数。无参数。 统计系统调用次数、函数调用次数等。
sum(int n) 求和。对参数 n的值进行累加。 计算总的字节读写量、总耗时等。
avg(int n) 求平均值。计算参数 n的平均值。 计算平均延迟、平均数据包大小等。
min(int n) 求最小值。记录参数 n的最小值。 追踪最小延迟、最小数据块大小。
max(int n) 求最大值。记录参数 n的最大值。 追踪最大延迟、最大数据块大小。
stats(int n) 统计摘要。返回参数 n的计数、平均值、总和、最小值、最大值。 获取一个指标的全面统计信息。
hist(int n) 对数直方图。按2的幂次方区间(如 [4-8), [8-16))展示参数 n的分布。 直观展示延迟、数据大小的分布情况,易于发现模式。
lhist(int n, int min, int max, int step) 线性直方图。在指定的线性区间(minmax,步长为step)内展示参数 n的分布。 当需要自定义固定区间进行分析时使用。
delete(@m[key]) 删除键值对。从 Map @m中删除指定的 key及其对应的值。 清理临时数据,防止 Map 无限增长,常用于配对探针(如 kprobe/kretprobe)。
clear(@m) 清空 Map。清除 Map @m中的所有键值对。 在定时器(如 interval)中定期重置统计。
zero(@m) 归零 Map。将 Map @m中所有键的值重置为 0。 重置计数或求和等数据,但保留键的结构。

例子

count()- 统计系统调用次数

统计每个进程调用的系统调用次数

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

输出

1
2
3
@count[bash]: 15
@count[sshd]: 28
@count[snmpd]: 102

  • @[comm]:以进程名 comm为键(key)。
  • count():每次事件触发,对应键的值加 1

sum(int n)- 计算读取的总字节数

累计所有进程通过 read 系统调用成功读取的字节数

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @bytes = sum(args->ret); }'

输出

1
@bytes: 1048576

  • /args->ret > 0/:过滤器,只处理成功读取(返回值大于0)的情况。
  • sum(args->ret):对返回值(读取的字节数)进行累加

avg(int n)- 计算平均读取大小

计算每次 read系统调用成功读取的平均字节数。

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @avg_size = avg(args->ret); }'

输出

1
@avg_size: 512

  • /args->ret > 0/:过滤器,只处理成功读取(返回值大于0)的情况。
  • avg(args->ret):对返回值(读取的字节数)求平均

stats(int n)- 获取完整的统计摘要

read系统调用的返回值进行全面的统计。

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @s = stats(args->ret); }'

输出

1
@s: count 100, average 4096, total 409600, min 1, max 8192

  • /args->ret > 0/:过滤器,只处理成功读取(返回值大于0)的情况。
  • stats(args->ret):求调用次数,对返回值(读取的字节数)求平均,求总数,求最大最小值

hist(int n)- 分析读取字节数的对数分布

显示 read系统调用返回值的分布,区间按2的幂次方划分。

bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

输出

1
2
3
4
5
6
@bytes:
[0, 1] 12 |@@@@@@@@@@@@@@@@@@@@ |
[2, 4) 18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 0 | |
...
[128, 256) 1 |@

  • hist(args->ret):分为2^x ~2^y 大小的桶,统计每个桶的里面的数目个数

lhist(int n, int min, int max, int step)- 分析读取字节数的线性分布

使用线性直方图统计 read返回值,范围从0到2000,步长为200。

bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = lhist(args->ret, 0, 2000, 200); }'

输出

1
2
3
4
5
6
@bytes:
(..., 0) 0 | |
[0, 200) 66 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400) 2 |@ |
...
[2000, ...) 39 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |

  • hist(args->ret):分为0-200区间桶,统计每个桶的里面的数目个数

delete(@m[key])- 清理临时数据

在计算函数耗时后,删除用于存储开始时间的临时键,避免 Map 无限增长。

1
2
3
4
5
6
bpftrace -e 'kprobe:vfs_read 
{ @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/
{ $duration_ns = nsecs - @start[tid];
@us = hist($duration_ns);
delete(@start[tid]); }'

  • 此函数通常用于配对操作的探针(如 kprobekretprobe),在操作完成后及时清理资源

clear(@m)和 zero(@m)- 重置 Map

每5秒打印并清空一次系统调用计数

1
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); clear(@); }'
输出
1
2
3
4
5
6
7
8
@[Relay(89295)]: 3
@[mini_init]: 4
@[gmain]: 5
@[Relay(909)]: 6
@[Relay(893)]: 6
@[systemd-udevd]: 37
@[chronyd]: 40
@[bpftrace]: 118

  • clear(@m)会删除 Map 中的所有键值对。而 zero(@m)则将所有键的值重置为0,但保留键的结构
    1
    bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); zero(@); }'
    输出
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    @[gmain]: 5
    @[Relay(893)]: 6
    @[Relay(909)]: 6
    @[Relay(89295)]: 9
    @[mini_init]: 12
    @[chronyd]: 44
    @[init]: 103
    @[bpftrace]: 114
    @[GnsPortTracker]: 242
    @[top]: 350
    @[grep]: 664
    @[sh]: 923
    @[Xwayland]: 1978
    @[ps]: 2778
    @[libuv-worker]: 4028
    @[ls]: 4256
    @[node]: 4722
    @[mini_init]: 0 <---------主要看这个,明显没有调用,但是结构被保留下来了
    @[Relay(109)]: 3
    @[Relay(909)]: 6
    @[Relay(89295)]: 6
    @[Relay(893)]: 6

map特点

  • 自动打印:默认情况下,当 bpftrace 程序终止时(例如用户按下 Ctrl-C),所有非空的 Map 变量会自动打印出来。
  • 过滤条件:使用 /<filter>/可以设置条件,只有满足条件时才会执行后面的动作,这能有效提升脚本效率和输出内容的针对性
  • 结合变量:Map 函数常与内置变量(如 comm, pid, nsecs)或临时变量(以 $开头)结合使用,以实现复杂的追踪逻辑

bpftrace是什么

一种追踪内核态和用户态的新技术

看可以跟踪哪些内核函数

1
bpftrace -l | grep kprobe

跟踪内核函数例子

是否调用

跟踪 net/ipv4/netfilter/ip_tables.c looks promising 的两个函数: compat_do_ipt_get_ctl 和 do_ipt_get_ctl

1
2
3
4
~# bpftrace -e 'kprobe:do_ipt_get_ctl { printf("function was called!\n"); }'
Attaching 1 probe...
function was called!
function was called!

compat_do_ipt_get_ctl 函数签名如下

1
static int compat_do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)

调用命令,pid,入参

建立test.bpf文件

1
2
3
4
5
6
#include <net/sock.h>

kprobe:do_ipt_get_ctl
{
printf("called by %s (pid: %d). and: %d\n", comm, pid, ((sock *)arg0)->__sk_common.skc_family);
}

执行结果如下

1
2
3
4
5
6
7
8
9
10
~# bpftrace test.bpf
/bpftrace/include/stdarg.h:52:1: warning: null character ignored [-Wnull-character]
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:209:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:256:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:310:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/jump_label.h:23:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
Attaching 1 probe...
called by iptables-legacy (pid: 2981). and: 2
called by iptables-legacy (pid: 2981). and: 2

 2 的含义解释如下

1
2
3
4
5
6
7
8
/usr/src/linux-headers-4.19.0-8-common/include/linux/socket.h
160 /* Supported address families. */
161 #define AF_UNSPEC 0
162 #define AF_UNIX 1 /* Unix domain sockets */
163 #define AF_LOCAL 1 /* POSIX name for AF_UNIX */
164 #define AF_INET 2 /* Internet IP Protocol */
165 #define AF_AX25 3 /* Amateur Radio AX.25 */
166 #define AF_IPX 4 /* Novell IPX */

打印入参

1
bpftrace -e 'kprobe:vfs_open { printf("open path: %s\n", str(((path *)arg0)->dentry->d_name.name)); }'

打印返回值

使用默认变量retval

内核函数返回值

使用 kretprobe跟踪内核函数的返回值。例如,跟踪 vfs_read的返回值(读取的字节数或错误码):

1
bpftrace -e 'kretprobe:vfs_read { printf("vfs_read returned: %d\n", retval); }'

用户空间函数返回值

使用 uretprobe跟踪用户空间函数的返回值。例如,跟踪一个名为 myfunc的函数的返回值:

1
bpftrace -e 'uretprobe:/path/to/binary:myfunc { printf("myfunc returned: %d\n", retval); }'

结合入口和返回探针

测量函数执行时间或关联参数与返回值。这可以通过在函数入口(如 kprobe/uprobe)记录时间戳(tid是内置变量,代表线程id),然后在返回探针中计算差值来实现。例如:

1
2
3
4
5
6
bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } 
kretprobe:vfs_read /@start[tid]/ {
$duration = nsecs - @start[tid];
printf("vfs_read took %d ns, returned %d\n", $duration, retval);
delete(@start[tid]);
}'

过滤条件

可以在返回探针上添加过滤条件,例如只处理特定进程或返回值范围的调用。使用 /<filter>/语法。例如,只打印返回值大于0的 vfs_read调用:

1
bpftrace -e 'kretprobe:vfs_read /retval > 0/ { printf("vfs_read returned: %d\n", retval); }'

wireshark

远程抓包

在wireshark的以下两个部分进行抓包,不建议性能分析时进行,在连通性验证时可以使用(ssh传输可能丢包等) image.png

ssh+tcpdump+wireshark解析

直接抓包解析

1
ssh <ip> 'tcpdump -ni any -s0 -U -w - udp port 53' | wireshark -k -i -
跳板机执行
1
ssh -J <jumpsever ip> <ip> 'tcpdump -ni any -s0 -U -w - udp port 53' | wireshark -k -i -
直接抓包并保存
1
ssh <ip> 'tcpdump -ni any -s0 -U -w- udp port 53' > /tmp/packets.pcap

tcpdump

常用抓包指令

过滤tcp flags,比如抓syn报文,看是否能握手成功

1
tcpdump -ni any 'tcp[tcpflags] == tcp-syn'
查看tcp状态,如下为发送了sync,如果一直未回复,则可以在sar -n ETCP 1看到retrans/s不断上升
1
watch -n 0.1 'ss -tpn state syn-sent'
抓syn和syn+ack报文
1
tcpdump -ni any 'tcp[tcpflags] == tcp-syn or tcp[13]=18'
抓syn和syn+rst报文
1
tcpdump -ni any 'tcp[tcpflags] == tcp-syn or tcp[13] & 4!=0'
抓syn和icmp不可达报文
1
tcpdump -ni any 'tcp[tcpflags] == tcp-syn or icmp[0] = 3'
抓syn、syn+ack、rst和icmp端口不可达报文(tcp[13]指的是从tcp报文头偏移13字节的选项)
1
tcpdump -ni any '(tcp[tcpflags] == tcp-syn or tcp[13]=18) or tcp[13] &amp; 4!=0 or icmp[0] = 3'
tcp flags抓包例子
1
2
3
tcpdump 'tcp[tcpflags] == tcp-syn'
tcpdump 'tcp[tcpflags] == tcp-rst'
tcpdump 'tcp[tcpflags] == tcp-fin'
抓取指定报文大小
1
2
3
tcpdump less 32
tcpdump greater 32
tcpdump <= 102
持续抓取端口 8080的流量,每个文件最大 100MB,最多保留 10个文件,写满后覆盖最旧的文件
1
tcpdump -i any -C 100 -W 10 -w my_capture.pcap port 8080
抓取 10000个发往 80端口的包
1
tcpdump -i any -c 10000 -w http_requests.pcap dst port 80

常用选项

参数 说明 示例
-i <interface> 指定抓包网卡。any表示所有网卡。 -i any
port <端口号> 过滤特定端口的流量(TCP/UDP)。 port 8080
-C <大小> 按文件大小分割。单位通常为MB(M)。 -C 100
-W <数量> 限制文件总数,与 -C配合实现循环覆盖。 -W 10
-w <文件名> 将抓取的原始数据包写入指定文件。 -w my_capture.pcap
-c <数量> 抓取指定数量的数据包后自动退出。 -c 10000
-s <长度> 设置每个数据包的抓取长度(快照长度byte),-s 0表示抓取完整数据包。(在长期抓只需要分析报文头很实用) -s 0

其他选项

Capture Commands

Command Example usage Explanation
-i any tcpdump -i any Capture from all interfaces; may require superuser (sudo/su)
-i eth0 tcpdump -i eth0 Capture from the interface eth0
-c count tcpdump -i eth0 -c 5 Exit after receiving count (5) packets
-r captures.pcap tcpdump -i eth0 -r captures.pcap Read and analyze saved capture file captures.pcap
tcp tcpdump -i eth0 tcp Show TCP packets only
udp tcpdump -i eth0 udp Show UDP packets only
icmp tcpdump -i eth0 icmp Show ICMP packets only
ip tcpdump -i eth0 ip Show IPv4 packets only
ip6 tcpdump -i eth0 ip6 Show IPv6 packets only
arp tcpdump -i eth0 arp Show ARP packets only
rarp tcpdump -i eth0 rarp Show RARP packets only
slip tcpdump -i eth0 slip Show SLIP packets only
-I tcpdump -i eth0 -I Set interface as monitor mode
-K tcpdump -i eth0 -K Don’t verify checksum
-p tcpdump -i eth0 -p Don’t capture in promiscuous mode

Filter Commands

Filter expression Explanation
src host 127.0.0.1 Filter by source IP/hostname 127.0.0.1
dst host 127.0.0.1 Filter by destination IP/hostname 127.0.0.1
host 127.0.0.1 Filter by source or destination = 127.0.0.1
ether src 01:23:45:AB:CD:EF Filter by source MAC 01:23:45:AB:CD:EF
ether dst 01:23:45:AB:CD:EF Filter by destination MAC 01:23:45:AB:CD:EF
ether host 01:23:45:AB:CD:EF Filter by source or destination MAC 01:23:45:AB:CD:EF
src net 127.0.0.1 Filter by source network location 127.0.0.1
dst net 127.0.0.1 Filter by destination network location 127.0.0.1
net 127.0.0.1 Filter by source or destination network location 127.0.0.1
net 127.0.0.1/24 Filter by source or destination network location 127.0.0.1 with the tcpdump subnet mask of length 24
src port 80 Filter by source port = 80
dst port 80 Filter by destination port = 80
port 80 Filter by source or destination port = 80
src portrange 80-400 Filter by source port value between 80 and 400
dst portrange 80-400 Filter by destination port value between 80 and 400
portrange 80-400 Filter by source or destination port value between 80 and 400
ether broadcast Filter for Ethernet broadcasts
ip broadcast Filter for IPv4 broadcasts
ether multicast Filter for Ethernet multicasts
ip multicast Filter for IPv4 multicasts
ip6 multicast Filter for IPv6 multicasts
ip src host mydevice Filter by IPv4 source hostname mydevice
arp dst host mycar Filter by ARP destination hostname mycar
rarp src host 127.0.0.1 Filter by RARP source 127.0.0.1
ip6 dst host mywatch Filter by IPv6 destination hostname mywatch
tcp dst port 8000 Filter by destination TCP port = 8000
udp src portrange 1000-2000 Filter by source TCP ports in 1000–2000
sctp port 22 Filter by source or destination port = 22

Display Commands

Example Explanation
tcpdump -i eth0 -A Print each packet (minus its link level header) in ASCII. Handy for capturing web pages.

[![Screenshot with ASCII (sudo tcpdump twitter)
tcpdump -D Print the list of the network interfaces available on the system and on which tcpdump can capture packets.
tcpdump -i eth0 -e Print the link-level header on each output line, such as MAC layer addresses for protocols such as Ethernet and IEEE 802.11.
tcpdump -i eth0 -F /path/to/params.conf Use the file params.conf as input for the filter expression. (Ignore other expressions on the command line.)
tcpdump -i eth0 -n Don’t convert addresses (i.e., host addresses, port numbers, etc.) to names.
tcpdump -i eth0 -S Print absolute, rather than relative, TCP sequence numbers. (Absolute TCP sequence numbers are longer.)
tcpdump -i eth0 --time-stamp-precision=nano When capturing, set the timestamp precision for the capture to tsp:
• micro for microsecond (default)
• nano for nanosecond.
tcpdump -i eth0 -t Omit the timestamp on each output line.
tcpdump -i eth0 -tt Print the timestamp, as seconds since January 1, 1970, 00:00:00, UTC, and fractions of a second since that time, on each dump line.
tcpdump -i eth0 -ttt Print a delta (microsecond or nanosecond resolution depending on the --time-stamp-precision option) between the current and previous line on each output line. The default is microsecond resolution.
tcpdump -i eth0 -tttt Print a timestamp as hours, minutes, seconds, and fractions of a second since midnight, preceded by the date, on each dump line.
tcpdump -i eth0 -ttttt Print a delta (microsecond or nanosecond resolution depending on the --time-stamp-precision option) between the current and first line on each dump line. The default is microsecond resolution.
tcpdump -i eth0 -u Print undecoded network file system (NFS) handles.
tcpdump -i eth0 -v Produce verbose output.
When writing to a file (-w option) and at the same time not reading from a file (-r option), report to standard error, once per second, the number of packets captured.
tcpdump -i eth0 -vv Additional verbose output than -v
tcpdump -i eth0 -vvv Additional verbose output than -vv
tcpdump -i eth0 -x Print the headers and data of each packet (minus its link level header) in hex.
tcpdump -i eth0 -xx Print the headers and data of each packet, including its link level header, in hex.
tcpdump -i eth0 -X Print the headers and data of each packet (minus its link level header) in hex and ASCII.
tcpdump -i eth0 -XX Print the headers and data of each packet, including its link level header, in hex and ASCII.

Output Commands

Command Example Explanation
-w captures.pcap tcpdump -i eth0 -w captures.pcap Output capture to a file captures.pcap
-d tcpdump -i eth0 -d Display human-readable form in standard output
-L tcpdump -i eth0 -L Display data link types for the interface
-q tcpdump -i eth0 -q Quick/quiet output. Print less protocol information, so output lines are shorter.
-U tcpdump -i eth0 -U -w out.pcap Without -w option
Print a description of each packet’s contents.
With -w option
Write each packet to the output file out.pcap in real time rather than only when the output buffer fills.

Miscellaneous Commands

Operator Syntax Example Description
AND and, && tcpdump -n src 127.0.0.1 and dst port 21 Combine filtering options joined by “and”
OR or, \| tcpdump dst 127.0.0.1 or src port 22 Match any of the conditions joined by “or”
EXCEPT not, ! tcpdump dst 127.0.0.1 and not icmp Negate the condition prefixed by “not”
LESS less, <, (<=) tcpdump dst host 127.0.0.1 and less 128 Shows packets shorter than (or equal to) 128 bytes in length.
< only applies to length 32, i.e., <32.
GREATER greater, >, (>=) tcpdump dst host 127.0.0.1 and greater 64 Shows packets longer than (or equal to) 64 bytes in length.
> only applies to length 32, i.e., >32.
EQUAL =, == tcpdump host 127.0.0.1 = 0 Show packets with zero length

Example Usage

Example Explanation
tcpdump -r outfile.pcap src host 10.0.2.15 Print all packets in the file outfile.pcap coming from the host with IP address 10.0.2.15
tcpdump -i any ip and not tcp port 80 Listen for non-HTTP packets (which have TCP port number 80) on any network interface
tcpdump -i eth0 -n >32 -w pv01.pcap -c 30 Save 30 packets of length exceeding 32 bytes to captures.pcap without DNS resolution on the eth0 network interface
tcpdump -AtuvX icmp Capture ICMP traffic and print ICMP packets in hex and ASCII and the following features:
With:
• headers
• data
• undecoded NFS handles
Without:
• link level headers
• timestamps.
tcpdump 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' Print all IPv4 HTTP packets to and from port 80, i.e. print only packets that contain data, not, for example, SYN and FIN packets and ACK-only packets.

参考/搬运

内置变量

bpftrace 的内置变量是其强大功能的基石,它们能让你在脚本中轻松获取事件触发时的上下文信息。

变量类别 变量名 说明与典型应用场景
进程/线程信息 pid 当前进程ID。用于过滤特定进程的事件。
tid 当前线程ID。用于进行更精细的线程级分析。
comm 当前进程名(最多16个字符)。常用于按进程名进行统计和过滤。
uid 当前用户ID。用于分析特定用户的操作。
时间信息 nsecs 自系统启动以来的纳秒级时间戳。常用于计算函数耗时、事件间隔等。
CPU/系统信息 cpu 当前事件发生时所在的CPU处理器ID
curtask 当前进程的内核task_struct结构体地址(以64位无符号整数表示),用于高级内核调试。
探针上下文信息 arg0, arg1, … argN Kprobe/Uprobe 的参数。用于获取被探测函数的参数值(注意:Tracepoint 不可用)。
args Tracepoint 的参数结构体。用于通过 args->field_name的方式访问 Tracepoint 的特定字段。
retval Kretprobe/Uretprobe 的返回值。用于获取被探测函数的返回值。
func 当前触发的 Kprobe/Uprobe 所探测的函数名
kstack/ ustack 当前时刻的内核栈用户栈描述。用于分析代码执行路径,定位性能瓶颈

例子

计算函数执行时间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 测量 vfs_read 函数的执行时间(微秒)
~ » bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
$duration_us = (nsecs - @start[tid]) / 1000;
@us = hist($duration_us);
delete(@start[tid]);
}'
Attaching 2 probes...
^C

@start[551]: 84374532046638
@start[323672]: 84374600291711
@us:
[0] 45 |@@@@@ |
[1] 452 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4) 264 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 338 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 43 |@@@@ |
[16, 32) 25 |@@ |
[32, 64) 4 | |
[64, 128) 0 | |
[128, 256) 0 | |
[256, 512) 0 | |
[512, 1K) 6 | |
[1K, 2K) 0 | |
[2K, 4K) 3 | |
[4K, 8K) 1 | |

追踪特定进程的系统调用

1
2
3
4
5
6
7
8
9
10
11
12
~ » bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "top"/ {
printf("PID %d is opening: %s\n", pid, str(args->filename));
}'
Attaching 1 probe...
PID 323556 is opening: /proc/uptime
PID 323556 is opening: /proc
PID 323556 is opening: /proc/uptime
PID 323556 is opening: /proc/1/stat
PID 323556 is opening: /proc/1/statm
PID 323556 is opening: /proc/2/stat
PID 323556 is opening: /proc/2/statm
PID 323556 is opening: /proc/7/stat

分析内核调用路径

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// 统计 ip_output 函数的调用栈,这里要注意,@[kstack]的key记录的是两条不同的调用链
~ » bpftrace -e 'kprobe:ip_output { @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
ip_output+5
__ip_queue_xmit+400
ip_queue_xmit+25
__tcp_transmit_skb+2666
__tcp_send_ack.part.0+198
tcp_send_ack+32
__tcp_ack_snd_check+66
tcp_rcv_established+660
tcp_v4_do_rcv+362
tcp_v4_rcv+3671
ip_protocol_deliver_rcu+55
ip_local_deliver_finish+138
ip_local_deliver+115
ip_sublist_rcv_finish+137
ip_sublist_rcv+410
ip_list_rcv+314
__netif_receive_skb_list_core+639
netif_receive_skb_list_internal+463
napi_complete_done+126
netvsc_poll+1451
__napi_poll+49
net_rx_action+680
handle_softirqs+244
__irq_exit_rcu+120
irq_exit_rcu+18
sysvec_hyperv_callback+180
asm_sysvec_hyperv_callback+31
pv_native_safe_halt+15
arch_cpu_idle+13
default_idle_call+46
do_idle+517
cpu_startup_entry+49
rest_init+218
arch_call_rest_init+18
start_kernel+1241
x86_64_start_reservations+37
__pfx_reserve_bios_regions+0
secondary_startup_64_no_verify+381
]: 3
@[
ip_output+5
__ip_queue_xmit+400
ip_queue_xmit+25
__tcp_transmit_skb+2666
__tcp_send_ack.part.0+198
tcp_send_ack+32
__tcp_ack_snd_check+66
tcp_rcv_established+660
tcp_v4_do_rcv+362
tcp_v4_rcv+3671
ip_protocol_deliver_rcu+55
ip_local_deliver_finish+138
ip_local_deliver+115
ip_sublist_rcv_finish+137
ip_sublist_rcv+410
ip_list_rcv+314
__netif_receive_skb_list_core+639
netif_receive_skb_list_internal+463
napi_complete_done+126
netvsc_poll+1451
__napi_poll+49
net_rx_action+680
handle_softirqs+244
__irq_exit_rcu+120
irq_exit_rcu+18
sysvec_hyperv_callback+180
asm_sysvec_hyperv_callback+31
pv_native_safe_halt+15
arch_cpu_idle+13
default_idle_call+46
do_idle+517
cpu_startup_entry+49
start_secondary+281
secondary_startup_64_no_verify+381
]: 8

其他说明

  • argX与 args的区别:arg0, arg1…用于 kprobe/uprobe,它们是简单的整数参数。而 args是一个结构体,专用于 tracepoint,需要通过 args->字段名来访问其成员。
  • 查看 Tracepoint 参数:你可以使用 bpftrace -lv tracepoint:name命令来查看某个 tracepoint 有哪些参数可用。例如,要查看 write系统调用的参数,可以执行:
    1
    2
    3
    4
    5
    6
    ~ » bpftrace -lv tracepoint:syscalls:sys_enter_write
    tracepoint:syscalls:sys_enter_write
    int __syscall_nr
    unsigned int fd
    const char * buf
    size_t count

.gdbinit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
set $lastcs = -1

define hook-stop
# There doesn't seem to be a good way to detect if we're in 16- or
# 32-bit mode, but in 32-bit mode we always run with CS == 8 in the
# kernel and CS == 35 in user space
if $cs == 8 || $cs == 35
if $lastcs != 8 && $lastcs != 35
set architecture i386
end
x/i $pc
else
if $lastcs == -1 || $lastcs == 8 || $lastcs == 35
set architecture i8086
end
# Translate the segment:offset into a physical address
printf "[%4x:%4x] ", $cs, $eip
x/i $cs*16+$eip
end
set $lastcs = $cs
end

# 需要注释掉
# echo + target remote localhost:25000\n
# target remote localhost:25000

echo + symbol-file kernel\n
symbol-file kernel

launch.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug xv6 x86",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/kernel", // 修改kernel 或者 bootblock
"miDebuggerServerAddress": "localhost:25000",
"miDebuggerPath": "/usr/bin/gdb",
"stopAtEntry": true,
"cwd": "${workspaceFolder}",
"setupCommands": [
{
"text": "set architecture i386:x86-64"
},
{
"text": "set disassemble-next-line auto"
},
{
"description": "Enable pretty-print for gdb",
"text": "-enable-pretty-printing",
"ignoreFailures": true
}
],
"preLaunchTask": "xv6build",
"logging": {
"trace": true,
"traceResponse": true,
"engineLogging": true
}
}
]
}

tasks.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
"version": "2.0.0",
"tasks": [
{
"label": "xv6build",
"type": "shell",
"isBackground": true,
"command": "make qemu-nox-gdb",
"problemMatcher": {
"pattern":{
"regexp": ".",
"file": 1,
"location": 2,
"message": 3
},
"background": {
"beginsPattern": ".* Now run 'gdb'.",
"endsPattern": "."
}
}
}
]
}

常见的linux性能查询命令

Purpose: a single-page, production-ready cheatsheet for Linux/SRE triage. Optimized for fast on-call use: concise flags, copy-paste recipes, brief notes, and clear risk callouts.

Tip: Use your editor/browser search to jump to any command by its number, e.g., “## 41. lsof”.

Binaries & ELF

Cheat Card - Linked libs: ldd /path/to/bin (security caveat: may execute code in rare cases) - ELF headers/sections: readelf -h /bin/ls; sections: readelf -S /bin/ls - Symbols (prefer readelf): readelf -Ws /bin/ls | grep ' FUNC '; dynamic: readelf -Ws -d /bin/ls - Disassemble: objdump -d /bin/ls | less (add -M intel for Intel syntax) - Symbols via nm: nm -D /bin/ls | grep symbol_name - Requires: binutils (readelf/objdump/nm)

1. ldd

List shared library dependencies of executables and shared objects. - Basic: ldd /path/to/bin - Security caveat: may execute code in rare cases; avoid on untrusted binaries. - Alternative: LD_TRACE_LOADED_OBJECTS=1 /lib64/ld-linux-x86-64.so.2 /path/to/bin (still uses loader)

Text & Data Utilities

Cheat Card - Search recursively: grep -RIn --exclude-dir .git 'pattern' .; context: -C2 - Edit in-place: sed -i.bak -E 's/old/new/g' file (backup) - Summarize data: awk -F, '{a[$1]+=$2} END{for(k in a) print k,a[k]}' file.csv - JSON parse: jq -r '.items[].metadata.name' file.json - Compare dirs: diff -ruN dir_old dir_new | less -R - Transform text: tr -s ' ' | cut -d, -f1,3 | xargs -n1 echo - Safe temp: mktemp -d for dirs; files: mktemp - Reverse lines: rev <file (quick visual check)

2. grep

  • search for one or more expressions: grep -E 'hello|world' temp
  • search for one or more words: grep -Ew 'hello|world' temp
  • search for suffix matches: grep -E 'hello(world|lolo)' temp
  • search for suffixes matching regex: grep -E 'hello[0-9]{3,}' temp
  • recursive search in tree: grep -RIn --exclude-dir .git --exclude='*.log' 'pattern' .
  • fixed strings (fast) and ignore case: grep -Fni 'literal text' file
  • context lines: grep -R --color -n -C2 'pattern' . (or -A after, -B before)
  • binary-skip and file names only: grep -rI -l 'pattern' .

3. sed

What it does: stream editor for non-interactive find/replace, line edits, and range selections.

  • In-place with backup: sed -i.bak -E 's/old/new/g' file
  • Delete matching lines: sed -i '/pattern/d' file
  • Print lines between markers: sed -n '/BEGIN/,/END/p' file
  • Replace with capture groups: sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3-\2-\1/' file
  • Insert before/after match:
    • Before: sed '/pattern/i\\inserted before' file
    • After: sed '/pattern/a\\appended after' file
  • Trim trailing spaces: sed -i 's/[ \t]\+$//' file
  • Multiple edits: sed -E -e 's/foo/bar/g' -e '/tmp/d' file

4. awk

What it does: text processing and quick data summarization using fields and expressions.

  • Default FS is whitespace; set CSV FS: awk -F, '...' file.csv
  • Select fields: awk '{print $1, $3}' file
  • Filter rows: awk '$5 > 100 {print $1, $5}' file
  • Sum a column: awk '{s+=$3} END{print s}' file
  • Group and sum by key: awk '{a[$1]+=$2} END{for (k in a) print k, a[k]}' file
  • Pretty print: awk '{printf "%-20s %10d\n", $1, $2}' file
  • Count unique values: awk '{c[$1]++} END{for (k in c) print k, c[k]}' file

Networking

Cheat Card - Ports→PIDs: ss -ltnp; established only: ss -tn state established - TCP detail: ss -i dst <ip> (rtt, cwnd, retrans) - Path/source IP: ip route get <dest>; counters: ip -s link show <iface> - Latency/loss: mtr -ezbw <dest>; quick traceroute ICMP: traceroute -I <dest> - Targeted capture: tcpdump -ni <iface> tcp port 443 (or port 53, icmp) - DNS: resolvectl query <name> or dig <name> A +short

5. ping

  • Compat: Linux; Root: may require CAP_NET_RAW depending on system; Requires: iputils-ping.

  • -4: ping IPv4 only

  • -6: ping IPv6 only

  • -A: adapts to roundtrip time

  • -b: allow pinging broadcast addresses

  • -I: ping through an interface

  • -M: set PMTU strategy

  • -s: set packetsize (default is 56B)

  • -t: set IP time-to-live

  • ping 224.0.0.1: ping multicast address

Notes: - Using average rtt values, you can determine whether there are huge variations causing jitter, especially in RT applications - ping will report duplications, however, duplicate packets should never occur, and seem to be caused by inappropriate link-level retransmissions - ping will report damaged packets, suggesting broken hardware in the network Requires: iputils-ping.

6. ip

  • Compat: Linux; Root: not required for reads; Requires: iproute2.

  • ip addr: Show information for all addresses

  • ip addr show dev wlo1: Display information only for device wlo1

  • ip link: Show information for all interfaces

  • ip link show dev wlo1: Display information only for device wlo1

  • ip -s: Display interface statistics (packets dropped, received, sent, etc.)

  • Quick recipes:

    • Path and source IP: ip route get <dest>
    • Interface counters: ip -s link show <iface> (rx/tx errors, drops)
    • Neighbors/ARP: ip neigh and ip neigh show dev <iface>
    • Multicast: ip maddr or ip maddr show dev <iface>

Example

1
2
3
# Query path and chosen source IP
ip route get 8.8.8.8
# Expect: 8.8.8.8 via 192.168.1.1 dev wlo1 src 192.168.1.23

  • ip route: List all of the route entries in the kernel

  • ip route add: Add a route entry to the kernel routing table

  • ip route replace: Replace an existing route (add if not present)

  • ip maddr: Display multicast information for all devices

  • ip maddr show dev wlo1

  • ip neigh show dev wlo1: check for reachability of specific interfaces
    Requires: iproute2.

7. arp

  • Compat: Legacy; prefer ip neigh; Requires: net-tools.

  • arp: show all ARP table entries

  • arp -d address: delete ARP entry for address

  • arp -s address hw_addr: set up new table entry Note: legacy from net-tools; prefer ip neigh. Requires: net-tools.

8. arping

  • Compat: Linux; Root/CAP_NET_RAW required; Package: arping (iputils-arping on some distros).

  • arping -I wlo1 192.168.0.1: send ARP requests to host

  • arping -D -I wlo1 192.168.0.15: check for duplicate MAC address Requires: arping (iputils-arping on some distros).

9. ethtool

  • Compat: Linux; Root for changing settings, read stats usually ok; Requires: ethtool.

  • ethtool -S wlo1: print network statistics Requires: ethtool.

10. ss

  • Compat: Linux; Modern replacement for netstat; Requires: iproute2.

  • ss -a: show all sockets

  • ss -o: show all sockets with timer information

  • ss -p: show process using the socket

  • ss -t|-u|-4|-6

  • ss -ltnp: list listening TCP sockets with PIDs

  • ss -tn state established: show established TCP only

  • ss -tn sport = :443 or ss -tn dport = :443: filter by port

  • ss -s: summary stats (TCP states, mem)

  • ss -i:

    • ts: show string “ts” if the timestamp option is set
    • sack: show string “sack” if the sack option is set
    • ecn: show string “ecn” if the explicit congestion notification option is set
    • ecnseen: show string “ecnseen” if the saw ecn flag is found in received packets
    • fastopen: show string “fastopen” if the fastopen option is set
    • cong_alg: the congestion algorithm name, the default congestion algorithm is “cubic”
    • wscale:<snd_wscale>:<rcv_wscale>: if window scale option is used, this field shows the send scale factor and receive scale factor
    • rto:<icsk_rto>: tcp re-transmission timeout value, the unit is millisecond
    • backoff:<icsk_backoff>: used for exponential backoff re-transmission, the actual re-transmission timeout value is icsk_rto << icsk_backoff
    • rtt:<rtt>/<rttvar>: rtt is the average round trip time, rttvar is the mean deviation of rtt, their units are millisecond
    • ato:<ato>: ack timeout, unit is millisecond, used for delay ack mode
    • mss:<mss>: max segment size
    • cwnd:<cwnd>: congestion window size
    • pmtu:<pmtu>: path MTU value
    • ssthresh:<ssthresh>: tcp congestion window slow start threshold
    • bytes_acked:<bytes_acked>: bytes acked
    • bytes_received:<bytes_received>: bytes received
    • segs_out:<segs_out>: segments sent out
    • segs_in:<segs_in>: segments received
    • send <send_bps>bps: egress bps
    • lastsnd:<lastsnd>: how long time since the last packet sent, the unit is millisecond
    • lastrcv:<lastrcv>: how long time since the last packet received, the unit is millisecond
    • lastack:<lastack>: how long time since the last ack received, the unit is millisecond
  • ss -A tcp,udp: dump socket tables Requires: iproute2.

37. tcpdump

Compat: Linux; Root/CAP_NET_RAW required for captures; Requires: tcpdump. What it does: capture packets for inspection and troubleshooting. Requires: tcpdump.

  • Interface and no name resolution: tcpdump -ni <iface>
  • Host or subnet: tcpdump -ni <iface> host <ip>; tcpdump -ni <iface> net 10.0.0.0/8
  • Ports/protocols: tcpdump -ni <iface> tcp port 443 or udp port 53
  • SYNs only (new TCP handshakes):
    1
    2
    # New TCP handshakes only (SYN without ACK)
    tcpdump -ni <iface> 'tcp[tcpflags] & (tcp-syn) != 0 and tcp[tcpflags] & (tcp-ack) == 0'
  • DNS queries: tcpdump -ni <iface> port 53
  • ICMP reachability: tcpdump -ni <iface> icmp
    1
    2
    3
    4
    5
    6
    # Requires: tcpdump
    # Capture full packets to a file
    tcpdump -ni <iface> -s 0 -w capture.pcap

    # Rotate captures every 5m, keep 6 files
    tcpdump -ni <iface> -s 0 -G 300 -W 6 -w 'cap-%Y%m%d%H%M%S.pcap'

38. mtr

Compat: Linux; May need root/CAP_NET_RAW for certain probe types; Requires: mtr. What it does: combines ping and traceroute to visualize latency and loss per hop.

  • Run with extra info: mtr -ezbw <dest>
  • Report mode (one-off): mtr -ezbwrc 10 <dest> Requires: mtr.

39. traceroute

  • Compat: Linux; Requires: traceroute; TCP mode may need CAP_NET_RAW/root.

  • traceroute -I: use ICMP echo for probes

  • traceroute -T: use TCP SYN for probes Requires: traceroute.

40. nicstat

  • Compat: Linux; Not widely packaged; Consider sar -n/ethtool -S alternatives.

  • nicstat prints out network statistics for all network cards (NICs), including packets, kilobytes per second, average packet sizes and more.

  • nicstat -t: show CPU stats

  • nicstat: show network interface stats Requires: nicstat (may need third-party repo/source on some distros).

Metrics reference (click to expand)
  • Time - The time corresponding to the end of the sample shown, in HH:MM:SS format (24-hour clock).
  • Int - The interface name.
  • rKB/s, InKB - Kilobytes/second read (received).
  • wKB/s, OutKB - Kilobytes/second written (transmitted).
  • rMbps, RdMbps - Megabits/second read (received).
  • wMbps, WrMbps - Megabits/second written (transmitted).
  • rPk/s, InSeg, InDG - Packets (TCP Segments, UDP Datagrams)/second read (received).
  • wPk/s, OutSeg, OutDG - Packets (TCP Segments, UDP Datagrams)/second written (transmitted).
  • rAvs - Average size of packets read (received).
  • wAvs - Average size of packets written (transmitted).
  • %Util - Percentage utilization of the interface. For full-duplex interfaces, this is the greater of rKB/s or wKB/s as a percentage of the interface speed. For half-duplex interfaces, rKB/s and wKB/s are summed.
  • %rUtil, %wUtil - Percentage utilization for bytes read and written, respectively.
  • Sat - Saturation. This the number of errors/second seen for the interface
    • an indicator the interface may be approaching saturation. This statistic is combined from a number of kernel statistics. It is recommended to use the ‘-x’ option to see more individual statistics (those mentioned below) when attempting to diagnose a network issue.
  • IErr - Packets received that could not be processed because they contained errors
  • OErr - Packets that were not successfully transmitted because of errors
  • Coll - Ethernet collisions during transmit.
  • NoCP - No-can-puts. This is when an incoming packet can not be put to the process reading the socket. This suggests the local process is unable to process incoming packets in a timely manner.
  • Defer - Defer Transmits. Packets without collisions where first transmit attempt was delayed because the medium was busy.
  • Reset - tcpEstabResets. The number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
  • AttF - tcpAttemptFails - The number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state.
  • %ReTX - Percentage of TCP segments retransmitted - that is, the number of TCP segments transmitted containing one or more previously transmitted octets.
  • InConn - tcpPassiveOpens - The number of times that TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state.
  • OutCon - tcpActiveOpens - The number of times that TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state.
  • Drops - tcpHalfOpenDrop + tcpListenDrop + tcpListenDropQ0. tcpListenDrop and tcpListenDropQ0 - Number of connections dropped from the completed connection queue and incomplete connection queue, respectively. tcpHalfOpenDrops - Number of connections dropped after the initial SYN packet was received.

41. nslookup

  • Compat: Legacy; prefer dig/resolvectl; Requires: dnsutils/bind-utils.

query Internet name servers interactively

  • nslookup <domain>
  • Note: legacy tool. Prefer dig for detailed queries or resolvectl on systemd-based systems. Requires: dnsutils/bind-utils (for nslookup/dig).
  • Quick equivalents: dig <domain> A +short; resolvectl query <domain>

42. host

  • Compat: Linux; Requires: bind9-host/bind-utils.

host is a simple utility for performing DNS lookups. It is normally used to convert names to IP addresses and vice versa.

  • host <domain>
  • Examples: host -t A <domain>; reverse lookup: host <ip>
  • Tip: for more control, use dig (if installed) or resolvectl. Requires: bind9-host (Debian/Ubuntu) or bind-utils.

43. iwconfig

  • Compat: Legacy; prefer iw; Requires: wireless-tools.

  • iwconfig wlo1: show WLAN config:

  • Note: iwconfig is legacy (wireless-tools). Prefer iw for modern drivers, e.g., iw dev, iw dev wlo1 link. Requires: wireless-tools. Modern alternative: iw.

1
2
3
4
5
6
7
8
wlo1      IEEE 802.11  ESSID:"NETGEAR97"  
Mode:Managed Frequency:2.462 GHz Access Point: C4:04:15:58:60:C7
Bit Rate=72.2 Mb/s Tx-Power=20 dBm
Retry short limit:7 RTS thr=2347 B Fragment thr:off
Power Management:off
Link Quality=70/70 Signal level=-32 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:22932 Missed beacon:0

44. brctl

  • Compat: Legacy; prefer ip link and bridge; Requires: bridge-utils.

  • brctl is used to set up, maintain, and inspect the ethernet bridge configuration in the linux kernel. Legacy: prefer ip link add name br0 type bridge and bridge (iproute2) tooling. Requires: bridge-utils.

Kernel & Tracing

Cheat Card - Kernel logs: dmesg -T -l err,crit,alert,emerg - Syscalls: strace -ttT -p <pid> -f -e trace=network,file - Modules: lsmod | head, modprobe <name> (caution), sysctl -a | grep tcp - Optional advanced:

1
2
3
4
5
6
# perf (if installed)
perf top
perf record -g -p <pid>; perf report

# bpftrace one-liner (Requires: bpftrace)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }'

11. dmesg

  • Compat: Linux; May be restricted by kernel.dmesg_restrict; Requires: util-linux.

  • dmesg --level=<LEVEL> where <LEVEL> is:

    • emerg - system is unusable.
    • alert - action must be taken immediately.
    • crit - critical conditions.
    • err - error conditions.
    • warn - warning conditions.
    • notice - normal but significant condition.
    • info - informational.
    • debug - debug-level messages.
  • dmesg -k: print kernel messages

  • dmesg -f=<FACILITY> where <FACILITY> is:

    • kern: Kernel messages.
    • user: User-level messages.
    • mail: Mail system.
    • daemon: System daemons.
    • auth: Security/authorization messages.
    • syslog: Internal syslogd messages.
    • lpr: Line printer subsystem.
    • news: Network news subsystem.
  • dmesg -T: human readable timestamps

12. lsmod

  • Compat: Linux; Lists modules without root; Requires: kmod.

  • Show loaded kernel modules and sizes/dependencies.

  • Quick peek: lsmod | head

  • Module info (version, params): modinfo <module>

13. modprobe

  • Compat: Linux; Root required; Caution: can destabilize systems; Requires: kmod.

Add or remove modules from the Linux kernel. - Load: modprobe <module>; with params: modprobe <module> key=value - Unload: modprobe -r <module> (fails if in use) - Caution: loading/unloading modules can destabilize systems; prefer persistent config and ensure module compatibility.

Disk & Filesystems

Cheat Card - Space/inodes: df -h and df -i; biggest dirs: du -xhd1 /path | sort -h - IO saturation: iostat -xz 1; per-proc IO: pidstat -d 1, iotop -oPa - Devices/FS: lsblk -o NAME,TYPE,SIZE,ROTA,MOUNTPOINT,MODEL; mounts: findmnt - Mount ops: mount --bind olddir newdir; remount ro: mount -o remount,ro /mnt

Inventory and health - Device tree: lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINT,MODEL - Identify filesystem UUID/TYPE: blkid - SMART check (if supported): smartctl -H /dev/sdX and smartctl -a /dev/sdX (Requires: smartmontools) - NVMe info: nvme list; nvme smart-log /dev/nvme0 (Requires: nvme-cli)

Notes - iostat quick view (Requires: sysstat): iostat -xz 1 (watch await, %util, r/s, w/s) - findmnt: show mount hierarchy or lookup by target: findmnt /mount/point

  • adds or removes modules from the Linux Kernel
  • Caution: loading/unloading modules can destabilize systems; prefer persistent config and ensure module compatibility.

14. dd (DANGER: DESTRUCTIVE — READ FIRST)

  • Compat: Linux; Root required for raw devices; Highly destructive when writing; Requires: coreutils.

  • Danger: dd will overwrite data with no confirmation. Double-check devices (e.g., /dev/sdX) and consider read-only or safer alternatives first. Use lsblk, blkid to verify targets.

  • Safer tips: for copies, consider pv to visualize throughput; for imaging, dcfldd; for testing, prefer non-destructive reads.

1
2
# Danger: wipes target disk. Verify device with lsblk/blkid.
dd if=/dev/zero of=/dev/sda bs=4k status=progress
1
2
# Verify a drive is zeroed (non-zero bytes check)
dd if=/dev/sda status=none | hexdump -C | grep -q '[^00]' || echo "All zeros"
1
2
# Fill a file with random data (example size)
dd if=/dev/urandom of=myfile bs=6703104 count=1 status=progress
1
2
# Danger: clone a partition to another (same size/align). Verify both!
dd if=/dev/sda3 of=/dev/sdb3 bs=4096 status=progress conv=fsync
1
2
# Danger: write an image to a USB device. Verify device path first!
dd if=/path/to/bootimage.img of=/dev/sdc bs=4M status=progress conv=fsync
1
2
3
# Quick r/w benchmark for a file (non-destructive read + temp write)
dd if=/home/$user/bigfile of=/dev/null status=progress
dd if=/dev/zero of=/home/$user/bigfile bs=1M count=1000 oflag=dsync status=progress
1
2
# Sequential device read throughput sample (approx 1 GiB)
dd if=/dev/sda of=/dev/null bs=1024k count=1024 status=progress
1
2
# Create a swapfile (example: 8 GiB), then mkswap + swapon
dd if=/dev/zero of=swapfile bs=1MiB count=$((8*1024)) status=progress

15. jq

  • Compat: Linux; Requires: jq package.

What it does: parse/query/transform JSON on the command line. Requires: jq.

  • Pretty-print: jq . file.json
  • Extract field list: jq -r '.items[].metadata.name' file.json
  • Filter by condition: jq '.[] | select(.status=="RUNNING")' file.json
  • Transform and count: jq '[.[] | .level] | group_by(.) | map({level: .[0], count: length})' file.json
  • Sort and top N: jq 'sort_by(.time) | reverse | .[0:5]' file.json
  • From journald: journalctl -o json | jq -r 'select(.PRIORITY<=3) | .MESSAGE'
    1
    2
    # Requires: jq — show high-priority messages from journald
    journalctl -o json | jq -r 'select(.PRIORITY<=3) | .MESSAGE'
  • Keys and length: jq 'keys, length' file.json

16. diff

  • Compat: Linux; Requires: diffutils.

  • unified diff: diff -u old.txt new.txt

  • recursive dirs: diff -ruN dir_old dir_new

  • ignore whitespace changes: diff -u -w old new

  • handle CRLF: diff -u --strip-trailing-cr a b

  • color (if supported): diff --color=auto -u a b

  • apply a patch: patch -p1 < change.diff

17. uname

  • Compat: Linux; Requires: coreutils.

  • get all details about the computer

18. sync/fsync

  • Compat: Linux; sync is user command; fsync is a syscall.

  • fsync is a syscall that flushes a file’s in-memory data and metadata to storage. From the shell, use sync (flush all dirty data) or syncfs (flush a filesystem) when available.

19. mkswap

  • Compat: Linux; Root required; Requires: util-linux.

  • -c: check if blocks are corrupted

  • -p: set pagesize

20. fsck

  • Compat: Linux; Root required; Avoid on mounted filesystems; Requires: e2fsprogs for ext*.

  • check for file system consistency:

    • The superblock is checked for inconsistencies in:
      • File system size
      • Number of inodes
      • Free-block count
      • Free-inode count
    • Each inode is checked for inconsistencies in:
      • Format and type
      • Link count
      • Duplicate block
      • Bad block numbers
      • Inode size
  • see: https://docs.oracle.com/cd/E19455-01/805-7228/6j6q7uf0e/index.html

  • Caution: avoid running fsck on a mounted filesystem (except with specific fs support); prefer read-only mounts or maintenance windows.

Extended notes - ext* specifics: e2fsck checks ext2/3/4; use -f to force, -n for read-only, -p for preen (auto-fix safe issues). Requires: e2fsprogs. - Bad blocks (DANGER): badblocks scans devices for bad sectors; write-mode is destructive. Prefer read-only first.

Examples

1
2
3
4
5
6
7
8
# Read-only badblocks scan (non-destructive)
sudo badblocks -sv /dev/sdX

# DANGER: write-mode destructive scan — data loss
sudo badblocks -wsv /dev/sdX

# ext* filesystem check (read-only)
sudo e2fsck -fn /dev/sdXN

21. mount

  • Compat: Linux; Root required unless user mounts configured; Requires: util-linux.

  • mount -a [-t type] [-O optlist]: mount all FSs mentioned in fstab to be mounted

  • -o: override the settings in fstab

  • mount --bind olddir newdir: remount part of the hierarchy elsewhere

  • mount --move: move mounted tree to another place

  • Caution: --bind/--move and remounts can impact running services; ensure correct fstab for persistence and have rollback plan.

22. umount

  • Compat: Linux; Root required for system mounts; Requires: util-linux.

  • unmount from a mountpoint

23. chown

  • chown root:staff /u: change owner and group

24. sysctl

  • Compat: Linux; Root required for -w; Persistence via /etc/sysctl.d; Requires: procps.

  • configure kernel parameters at runtime

  • sysctl -a | grep "tcp"

  • Caution: sysctl -w changes take effect immediately; persist only via /etc/sysctl.d/*.conf after validation.

  • Read a key: sysctl net.ipv4.tcp_congestion_control

  • Set a key (runtime): sysctl -w vm.swappiness=10

  • Persist: create /etc/sysctl.d/99-local.conf with vm.swappiness = 10, then sysctl --system

25. iotop

  • Compat: Linux; Root required; Needs kernel taskstats/delay accounting; Python tool.

  • iotop -o: only show threads doing I/O

  • iotop -p <PID1>,<PID2>,...: list of processes to monitor

  • iotop -a: show accumulated IO rather than diff Requires: iotop.

26. netstat

  • Compat: Legacy; prefer ss; Requires: net-tools.

Processes & Scheduling

Cheat Card - Top offenders: ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head - Threads view: top -H or ps -Lp <pid> -o pid,tid,pcpu,comm - Target processes:

1
2
3
4
5
# Preview before signaling
pgrep -a <name>

# Then send a scoped, safe signal (example: TERM)
pkill -TERM -u <user> -f '<exact-pattern>'
- Over time: pidstat -u 1 -p <pid> (CPU) and pidstat -d 1 (IO) - Find PIDs: pidof <proc>; list threads: ps -Lp <pid> - Niceness: start nice -n 10 cmd; adjust: renice -n 10 -p <pid> - Locks: /proc/locks shows current file locks (read-only) - Sessions: users who; recent logins last | head - Schedule: crontab -l list; crontab -e edit

  • Deprecated in many distros; prefer ss.
  • Common mappings:
    • netstat -tulpn -> ss -tulpn
    • netstat -anp -> ss -anp
    • netstat -s -> ss -s

27. top

  • Compat: Linux; Requires: procps.

  • Dynamic process view with CPU, memory, and load summaries.

  • Key CPU line fields: us (user), sy (system), ni, id (idle), wa (iowait), hi/si (IRQ/softIRQ), st (steal).

  • Key per-proc fields: %CPU, %MEM, VIRT (virtual), RES (resident), SHR (shared), TIME+ (CPU time).

  • top -E m|g: scale as mega|giga bytes

  • top -H: thread-mode

  • top -i: show idle processes

  • top -o RES|VIRT|SWAP, etc: sort by attribute

  • top -O: output fields: print all available sort-attributes

  • top -p pid1,pid2,...: monitor only these PIDs

  • top -1: show per-CPU stats

28. vmstat

  • Compat: Linux; Requires: procps.

Useful to get so/si information

  • Report virtual memory statistics
  • vmstat -a: number active/inactive memory
  • vmstat --stats: various statistics

Interpretation tips - r runnable > number of CPUs indicates run-queue contention. - b blocked processes (often IO wait); correlate with %wa in top/mpstat. - si/so swap in/out: sustained non-zero values indicate memory pressure. - Use vmstat 1 for near-real-time view.

29. strace

  • Compat: Linux; May be restricted by ptrace scope; Requires: strace.

Trace system calls and signals. - Attach to a PID: strace -ttT -p <pid> -f -e trace=network,file,fsync,clock,nanosleep - Run a program under strace: strace -o strace.log -s 200 -vv -f -ttT your_cmd --arg - Syscall time summary: strace -c -p <pid> - Filter a path: strace -ttT -e trace=file -P /etc/resolv.conf -p <pid> - Notes: -f follows forks; -ttT adds timestamps and syscall durations; -s increases string size. - trace system calls and signals

30. slabtop

  • Compat: Linux; Requires: procps.

  • slabtop: display kernel slab cache information in real time

  • Sort by size: slabtop -s c; one-shot: slabtop -o

31. uptime

  • Compat: Linux; Requires: procps.

  • information about how long the system has been up, and load averages

32. htop

  • Compat: Linux; Requires: htop package.

  • like top, but prettier

33. ps

  • Compat: Linux; Requires: procps.

Cheat Card - Top CPU: ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head - Top RSS: ps -eo pid,user,rss,cmd --sort=-rss | head - Tree view: ps -ejH (or ps axjf) - By command: ps -C nginx -o pid,ppid,cmd,%mem,%cpu - Threads of a PID: ps -Lp <pid> -o pid,tid,pcpu,comm

  • ps aux: show all processes

  • ps axjf - print process tree

  • ps a - Lift the BSD-style “only yourself” restriction

  • ps -A - select all processes

  • ps -d - select all processes except session leaders

  • ps g - select all processes including session leaders

  • ps Ta - all process associated with this terminal

  • ps r - restrict to running processes

  • ps --pid pidlist - restrict to pidlist processes

  • ps -s|--sid - select by session ID

  • ps t ttylist - select by TTY list

  • ps U|-U - select by effective user-id

  • ps s - display signals

  • ps f - ASCII art process hierarchy

  • ps ax -o rss,pid,user,pcpu,command --sort -%cpu: sort by %cpu

  • ps ax -o rss,pid,user,pcpu,command --sort -rss: sort by rss

process states: - D - uninterruptible sleep (usually IO) - I - Idle kernel thread - R - running or runnable (on run queue) - S - interruptible sleep (waiting for an event to complete) - T - stopped by job control signal - t - stopped by debugger during the tracing - W - paging (not valid since the 2.6.xx kernel) - X - dead (should never be seen) - Z - defunct (“zombie”) process, terminated but not reaped by its parent

see STANDARD FORMAT SPECIFIERS in man ps

CPU

Cheat Card - CPU saturation: mpstat -P ALL 1 (sys/iowait/irq/soft) - Per-core view in top: top -1; over time per PID: pidstat -u 1 -p <pid> - Interrupt spikes: mpstat -I CPU 1

34. mpstat

  • Compat: Linux; Requires: sysstat.

The mpstat command writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported. Requires: sysstat.

Interpretation tips - High %iowait: CPUs idle while waiting on disk IO (check iostat). - High %irq/%soft: heavy interrupts/softirqs (often network or storage). - High %steal: hypervisor stealing time (noisy neighbor in a VM). - Compare per-core: hotspots can be isolated to specific cores (affinity).

  • CPU: Processor number. The keyword all indicates that statistics are calculated as averages among all processors.

  • %usr: Show the percentage of CPU utilization that occurred while executing at the user level (application).

  • %nice: Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.

  • %sys: Show the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and software interrupts.

  • %iowait: Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

  • %irq: Show the percentage of time spent by the CPU or CPUs to service hardware interrupts.

  • %soft: Show the percentage of time spent by the CPU or CPUs to service software interrupts.

  • %steal: Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

  • %guest: Show the percentage of time spent by the CPU or CPUs to run a virtual processor.

  • %gnice: Show the percentage of time spent by the CPU or CPUs to run a niced guest.

  • mpstat -I: report interrupt stats

    • of interrupts per CPU

    • of times a particular interrupt occurred

Memory

Cheat Card - Snapshot: free -h --wide; paging: vmstat -a 1 (si/so) - Per-proc memory: ps -eo pid,user,rss,cmd --sort=-rss | head; deep dive: pmap -x <pid> - OOM evidence: dmesg -T | grep -i oom or journalctl -k -g OOM

35. free

  • Compat: Linux; Requires: procps.

  • used - Used memory (calculated as total - free - buffers - cache)

  • free - Unused memory (MemFree and SwapFree in /proc/meminfo)

  • shared - Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)

  • buffers - Memory used by kernel buffers (Buffers in /proc/meminfo)

  • cache - Memory used by the page cache and slabs (Cached and SReclaimable in /proc/meminfo)

  • buff/cache - Sum of buffers and cache

  • available - Estimation of how much memory is available for starting new applications, without swapping. Unlike the data provided by the cache or free fields, this field takes into account page cache and also that not all reclaimable memory slabs will be reclaimed due to items being in use (MemAvailable in /proc/meminfo, available on kernels 3.14, emulated on kernels 2.6.27+, otherwise the same as free)

  • free -l: show low-high memory breakdown

  • free --wide: show free memory stats

Interpretation tips - available approximates memory free for new apps without swapping; don’t confuse free with usable memory. - High buff/cache is normal; it’s the page cache and reclaimable slabs.

Examples - Human-readable snapshot: free -h --wide - Example output:

1
2
3
              total        used        free      shared  buff/cache   available
Mem: 31Gi 2.1Gi 22Gi 312Mi 7.2Gi 28Gi
Swap: 8Gi 0B 8Gi

36. sar

  • Compat: Linux; Requires: sysstat; history needs sadc enabled.

Cheat Card - CPU load/queue: sar -q 1 5; memory: sar -r 1 5 - IO bw/ops: sar -b 1 5; per-device: sar -d 1 5 (watch await, %util) - Network: sar -n DEV 1 5; TCP: sar -n TCP,ETCP 1 5 - Paging: sar -B 1 5 (pgsteal, pgscan, majflt/s)

Requires: sysstat (includes pidstat).

Field reference (click to expand)
  • sar -B: report paging stats

    • gpgin/s - Total number of kilobytes the system paged in from disk per second.
    • pgpgout/s - Total number of kilobytes the system paged out to disk per second.
  • fault/s - Number of page faults (major + minor) made by the system per second. This is not a count of page faults that generate I/O, because some page faults can be resolved without I/O.

    • majflt/s - Number of major faults the system has made per second, those which have required loading a memory page from disk.
    • pgfree/s - Number of pages placed on the free list by the system per second.
    • pgscank/s - Number of pages scanned by the kswapd daemon per second.
    • pgscand/s - Number of pages scanned directly per second.
    • pgsteal/s - Number of pages the system has reclaimed from cache (pagecache and swapcache) per second to satisfy its memory demands.
    • %vmeff - Calculated as pgsteal / pgscan, this is a metric of the efficiency of page reclaim. If it is near 100% then almost every page coming off the tail of the inactive list is being reaped. If it gets too low (e.g. less than 30%) then the virtual memory is having some difficulty. This field is displayed as zero if no pages have been scanned during the interval of time.
  • sar -b: Report I/O and transfer rate statistics.

    • tps - Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.
    • rtps - Total number of read requests per second issued to physical devices.
    • wtps - Total number of write requests per second issued to physical devices.
    • bread/s - Total amount of data read from the devices in blocks per second. Blocks are equivalent to sectors and therefore have a size of 512 bytes.
    • bwrtn/s - Total amount of data written to devices in blocks per second.
  • sar -d: report activity for each block device

  • tps - Total number of transfers per second that were issued to physical devices. A transfer is an I/O request to a physical device. Multiple logical requests can be combined into a single I/O request to the device. A transfer is of indeterminate size.

    • rkB/s - Number of kilobytes read from the device per second.
    • wkB/s - Number of kilobytes written to the device per second.
    • areq-sz - The average size (in kilobytes) of the I/O requests that were issued to the device. Note: In previous versions, this field was known as avgrq-sz and was expressed in sectors.
    • aqu-sz - The average queue length of the requests that were issued to the device. Note: In previous versions, this field was known as avgqu-sz.
    • await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
    • svctm - The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
    • %util - Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
  • sar -F: display stats. for currently mounted FSs:

    • MBfsfree - Total amount of free space in megabytes (including space available only to privileged user).
    • MBfsused - Total amount of space used in megabytes.
    • %fsused - Percentage of filesystem space used, as seen by a privileged user.
    • %ufsused - Percentage of filesystem space used, as seen by an unprivileged user.
    • Ifree - Total number of free file nodes in filesystem.
    • Iused - Total number of file nodes used in filesystem.
    • %Iused - Percentage of file nodes used in filesystem.
  • sar -m: power management statistics:

    • MHz - Instantaneous CPU clock frequency in MHz.

    With the FAN keyword, statistics about fans speed are reported. The following values are displayed:

    • rpm - Fan speed expressed in revolutions per minute.
    • drpm - This field is calculated as the difference between current fan speed (rpm) and its low limit (fan_min).
    • DEVICE - Sensor device name.

    With the FREQ keyword, statistics about CPU clock frequency are reported. The following value is displayed:

    • wghMHz - Weighted average CPU clock frequency in MHz. Note that the cpufreq-stats driver must be compiled in the kernel for this option to work.

    With the IN keyword, statistics about voltage inputs are reported. The following values are displayed:

    • inV - Voltage input expressed in Volts.
    • %in - Relative input value. A value of 100% means that voltage input has reached its high limit (in_max) whereas a value of 0% means that it has reached its low limit (in_min).
    • DEVICE - Sensor device name.

    With the USB keyword, the sar command takes a snapshot of all the USB devices currently plugged into the system. At the end of the report, sar will display a summary of all those USB devices. The following values are displayed:

    • BUS - Root hub number of the USB device.
    • idvendor - Vendor ID number (assigned by USB organization).
    • idprod - Product ID number (assigned by Manufacturer).
    • maxpower - Maximum power consumption of the device (expressed in mA).
    • manufact - Manufacturer name.
    • product - Product name.
  • sar -n DEV:

    • IFACE - Name of the network interface for which statistics are reported.
    • rxpck/s - Total number of packets received per second.
    • txpck/s - Total number of packets transmitted per second.
    • rxkB/s - Total number of kilobytes received per second.
    • txkB/s - Total number of kilobytes transmitted per second.
    • rxcmp/s - Number of compressed packets received per second (for cslip etc.).
    • txcmp/s - Number of compressed packets transmitted per second.
    • rxmcst/s - Number of multicast packets received per second.
    • %ifutil - Utilization percentage of the network interface. For half-duplex interfaces, utilization is calculated using the sum of rxkB/s and txkB/s as a percentage of the interface speed. For full-duplex, this is the greater of rxkB/S or txkB/s.
  • sar -n EDEV:

    • IFACE - Name of the network interface for which statistics are reported.
    • rxerr/s - Total number of bad packets received per second.
    • txerr/s - Total number of errors that happened per second while transmitting packets.
    • coll/s - Number of collisions that happened per second while transmitting packets.
    • rxdrop/s - Number of received packets dropped per second because of a lack of space in linux buffers.
    • txdrop/s - Number of transmitted packets dropped per second because of a lack of space in linux buffers.
    • txcarr/s - Number of carrier-errors that happened per second while transmitting packets.
    • rxfram/s - Number of frame alignment errors that happened per second on received packets.
    • rxfifo/s - Number of FIFO overrun errors that happened per second on received packets.
    • txfifo/s - Number of FIFO overrun errors that happened per second on transmitted packets.
  • sar -n ICMP:

    • imsg/s - The total number of ICMP messages which the entity received per second [icmpInMsgs]. Note that this counter includes all those counted by ierr/s.
    • omsg/s - The total number of ICMP messages which this entity attempted to send per second [icmpOutMsgs]. Note that this counter includes all those counted by oerr/s.
    • iech/s - The number of ICMP Echo (request) messages received per second [icmpInEchos].
    • iechr/s - The number of ICMP Echo Reply messages received per second [icmpInEchoReps].
    • oech/s - The number of ICMP Echo (request) messages sent per second [icmpOutEchos].
    • oechr/s - The number of ICMP Echo Reply messages sent per second [icmpOutEchoReps].
    • itm/s - The number of ICMP Timestamp (request) messages received per second [icmpInTimestamps].
    • itmr/s - The number of ICMP Timestamp Reply messages received per second [icmpInTimestampReps].
    • otm/s - The number of ICMP Timestamp (request) messages sent per second [icmpOutTimestamps].
    • otmr/s - The number of ICMP Timestamp Reply messages sent per second [icmpOutTimestampReps].
    • iadrmk/s - The number of ICMP Address Mask Request messages received per second [icmpInAddrMasks].
    • oadrmk/s - The number of ICMP Address Mask Request messages sent per second [icmpOutAddrMasks].
    • oadrmkr/s - The number of ICMP Address Mask Reply messages sent per second [icmpOutAddrMaskReps].
  • sar -n EICMP: Extended ICMP stats (errors, dest unreachable, time exceeded). Focus on spikes in ierr/s and oerr/s, and patterns in unreachable/time- exceeded when debugging path issues.

  • sar -n EIP: Extended IPv4 stats (header errors, addr errors, discards, no routes, reassembly, fragment fails). Use to spot header errors and routing/ no-route conditions.

  • sar -n IP6: IPv6 per-protocol counters (receive/deliver/forward, multicast, fragmentation). Check for anomalies similar to IPv4.

  • sar -n EIP6: Extended IPv6 errors and routing stats (header/addr errors, discards, no routes, reassembly/frag). Useful for IPv6-specific troubleshooting.

  • sar -n SOCK:

    • totsck - Total number of sockets used by the system.
    • tcpsck - TCP sockets in use; tcp-tw - TIME_WAIT sockets.
  • sar -n SOFT:

    • total/s - The total number of network frames processed per second.
    • dropd/s - The total number of network frames dropped per second because there was no room on the processing queue.
    • squeezd/s - The number of times the softirq handler function terminated per second because its budget was consumed or the time limit was reached, but more work could have been done.
    • rx_rps/s - The number of times the CPU has been woken up per second to process packets via an inter-processor interrupt.
    • flw_lim/s - The number of times the flow limit has been reached per second. Flow limiting is an optional RPS feature that can be used to limit the number of packets queued to the backlog for each flow to a certain amount. This can help ensure that smaller flows are processed even though much larger flows are pushing packets in.
  • sar -n TCP:

    • active/s - The number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state per second [tcpActiveOpens].
    • passive/s - The number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state per second [tcpPassiveOpens].
    • iseg/s - The total number of segments received per second, including those received in error [tcpInSegs]. This count includes segments received on currently established connections.
    • oseg/s - The total number of segments sent per second, including those on current connections but excluding those containing only retransmitted octets [tcpOutSegs].
  • sar -n ETCP:

    • atmptf/s - The number of times per second TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times per second TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state [tcpAttemptFails].
    • estres/s - The number of times per second TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state [tcpEstabResets].
    • retrans/s - The total number of segments retransmitted per second - that is, the number of TCP segments transmitted containing one or more previously transmitted octets [tcpRetransSegs].
    • isegerr/s - The total number of segments received in error (e.g., bad TCP checksums) per second [tcpInErrs].
    • orsts/s - The number of TCP segments sent per second containing the RST flag [tcpOutRsts].
  • sar -n UDP:

    • idgm/s - The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].
    • odgm/s - The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].
    • noport/s - The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].
    • idgmerr/s - The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].
  • sar -n UDP6:

    • idgm6/s - The total number of UDP datagrams delivered per second to UDP users [udpInDatagrams].
    • odgm6/s - The total number of UDP datagrams sent per second from this entity [udpOutDatagrams].
    • noport6/s - The total number of received UDP datagrams per second for which there was no application at the destination port [udpNoPorts].
    • idgmer6/s - The number of received UDP datagrams per second that could not be delivered for reasons other than the lack of an application at the destination port [udpInErrors].
  • sar -q:

    • runq-sz - Run queue length (number of tasks waiting for run time).
    • plist-sz - Number of tasks in the task list.
    • ldavg-1 - System load average for the last minute. The load average is calculated as the average number of runnable or running tasks (R state), and the number of tasks in uninterruptible sleep (D state) over the specified interval.
    • ldavg-5 - System load average for the past 5 minutes.
    • ldavg-15 - System load average for the past 15 minutes.
    • blocked - Number of tasks currently blocked, waiting for I/O to complete.
  • sar -r:

    • kbmemfree - Amount of free memory available in kilobytes.
    • kbavail - Estimate of how much memory in kilobytes is available for starting new applications, without swapping. The estimate takes into account that the system needs some page cache to function well, and that not all reclaimable memory slabs will be reclaimable, due to items being in use. The impact of those factors will vary from system to system.
  • kbmemused - Amount of used memory in kilobytes (calculated as total installed memory - kbmemfree - kbbuffers - kbcached - kbslab).

    • %memused - Percentage of used memory.
    • kbbuffers - Amount of memory used as buffers by the kernel in kilobytes.
    • kbcached - Amount of memory used to cache data by the kernel in kilobytes.
  • kbcommit - Amount of memory in kilobytes needed for current workload. This is an estimate of how much RAM/swap is needed to guarantee that there never is out of memory.

  • %commit - Percentage of memory needed for current workload in relation to the total amount of memory (RAM+swap). This number may be greater than 100% because the kernel usually overcommits memory.

  • kbactive - Amount of active memory in kilobytes (memory that has been used more recently and usually not reclaimed unless absolutely necessary).

  • kbinact - Amount of inactive memory in kilobytes (memory which has been less recently used. It is more eligible to be reclaimed for other purposes).

    • kbdirty - Amount of memory in kilobytes waiting to get written back to the disk.
    • kbanonpg - Amount of non-file backed pages in kilobytes mapped into userspace page tables.
    • kbslab - Amount of memory in kilobytes used by the kernel to cache data structures for its own use.
    • kbkstack - Amount of memory in kilobytes used for kernel stack space.
    • kbpgtbl - Amount of memory in kilobytes dedicated to the lowest level of page tables.
    • kbvmused - Amount of memory in kilobytes of used virtual address space.
  • sar -S:

    • kbswpfree - Amount of free swap space in kilobytes.
    • kbswpused - Amount of used swap space in kilobytes.
    • %swpused - Percentage of used swap space.
  • kbswpcad - Amount of cached swap memory in kilobytes. This is memory that once was swapped out, is swapped back in but still also is in the swap area (if memory is needed it doesn’t need to be swapped out again because it is already in the swap area. This saves I/O).

    • %swpcad - Percentage of cached swap memory in relation to the amount of used swap space.
  • sar -u:

  • %user - Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field includes time spent running virtual processors.

  • %usr - Percentage of CPU utilization that occurred while executing at the user level (application). Note that this field does NOT include time spent running virtual processors.

  • %nice - Percentage of CPU utilization that occurred while executing at the user level with nice priority.

  • %system - Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field includes time spent servicing hardware and software interrupts.

  • %sys - Percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this field does NOT include time spent servicing hardware or software interrupts.

  • %iowait - Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

  • %steal - Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.

    • %irq - Percentage of time spent by the CPU or CPUs to service hardware interrupts.
    • %soft - Percentage of time spent by the CPU or CPUs to service software interrupts.
    • %guest - Percentage of time spent by the CPU or CPUs to run a virtual processor.
    • %gnice - Percentage of time spent by the CPU or CPUs to run a niced guest.
  • %idle - Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.

  • sar -v:

    • dentunusd - Number of unused cache entries in the directory cache.
    • file-nr - Number of file handles used by the system.
    • inode-nr - Number of inode handlers used by the system.
    • pty-nr - Number of pseudo-terminals used by the system.
  • sar -W: Report swapping statistics. The following values are displayed:

    • pswpin/s - Total number of swap pages the system brought in per second.
    • pswpout/s - Total number of swap pages the system brought out per second.
  • sar -w: Report task creation and system switching activity.

    • proc/s - Tasks created per second; cswch/s - context switches per second.
  • sar -y: Report TTY devices activity. The following values are displayed:

  • rcvin/s - Number of receive interrupts per second for current serial line. Serial line number is given in the TTY column.

    • xmtin/s - Number of transmit interrupts per second for current serial line.
    • framerr/s - Number of frame errors per second for current serial line.
    • prtyerr/s - Number of parity errors per second for current serial line.
    • brk/s - Number of breaks per second for current serial line.
    • ovrun/s - Number of overrun errors per second for current serial line.

45. pidstat

  • Compat: Linux; Requires: sysstat.

  • monitor individual tasks currently being managed Requires: sysstat.

Cheat Card - CPU by PID: pidstat -u 1 -p <pid> (watch %usr/%system/%wait) - IO by PID: pidstat -d 1 -p <pid> (check kB_rd/s, kB_wr/s, iodelay) - Memory faults: pidstat -r 1 -p <pid> (watch majflt/s) - Threads: pidstat -t -u 1 -p <pid>

  • pidstat -d:

    • Key fields: kB_rd/s, kB_wr/s, iodelay (IO wait), kB_ccwr/s (cancelled writes).
  • pidstat -R: Report realtime priority and scheduling policy information. The following values may be displayed:

    • Key fields: prio, policy.
  • pidstat -r: Report page faults and memory utilization.

    When reporting statistics for individual tasks, the following values may be displayed:

    • Key fields: majflt/s (major faults), RSS, %MEM.

    When reporting global statistics for tasks and all their children, the following values may be displayed:

    • With children: majflt-nr, minflt-nr summarize faults.
  • pidstat -s: Report stack utilization. The following values may be displayed:

    • Key fields: StkRef (used), StkSize (reserved).
  • pidstat -t: Also display statistics for threads associated with selected tasks. List process and threads

  • pidstat -u: Report CPU utilization.

    When reporting statistics for individual tasks, the following values may be displayed:

    • Key fields: %usr, %system, %wait, %CPU, CPU.

    When reporting global statistics for tasks and all their children, the following values may be displayed:

    • With children: usr-ms, system-ms, guest-ms summarize CPU time.

46. lsof

  • Compat: Linux; May require root to see all descriptors; Requires: lsof.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# List all open files
lsof

# Processes using a file? (fuser equivalent)
lsof /path/to/file

# Open files within a directory
lsof +D /path

# Files by user
lsof -u name
lsof -u name1,name2
lsof -u name1 -u name2

# By program name
lsof -c apache

# AND'ing selection conditions
lsof -u www-data -c apache

# By pid
lsof -p 1

# Except certain pids
lsof -p ^1

# TCP and UDP connections
lsof -i
lsof -i tcp # TCP connections
lsof -i udp # UDP connections

# By port
lsof -i :25
lsof -i :smtp
lsof -i udp:53
lsof -i tcp:80

# All network activity by a user
lsof -a -u name1 -i

lsof -N # NFS use
lsof -U # UNIX domain socket use

# List PIDs
lsof -t -i
# Danger: broad kill; preview and scope carefully before use
kill -9 $(lsof -t -i) # Kill all programs w/network activity

Requires: lsof.

51. pmap

  • Compat: Linux; Requires: procps; -X needs procps-ng.

  • pmap 29740 -X: show Address,Perm,Offset,Device,Inode,Size,Rss,Pss,Referenced,Anonymous,LazyFree, ShmemPmdMapped,Shared_Hugetlb,Private_Hugetlb,Swap,SwapPss,Locked,THPeligible, Mapping Requires: procps.

Common recipes - Largest mappings first: pmap -x <pid> | sort -nrk 3 | head (by RSS KB) - Totals summary: pmap <pid> (last line shows total)

52. blktrace

  • Compat: Linux; Root required; Needs kernel block trace support; Requires: blktrace.

  • blktrace is a block layer IO tracing mechanism which provides detailed information about request queue operations up to user space. There are three major components: a kernel component, a utility to record the i/o trace information for the kernel to user space, and utilities to analyse and view the trace information.

1
2
# Trace block I/O on /dev/sda and parse
sudo blktrace -d /dev/sda -o - | blkparse -i -

Requires: blktrace.

outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
CPU0 (8,0):
Reads Queued: 385, 1540KiB Writes Queued: 0, 0KiB
Read Dispatches: 75, 1544KiB Write Dispatches: 4, 16KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 681, 15168KiB Writes Completed: 42, 1208KiB
Read Merges: 315, 1260KiB Write Merges: 0, 0KiB
Read depth: 84 Write depth: 21
IO unplugs: 63 Timer unplugs: 0
CPU1 (8,0):
Reads Queued: 406, 1624KiB Writes Queued: 13, 996KiB
Read Dispatches: 71, 1620KiB Write Dispatches: 10, 992KiB
Reads Requeued: 1 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 336, 1344KiB Write Merges: 2, 200KiB
Read depth: 84 Write depth: 21
IO unplugs: 68 Timer unplugs: 0
CPU2 (8,0):
Reads Queued: 1531, 6152KiB Writes Queued: 30, 120KiB
Read Dispatches: 257, 6152KiB Write Dispatches: 3, 108KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 1277, 5108KiB Write Merges: 24, 96KiB
Read depth: 84 Write depth: 21
IO unplugs: 255 Timer unplugs: 0
CPU3 (8,0):
Reads Queued: 1266, 5852KiB Writes Queued: 23, 92KiB
Read Dispatches: 279, 5852KiB Write Dispatches: 21, 92KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB
Read Merges: 987, 3948KiB Write Merges: 2, 8KiB
Read depth: 84 Write depth: 21
IO unplugs: 279 Timer unplugs: 1

Total (8,0):
Reads Queued: 3588, 15168KiB Writes Queued: 66, 1208KiB
Read Dispatches: 682, 15168KiB Write Dispatches: 38, 1208KiB
Reads Requeued: 1 Writes Requeued: 0
Reads Completed: 681, 15168KiB Writes Completed: 42, 1208KiB
Read Merges: 2915, 11660KiB Write Merges: 28, 304KiB
IO unplugs: 665 Timer unplugs: 1

53. btrace

  • Compat: Linux; Wrapper script from blktrace; Root required.

  • The btrace script provides a quick and easy way to do live tracing of block devices. It calls blktrace on the specified devices and pipes the output through blkparse for formatting. See blktrace (8) for more in-depth information about how blktrace works.

  • btrace /dev/sda Requires: blktrace.

54. tr

  • Compat: Linux; Requires: coreutils.

Translate, squeeze, and/or delete characters from standard input, writing to standard output.

  • tr '\n' ',': convert new lines to commas
  • squeeze repeats: tr -s ' ' < file (collapse runs of spaces)
  • delete chars: tr -d '\r' < file (remove CR)
  • keep only printable: tr -cd '[:print:]\n' < file
  • case convert: tr '[:upper:]' '[:lower:]' < file

55. cut

  • Compat: Linux; Requires: coreutils.

  • select CSV fields: cut -d, -f1,3 file.csv

  • ranges: cut -d: -f1-3 /etc/passwd

  • bytes/chars: cut -b1-10 file; cut -c1-20 file

  • complement: cut -d, -f1 --complement file.csv

  • with headers: pair with head -1 to see column indexes ## 56. xargs

  • Compat: Linux; Requires: findutils; GNU -r may vary on BusyBox.

Build and run argument lists; combine with find and null-terminated records for safety.

  • safe null delim: find . -type f -name '*.log' -print0 | xargs -0 rm -f
  • limit args per call: xargs -n 1 -I{} sh -c 'echo {}'
  • parallelism: xargs -P 4 -n 1 cmd (run 4 at a time)
  • interactive confirm: xargs -p rm (ask before each batch)
  • do nothing on empty input: xargs -r cmd (GNU)

Logs & Systemd

Cheat Card - Unit status: systemctl status <unit>; failed: systemctl list-units --failed - Hot errors: journalctl -xeu <unit>; follow: journalctl -fu <unit> - Boot scoping: journalctl -b and -b -1; size: journalctl --disk-usage

Systemd basics

1
2
3
4
5
6
7
8
9
10
11
12
# Unit status and enablement
systemctl status <unit>
systemctl is-active <unit>
systemctl is-enabled <unit>

# Failed units overview
systemctl list-units --failed
journalctl -xe # recent critical logs

# Restart and verify logs from this boot
systemctl restart <unit>
journalctl -u <unit> -b -n 50

Journal essentials

1
2
3
4
5
6
7
8
9
10
11
12
13
# Recent errors for a unit and live follow
journalctl -xeu <unit>
journalctl -fu <unit>

# Time window and priority
journalctl -u <unit> --since "1 hour ago" --until now
journalctl -p err..alert -b

# Previous boot
journalctl -b -1

# JSON output piped to jq (Requires: jq)
journalctl -u <unit> -o json | jq -r '.MESSAGE'

Journal management

1
2
3
4
5
6
7
8
9
10
# Disk usage and vacuum
journalctl --disk-usage
journalctl --vacuum-size=1G
journalctl --vacuum-time=7d

# Make logs persistent (requires root; edit journald.conf)
# /etc/systemd/journald.conf: set Storage=persistent
systemctl restart systemd-journald

# Tip: tune RateLimitIntervalSec/RateLimitBurst to manage log storms

Resolved (DNS)

1
2
3
4
5
6
7
8
# Overall resolver status
resolvectl status

# Query using systemd-resolved
resolvectl query example.com

# Flush caches
resolvectl flush-caches

Security & Audit

Cheat Card - SELinux mode: getenforce; recent denials: ausearch -m AVC -ts recent - AppArmor status: aa-status; set complain/enforce on a profile - Audit rule example: auditctl -w /etc/ssh/sshd_config -p wa -k sshcfg

SELinux

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Current mode and temporary permissive (diagnostic; requires root)
getenforce
setenforce 0 # Caution: reduces enforcement

# Contexts and recent denials
ls -Z
ps -eZ | head
ausearch -m AVC -ts recent
journalctl -t setroubleshoot

# Manage booleans (example: allow httpd network connect)
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on
# Requires: selinux-utils/policycoreutils; setroubleshoot (optional)

AppArmor

1
2
3
4
5
6
7
8
# Status and service
aa-status
systemctl status apparmor

# Toggle a profile mode
aa-complain /path/to/bin
aa-enforce /path/to/bin
# Requires: apparmor-utils

Auditd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Service and rules
systemctl status auditd
auditctl -l

# Search recent denials / by PID
ausearch -m avc -ts recent
ausearch -p <pid> -ts recent

# Watch a file for writes/attr changes (key: sshcfg)
auditctl -w /etc/ssh/sshd_config -p wa -k sshcfg

# Summary report
aureport --summary -ts today
# Requires: auditd (auditd, auditctl, ausearch, aureport)

Containers & Namespaces

Cheat Card - Enter container namespace: nsenter --target <pid> --mount --uts --ipc --net --pid -- bash - Docker triage: docker ps, docker logs --tail=200 -f <id>, docker exec -it <id> sh - K8s triage: kubectl get pods -A, kubectl describe pod <pod> -n <ns>, kubectl logs <pod> -n <ns> --previous

nsenter (enter namespaces of a PID)

1
2
3
4
5
6
7
8
9
# Get target PID (e.g., container process)
pidof <proc>

# Enter multiple namespaces of a PID
nsenter --target <pid> --mount --uts --ipc --net --pid -- bash

# Inspect and chroot-like into the process rootfs
ls -l /proc/<pid>/root
nsenter --target <pid> --mount -- chroot /proc/<pid>/root bash

Docker (if present)

1
2
3
4
5
# List, exec, inspect PID, and tail logs
docker ps --format '{{.ID}} {{.Names}} {{.Status}}'
docker exec -it <id|name> bash # or sh
docker inspect -f '{{.State.Pid}}' <id>
docker logs --tail=200 -f <id>

Kubernetes (if present)

1
2
3
4
5
6
7
8
9
# Pods and events
kubectl get pods -A -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail

# Describe, logs, and exec
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --tail=200
kubectl logs <pod> -n <ns> --previous
kubectl exec -it <pod> -n <ns> -- bash

CRI/containerd (if present)

1
2
3
4
# List, inspect, and logs via crictl
crictl ps -a
crictl inspect <id>
crictl logs <id>

Notes - Without runtime CLIs, use nsenter by PID from ps/systemctl. - Requires: docker or podman for Docker-like commands; kubectl; crictl for containerd/CRI.

Incident Playbooks

High CPU

1
2
3
4
5
6
7
8
# Top CPU processes and hot threads
ps -eo pid,ppid,user,%cpu,%mem,cmd --sort=-%cpu | head
top -H
ps -Lp <pid> -o pid,tid,pcpu,comm

# Per-process CPU over time; optional perf if available
pidstat -u 1 -p <pid>
perf top # if installed

High IO wait / Disk latency

1
2
3
4
5
6
7
8
9
10
11
# Device saturation and per-process IO
iostat -xz 1 # watch await, %util, r/s, w/s
pidstat -d 1
iotop -oPa

# Device/FS inventory and kernel errors
lsblk -o NAME,TYPE,SIZE,ROTA,MOUNTPOINT,MODEL
dmesg -T | egrep -i 'error|reset|blk|nvme'

# Optional deep dive
blktrace -d /dev/sdX -o - | blkparse -i -

Memory leak / OOM

1
2
3
4
5
6
7
8
9
10
11
# Snapshot and top RSS processes
free -h
ps aux --sort=-rss | head

# Per-process mappings and over-time faults
pmap -x <pid> | sort -nrk 3 | head
pidstat -r 1 -p <pid>
smem -r # if installed

# OOM evidence
dmesg -T | grep -i oom || journalctl -k -g OOM

Packet loss / High latency

1
2
3
4
5
6
7
8
9
10
11
12
# Path and end-to-end latency
ip route get <dest>
mtr -ezbw <dest>

# Interface health and TCP details
ip -s link show <iface>
ethtool -S <iface>
ss -i dst <dest>

# Targeted capture samples
tcpdump -ni <iface> host <dest> and icmp
tcpdump -ni <iface> tcp port 443 and 'tcp[tcpflags] & tcp-syn != 0'

DNS failures

1
2
3
4
5
6
7
8
9
10
11
12
13
# Resolve via systemd-resolved (or dig if available)
resolvectl query example.com
resolvectl status
dig @8.8.8.8 example.com A +time=2 +tries=1

# Reachability and captures
ss -u 'sport = :53 or dport = :53'
tcpdump -ni <iface> port 53

# Config checks
ls -l /etc/resolv.conf
resolvectl flush-caches
# Check firewall rules as appropriate

TLS handshake issues

1
2
3
4
5
6
7
8
9
10
11
12
# Inspect handshake/cert chain (TLS1.2 example)
openssl s_client -connect host:443 -servername host -tls1_2 -showcerts

# Check expiry/subject/issuer quickly
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer

# App behavior (SNI, ALPN, protocols)
curl -v https://host/

# If proxy/MTLS: verify CA path and client certs; check time skew
timedatectl

Disk full / Inode exhaustion

1
2
3
4
5
6
7
8
9
10
11
12
13
# Space vs inodes
df -h
df -i

# Find biggest dirs on same filesystem
du -xhd1 /path | sort -h

# Deleted-but-open files
lsof +L1
journalctl --vacuum-size=1G # cull journal size

# Many small files
find /path -xdev -type f | wc -l

Syscall slowness

1
2
3
4
5
# Trace syscalls and timings
strace -ttT -p <pid> -f -e trace=network,file,fsync,clock,nanosleep

# Optional CPU hotspot profiling
perf record -g -p <pid>; perf report

Container restart loops

1
2
3
4
5
6
7
8
9
10
11
# Docker restart loops
docker ps --filter 'status=restarting'
docker logs <id> --tail=200

# Kubernetes restart loops
kubectl get pods -A
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous

# Node/agent issues
journalctl -u kubelet

  • modify priority of running process

panic

以下代码报错

1
2
lapicid 2: panic: mycpu called with interrupts enabled
80103937 8010394f 80104a7d 80105b51 8010589e 0 0 0 0 0

1
2
3
4
5
int 
getcpuid()
{
return cpuid();
}
  • 原因:mycpu()要求在关闭中断时调用(函数开头检查 IF),而某处(这里是 scheduler 开头)在中断允许的情况下调用了它,导致 panic。
  • 修复:在调用 mycpu()前禁用中断(pushcli),调用后恢复(popcli)。把 scheduler 开头修改如下。
  • 说明:这样能保证在读取/比较本 CPU 的 LAPIC id 时不会被中断重入。若还有其它在未禁中断情况下直接调用 mycpu() 的地方,也请同样处理或确保调用者已禁中断。
    1
    2
    3
    4
    5
    6
    7
    8
    int 
    getcpuid()
    {
    pushcli();
    int id = cpuid();
    popcli();
    return id;
    }

audit用途

监控文件、命令、网络等,生成监控报告。

安装启动audit

安装audit工具

1
yum install audit
配置了 auditd 后,启动服务以收集 审计信息,并将它存储在日志文件中。以 root 用户身份运行以下命令来启动 auditd :

1
service auditd start

将 auditd 配置为在引导时启动:

1
systemctl enable auditd

可以使用 # auditctl -e 0 命令临时禁用 auditd,并使用 # auditctl -e 1 重新启用它。

可以使用 service auditd _<action>_ 命令对 auditd 执行其他操作,其中 _<action>_可以是以下之一:

stop :停止 auditd

restart:重新启动 auditd

reload 或 force-reload:重新加载 /etc/audit/auditd.conf 文件中 auditd 的配置。

rotate:轮转 /var/log/audit/ 目录中的日志文件。

resume:在其之前被暂停后重新恢复审计事件记录,例如,当保存审计日志文件的磁盘分区中没有足够的可用空间时。

condrestart 或 try-restart:只有当 auditd 运行时才重新启动它。

status:显示 auditd 的运行状态。

配置规则

举例说明,监控/home/test_audit/文件夹(文件)的变更选项为rwxa(r=read, w=write, x=execute, a=attribute),设置关键字dushnda_watch

1
auditctl -w /home/test_audit/ -p rwxa -k dushnda_watch
配置完后查询规则
1
2
[root@172 ~]# auditctl -l
-w /home/test_audit -p rwxa -k dushnda_watch
之后做一些权限改变,增改文件操,查看日志ausearch,查看报告areport
1
ausearch -i -k dushnda_watch

这里的每个type是一个一次的一条记录,具体的含义查看参考链接[1],这里主要关注对文件的操作,这段日志含义是使用vim打开了文件(syscall),当前文件权限是644。

删除路径监控

1
auditctl -W /home/test_audit -p rwxa -k dushnda_watch

其中,auditctl -d的删除和auditctl -a的添加对应,auditctl -W的删除和auditctl -w的添加对应,auditctl -D删除所有规则。

参考链接

[1] https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/8/html/security_hardening/auditing-the-system_security-hardening#linux-audit_auditing-the-system

[2] https://deepinout.com/linux-cmd/linux-audit-system-related-cmd/linux-cmd-auditctl.html

GDB调试

当前文件夹目录 Makefile文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cpvmlinux:  
cp /home/dsd/Code/linux-5.18.10/vmlinux vmlinux

cpimage:
cp /home/dsd/Code/linux-5.18.10/arch/x86/boot/bzImage ./bzImage

initramfs:
cd ./initramfs_dir && find . -print0 | cpio -ov --null --format=newc | gzip -9 > ../initramfs.img

run:
qemu-system-x86_64 \
-kernel bzImage \
-initrd initramfs.img \
-m 1G \
-nographic \
-append "earlyprintk=serial,ttyS0 console=ttyS0"


debug:
qemu-system-x86_64 \
-kernel bzImage \
-initrd initramfs.img \
-m 1G \
-nographic \
-append "earlyprintk=serial,ttyS0 console=ttyS0 nokaslr" \
-S \
-gdb tcp::9000
此目录下新建.gdbinit文件
1
2
3
4
target remote :9000  
break start_kernel
continue
step
root/.gdbinit文件增加add-auto-load-safe-path /home/dsd/Code/qemu_linux_x86_5.18_space/.gdbinit 运行指令,任选一条
1
2
gdb vmlinux
gdb-multiarch vmlinux --tui

vscode调试

wsl权限问题,目录往外多一些

1
chown 755 <usr> *
生成编译指令信息,此时linux源码根目录下增加文件compile_commands.json
1
./scripts/clang-tools/gen_compile_commands.py
配置.vscode/lanuch.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "qemu-kernel-gdb",
"type": "cppdbg",
"request": "launch",
"miDebuggerServerAddress": "127.0.0.1:9000",
"program": "${workspaceRoot}/vmlinux",
"args": [],
"stopAtEntry": false,
"cwd": "${fileDirname}",
"environment": [],
"externalConsole": false,
"MIMode": "gdb",
"setupCommands": [
{
"description": "Enable pretty-printing for gdb",
"text": "-enable-pretty-printing",
"ignoreFailures": true
}
]
},
]
}
配置.vscode/c_cpp_properties.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"configurations": [
{
"name": "Linux",
"includePath": [
"${workspaceFolder}/**"
],
"defines": [],
"compilerPath": "/usr/bin/gcc",
"cStandard": "c11",
"cppStandard": "gnu++14",
"intelliSenseMode": "linux-gcc-x64",
"compileCommands": "${workspaceFolder}/compile_commands.json"
}
],
"version": 4
}
开始调试

附:调试qemu虚拟机模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
{
// 使用 IntelliSense 了解相关属性。
// 悬停以查看现有属性的描述。
// 欲了解更多信息,请访问: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "name",
"type": "cppdbg",
"request": "launch",
"program": "debug.elf",//vmlinux路径,debug的文件 <--修改这里
"args": [],//启动参数,如果需要记录日志可以自己增加参数。因为gdbsever已经有了参数,这里可以不用设置
"stopAtEntry": true,//会自动停在main,不需要则设置为false
"cwd": "${workspaceRoot}",
"environment": [],
"externalConsole": false,
"MIMode": "gdb", //运行模式
"logging": {
"moduleLoad": false,
"engineLogging": false,
"trace": false
},
"setupCommands": [ //命令
{
"description": "Enable pretty-printing for gdb",
"text": "-enable-pretty-printing",
"ignoreFailures": true
},
{
"description": "set architecture aarch64",
"text": "set architecture aarch64",
"ignoreFailures": true
}
],
"miDebuggerPath":"gdb-multiarch",//工具链gdb,arm下要使用多分支
"miDebuggerServerAddress":"localhost:1234"//远程端口

}
]
}