_config.next.yml文件

# Mermaid tag
mermaid:
  enable: true
  # Available themes: default | dark | forest | neutral
  theme:
    light: forest
    dark: dark

例子

flowchart

A-- This is the text! ---B

sequenceDiagram
Alice->>John: Hello John, how are you?
John-->>Alice: Great!
Alice-)John: See you later!

现象

因为两个网卡都在同一个网段，所以如果这个时候直接和我们的主机（IP192.168.0.2）进行ping，因为优先级的原因，会导致有一个网卡会无法ping通主机。

反向路由检查

您可以把这个机制想象成一个严格的“门卫”。当服务器从一个网络接口（比如eth1）收到一个数据包时，这个“门卫”会去查路由表，问：“要回复这个数据包，我们应该从哪个接口发出去？”

正常情况：如果查到的最佳出口恰好是数据包进来的接口（eth1），则验证通过，数据包被正常接收处理。
您遇到的情况：服务器上两个IP在同一LAN，网关可能只在一个接口上（或系统认为只有一个最佳路径）。当数据包从eth1进入，但路由表显示回复它的最佳路径是eth0，这时“门卫”就会拒绝这个数据包，导致ping不通。关键在于，系统检查的是“反向路径”是否最优，而非是否可达 。 ## 问题诊断与解决步骤

1. 确认 rp_filter 的当前设置

首先，您需要查看系统当前的反向路由检查设置。在终端中执行以下命令：

1
2
3

cat /proc/sys/net/ipv4/conf/all/rp_filter
cat /proc/sys/net/ipv4/conf/eth0/rp_filter  # 请将 eth0 替换为您的实际网卡名
cat /proc/sys/net/ipv4/conf/eth1/rp_filter  # 请将 eth1 替换为您的实际网卡名

参数值解读 ： - 0：关闭源地址验证。 - 1：严格模式。只要反向路径不是最佳路径（即使有其他路径可达），数据包也会被丢弃。这通常是问题的根源。 - 2：松散模式。只检查源地址是否可通过任意接口可达，而不要求是最佳路径。这在有多路径的环境中是更安全的选择。

如果这些值被设置为 1（严格模式），那么它很可能就是导致您一个IP不通的原因。 ### 2. 关闭反向路由检查

重要提示：关闭此功能会降低系统对IP地址欺骗（IP Spoofing）的防御能力。请确保您的服务器处于受信任的内网环境中。

方法一：临时关闭（重启后失效）

适用于快速验证问题。执行命令：

1
2
3

echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/eth0/rp_filter
echo 0 > /proc/sys/net/ipv4/conf/eth1/rp_filter

方法二：永久关闭（推荐配置）

通过修改系统配置文件，使设置永久生效。

编辑 /etc/sysctl.conf文件：
1
vi /etc/sysctl.conf

在文件末尾添加或修改以下行：

net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
# 如果需要，也可以为每个网卡单独设置
# net.ipv4.conf.eth0.rp_filter = 0
# net.ipv4.conf.eth1.rp_filter = 0

使用以下命令使配置立即生效：
1
sysctl -p

3. 折中方案：采用松散模式

如果担心完全关闭的安全风险，折中的办法是启用松散模式（Loose Mode），将值设置为 2。这样既允许了非对称路径的存在（您的包从A进，从B出），又保留了基本的源地址验证。修改方法同上，只需将配置文件中的值改为2：

1 2	net.ipv4.conf.all.rp_filter = 2 net.ipv4.conf.default.rp_filter = 2

其他可能性

虽然反向路由检查是最常见的原因，但如果调整后问题依旧，还可以关注以下几点：

防火墙规则：检查 iptables 或 firewalld 是否有规则阻止了特定IP的ICMP请求。
路由表本身：使用 ip route show或 route -n确认两个IP地址的路由表项是否正确。
ARP问题：在局域网中，使用 arp -a或 ip neighbour检查ARP表项是否正确。

ARM相关缩略语

发表于 2025-12-20

转载：https://zhuanlan.zhihu.com/p/1985799288740136627

ARM 相关缩略语大全（按大类系统整理）
本文系统性整理 ARM 架构中常见缩略语，覆盖 CPU / Core / AMBA / CoreSight / GIC / 内存系统 / 异构与加速 / 安全 / 虚拟化 / SoC 互连 等多个层面，适合 内核、驱动、SoC、性能分析、芯片架构 方向长期查阅。

一、ARM 架构与指令集（Architecture & ISA）

缩略语	全称	说明
ARM	Advanced RISC Machines	ARM 公司 / 架构统称
ISA	Instruction Set Architecture	指令集架构
AArch32	ARM Architecture 32-bit	ARMv7 及兼容 32 位执行态
AArch64	ARM Architecture 64-bit	ARMv8+ 的 64 位执行态
ARMv7-A	ARM Architecture v7 Application	经典 32 位应用处理器架构
ARMv8-A	ARM Architecture v8 Application	引入 AArch64
ARMv9-A	ARM Architecture v9 Application	引入 SVE2 / CCA
Thumb	Thumb Instruction Set	16-bit 压缩指令集
Thumb-2	Thumb-2 Instruction Set	16/32 混合指令
NEON	Advanced SIMD	ARM 向量 SIMD 扩展
SVE	Scalable Vector Extension	可变向量长度 SIMD
SVE2	SVE v2	面向通用计算与 DSP
SME	Scalable Matrix Extension	ARMv9 矩阵扩展

二、ARM CPU Core 家族（Processor Cores）
Cortex 系列

缩略语	全称	说明
Cortex-A	Cortex Application	应用处理器（Linux / Android）
Cortex-R	Cortex Real-time	实时处理器
Cortex-M	Cortex Microcontroller	微控制器

Cortex-A 常见核心

Core	特点	应用
A53	小核，低功耗	手机 / 嵌入式
A55	A53 后继	big.LITTLE
A57	高性能	服务器 / 手机
A72	高 IPC	SoC 主核
A73	优化能效	移动平台
A75	ARMv8.2	DynamIQ
A76	高性能单核	旗舰 SoC
A78	高能效	新一代移动
X1 / X2 / X3	Cortex-X	极致性能

Neoverse（服务器）

Core	定位
N1	通用服务器
N2	ARMv9 数据中心
V1	SVE 向量优化
E1	边缘高吞吐

三、异常级别与特权模型（Exception Levels）

缩略语	全称	说明
EL0	Exception Level 0	用户态
EL1	Exception Level 1	内核态
EL2	Exception Level 2	Hypervisor
EL3	Exception Level 3	Secure Monitor
PL	Privilege Level	特权等级
SPSR	Saved Program Status Register	异常返回寄存器
ESR	Exception Syndrome Register	异常原因

四、内存系统（Memory System & MMU）

缩略语	全称	说明
MMU	Memory Management Unit	内存管理单元
SMMU	System Memory Management Unit	系统内存管理单元
MPU	Memory Protection Unit	无 MMU 系统
TLB	Translation Lookaside Buffer	地址翻译缓存
ASID	Address Space ID	进程地址空间标识
VA	Virtual Address	虚拟地址
PA	Physical Address	物理地址
IPA	Intermediate Physical Address	虚拟化中间地址
TTBR	Translation Table Base Register	页表基址
MAIR	Memory Attribute Indirection Register	内存属性
ATS	Address Translation Service	PCIe 地址翻译

五、AMBA 总线与互连（Bus & Interconnect）
AMBA 总线协议

缩略语	全称	说明
AMBA	Advanced Microcontroller Bus Architecture	ARM 总线体系
AXI	Advanced eXtensible Interface	高性能总线
AXI4	AXI version 4	主流 SoC 总线
AXI-Lite	AXI Lite	寄存器访问
AHB	Advanced High-performance Bus	中速总线
APB	Advanced Peripheral Bus	低速外设
CHI	Coherent Hub Interface	Cache Coherent 总线
ACE	AXI Coherency Extensions	一致性扩展

互连与一致性

缩略语	说明
CCN	Cache Coherent Network	一致性网络
CMN	Coherent Mesh Network	Mesh 架构
DVM	Distributed Virtual Memory	TLB 一致性
Snoop	Cache Snoop	Cache 探测

六、GIC 中断控制器（Generic Interrupt Controller）

缩略语	全称	说明
GIC	Generic Interrupt Controller	ARM 中断控制器
GICv2	GIC version 2	ARMv7 常用
GICv3	GIC version 3	ARMv8 主流
GICv4	GIC version 4	虚拟化直通
SGI	Software Generated Interrupt	软件中断
PPI	Private Peripheral Interrupt	CPU 私有中断
SPI	Shared Peripheral Interrupt	共享中断
ITS	Interrupt Translation Service	MSI 映射
LPI	Locality-specific Peripheral Interrupt	大规模中断

七、CoreSight 调试与追踪（Debug & Trace）

缩略语	全称	说明
CoreSight	ARM Debug & Trace 架构	调试体系
JTAG	Joint Test Action Group	硬件调试接口
SWD	Serial Wire Debug	精简调试
ETM	Embedded Trace Macrocell	指令追踪
PTM	Program Trace Macrocell	程序追踪
STM	System Trace Macrocell	系统事件追踪
ITM	Instrumentation Trace Macrocell	软件埋点
DWT	Data Watchpoint and Trace	数据监控
CTI	Cross Trigger Interface	模块联动
TPIU	Trace Port Interface Unit	Trace 输出

八、安全架构（Security）

缩略语	全称	说明
TrustZone	ARM TrustZone	安全隔离
Secure World	安全世界	TEE
Normal World	普通世界	REE
TEE	Trusted Execution Environment	安全 OS
OP-TEE	Open Portable TEE	开源 TEE
CCA	Confidential Compute Architecture	ARMv9
RME	Realm Management Extension	Realm 隔离

九、虚拟化（Virtualization）

缩略语	说明
HYP	Hypervisor 模式
VHE	Virtualization Host Extensions
Stage-2	二级地址翻译
VGIC	Virtual GIC
VCPU	Virtual CPU
VMID	Virtual Machine ID

十、SoC / 系统级常见缩略语

缩略语	说明
SoC	System on Chip
PMU	Performance Monitor Unit
DVFS	Dynamic Voltage and Frequency Scaling
QoS	Quality of Service
NoC	Network on Chip
SRAM	Static RAM
DRAM	Dynamic RAM
LPDDR	Low Power DDR

十一、软件与生态

缩略语	说明
PSCI	Power State Coordination Interface
SMC	Secure Monitor Call
HVC	Hypervisor Call
UEFI	Unified Extensible Firmware Interface
ACPI	Advanced Configuration and Power Interface
DT / DTS	Device Tree / Source

Overview

介绍几款Windows下的性能分析工具。

工具介绍

Perfmon（性能监视器）：是Windows自带的性能监控工具，可以通过它来查看CPU、内存、硬盘、网络等系统性能指标的实时数据，也可以将数据保存到文件中进行后续分析。
Sysinternals Suite：是一个由微软官方提供的一组系统工具集合，其中包括了很多用于性能监控的工具，例如Process Monitor、DiskMon、TCPView等。
Windows Performance Toolkit：是一组高级性能监控工具，可用于性能分析和故障排查，包括xperf、WPR、WPA等。
perfview：Perfview是一个开源的CPU和内存性能分析工具，也包括一些针对.NET的分析功能，例如GC分析，JIT分析，甚至ASP.NET中的请求统计等等。
System Informer：系统资源监控工具，支持windows10、windows11。

工具概览

Perfmon

Windows自带的，详细使用方法，参考官方链接

Sysinternals

下载后即可使用的工具集，官方下链接下载，如常用的process monitor也包括在内

Windows Performance Toolkit

这个是Windows下的性能工具合集，见官网，Windows 性能工具包包含两个独立的工具：Windows Performance Recorder (WPR) 和 Windows Performance Analyzer (WPA) 此外，还保留了对以前的命令行工具 Xperf 的支持。但是，不再支持 Xperfview。所有记录都必须使用 WPA 来打开和分析。

需要使用WPR收集数据，再使用WPA分析数据。

perfview

perfview下载地址。Perfview是一个开源的CPU和内存性能分析工具，也包括一些针对.NET的分析功能，例如GC分析，JIT分析，甚至ASP.NET中的请求统计等等。Perfview是一个Windows应用程序，但也能对在Linux系统上采集的数据进行分析（参考）。Perfview免安装，而且只是一个14M的.exe文件，非常容易部署到需要进行性能分析的机器上，例如生产环境的服务器。而且在性能数据收集的过程中不需要重启应用程序或者服务器，而且收集的性能数据日志（.etl文件）可以被拷贝到其他Windows机器上，再进行分析工作，对业务的影响非常少。

System Informer

系统资源监控工具，支持windows10、windows11，官网地址。

TCP学习-NS3模拟环境搭建

发表于 2025-12-15

参考https://github.com/ituring/tcp-book/?tab=readme-ov-file

前置工具下载

virtualBox（最新版本即可）

https://www.virtualbox.org/

1
2
3

VirtualBox 图形用户界面
版本 7.0.22 r165102 (Qt5.15.2)
Copyright © 2024 Oracle and/or its affiliates

Vagrant（最新版本即可）

https://developer.hashicorp.com/vagrant

1 2	PS C:\Users\dushenda> vagrant.exe -v Vagrant 2.4.9

Powershell

https://learn.microsoft.com/en-us/powershell/scripting/install/install-powershell-on-windows?view=powershell-7.5

1 2	PS C:\Users\dushenda> pwsh.exe -v PowerShell 7.1.4

X server

下载并安装 VcXsrv：

访问：https://sourceforge.net/projects/vcxsrv/
下载并安装 VcXsrv Windows X Server

配置 VcXsrv：

启动 “XLaunch”
选择 “Multiple windows”
Display number 设为 “0” 或 “-1”（自动）
选择 “Start no client”
在 “Extra settings” 中勾选 “Disable access control”（重要！）
完成设置，VcXsrv 会在系统托盘运行

设置 PowerShell 环境变量

# 在启动 Vagrant 前设置 DISPLAY
$env:DISPLAY = "localhost:0.0"

# 或者尝试
$env:DISPLAY = "127.0.0.1:0.0"

修改 Vagrantfile 配置

Vagrant.configure("2") do |config|
  config.vm.define "guest1" do |guest1|
    guest1.vm.box = "ubuntu/xenial64"
    
    # 关键配置：启用 X11 转发
    guest1.ssh.forward_x11 = true
    guest1.ssh.forward_agent = true
    guest1.ssh.forward_x11_trusted = true
    guest1.ssh.keep_alive = true
    
    # 网络配置
    guest1.vm.network "private_network", type: "dhcp"
  end
end

安装wireshark虚拟机

配置环境

1
2
3

$ git clone https://github.com/ituring/tcp-book.git
$ cd tcp-book/wireshark/vagrant
$ vagrant up

登录虚拟机

使用以下的命令，ssh连接到Guest操作系统上。在登录消息显示之后，命令行提示会变成vagrant@guest1:~$。

在powershell中运行vagrant ssh guest1。

$ vagrant ssh guest1

> Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-139-generic x86_64)
>
> * Documentation:  https://help.ubuntu.com
> * Management:     https://landscape.canonical.com
> * Support:        https://ubuntu.com/advantage
>
> Get cloud support with Ubuntu Advantage Cloud Guest:
> http://www.ubuntu.com/business/services/cloud
>
> 0 packages can be updated.
> 0 updates are security updates.
>
> New release '18.04.1 LTS' available.
> Run 'do-release-upgrade' to upgrade to it.

vagrant@guest1:~$

启动Wireshark。

1	vagrant@guest1:~$ wireshark

安装ns3虚拟机

当确认已经准备好VirtualBox和Vagrant的环境之后，请将此Github代码库克隆到任意目录。打开其中的ns3/vagrant目录，执行vagrant up命令。如此一来，就在虚拟机上完成了安装Ubuntu 16.04，并搭建ns-3的过程。另外，在2019年4月1日的时间点，第5章和第6章所使用的CUBIC和BBR模块不支持ns-3.28以上版本，因此本书使用ns-3.27版本。由于搭建ns-3环境相当花时间，请务必耐心等待。

1
2
3

$ git clone https://github.com/ituring/tcp-book.git
$ cd tcp-book/ns3/vagrant
$ vagrant up

在powershell中运行vagrant ssh

Linux内存子系统-Page Cache

发表于 2025-12-13

pagecache是什么

$ cat /proc/meminfo
...
Buffers:            1224 kB
Cached:           111472 kB
SwapCached:        36364 kB
Active:          6224232 kB
Inactive:         979432 kB
Active(anon):    6173036 kB
Inactive(anon):   927932 kB
Active(file):      51196 kB
Inactive(file):    51500 kB
...
Shmem:             10000 kB
...
SReclaimable:      43532 kB
...

Page Cache = Buffers + Cached + SwapCached = Active(file) + Inactive(file) + Shmem + SwapCached

/proc/meminfo的解释见如下网站https://www.kernel.org/doc/Documentation/filesystems/proc.rst

在 Page Cache 中，Active(file)+Inactive(file) 是 File-backed page（与文件对应的内存页），是你最需要关注的部分。因为你平时用的 mmap() 内存映射方式和 buffered I/O 来消耗的内存就属于这部分。

而 SwapCached 是在打开了 Swap 分区后，把 Inactive(anon)+Active(anon) 这两项里的匿名页给交换到磁盘（swap out），然后再读入到内存（swap in）后分配的内存。由于读入到内存后原来的 Swap File 还在，所以 SwapCached 也可以认为是 File-backed page，即属于 Page Cache。这样做的目的也是为了减少 I/O。

SwapCached 只在 Swap 分区打开的情况下才会有，建议在生产环境中关闭 Swap 分区，因为 Swap 过程产生的 I/O 会很容易引起性能抖动。

下面解释一下free输出

buff/cache = Buffers + Cached + SReclaimable

$ free -k
              total        used        free      shared  buff/cache   available
Mem:        7926580     7277960      492392       10000      156228      430680
Swap:       8224764      380748     7844016

指标	数据来源 (参考 `/proc/meminfo`)	主要作用
`Buffers`	`Buffers`值	临时存放原始磁盘块数据，用于优化对磁盘（块设备）的直接读写操作。
`Cached`	`Cached`值	缓存从文件读取的数据（即页缓存/Page Cache），使得再次访问这些文件时可以直接从内存快速读取，无需访问磁盘。
`SReclaimable`	`SReclaimable`值	属于 Slab 分配器的一部分，记录内核对象（如目录项、inode 等）缓存中可以被回收占用的内存。指可以被回收的内核内存，包括 dentry 和 inode 等。

实际场景中的表现：通过 vmstat等工具可以观察它们的变化。例如，当使用 dd命令向磁盘分区直接写入大量数据时（如 dd if=/dev/urandom of=/dev/sdb1 …），会观察到 Buffers显著增长，因为这是对原始磁盘块的操作。而当读写普通文件时（如 dd if=/dev/urandom of=/tmp/file …），则会看到 Cached部分明显增加，因为这些数据被文件系统缓存了。

pagecache出现的意义

先看如何使用pagecache的方法

第一种，应用程序维护自己的 Cache 做更加细粒度的控制，比如 MySQL 就是这样做的，你可以参考MySQL Buffer Pool 。
第二种，使用 Direct I/O 来绕过 Page Cache。

pagecache的意义就是加速：标准 I/O 和内存映射会先把数据写入到 Page Cache，这样做会通过减少 I/O 次数来提升读写效率。

我们看一个具体的例子。首先，我们来生成一个 1G 大小的新文件，然后把 Page Cache 清空，确保文件内容不在内存中，以此来比较第一次读文件和第二次读文件耗时的差异。具体的流程如下。

1	dd if=/dev/zero of=./dd.out bs=4096 count=((1024*256))

清空 Page Cache，需要先执行一下 sync 来将脏页同步到磁盘再去 drop cache。

1	sync && echo 3 > /proc/sys/vm/drop_caches

读两次文件，计算耗时

$ time cat /home/yafang/test/dd.out &> /dev/null
real  0m5.733s
user  0m0.003s
sys  0m0.213s

$ time cat /home/yafang/test/dd.out &> /dev/null 
real  0m0.132s
user  0m0.001s
sys  0m0.130s

通过这样详细的过程你可以看到，第二次读取文件的耗时远小于第一次的耗时，这是因为第一次是从磁盘来读取的内容，磁盘 I/O 是比较耗时的，而第二次读取的时候由于文件内容已经在第一次读取时被读到内存了，所以是直接从内存读取的数据，内存相比磁盘速度是快很多的。这就是 Page Cache 存在的意义：减少 I/O，提升应用的 I/O 速度。

pagecache机制

pagecache产生

Page Cache 的产生有两种不同的方式：

Buffered I/O（标准 I/O）；
Memory-Mapped I/O（存储映射 I/O）。

buffered/mmap都能产生 Page Cache，但是二者的还是有些差异的：标准 I/O 是写的 (write(2)) 用户缓冲区 (Userpace Page 对应的内存)，然后再将用户缓冲区里的数据拷贝到内核缓冲区 (Pagecache Page 对应的内存)；如果是读的 (read(2)) 话则是先从内核缓冲区拷贝到用户缓冲区，再从用户缓冲区读数据，也就是 buffer 和文件内容不存在任何映射关系。

对于存储映射 I/O 而言，则是直接将 Pagecache Page 给映射到用户地址空间，用户直接读写 Pagecache Page 中内容。

显然，存储映射 I/O 要比标准 I/O 效率高一些，毕竟少了“用户空间到内核空间互相拷贝”的过程。这也是很多应用开发者发现，为什么使用内存映射 I/O 比标准 I/O 方式性能要好一些的主要原因。

我们来用具体的例子演示一下 Page Cache 是如何“诞生”的，就以其中的标准 I/O 为例，因为这是我们最常使用的一种方式，如下是一个简单的示例脚本：

pagecache回收

Linux性能IO-blktrace工具

发表于 2025-12-11

快速分析

使用命令组合

# 安装工具
yum install blktrace -y
# 采集数据：（监控30秒）
blktrace -d /dev/nvme0n1 -w 30
# 合并数据：
blkparse -i nvme0n1 -d nvme_trace.bin。
# 解析分析：，重点查看 "ALL" 统计区和 "Device Overhead" 分布
btt -i nvme_trace.bin

# 组合命令
blktrace -d /dev/sdf -w 5 -o - | blkparse -i - -d btt.bin > blkparse.txt
btt -i btt.bin

btrace组合

1 2	# 追踪设备/dev/sda并输出blkparse数据 btrace /dev/sda

实际上btrace是调用了blktarce和blkparse组合出来的脚本文件，位于/usr/bin/btrace

理论分析

一个I/O请求，从应用层到底层块设备，路径如下图所示：我们将IO路径简化一下：一个I/O请求进入block layer之后，可能会经历下面的过程：

Remap: 可能被DM(Device Mapper)或MD(Multiple Device, Software RAID) remap到其它设备
Split: 可能会因为I/O请求与扇区边界未对齐、或者size太大而被分拆(split)成多个物理I/O
Merge: 可能会因为与其它I/O请求的物理位置相邻而合并(merge)成一个I/O
被IO Scheduler依照调度策略发送给driver
被driver提交给硬件，经过HBA、电缆（光纤、网线等）、交换机（SAN或网络）、最后到达存储设备，设备完成IO请求之后再把结果发回。

blktrace 能够记录下IO所经历的各个步骤: 我们一起看下blktrace的输出长什么样子：

第一个字段：8,0 这个字段是设备号 major device ID和minor device ID。
第二个字段：3 表示CPU
第三个字段：11 序列号
第四个字段：0.009507758 Time Stamp是时间偏移
第五个字段：PID 本次IO对应的进程ID
第六个字段：Event，这个字段非常重要，反映了IO进行到了那一步
第七个字段：R表示 Read， W是Write，D表示block，B表示Barrier Operation
第八个字段：223490+56，表示的是起始block number 和 number of blocks，即我们常说的Offset 和 Size
第九个字段：进程名

blkparse的输出包含了每个 I/O 请求事件的详细信息，理解这些字段是分析的关键：

字段	含义说明
设备号 (Major, Minor)	如 `8,0`通常指 `/dev/sda`
CPU ID	处理此事件的 CPU 核心编号
序列号	事件的序列号
时间戳	事件发生的时间（通常为相对时间）
进程ID (PID)	发起 I/O 操作的进程 ID
事件类型 (Action)	核心字段，表示 I/O 请求所处的阶段
RWBS 描述符	描述 I/O 类型：R(读)/W(写)/B(屏障)/S(同步)
扇区信息	如 `2048 + 8`，表示起始扇区号及连续扇区数
进程名	发起 I/O 的进程名称

其中，事件类型 (Action) 是理解 I/O 路径的关键，它记录了请求从产生到完成的各个阶段：

Q (Queued): I/O 请求进入块层。
G (Get request): 分配请求结构。
I (Inserted): 请求插入 I/O 调度器队列。
D (Issued): 请求提交给设备驱动。
C (Completed): 请求完成。

根据这些事件的时间戳，可以计算出 I/O 路径各阶段的耗时，例如：

D2C: 请求在驱动和硬件上消耗的时间，是评估硬件性能的关键指标。
I2D: 请求在 I/O 调度队列中等待的时间，反映调度器性能。
Q2C: 整个 I/O 请求的总时间，近似于 iostat中的 await。

其中第六个字段非常有用：每一个字母都代表了IO请求所经历的某个阶段。

Q – 即将生成IO请求
|
G – IO请求生成
|
I – IO请求进入IO Scheduler队列
|
D – IO请求进入driver
|
C – IO请求执行完毕

注意，整个IO路径，分成很多段，每一段开始的时候，都会有一个时间戳，根据上一段开始的时间和下一段开始的时间，就可以得到IO 路径各段花费的时间。

注意，我们心心念念的service time，也就是反应块设备处理能力的指标，就是从D到C所花费的时间，简称D2C。

而iostat输出中的await，即整个IO从生成请求到IO请求执行完毕，即从Q到C所花费的时间，我们简称Q2C。

我们知道Linux 有I/O scheduler，调度器的效率如何，I2D是重要的指标。

注意，这只是blktrace输出的一个部分，很明显，我们还能拿到offset和size，根据offset，我们能拿到某一段时间里，应用程序都访问了整个块设备的那些block，从而绘制出块设备访问轨迹图。

另外还有size和第七个字段（Read or Write），我们可以知道IO size的分布直方图。对于本文来讲，我们就是要根据blktrace来获取这些信息。 # 工具使用

我们接下来简单介绍这些工具的使用，其中这三个命令都是属于blktrace这个包的，他们是一家人。

首先通过如下命令，可以查看磁盘上的实时信息：

1	blktrace -d /dev/sdb -o – \| blkparse -i –

这个命令会连绵不绝地出现很多输出，当你输入ctrl＋C的时候，会停止。

当然了，你也可以先用如下命令采集信息，待所有信息采集完毕后，统一分析所有采集到的数据。搜集信息的命令如下：

1	blktrace -d /dev/sdb

注意，这个命令并不是只输出一个文件，他会根据CPU的个数上，每一个CPU都会输出一个文件，如下所示：

-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.0
-rw-r--r-- 1 manu manu  823K Jul  6 19:58 sdb.blktrace.1
-rw-r--r-- 1 manu manu  2.8M Jul  6 19:58 sdb.blktrace.10
-rw-r--r-- 1 manu manu  1.9M Jul  6 19:58 sdb.blktrace.11
-rw-r--r-- 1 manu manu  474K Jul  6 19:58 sdb.blktrace.12
-rw-r--r-- 1 manu manu  271K Jul  6 19:58 sdb.blktrace.13
-rw-r--r-- 1 manu manu  578K Jul  6 19:58 sdb.blktrace.14
-rw-r--r-- 1 manu manu  375K Jul  6 19:58 sdb.blktrace.15
-rw-r--r-- 1 manu manu  382K Jul  6 19:58 sdb.blktrace.16
-rw-r--r-- 1 manu manu  478K Jul  6 19:58 sdb.blktrace.17
-rw-r--r-- 1 manu manu  839K Jul  6 19:58 sdb.blktrace.18
-rw-r--r-- 1 manu manu  848K Jul  6 19:58 sdb.blktrace.19
-rw-r--r-- 1 manu manu  1.6M Jul  6 19:58 sdb.blktrace.2
-rw-r--r-- 1 manu manu  652K Jul  6 19:58 sdb.blktrace.20
-rw-r--r-- 1 manu manu  738K Jul  6 19:58 sdb.blktrace.21
-rw-r--r-- 1 manu manu  594K Jul  6 19:58 sdb.blktrace.22
-rw-r--r-- 1 manu manu  527K Jul  6 19:58 sdb.blktrace.23
-rw-r--r-- 1 manu manu 1005K Jul  6 19:58 sdb.blktrace.3
-rw-r--r-- 1 manu manu  1.2M Jul  6 19:58 sdb.blktrace.4
-rw-r--r-- 1 manu manu  511K Jul  6 19:58 sdb.blktrace.5
-rw-r--r-- 1 manu manu  2.3M Jul  6 19:58 sdb.blktrace.6
-rw-r--r-- 1 manu manu  1.3M Jul  6 19:58 sdb.blktrace.7
-rw-r--r-- 1 manu manu  2.1M Jul  6 19:58 sdb.blktrace.8
-rw-r--r-- 1 manu manu  1.1M Jul  6 19:58 sdb.blktrace.9

有了输出，我们可以通过blkparse -i sdb来分析采集的数据：

8,16   7     2147     0.999400390 630169  I   W 447379872 + 8 [kworker/u482:0]
8,16   7     2148     0.999400653 630169  I   W 447380040 + 8 [kworker/u482:0]
8,16   7     2149     0.999401057 630169  I   W 447380088 + 16 [kworker/u482:0]
8,16   7     2150     0.999401364 630169  I   W 447380176 + 8 [kworker/u482:0]
8,16   7     2151     0.999401521 630169  I   W 453543312 + 8 [kworker/u482:0]
8,16   7     2152     0.999401843 630169  I   W 453543328 + 8 [kworker/u482:0]
8,16   7     2153     0.999402195 630169  U   N [kworker/u482:0] 14
8,16   6     5648     0.999403047 16921  C   W 347875880 + 8 [0]
8,16   6     5649     0.999406293 16921  D   W 301856632 + 8 [ceph-osd]
8,16   6     5650     0.999421040 16921  C   W 354834456 + 8 [0]
8,16   6     5651     0.999423900 16921  D   W 301857280 + 8 [ceph-osd]
8,16   7     2154     0.999442195 630169  A   W 425409840 + 8 <- (8,22) 131806512
8,16   7     2155     0.999442601 630169  Q   W 425409840 + 8 [kworker/u482:0]
8,16   7     2156     0.999444277 630169  G   W 425409840 + 8 [kworker/u482:0]
8,16   7     2157     0.999445177 630169  P   N [kworker/u482:0]
8,16   7     2158     0.999446341 630169  I   W 425409840 + 8 [kworker/u482:0]
8,16   7     2159     0.999446773 630169 UT   N [kworker/u482:0] 1
8,16   6     5652     0.999452685 16921  C   W 354834520 + 8 [0]
8,16   6     5653     0.999455613 16921  D   W 301857336 + 8 [ceph-osd]
8,16   6     5654     0.999470425 16921  C   W 393228176 + 8 [0]
8,16   6     5655     0.999474127 16921  D   W 411554968 + 8 [ceph-osd]
8,16   6     5656     0.999488551 16921  C   W 393228560 + 8 [0]
8,16   6     5657     0.999491549 16921  D   W 411556112 + 8 [ceph-osd]
8,16   6     5658     0.999594849 16923  C   W 393230152 + 16 [0]
8,16   6     5659     0.999604038 16923  D   W 432877368 + 8 [ceph-osd]
8,16   6     5660     0.999610322 16923  C   W 487390128 + 8 [0]
8,16   6     5661     0.999614654 16923  D   W 432879632 + 8 [ceph-osd]
8,16   6     5662     0.999628284 16923  C   W 487391344 + 8 [0]
8,16   6     5663     0.999632014 16923  D   W 432879680 + 8 [ceph-osd]
8,16   6     5664     0.999646122 16923  C   W 293759504 + 8 [0]

注意，blkparse仅仅是将blktrace输出的信息转化成人可以阅读和理解的输出，但是，信息太多，太杂，人完全没法得到关键信息。这时候btt就横空出世了，这个工具可以将blktrace采集回来的数据，进行分析，得到对人更有用的信息。事实上，btt也是我们的终点。

获取个阶段的延迟信息

注意，btt已经可以很自如地生成这部分统计信息，我们可以很容易得到如下的表格：

阶段缩写	全称与含义	性能指标解读
Q2Q	Queue to Queue 衡量上一个I/O完成到当前I/O抵达块层的时间间隔，反映了I/O请求的到达速率。	间隔时间短表示I/O负载很重，请求密集；间隔时间长则表示I/O负载较轻。
Q2G	Queue to Get I/O请求进入块层后，等待系统为其分配一个`request`结构体的时间。	此阶段通常极短。如果时间较长，可能表示系统内存紧张或内核在分配数据结构时遇到瓶颈。
G2I	Get to Insert 获取`request`结构体后，准备并将其插入到I/O调度器队列所花费的时间。	此阶段也非常短暂。它反映了I/O调度器处理请求的初始开销。
I2D	Insert to Issue 请求在I/O调度器队列中等待以及被调度（合并、排序）后，派发（Issue）到设备驱动程序的时间。这是分析操作系统层面瓶颈的关键指标。	这是判断I/O调度器和系统负载的关键指标。如果I2D时间很长，说明： 1. 存储设备速度跟不上请求速率，队列中有大量请求在排队。 2. I/O调度策略可能不适合当前负载。
D2C	Issue to Complete 请求被提交给设备驱动后，在物理硬件上真正执行所花费的时间（包括在设备自身的缓存和介质上的读写时间）。这是评估硬件性能最直接的指标。	这是判断硬件瓶颈的核心指标。如果D2C时间很长，通常意味着： 1. 存储设备本身性能不足（如机械硬盘随机读写慢）。 2. 设备可能处于高负载或存在故障。
Q2C	Queue to Complete 一个I/O请求在块层处理的总时间。它近似等于 `Q2I + I2D + D2C`（Q2I可再细分为Q2G+G2I）。	这大致相当于 `iostat`命令输出的 `await`值。反映了一个I/O请求从进入系统到完成的总延迟。

方法如下：

首先blkparse可以将对应不同cpu的多个文件聚合成一个文件：

1	blkparse -i sdb -d sdb.blktrace.bin

然后btt就可以分析这个sdb.blktrace.bin了：

==================== All Devices ====================

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000001   0.000159747   0.025292639       62150
Q2G               0.000000233   0.000001380   0.000056343       52423
G2I               0.000000146   0.000027084   0.005031317       48516
Q2M               0.000000142   0.000000751   0.000021613        9728
I2D               0.000000096   0.001534463   0.022469688       52423
M2D               0.000000647   0.002617691   0.022445412        5821
D2C               0.000046189   0.000779355   0.007860766       62151
Q2C               0.000051089   0.002522832   0.026096657       62151

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8, 16) |   0.0461%   0.8380%   0.0047%  51.3029%  30.8921%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.0461%   0.8380%   0.0047%  51.3029%  30.8921%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8, 16) |    62151    52246     1.2 |        1       20      664  1051700

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 16) |           62151      42079658.0               0 | 0(17159)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           62151      42079658.0               0 | 0(17159)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8, 16) |           52246      39892356.2               0 | 0(9249)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |           52246      39892356.2               0 | 0(9249)

注意： D2C和Q2C，一个是表征块设备性能的关键指标，另一个是客户发起请求到收到响应的时间，我们可以看出，

D2C 平均在0.000779355 秒，即0.7毫秒 Q2C 平均在0.002522832 秒，即2.5毫秒，

无论是service time 还是客户感知到的await time，都是非常短的，表现非常不俗。但是D2C花费的时间只占整个Q2C的30%， 51%以上的时间花费在I2D。

参考/搬运

https://bean-li.github.io/blktrace-to-report/

Linux-网络抓包

发表于 2025-12-10

wireshark

远程抓包

在wireshark的以下两个部分进行抓包，不建议性能分析时进行，在连通性验证时可以使用（ssh传输可能丢包等）

ssh+tcpdump+wireshark解析

直接抓包解析

1	ssh <ip> 'tcpdump -ni any -s0 -U -w - udp port 53' \| wireshark -k -i -

跳板机执行

1	ssh -J <jumpsever ip> <ip> 'tcpdump -ni any -s0 -U -w - udp port 53' \| wireshark -k -i -

直接抓包并保存

1	ssh <ip> 'tcpdump -ni any -s0 -U -w- udp port 53' > /tmp/packets.pcap

tcpdump

常用抓包指令

过滤tcp flags，比如抓syn报文，看是否能握手成功

1	tcpdump -ni any 'tcp[tcpflags] == tcp-syn'

查看tcp状态，如下为发送了sync，如果一直未回复，则可以在sar -n ETCP 1看到retrans/s不断上升

1	watch -n 0.1 'ss -tpn state syn-sent'

抓syn和syn+ack报文

1	tcpdump -ni any 'tcp[tcpflags] == tcp-syn or tcp[13]=18'

抓syn和syn+rst报文

1	tcpdump -ni any 'tcp[tcpflags] == tcp-syn or tcp[13] & 4!=0'

抓syn和icmp不可达报文

1	tcpdump -ni any 'tcp[tcpflags] == tcp-syn or icmp[0] = 3'

抓syn、syn+ack、rst和icmp端口不可达报文（tcp[13]指的是从tcp报文头偏移13字节的选项）

1	tcpdump -ni any '(tcp[tcpflags] == tcp-syn or tcp[13]=18) or tcp[13] & 4!=0 or icmp[0] = 3'

tcp flags抓包例子

1
2
3

tcpdump 'tcp[tcpflags] == tcp-syn'
tcpdump 'tcp[tcpflags] == tcp-rst'
tcpdump 'tcp[tcpflags] == tcp-fin'

抓取指定报文大小

1
2
3

tcpdump less 32
tcpdump greater 32
tcpdump <= 102

持续抓取端口 8080的流量，每个文件最大 100MB，最多保留 10个文件，写满后覆盖最旧的文件

1	tcpdump -i any -C 100 -W 10 -w my_capture.pcap port 8080

抓取 10000个发往 80端口的包

1	tcpdump -i any -c 10000 -w http_requests.pcap dst port 80

常用选项

参数	说明	示例
`-i <interface>`	指定抓包网卡。`any`表示所有网卡。	`-i any`
`port <端口号>`	过滤特定端口的流量（TCP/UDP）。	`port 8080`
`-C <大小>`	按文件大小分割。单位通常为MB（M）。	`-C 100`
`-W <数量>`	限制文件总数，与 `-C`配合实现循环覆盖。	`-W 10`
`-w <文件名>`	将抓取的原始数据包写入指定文件。	`-w my_capture.pcap`
`-c <数量>`	抓取指定数量的数据包后自动退出。	`-c 10000`
`-s <长度>`	设置每个数据包的抓取长度（快照长度byte），`-s 0`表示抓取完整数据包。(在长期抓只需要分析报文头很实用)	`-s 0`

其他选项

Capture Commands

Command	Example usage	Explanation
`-i any`	`tcpdump -i any`	Capture from all interfaces; may require superuser (`sudo/su`)
`-i eth0`	`tcpdump -i eth0`	Capture from the interface `eth0`
`-c count`	`tcpdump -i eth0 -c 5`	Exit after receiving `count (5)` packets
`-r captures.pcap`	`tcpdump -i eth0 -r captures.pcap`	Read and analyze saved capture file `captures.pcap`
`tcp`	`tcpdump -i eth0 tcp`	Show TCP packets only
`udp`	`tcpdump -i eth0 udp`	Show UDP packets only
`icmp`	`tcpdump -i eth0 icmp`	Show ICMP packets only
`ip`	`tcpdump -i eth0 ip`	Show IPv4 packets only
`ip6`	`tcpdump -i eth0 ip6`	Show IPv6 packets only
`arp`	`tcpdump -i eth0 arp`	Show ARP packets only
`rarp`	`tcpdump -i eth0 rarp`	Show RARP packets only
`slip`	`tcpdump -i eth0 slip`	Show SLIP packets only
`-I`	`tcpdump -i eth0 -I`	Set interface as monitor mode
`-K`	`tcpdump -i eth0 -K`	Don’t verify checksum
`-p`	`tcpdump -i eth0 -p`	Don’t capture in promiscuous mode

Filter Commands

Filter expression	Explanation
`src host 127.0.0.1`	Filter by source IP/hostname `127.0.0.1`
`dst host 127.0.0.1`	Filter by destination IP/hostname `127.0.0.1`
`host 127.0.0.1`	Filter by source or destination = `127.0.0.1`
`ether src 01:23:45:AB:CD:EF`	Filter by source MAC `01:23:45:AB:CD:EF`
`ether dst 01:23:45:AB:CD:EF`	Filter by destination MAC `01:23:45:AB:CD:EF`
`ether host 01:23:45:AB:CD:EF`	Filter by source or destination MAC `01:23:45:AB:CD:EF`
`src net 127.0.0.1`	Filter by source network location `127.0.0.1`
`dst net 127.0.0.1`	Filter by destination network location `127.0.0.1`
`net 127.0.0.1`	Filter by source or destination network location `127.0.0.1`
`net 127.0.0.1/24`	Filter by source or destination network location `127.0.0.1` with the tcpdump subnet mask of length `24`
`src port 80`	Filter by source port = 80
`dst port 80`	Filter by destination port = 80
`port 80`	Filter by source or destination port = 80
`src portrange 80-400`	Filter by source port value between 80 and 400
`dst portrange 80-400`	Filter by destination port value between 80 and 400
`portrange 80-400`	Filter by source or destination port value between 80 and 400
`ether broadcast`	Filter for Ethernet broadcasts
`ip broadcast`	Filter for IPv4 broadcasts
`ether multicast`	Filter for Ethernet multicasts
`ip multicast`	Filter for IPv4 multicasts
`ip6 multicast`	Filter for IPv6 multicasts
`ip src host mydevice`	Filter by IPv4 source hostname `mydevice`
`arp dst host mycar`	Filter by ARP destination hostname `mycar`
`rarp src host 127.0.0.1`	Filter by RARP source `127.0.0.1`
`ip6 dst host mywatch`	Filter by IPv6 destination hostname `mywatch`
`tcp dst port 8000`	Filter by destination TCP port = 8000
`udp src portrange 1000-2000`	Filter by source TCP ports in 1000–2000
`sctp port 22`	Filter by source or destination port = 22

Display Commands

Example	Explanation
`tcpdump -i eth0 -A`	Print each packet (minus its link level header) in ASCII. Handy for capturing web pages. [![Screenshot with ASCII (sudo tcpdump twitter)
`tcpdump -D`	Print the list of the network interfaces available on the system and on which tcpdump can capture packets.
`tcpdump -i eth0 -e`	Print the link-level header on each output line, such as MAC layer addresses for protocols such as Ethernet and IEEE 802.11.
`tcpdump -i eth0 -F /path/to/params.conf`	Use the file `params.conf` as input for the filter expression. (Ignore other expressions on the command line.)
`tcpdump -i eth0 -n`	Don’t convert addresses (i.e., host addresses, port numbers, etc.) to names.
`tcpdump -i eth0 -S`	Print absolute, rather than relative, TCP sequence numbers. (Absolute TCP sequence numbers are longer.)
`tcpdump -i eth0 --time-stamp-precision=nano`	When capturing, set the timestamp precision for the capture to `tsp`: • `micro` for microsecond (default) • `nano` for nanosecond.
`tcpdump -i eth0 -t`	Omit the timestamp on each output line.
`tcpdump -i eth0 -tt`	Print the timestamp, as seconds since January 1, 1970, 00:00:00, UTC, and fractions of a second since that time, on each dump line.
`tcpdump -i eth0 -ttt`	Print a delta (microsecond or nanosecond resolution depending on the `--time-stamp-precision` option) between the current and previous line on each output line. The default is microsecond resolution.
`tcpdump -i eth0 -tttt`	Print a timestamp as hours, minutes, seconds, and fractions of a second since midnight, preceded by the date, on each dump line.
`tcpdump -i eth0 -ttttt`	Print a delta (microsecond or nanosecond resolution depending on the `--time-stamp-precision` option) between the current and first line on each dump line. The default is microsecond resolution.
`tcpdump -i eth0 -u`	Print undecoded network file system (NFS) handles.
`tcpdump -i eth0 -v`	Produce verbose output. When writing to a file (`-w` option) and at the same time not reading from a file (`-r` option), report to standard error, once per second, the number of packets captured.
`tcpdump -i eth0 -vv`	Additional verbose output than `-v`
`tcpdump -i eth0 -vvv`	Additional verbose output than `-vv`
`tcpdump -i eth0 -x`	Print the headers and data of each packet (minus its link level header) in hex.
`tcpdump -i eth0 -xx`	Print the headers and data of each packet, including its link level header, in hex.
`tcpdump -i eth0 -X`	Print the headers and data of each packet (minus its link level header) in hex and ASCII.
`tcpdump -i eth0 -XX`	Print the headers and data of each packet, including its link level header, in hex and ASCII.

Output Commands

Command	Example	Explanation
`-w captures.pcap`	`tcpdump -i eth0 -w captures.pcap`	Output capture to a file `captures.pcap`
`-d`	`tcpdump -i eth0 -d`	Display human-readable form in standard output
`-L`	`tcpdump -i eth0 -L`	Display data link types for the interface
`-q`	`tcpdump -i eth0 -q`	Quick/quiet output. Print less protocol information, so output lines are shorter.
`-U`	`tcpdump -i eth0 -U -w out.pcap`	Without -w option Print a description of each packet’s contents. With -w option Write each packet to the output file `out.pcap` in real time rather than only when the output buffer fills.

Miscellaneous Commands

Operator	Syntax	Example	Description
`AND`	`and, &&`	`tcpdump -n src 127.0.0.1 and dst port 21`	Combine filtering options joined by “and”
`OR`	`or, \\|`	`tcpdump dst 127.0.0.1 or src port 22`	Match any of the conditions joined by “or”
`EXCEPT`	`not, !`	`tcpdump dst 127.0.0.1 and not icmp`	Negate the condition prefixed by “not”
`LESS`	`less, <, (<=)`	`tcpdump dst host 127.0.0.1 and less 128`	Shows packets shorter than (or equal to) 128 bytes in length. < only applies to length 32, i.e., `<32`.
`GREATER`	`greater, >, (>=)`	`tcpdump dst host 127.0.0.1 and greater 64`	Shows packets longer than (or equal to) 64 bytes in length. > only applies to length 32, i.e., `>32`.
`EQUAL`	`=, ==`	`tcpdump host 127.0.0.1 = 0`	Show packets with zero length

Example Usage

Example	Explanation
`tcpdump -r outfile.pcap src host 10.0.2.15`	Print all packets in the file `outfile.pcap` coming from the host with IP address 10.0.2.15
`tcpdump -i any ip and not tcp port 80`	Listen for non-HTTP packets (which have TCP port number 80) on any network interface
`tcpdump -i eth0 -n >32 -w pv01.pcap -c 30`	Save 30 packets of length exceeding 32 bytes to `captures.pcap` without DNS resolution on the `eth0` network interface
`tcpdump -AtuvX icmp`	Capture ICMP traffic and print ICMP packets in hex and ASCII and the following features: With: • headers • data • undecoded NFS handles Without: • link level headers • timestamps.
`tcpdump 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'`	Print all IPv4 HTTP packets to and from port 80, i.e. print only packets that contain data, not, for example, SYN and FIN packets and ACK-only packets.

参考/搬运

bpftrace技术1-常用命令

发表于 2025-12-10

bpftrace是什么

一种追踪内核态和用户态的新技术

看可以跟踪哪些内核函数

1	bpftrace -l \| grep kprobe

跟踪内核函数例子

是否调用

跟踪 net/ipv4/netfilter/ip_tables.c looks promising 的两个函数: compat_do_ipt_get_ctl 和 do_ipt_get_ctl。

~# bpftrace -e 'kprobe:do_ipt_get_ctl { printf("function was called!\n"); }'
Attaching 1 probe...
function was called!
function was called!

compat_do_ipt_get_ctl 函数签名如下

1	static int compat_do_ipt_get_ctl(struct sock sk, int cmd, void __user user, int *len)

调用命令，pid，入参

建立test.bpf文件

#include <net/sock.h>

kprobe:do_ipt_get_ctl
{
    printf("called by %s (pid: %d). and: %d\n", comm, pid, ((sock *)arg0)->__sk_common.skc_family);
}

执行结果如下

~# bpftrace test.bpf
/bpftrace/include/stdarg.h:52:1: warning: null character ignored [-Wnull-character]       
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:209:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:256:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/bitops.h:310:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/jump_label.h:23:2: error: 'asm goto' constructs are not supported yet
/lib/modules/4.19.0-8-amd64/source/arch/x86/include/asm/signal.h:24:2: note: array 'sig' declared here
Attaching 1 probe...
called by iptables-legacy (pid: 2981). and: 2
called by iptables-legacy (pid: 2981). and: 2

2 的含义解释如下

/usr/src/linux-headers-4.19.0-8-common/include/linux/socket.h
160 /* Supported address families. */
161 #define AF_UNSPEC   0
162 #define AF_UNIX     1   /* Unix domain sockets      */
163 #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
164 #define AF_INET     2   /* Internet IP Protocol     */
165 #define AF_AX25     3   /* Amateur Radio AX.25      */
166 #define AF_IPX      4   /* Novell IPX           */

打印入参

1	bpftrace -e 'kprobe:vfs_open { printf("open path: %s\n", str(((path *)arg0)->dentry->d_name.name)); }'

打印返回值

使用默认变量retval。

内核函数返回值

使用 kretprobe跟踪内核函数的返回值。例如，跟踪 vfs_read的返回值（读取的字节数或错误码）：

1	bpftrace -e 'kretprobe:vfs_read { printf("vfs_read returned: %d\n", retval); }'

用户空间函数返回值

使用 uretprobe跟踪用户空间函数的返回值。例如，跟踪一个名为 myfunc的函数的返回值：

1	bpftrace -e 'uretprobe:/path/to/binary:myfunc { printf("myfunc returned: %d\n", retval); }'

结合入口和返回探针

测量函数执行时间或关联参数与返回值。这可以通过在函数入口（如 kprobe/uprobe）记录时间戳（tid是内置变量，代表线程id），然后在返回探针中计算差值来实现。例如：

bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } 
                kretprobe:vfs_read /@start[tid]/ { 
                    $duration = nsecs - @start[tid]; 
                    printf("vfs_read took %d ns, returned %d\n", $duration, retval); 
                    delete(@start[tid]); 
                }'

过滤条件

可以在返回探针上添加过滤条件，例如只处理特定进程或返回值范围的调用。使用 /<filter>/语法。例如，只打印返回值大于0的 vfs_read调用：

1	bpftrace -e 'kretprobe:vfs_read /retval > 0/ { printf("vfs_read returned: %d\n", retval); }'

bpftrace技术2-map变量和map函数

发表于 2025-12-10

Map 的核心概念与用途

在 bpftrace 中，符号 @用于定义和操作 Map 变量（或称映射变量）。这是 bpftrace 实现高效数据聚合的核心机制，你可以把它理解为一个在内核中运行的、功能强大的迷你数据库

Map 的主要作用是在不同的事件探针（probe）之间存储、共享和聚合数据。当某个事件触发时，你可以将数据记录到 Map 中；当另一个相关事件触发时，再从中读取或更新数据。这使得实现复杂的追踪逻辑成为可能。

特性	说明
数据持久化	Map 中的数据在多个探针事件之间保持存在，不像普通变量那样每次事件触发后就被重置。（全局变量）
键值对结构	Map 使用键（key）来索引值（value），格式为 `@map_name[key] = value`。键可以是单一值，也可以是多个值的组合（如 `@a[pid, comm]`）。kv结构
自动输出	默认情况下，当 bpftrace 程序退出时（例如你按下 `Ctrl-C`），所有非空的 Map 内容会自动打印到屏幕上。

Map 的常见操作与示例

操作/函数	说明	示例
赋值	直接给 Map 赋值。	`@start_time[tid] = nsecs`(记录线程的开始时间)
计数 (`count()`)	统计事件发生的次数。	`@syscall_count[comm] = count()`(统计每个进程名的系统调用次数)
求和 (`sum()`)	对数值进行累加。	`@total_bytes[pid] = sum(args->ret)`(累计每个进程读取的总字节数)
统计 (`avg()`, `min()`, `max()`)	计算平均值、最小值、最大值。	`@response_time = avg($latency)`
直方图 (`hist()`)	非常实用，生成2的幂次方的直方图，直观展示数据分布。	`@latency_ns = hist(nsecs - @start[tid])`(可视化读操作的延迟分布)
线性直方图 (`lhist()`)	生成自定义区间的线性直方图。	`@read_sizes = lhist(args->ret, 0, 10000, 1000)`(统计读取大小的分布)
数据清理 (`delete()`)	从 Map 中删除特定的键值对，防止内存无限增长。	`delete(@start_time[tid])`(在处理完一个事件后清理对应的开始时间)

与其他变量的区别

变量类型	前缀	作用域与用途
Map 变量	`@`	全局。用于在探针之间持久化存储和聚合数据。
内置变量	无	只读。提供事件上下文信息，如 `pid`(进程ID)、`comm`(命令名)、`retval`(函数返回值)等。
暂存变量	`$`	局部临时。用于单次探针触发过程中的中间计算，例如 `$duration = nsecs - @start[tid]`。

map函数参考表

函数原型	核心作用与参数说明	典型应用场景
`count()`	计数。统计事件被触发的次数。无参数。	统计系统调用次数、函数调用次数等。
`sum(int n)`	求和。对参数 `n`的值进行累加。	计算总的字节读写量、总耗时等。
`avg(int n)`	求平均值。计算参数 `n`的平均值。	计算平均延迟、平均数据包大小等。
`min(int n)`	求最小值。记录参数 `n`的最小值。	追踪最小延迟、最小数据块大小。
`max(int n)`	求最大值。记录参数 `n`的最大值。	追踪最大延迟、最大数据块大小。
`stats(int n)`	统计摘要。返回参数 `n`的计数、平均值、总和、最小值、最大值。	获取一个指标的全面统计信息。
`hist(int n)`	对数直方图。按2的幂次方区间（如 `[4-8)`, `[8-16)`）展示参数 `n`的分布。	直观展示延迟、数据大小的分布情况，易于发现模式。
`lhist(int n, int min, int max, int step)`	线性直方图。在指定的线性区间（`min`到`max`，步长为`step`）内展示参数 `n`的分布。	当需要自定义固定区间进行分析时使用。
`delete(@m[key])`	删除键值对。从 Map `@m`中删除指定的 `key`及其对应的值。	清理临时数据，防止 Map 无限增长，常用于配对探针（如 kprobe/kretprobe）。
`clear(@m)`	清空 Map。清除 Map `@m`中的所有键值对。	在定时器（如 `interval`）中定期重置统计。
`zero(@m)`	归零 Map。将 Map `@m`中所有键的值重置为 0。	重置计数或求和等数据，但保留键的结构。

例子

count()- 统计系统调用次数

统计每个进程调用的系统调用次数

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

输出

1
2
3

@count[bash]: 15
@count[sshd]: 28
@count[snmpd]: 102

@[comm]：以进程名 comm为键（key）。
count()：每次事件触发，对应键的值加 1

sum(int n)- 计算读取的总字节数

累计所有进程通过 read 系统调用成功读取的字节数

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @bytes = sum(args->ret); }'

输出

1	@bytes: 1048576

/args->ret > 0/：过滤器，只处理成功读取（返回值大于0）的情况。
sum(args->ret)：对返回值（读取的字节数）进行累加

avg(int n)- 计算平均读取大小

计算每次 read系统调用成功读取的平均字节数。

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @avg_size = avg(args->ret); }'

输出

1	@avg_size: 512

/args->ret > 0/：过滤器，只处理成功读取（返回值大于0）的情况。
avg(args->ret)：对返回值（读取的字节数）求平均

stats(int n)- 获取完整的统计摘要

对 read系统调用的返回值进行全面的统计。

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret > 0/ { @s = stats(args->ret); }'

输出

1	@s: count 100, average 4096, total 409600, min 1, max 8192

/args->ret > 0/：过滤器，只处理成功读取（返回值大于0）的情况。
stats(args->ret)：求调用次数，对返回值（读取的字节数）求平均，求总数，求最大最小值

hist(int n)- 分析读取字节数的对数分布

显示 read系统调用返回值的分布，区间按2的幂次方划分。

bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = hist(args->ret); }'

输出

@bytes:
[0, 1]                12 |@@@@@@@@@@@@@@@@@@@@                                |
[2, 4)                18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[4, 8)                 0 |                                                    |
...
[128, 256)             1 |@

hist(args->ret)：分为2^x ~2^y 大小的桶，统计每个桶的里面的数目个数

lhist(int n, int min, int max, int step)- 分析读取字节数的线性分布

使用线性直方图统计 read返回值，范围从0到2000，步长为200。

bpftrace -e 'tracepoint:syscalls:sys_exit_read { @bytes = lhist(args->ret, 0, 2000, 200); }'

输出

@bytes:
(..., 0)                0 |                                                    |
[0, 200)              66 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)             2 |@                                                   |
...
[2000, ...)            39 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |

hist(args->ret)：分为0-200区间桶，统计每个桶的里面的数目个数

delete(@m[key])- 清理临时数据

在计算函数耗时后，删除用于存储开始时间的临时键，避免 Map 无限增长。

bpftrace -e 'kprobe:vfs_read 
			{ @start[tid] = nsecs; } 
			   kretprobe:vfs_read /@start[tid]/ 
			   { $duration_ns = nsecs - @start[tid]; 
				 @us = hist($duration_ns); 
			     delete(@start[tid]); }'

此函数通常用于配对操作的探针（如 kprobe和 kretprobe），在操作完成后及时清理资源

clear(@m)和 zero(@m)- 重置 Map

每5秒打印并清空一次系统调用计数

1	bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); clear(@); }'

输出

@[Relay(89295)]: 3
@[mini_init]: 4
@[gmain]: 5
@[Relay(909)]: 6
@[Relay(893)]: 6
@[systemd-udevd]: 37
@[chronyd]: 40
@[bpftrace]: 118

clear(@m)会删除 Map 中的所有键值对。而 zero(@m)则将所有键的值重置为0，但保留键的结构

1	bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); } interval:s:5 { print(@); zero(@); }'

输出

@[gmain]: 5
@[Relay(893)]: 6
@[Relay(909)]: 6
@[Relay(89295)]: 9
@[mini_init]: 12
@[chronyd]: 44
@[init]: 103
@[bpftrace]: 114
@[GnsPortTracker]: 242
@[top]: 350
@[grep]: 664
@[sh]: 923
@[Xwayland]: 1978
@[ps]: 2778
@[libuv-worker]: 4028
@[ls]: 4256
@[node]: 4722
@[mini_init]: 0   <---------主要看这个，明显没有调用，但是结构被保留下来了
@[Relay(109)]: 3
@[Relay(909)]: 6
@[Relay(89295)]: 6
@[Relay(893)]: 6

map特点

自动打印：默认情况下，当 bpftrace 程序终止时（例如用户按下 Ctrl-C），所有非空的 Map 变量会自动打印出来。
过滤条件：使用 /<filter>/可以设置条件，只有满足条件时才会执行后面的动作，这能有效提升脚本效率和输出内容的针对性
结合变量：Map 函数常与内置变量（如 comm, pid, nsecs）或临时变量（以 $开头）结合使用，以实现复杂的追踪逻辑