Am I reading the iostat command output correctly?
Iostat command came from the same sysstat family package
# rpm -qf `which iostat`sysstat-11.7.3-6.el8.x86_64
It mainly read data from /proc/diskstats
# cat /proc/diskstats259 0 nvme1n1 147 0 6536 2888 0 0 0 0 0 798 2888 0 0 0 0259 1 nvme0n1 33608 16 1588184 382879 2001 376 142595 20368 0 195069 403248 0 0 0 0259 2 nvme0n1p1 41 0 328 402 0 0 0 0 0 331 402 0 0 0 0259 3 nvme0n1p2 33524 16 1585472 382298 2001 376 142595 20368 0 195013 402667 0 0 0 0
- If you run iostat without option where 1 is Interval(amount of time between each report)and 2 Count(number of report generated)
NOTE:
- If you don’t specify the count it will run forever.
- The first output is average from boot until now. So its good idea to ignore it. So we are mostly interested in 2nd and all the subsequent output.
Now look at each column bit by bit
- Device: It’s name of partition or device listed under /dev directory
- tps: Transaction per second is the number of I/O requests which is issued to the device per second.
- kB_read/s/kB_wrtn/s: It’s the number of 512b sectors read/write from the device per second.
- Blk_read/Blk_write: It’s the number of 512b sectors read/written in the sample
The output is not very useful because it’s not telling us
- the ratio of read/write command
- average size of read/write request
- how long each I/O is taking
- how busy the device is
To solve the issue we need to use -tkx along -p with iostat
- -t timestamp in every sample
- -x extended information
- -k to get the more human friendly output(-m is for megabyte)
- -p is to get the per partition breakout
- Device: It’s name of partition or device listed under /dev directory
- r/s,w/s: These are the number of request read/write after merge that is completed by device/storage per second.Remember this account for completed I/O not any I/O which is waiting to be processed at disk level or scheduler.
- rkb/s,wkb/s: These are the number of kilobytes transferred between the host and storage/device per second
- rrqm/s,wrqm/s: The number of read/write requests merged per second that were queued to the device
- %rrqm,%wrqm: The percentage of read/write requests merged at ioscheduler before sent to the device.
To check the scheduler on the system. To know more about schedulers check the following doc https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/managing_storage_devices/index#available-disk-schedulers_setting-the-disk-scheduler
cat /sys/block/nvme0n1/queue/scheduler[none] mq-deadline kyber bfq
If you read the the iostat man page it used the device a lot. But device actually consist of following sub-par.
rareq-sz/wareq-sz: It’s always in sectors(512b sectors). . It you want to calculate the individual average read io size(rkb/s*2)/(r/s) and same with write (wkb/s*2)/(w/s).It’s you want the combination of both read and writes, it can be calculated using (rkb/s +wkb/s)*2/(r/s|w/s)
aqu-sz: There is two level of queues which is been maintained one at scheduler level and other at device level. Driver doesn’t maintain any queue but it will keep track of all the I/O requests passing through it. So it’s the average number of requests with I/O scheduler queue(nr_requests) plus outstanding number of requests in storage(lun_queue)
/sys/block/nvme0n1/queue/nr_requests
r_await/w_await: This is the average time in millisecond for read/write request to be completed by storage. This include the time spent in the scheduler queue and the time storage spent in servicing the request.
svctm: The average service time (in milliseconds) for I/O requests that were issued to the device. Warning! Do not trust this field any more. This field will be removed in a future sysstat version.
%util: Percentage of elapsed time during which I/O requests were issued to the device. Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs(i.e its not single disk device), this number does not reflect their performance limits.
So the bottom line is as long as the avgqu-sz value remains below the lun queue depth, io is passed quickly from the scheduler to the driver which passes it to the hba. The moment this exceed you will start seeing IO issue and you cant do much as your storage is not capable of handling this much traffic(So either you can change storage to much faster or ask your developer to change the I/O pattern of your application.)