Forum for asking questions related to block device drivers

Thu Apr 11 19:02:41 EDT 2013

On Thu, Apr 11, 2013 at 4:48 PM, neha naik <nehanaik27 at gmail.com> wrote:
> HI Greg,
>    Thanks a lot. Everything you said made complete sense to me but when i
> tried running with following options my read is so slow (basically with
> direct io, that with 1MB/s it will just take 32minutes to read 32MB data)
> yet my write is doing fine. Should i use some other options of dd (though i
> understand that with direct we bypass all caches, but direct doesn't
> guarantee that everything is written when call returns to user for which i
> am using fdatasync).
>
> time dd if=/dev/shm/image of=/dev/sbd0 bs=4096 count=262144 oflag=direct
> conv=fdatasync
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=2621262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 17.7809 s, 60.4 MB/s
>
> real    0m17.785s
> user    0m0.152s
> sys    0m1.564s
>
>
> I interrupted the dd for read because it was taking too much time with 1MB/s
> :
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=262144 iflag=direct
> conv=fdatasync
> ^C150046+0 records in
> 150045+0 records out
> 614584320 bytes (615 MB) copied, 600.197 s, 1.0 MB/s
>
>
> real    10m0.201s
> user    0m2.576s
> sys    0m0.000s

Before reading the below, please not the rotating disks are made of
zones with a constant number of sectors/track.  In the below I discuss
1 track as holding 1MB of data.  I believe that is roughly accurate
for an outer track with near 3" of diameter.  A inner track with
roughly 2" of diameter, would only have 2/3rds of 1MB of data.  I am
ignoring that for simplicity sake.  You can worry about it yourself
separately.

====
When you use iflag=direct, you are telling the kernel, I know what I'm
doing, just do it.

So let's do some math and see if we can figure it out.  I assume you
are working with rotating media as your backing store for the LVM
volumes.

A rotating disk with 6000 RPMs takes 10 milliseconds per revolution.
(I'm using this because the math is easy.  Check the specs for your
drives.)

With iflag=direct, you have taken control of interacting with a
rotating disk that can only read data once every rotation. That is
relevant sectors are only below the read head once every 10 msecs.

So, you are saying, give me 4KB every time the data rotates below the
read head.  That happens about 100 times per second, so per my logic
you should be seeing 400KB/sec read rate.

You are actually getting roughly twice that.  Thus my question is what
is happening in your setup that you are getting 10KB per rotation
instead of the 4KB you asked for.  (the answer could be that you have
15K rpm drives, instead of the 6K rpm drives I calculated for.)

My laptop is giving 20MB/sec with bs=4KB which implies I'm getting 50x
the speed I expect from the above theory.  I have to assume some form
of read-ahead is going on and reading 256KB at a time.  That logic may
be in my laptop's disk and not the kernel. (I don't know for sure).

Arlie recommended 1 MB reads.  That should be a lot faster because a
disk track is roughly 1 MB, so you are telling the disk: As you spin,
when you get to the sector I care about, do a continuous read for a
full rotation (1MB).  By the time you ask for the next 1MB, I would
expect it will be too late get the very next sector, so the drive
would do a full rotation looking for your sector, then do a continuous
1MB read.

So, if my logic is right the drive itself is doing:

rotation 1: searching for first sector of read
rotation 2: read 1MB continuously
rotation 3: searching for first sector of next read
rotation 4: read 1MB continuously

I just checked my laptop's drive, and with bs=1MB it actually achieves
more or less max transfer rate, so for it at least with 1MB reads the
cpu / drive controller is able to keep up with the rotating disk and
not have the 50% wasted rotations that I would actually expect.

Again it appears something is doing some read ahead.  Let's assume my
laptop's disk does a 256KB readahead every time it gets a read
request.  So when it gets that 1MB request, it actually reads
1MB+256KB, but it returns the first 1MB to the cpu as soon as it has
it.  Thus when the 1MB is returned to the cpu, the drive is still
working on the next 256KB and putting it in on-disk cache.  If 256KB
is 1/4 of a track's data, then it takes the disk about 2.5 msecs to
read that data from the rotating platter to drives internal controller
cache.  If during that 2.5 msecs the cpu issues the next 1MB read
request, the disk will just continue reading and not have any dead
time.

If you want to understand exactly what is happening you would need to
monitor exactly what is going back and forth across the sata bus.  Is
the kernel doing a read-ahead even with direct io?  Is the drive doing
some kind of read ahead? etc.

If you are going to work with direct io, hopefully the above gives you
a new way to think about things.
Greg