HI Greg,<br> I am using SSD underneath. However, my problem is not exactly related to disk cache. I think i should give some more background.<br><br> These are my key points:<br><ol><li> Read on my passthrough driver on top of lvm is slower than read on just the lvm (with or without any kind of direct i/o).</li>
<li> Read on my passthrough driver (on top of lvm) is slower than write on my passthrough driver (on top of lvm).</li><li> If i disable lvm readahead (we can do that for all block device drivers) then its read performance becomes almost equal to the read performance of my passthrough driver. This suggested that lvm readahead was helping lvm's performance. But, if it helps the lvm performance then it should also help the performance of my passthrough driver (which is sitting on top of it). This led me to thinking that i am doing something in my device driver which is possibly either disabling the lvm readahead or lvm readahead gets switched off when it is not interacting with the kernel directly.</li>
</ol> Given this, i am thinking there are there may be some issue with how i have written my device driver (rather used the api). I am using the 'merge_bvec_fn' function of lvm underneath it which i think should have merged the ios (since we are doing sequential io). But, that is clearly not the case. When i print the pages that come to my driver i see that each time the function 'make_request' gets called with one page. Shouldn't it be merging the io using lvm merge function or it doesn't work like that? That is should each driver write its own 'merge_bvec_fn' and not rely on the driver beneath it to take care of that?<br>
Or is there some problem when i pass the request to lvm (should i be calling some thing else or passing some kind of flag).<br><br><br>Regards,<br>Neha<br><br><br><div class="gmail_quote">On Thu, Apr 11, 2013 at 5:02 PM, Greg Freemyer <span dir="ltr"><<a href="mailto:greg.freemyer@gmail.com" target="_blank">greg.freemyer@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Thu, Apr 11, 2013 at 4:48 PM, neha naik <<a href="mailto:nehanaik27@gmail.com">nehanaik27@gmail.com</a>> wrote:<br>
> HI Greg,<br>
> Thanks a lot. Everything you said made complete sense to me but when i<br>
> tried running with following options my read is so slow (basically with<br>
> direct io, that with 1MB/s it will just take 32minutes to read 32MB data)<br>
> yet my write is doing fine. Should i use some other options of dd (though i<br>
> understand that with direct we bypass all caches, but direct doesn't<br>
> guarantee that everything is written when call returns to user for which i<br>
> am using fdatasync).<br>
><br>
> time dd if=/dev/shm/image of=/dev/sbd0 bs=4096 count=262144 oflag=direct<br>
> conv=fdatasync<br>
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=2621262144+0 records in<br>
> 262144+0 records out<br>
> 1073741824 bytes (1.1 GB) copied, 17.7809 s, 60.4 MB/s<br>
><br>
> real 0m17.785s<br>
> user 0m0.152s<br>
> sys 0m1.564s<br>
><br>
><br>
> I interrupted the dd for read because it was taking too much time with 1MB/s<br>
> :<br>
> time dd if=/dev/pdev0 of=/dev/null bs=4096 count=262144 iflag=direct<br>
> conv=fdatasync<br>
> ^C150046+0 records in<br>
> 150045+0 records out<br>
> 614584320 bytes (615 MB) copied, 600.197 s, 1.0 MB/s<br>
><br>
><br>
> real 10m0.201s<br>
> user 0m2.576s<br>
> sys 0m0.000s<br>
<br>
</div></div>Before reading the below, please not the rotating disks are made of<br>
zones with a constant number of sectors/track. In the below I discuss<br>
1 track as holding 1MB of data. I believe that is roughly accurate<br>
for an outer track with near 3" of diameter. A inner track with<br>
roughly 2" of diameter, would only have 2/3rds of 1MB of data. I am<br>
ignoring that for simplicity sake. You can worry about it yourself<br>
separately.<br>
<br>
====<br>
When you use iflag=direct, you are telling the kernel, I know what I'm<br>
doing, just do it.<br>
<br>
So let's do some math and see if we can figure it out. I assume you<br>
are working with rotating media as your backing store for the LVM<br>
volumes.<br>
<br>
A rotating disk with 6000 RPMs takes 10 milliseconds per revolution.<br>
(I'm using this because the math is easy. Check the specs for your<br>
drives.)<br>
<br>
With iflag=direct, you have taken control of interacting with a<br>
rotating disk that can only read data once every rotation. That is<br>
relevant sectors are only below the read head once every 10 msecs.<br>
<br>
So, you are saying, give me 4KB every time the data rotates below the<br>
read head. That happens about 100 times per second, so per my logic<br>
you should be seeing 400KB/sec read rate.<br>
<br>
You are actually getting roughly twice that. Thus my question is what<br>
is happening in your setup that you are getting 10KB per rotation<br>
instead of the 4KB you asked for. (the answer could be that you have<br>
15K rpm drives, instead of the 6K rpm drives I calculated for.)<br>
<br>
My laptop is giving 20MB/sec with bs=4KB which implies I'm getting 50x<br>
the speed I expect from the above theory. I have to assume some form<br>
of read-ahead is going on and reading 256KB at a time. That logic may<br>
be in my laptop's disk and not the kernel. (I don't know for sure).<br>
<br>
Arlie recommended 1 MB reads. That should be a lot faster because a<br>
disk track is roughly 1 MB, so you are telling the disk: As you spin,<br>
when you get to the sector I care about, do a continuous read for a<br>
full rotation (1MB). By the time you ask for the next 1MB, I would<br>
expect it will be too late get the very next sector, so the drive<br>
would do a full rotation looking for your sector, then do a continuous<br>
1MB read.<br>
<br>
So, if my logic is right the drive itself is doing:<br>
<br>
rotation 1: searching for first sector of read<br>
rotation 2: read 1MB continuously<br>
rotation 3: searching for first sector of next read<br>
rotation 4: read 1MB continuously<br>
<br>
I just checked my laptop's drive, and with bs=1MB it actually achieves<br>
more or less max transfer rate, so for it at least with 1MB reads the<br>
cpu / drive controller is able to keep up with the rotating disk and<br>
not have the 50% wasted rotations that I would actually expect.<br>
<br>
Again it appears something is doing some read ahead. Let's assume my<br>
laptop's disk does a 256KB readahead every time it gets a read<br>
request. So when it gets that 1MB request, it actually reads<br>
1MB+256KB, but it returns the first 1MB to the cpu as soon as it has<br>
it. Thus when the 1MB is returned to the cpu, the drive is still<br>
working on the next 256KB and putting it in on-disk cache. If 256KB<br>
is 1/4 of a track's data, then it takes the disk about 2.5 msecs to<br>
read that data from the rotating platter to drives internal controller<br>
cache. If during that 2.5 msecs the cpu issues the next 1MB read<br>
request, the disk will just continue reading and not have any dead<br>
time.<br>
<br>
If you want to understand exactly what is happening you would need to<br>
monitor exactly what is going back and forth across the sata bus. Is<br>
the kernel doing a read-ahead even with direct io? Is the drive doing<br>
some kind of read ahead? etc.<br>
<br>
If you are going to work with direct io, hopefully the above gives you<br>
a new way to think about things.<br>
<span class="HOEnZb"><font color="#888888">Greg<br>
</font></span></blockquote></div><br>